The Complete
Internet Marketer
How Search Engines Work
Under The Hood
by Jay Neuman


This article is an excerpt from: The Complete Internet Marketer: A Practical Guide To Everything You
Need To Know About Marketing Online

The key to successfully getting listed on search engines is to understand how natural search works.  
How do search engines find things people are looking for and how do they choose which results to
list first?  When you understand this, you will be able to make sure people find your website.

To understand how search engines work, there are really just two key things to learn:
spiders and
ranking web pages.  Everything else revolves around these two things.  Spiders are automated
computer programs that crawl the Web and read all of the words on web pages they find.  These
words are then put into a database.  The words in the database are organized so it is easy to search
in the future.  For each word in the database, web pages will be
ranked to tell which will be listed at
the top of search results.  

The process of organizing the database to allow future searches is commonly called
indexing.  This is
a little bit of a jargon shortcut.  Indexing and ranking web pages are actually two separate actions
that occur in the database.  These will be described below.  

Being successful with search engines is a matter of knowing how to get spiders to find your web
pages, what the search engines are looking for when they rank order pages, and how to use that
knowledge to get your pages listed at the top.


Spiders Crawl the Web

It all starts with those digital creatures that could only exist on the World Wide Web.  Spiders are
small, automated computer programs that search for web pages and read all of the text on each
page they find.  To make more sense out of just what this means, we need to take a dive into the
world of Internet jargon.  This is where it starts to get pretty thick.

First, spiders are a type of program known as a
robot or bot for short.  Robots are automated
programs that are designed to carry out some kind of action on the Web.  Usually, this involves
searching for some type of information on web pages and then performing some action based on the
information it finds.  They are called robots, because once they are turned on, they will keep doing
what they are programmed to do without human intervention.  For example, shopping bots search
online stores and can be used to find the best prices on products you are looking for.  Search engine
spiders simply read every bit of text they find on each page they go to.

So what do spiders do?  Well, of course they
crawl the Web.  Here is how crawling works.  A spider
goes to a particular website.  The search engine will send it to the URL of a popular site to start on.  
The spider reads every bit of text on that web page.  Also, the spider will collect some information
about that text.  For example: is it in a meta tag, is it in the page title or a section header, is it close
to the top of the page?  Once it has collected all of the information it was programmed to get, the
spider follows every link on that page.  This way, it will go through all of the pages on that website.  
It will also go to all of the other websites that are linked to from that site.  Once it gets to another
website, it will follow the same processes.  In this way a spider continues to go from website to
website, picking up all of the words on each one as it goes.  It is crawling through the World Wide
Web.


Filling the Database

What happens to all of that information the spiders collect?  It gets sent back to the search engine
computers and stored in their database.  You can think of the database as being a gigantic list of
words.  To make sense out of all the information collected, it has to be organized in some way.

Every word on the Web is in there, at least all the ones found by the spiders.  For each word, it
stores the URL of the web page where the word was found.  It also has information about that word
as it appeared on the page.  Was the word in a meta tag?  Then it will show which meta tag it was in  
(Meta tags will be discussed in a later section).  Was it in a subject header on the page?  How close
was it to the top of the page?  Each word, with its corresponding URL and information is a row in the
list.  If the word appears ten times on a given page, then there will be ten rows for that word with
that URL.

There is one more important piece of information that will be stored in the database.  For each web
page, the search engine will record all of the external websites which link to that page.  Remember,
the spider gets to a website by following links on other websites.  Each time a link takes the spider to
a given website, that is recorded in the database.  So the search engine knows how many times your
website is being linked to from other websites.  This will become important when it comes to rank
ordering your web pages.  Websites being linked to from many external websites are considered
more popular than those with few external links.


Ranking Web Pages

Now the spiders have filled the search engine database with a list of words found on web pages.  The
next step is to organize the information in the database so it will return relevant listings.  All of the
URLs associated with each word in the database are rank ordered, so ones where a given word is
particularly important to the content of the web page will be listed higher than those where the word
just happens to appear there.  This is done by assigning weights to the web pages.

Every search engine has its own algorithm for ranking web pages.  In general, they all assign
weights based on certain information about the word as it appears on the page.  Think of it as a point
system.  The web page gets points, or loses points based on the algorithm.  There are four key
factors considered by search engines in assigning these points.

1.        Meta Tags

Every web page contains sections in the HTML code set aside to give information about what is on
the page.  These are called
meta tags.  This is where the website creator can tell what the page is
about.  Information in meta tags is not displayed on the page, but it is read by search engine
spiders.  

Three meta tags are especially important for every web page:  the
Title, Description and Keywords.  
These three tags allow you to categorize your own web page.  So it is also the first place search
engines look when figuring out which words are most important on that page.  There is also a fourth
meta tag important for search engines.  That is the
Robots tag.  This meta tag allows you to instruct
search engine spiders to skip over your web page altogether.

Of course, people are going to look for ways to trick the system.  Sometimes websites will load a lot
of popular terms into their meta tags that do not really have that much to do with the content of the
page.  To account for this, search engines will take away points from pages that have the words in
the meta tags but not in the body of the page.  Search engines also take away points from pages
that simply have too many words of any kind in the meta tags.

2.        Keyword Density

The second thing search engines look for when ranking web pages is how often the word appears on
the page.  This is called
keyword density.  In general, the more often the word appears on the web
page, the more relevant the word is to the overall content of the page.  Therefore, it will receive a
higher weight.  

However, search engine companies know that some websites try to trick the system and load their
web pages with text that is just there to be found by search engines.  To get around this, the search
engines will start taking away points if the word appears too many times.

3.        Position on Page

Another way web pages rank the relevance of words to the pages they are on is by where the word
appears.  The HTML code for the web page identifies subject headers and titles on the page.  These
will usually be displayed in a larger font, in bold or in a different color.  When a word appears in a
title or subject header that gives the page more points.  

Also, when people design web pages, they want the most important information to be displayed at
the top of the page, where people will see it without scrolling down the page.  This is called being
above the fold.  Words that appear above the fold are considered more important than those below.  
But there is no easy way to tell exactly where the fold will be for each page.  Instead, search
engines give more points for how close the word appears to the top of the page.

4.        External Links

Finally, if all else is equal, search engines want to list more popular web pages above pages that are
not very popular.  They estimate how popular a web page is by how many times it is linked to from
external websites.

Once again, search engine companies know that people will try to work the system.  A common trick
is to set up a lot of useless or redundant websites and have them all link to the main site.  Search
engines have ways of identifying this trick.


Indexing the Database

As we said above, organizing the database to be searched is often referred to as indexing, or
creating the index.  Indexing is a technical database term.  It means to physically reorganize the
information in the database.  Some advanced mathematical algorithms are used to allow very fast
search and retrieval of data records.  There are a variety of indexing methods available.  Search
engines have massive amounts of information that must be searched and retrieved in a matter of
seconds.  So you can imagine they use some of the most complex algorithms available.  Luckily, you
really do not need to know about them.  What is important to understand is how web pages are
ranked in the database.  It is just a nice thing to know that the database is also indexed to make it
run faster.  Anyone who actually needs to know about indexing will be a database professional and
already familiar with the concepts.

The main reason for discussing indexing at all is to avoid confusion.  In the common web jargon, the
terms indexing and ranking are often used together.  You should be able to know what the two terms
actually mean.


Performing Searches

The search engine has an indexed database full of words.  All of the web pages where those words
appear are rank ordered based on relevance.  Now it is ready to be searched.  So what happens
when you type those words into the search box and click on the “Search” button?

It is easy to see what happens when you enter just a single word.  Let us say you want to find
information about the Kodak Corporation.  You type the word, “Kodak” into the search box and click
“Search.”  All of the web pages where the word Kodak appears have been weighted.  So the search
engine basically just returns web pages with the highest weight first.  But what happens when you
enter more than one word?  That is when it gets interesting.

To search for more than one term, the search engine uses what is called
Boolean logic.  Boolean
logic connects multiple words using connectors such as and, or, not.  These are known as
Boolean
operators
in computer programming speak.  By stringing together multiple words into a phrase,
using these connectors, the search engine is able to find results based on the combination of all the
terms being searched.  Doing this basically involves searching for all of the words and then using
some kind of algorithm to calculate a weight based on the combination of all of the words.  Once
again, each search engine will use their own algorithms for calculating weights for the combined
search term.

Most people do not use Boolean operators when submitting their searches.  They just type in all of
the words.  The search engine then has to decide how to construct the Boolean expression before
submitting the search.  Typically, a search engine will try an expression where all the words are
present (connected by and) first.  Then it will look for pages where some of the words are present
(connected by or).  If you want to narrow your search beyond this, you can enter your own
expression.


Submitting Web Pages to Search Engines

Now spiders are crawling web pages to build their database of words to be searched.  What do you
do if you have a small website that isn’t being linked to by popular sites?  Do you just have to wait
and hope the spiders find your website?  Luckily, most search engines will allow you to submit your
website to be spidered.

Each search engine has its own submission procedure.  To submit your website, just go to the search
engine where you want to be listed.  Their submission guidelines will be posted on the website.  If
you are using a Search Engine Marketing firm, they will usually be able to do this for you.


Blocking Robots

In most cases, you want spiders from the major search engines to find your web pages.  But
sometimes you do not.  There are a variety of reasons you may not want some of the pages on your
website to be crawled.  You may have pages on your website that are still under construction or are
simply intended as doorways from specific links.  Also, robots look like web users coming to your
web pages.  They perform actions (e.g. following links) just like web users.  They also leave a record
in your web logs like regular users.  But, robots are not web users.  They are little automated
computer programs.  What if your web pages are running some kind of software that is triggered by
the actions of users on the page?  The actions of spiders on the page could throw a wrench in your
works.  Perhaps you just do not want your traffic logs being cluttered with robot traffic.  Is there
anything you can do about it?  Luckily, yes.

Robots usually identify themselves.  In most cases, the robots are benign.  They are there to
perform a legitimate service.  Companies have nothing to gain by hiding them.  In fact, they want
websites to know they are coming.  That makes it easier for the websites to optimize their pages to
be searched by the robots.  This also helps if you do not want to be searched by the robots.

The first good thing about this is it lets you automatically filter out robot traffic from your traffic
reports.  They are easily identified as robots and can simply be taken out of the calculations.  

The second good thing is you can instruct the robots to skip your web page altogether.  There are
two ways to do this.  First, you can instruct spiders to skip a specific web page by including a special
meta tag called the
Robots tag.  This simply tells the spider to skip that page.  

A better way to block robots is to include a
robots.txt file in the root directory of your website.  This
is a simple text file that lists all of the pages you do not want spiders to crawl.  You can also instruct
specific search engines not to crawl pages on your website in the robots.txt file.  By putting this file
at the root directory of your website, robots can easily find it.  It is the first thing they look for.  
Some search engines require a website to have a robots.txt file or else they will not crawl the site.  
It is, therefore, a good practice to have the file present even if there is nothing in it.




==========================
This article is an excerpt from
The Complete Internet Marketer: A Practical Guide To Everything You
Need To Know About Marketing Online by Jay Neuman.

Since 1994, Jay Neuman has been helping businesses as varied as Fortune 500 companies, startup
Dot-Coms and nonprofit organizations overcome their Internet Marketing and Database Marketing
challenges.  

Jay is currently Sole Proprietor of the KnExT Consulting Group.
www.knextconsulting.com.  
He can be reached at
jay.neuman@knextconsulting.com