_______
How Search Engines Work
Under The Hood
by Jay Neuman

.     Print Friendly Version

This article is an excerpt from: The Complete Internet Marketer: A Practical
Guide To Everything You Need To Know About Marketing Online

The key to successfully getting listed on search engines is to understand how
natural search works.  How do search engines find things people are looking for
and how do they choose which results to list first?  When you understand this,
you will be able to make sure people find your website.  In this article, you will
learn how search engines work under the hood.

To understand how search engines work, there are really just two key things to
learn:
spiders and ranking web pages.  Everything else revolves around these
two things.  Spiders are automated computer programs that crawl the Web and
read all of the words on web pages they find.  These words are then put into a
database.  The words in the database are organized so it is easy to search in the
future.  For each word in the database, web pages will be
ranked to tell which
will be listed at the top of search results.  

The process of organizing the database to allow future searches is commonly
called
indexing.  This is a little bit of a jargon shortcut.  Indexing and ranking
web pages are actually two separate actions that occur in the database.  These
will be described below.  

Being successful with search engines is a matter of knowing how to get spiders
to find your web pages, what the search engines are looking for when they rank
order pages, and how to use that knowledge to get your pages listed at the top.


Spiders Crawl the Web

It all starts with those digital creatures that could only exist on the World Wide
Web.  
Spiders are small, automated computer programs that search for web
pages and read all of the text on each page they find.  To make more sense out
of just what this means, we need to take a dive into the world of Internet
jargon.  This is where it starts to get pretty thick.

First, spiders are a type of program known as a
robot or bot for short.  Robots
are automated programs that are designed to carry out some kind of action on
the Web.  Usually, this involves searching for some type of information on web
pages and then performing some action based on the information it finds.  They
are called robots, because once they are turned on, they will keep doing what
they are programmed to do without human intervention.  For example, shopping
bots search online stores and can be used to find the best prices on products
you are looking for.  Search engine spiders simply read every bit of text they
find on each page they go to.

So what do spiders do?  Well, of course they
crawl the Web.  Here is how
crawling works.  A spider goes to a particular website.  The search engine will
send it to the URL of a popular site to start on.  The spider reads every bit of
text on that web page.  Also, the spider will collect some information about that
text.  For example: is it in a meta tag, is it in the page title or a section header,
is it close to the top of the page?  Once it has collected all of the information it
was programmed to get, the spider follows every link on that page.  This way, it
will go through all of the pages on that website.  It will also go to all of the other
websites that are linked to from that site.  Once it gets to another website, it will
follow the same processes.  In this way a spider continues to go from website to
website, picking up all of the words on each one as it goes.  It is crawling
through the World Wide Web.


Filling the Database

What happens to all of that information the spiders collect?  It gets sent back to
the search engine computers and stored in their database.  You can think of the
database as being a gigantic list of words.  To make sense out of all the
information collected, it has to be organized in some way.

Every word on the Web is in there, at least all the ones found by the spiders.  
For each word, it stores the URL of the web page where the word was found.  It
also has information about that word as it appeared on the page.  Was the word
in a meta tag?  Then it will show which meta tag it was in  (Meta tags will be
discussed in a later section).  Was it in a subject header on the page?  How close
was it to the top of the page?  Each word, with its corresponding URL and
information is a row in the list.  If the word appears ten times on a given page,
then there will be ten rows for that word with that URL.

There is one more important piece of information that will be stored in the
database.  For each web page, the search engine will record all of the external
websites which link to that page.  Remember, the spider gets to a website by
following links on other websites.  Each time a link takes the spider to a given
website, that is recorded in the database.  So the search engine knows how
many times your website is being linked to from other websites.  This will
become important when it comes to rank ordering your web pages.  Websites
being linked to from many external websites are considered more popular than
those with few external links.


Ranking Web Pages

Now the spiders have filled the search engine database with a list of words found
on web pages.  The next step is to organize the information in the database so it
will return relevant listings.  All of the URLs associated with each word in the
database are rank ordered, so ones where a given word is particularly important
to the content of the web page will be listed higher than those where the word
just happens to appear there.  This is done by assigning weights to the web
pages.

Every search engine has its own algorithm for ranking web pages.  In general,
they all assign weights based on certain information about the word as it
appears on the page.  Think of it as a point system.  The web page gets points,
or loses points based on the algorithm.  There are four key factors considered
by search engines in assigning these points.

1.        Meta Tags

Every web page contains sections in the HTML code set aside to give information
about what is on the page.  These are called
meta tags.  This is where the
website creator can tell what the page is about.  Information in meta tags is not
displayed on the page, but it is read by search engine spiders.  

Three meta tags are especially important for every web page:  the
Title,
Description and Keywords.  These three tags allow you to categorize your own
web page.  So it is also the first place search engines look when figuring out
which words are most important on that page.  There is also a fourth meta tag
important for search engines.  That is the
Robots tag.  This meta tag allows you
to instruct search engine spiders to skip over your web page altogether.

Of course, people are going to look for ways to trick the system.  Sometimes
websites will load a lot of popular terms into their meta tags that do not really
have that much to do with the content of the page.  To account for this, search
engines will take away points from pages that have the words in the meta tags
but not in the body of the page.  Search engines also take away points from
pages that simply have too many words of any kind in the meta tags.

2.        Keyword Density

The second thing search engines look for when ranking web pages is how often
the word appears on the page.  This is called
keyword density.  In general, the
more often the word appears on the web page, the more relevant the word is to
the overall content of the page.  Therefore, it will receive a higher weight.  

However, search engine companies know that some websites try to trick the
system and load their web pages with text that is just there to be found by
search engines.  To get around this, the search engines will start taking away
points if the word appears too many times.

3.        Position on Page

Another way web pages rank the relevance of words to the pages they are on is
by where the word appears.  The HTML code for the web page identifies subject
headers and titles on the page.  These will usually be displayed in a larger font,
in bold or in a different color.  When a word appears in a title or subject header
that gives the page more points.  

Also, when people design web pages, they want the most important information
to be displayed at the top of the page, where people will see it without scrolling
down the page.  This is called being
above the fold.  Words that appear above
the fold are considered more important than those below.  But there is no easy
way to tell exactly where the fold will be for each page.  Instead, search engines
give more points for how close the word appears to the top of the page.

4.        External Links

Finally, if all else is equal, search engines want to list more popular web pages
above pages that are not very popular.  They estimate how popular a web page
is by how many times it is linked to from external websites.

Once again, search engine companies know that people will try to work the
system.  A common trick is to set up a lot of useless or redundant websites and
have them all link to the main site.  Search engines have ways of identifying this
trick.


Indexing the Database

As we said above, organizing the database to be searched is often referred to as
indexing, or creating the index.  Indexing is a technical database term.  It
means to physically reorganize the information in the database.  Some
advanced mathematical algorithms are used to allow very fast search and
retrieval of data records.  There are a variety of indexing methods available.  
Search engines have massive amounts of information that must be searched
and retrieved in a matter of seconds.  So you can imagine they use some of the
most complex algorithms available.  Luckily, you really do not need to know
about them.  What is important to understand is how web pages are ranked in
the database.  It is just a nice thing to know that the database is also indexed to
make it run faster.  Anyone who actually needs to know about indexing will be a
database professional and already familiar with the concepts.

The main reason for discussing indexing at all is to avoid confusion.  In the
common web jargon, the terms indexing and ranking are often used together.  
You should be able to know what the two terms actually mean.


Performing Searches

The search engine has an indexed database full of words.  All of the web pages
where those words appear are rank ordered based on relevance.  Now it is
ready to be searched.  So what happens when you type those words into the
search box and click on the “Search” button?

It is easy to see what happens when you enter just a single word.  Let us say
you want to find information about the Kodak Corporation.  You type the word,
“Kodak” into the search box and click “Search.”  All of the web pages where the
word Kodak appears have been weighted.  So the search engine basically just
returns web pages with the highest weight first.  But what happens when you
enter more than one word?  That is when it gets interesting.

To search for more than one term, the search engine uses what is called
Boolean logic.  Boolean logic connects multiple words using connectors such as
and, or, not.  These are known as
Boolean operators in computer programming
speak.  By stringing together multiple words into a phrase, using these
connectors, the search engine is able to find results based on the combination of
all the terms being searched.  Doing this basically involves searching for all of
the words and then using some kind of algorithm to calculate a weight based on
the combination of all of the words.  Once again, each search engine will use
their own algorithms for calculating weights for the combined search term.

Most people do not use Boolean operators when submitting their searches.  They
just type in all of the words.  The search engine then has to decide how to
construct the Boolean expression before submitting the search.  Typically, a
search engine will try an expression where all the words are present (connected
by and) first.  Then it will look for pages where some of the words are present
(connected by or).  If you want to narrow your search beyond this, you can
enter your own expression.


Submitting Web Pages to Search Engines

Now spiders are crawling web pages to build their database of words to be
searched.  What do you do if you have a small website that isn’t being linked to
by popular sites?  Do you just have to wait and hope the spiders find your
website?  Luckily, most search engines will allow you to submit your website to
be spidered.

Each search engine has its own submission procedure.  To submit your website,
just go to the search engine where you want to be listed.  Their submission
guidelines will be posted on the website.  If you are using a Search Engine
Marketing firm, they will usually be able to do this for you.


Blocking Robots

In most cases, you want spiders from the major search engines to find your web
pages.  But sometimes you do not.  There are a variety of reasons you may not
want some of the pages on your website to be crawled.  You may have pages on
your website that are still under construction or are simply intended as doorways
from specific links.  Also, robots look like web users coming to your web pages.  
They perform actions (e.g. following links) just like web users.  They also leave
a record in your web logs like regular users.  But, robots are not web users.  
They are little automated computer programs.  What if your web pages are
running some kind of software that is triggered by the actions of users on the
page?  The actions of spiders on the page could throw a wrench in your works.  
Perhaps you just do not want your traffic logs being cluttered with robot traffic.  
Is there anything you can do about it?  Luckily, yes.

Robots usually identify themselves.  In most cases, the robots are benign.  They
are there to perform a legitimate service.  Companies have nothing to gain by
hiding them.  In fact, they want websites to know they are coming.  That makes
it easier for the websites to optimize their pages to be searched by the robots.  
This also helps if you do not want to be searched by the robots.

The first good thing about this is it lets you automatically filter out robot traffic
from your traffic reports.  They are easily identified as robots and can simply be
taken out of the calculations.  

The second good thing is you can instruct the robots to skip your web page
altogether.  There are two ways to do this.  First, you can instruct spiders to skip
a specific web page by including a special meta tag called the
Robots tag.  This
simply tells the spider to skip that page.  

A better way to block robots is to include a
robots.txt file in the root directory of
your website.  This is a simple text file that lists all of the pages you do not want
spiders to crawl.  You can also instruct specific search engines not to crawl
pages on your website in the robots.txt file.  By putting this file at the root
directory of your website, robots can easily find it.  It is the first thing they look
for.  Some search engines require a website to have a robots.txt file or else they
will not crawl the site.  It is, therefore, a good practice to have the file present
even if there is nothing in it.



==========================
This article is an excerpt from
The Complete Internet Marketer: A Practical
Guide To Everything You Need To Know About Marketing Online by Jay Neuman.

Since 1994, Jay Neuman has been helping businesses as varied as Fortune 500
companies, startup Dot-Coms and nonprofit organizations overcome their
Internet Marketing and Database Marketing challenges.  Jay is currently Sole
Proprietor of the KnExT Consulting Group. -
www.knextconsulting.com.  

He can be reached at
jay.neuman@knextconsulting.com
The Complete
Internet Marketer
 Email:  
Subscribe  Today!
The Complete Internet Marketer Newsletter
receive a FREE Dictionary of Internet Jargon
Buy
The Complete
Internet
Marketer
Price:  $44.95
Easy to follow tutorials,
How-To guides and
real-world tips teach you
everything you need to
know about. . .
Search Engines
Email
Online Advertising
Affiliate Programs
Viral Marketing
Blogs
Web Analytics
Making Money from
your Website or Blog
Designing effective
website usability
Building a successful
Online Store
Building a successful
Small Business
website
Building a successful
Content website
Building a successful
B2B website
Building a successful
Nonprofit Org website
Building a successful
Corporate website
Building a successful
Free Online Service
Becoming profitable
And Much More. . .

Learn More