By now we hope you have noticed a theme running throughout the Internet Income Course. When we discuss recruiting affiliates, we ask you to think from the perspective of the prospects you are targeting. When we discuss selling products, we ask you to think from the perspective of the potential consumer. When we discuss contacting Webmasters to place your ads on their sites, we ask you to think from the Webmasters' perspective. When we discuss spam, we ask you to think from the perspective of the e-mail recipients. Now, as we begin to discuss search engines, we are going to ask you to think from the perspective of the search engine operators.
Take a few minutes and pretend that you are starting your own search engine. Say it is back in the mid 1990's and you want to establish your site as one of the popular search engines on this new fantastic Internet. You are going to take your company public and retire as a zillionaire! What would be important to you? How would you make that happen? You would certainly want your database to include all of the important, valuable Websites on the Internet. You would also want a comprehensive list of all the other not-so-great sites on the Internet as well. You would want your visitors to be able to efficiently find just what they want among those sites when they search your engine. You would want the most useful, valuable, high-quality sites to come up first in the list, followed by the less useful sites. You would want the most relevant sites to show up first, followed by the less relevant sites in the search results.
Say you start your index of Websites and for each entry you have a field for "site name," "site description," "site URL (address)," and "keywords." When visitors search from your search engine, they will input search words or phrases and then be led to sites pertaining to those words or phrases. Again, you want the most useful, valuable sites to come up first in the results. How would you accomplish this?
Let's say, for example, that one of the visitors to your search engine has to write a report on the planet Saturn for a class he is taking. He types in the keyword "Saturn." When your engine searches through its index of keywords, it will pull up sites that discuss the planet Saturn, but it will also pull up sites that deal with the automobile named Saturn, the Saturn Sega game system, the comic strip character named Sailor Saturn, perhaps a rock band named Saturn's Rings, and maybe even a porn star who goes by the nickname Saturn (provided you have not taken steps to filter out porn sites).
These diverse results are to be expected. What you do not want, however, is for a number of business opportunity sites, credit card sites, long-distance service sites, and porn sites (at least those that don't have stars named Saturn) to come up also. If those sites come up under a search for Saturn, then your search engine is not very efficient and your visitors will become frustrated and quit using your search engine. (If your visitor loses valuable time needed to write his report on the planet Saturn while plowing through all of these other non-relevant sites, he or she will find another search engine to use next time. If this happens often, there goes your zillion dollars!) You also want a means to identify the most valuable sites and rank them to come up first in the results. So, what you realize is that you need a means to exclude irrelevant sites and to rank relevant sites according to their value. The better you do these things, the more people will like your search engine and the more successful it will be.
The first thing you have to figure out as a search engine operator is how to associate keywords with sites. You will not have time to examine all the sites on the Internet and write keywords for them. You will either have to create a program that can read sites and make the keywords for them, or you will have to get the site publishers to do that themselves. There are at least two ways that you can get the sites' Webmasters to do this. One (employed by Yahoo!) is to make the Webmaster submit keywords in the process of submitting their site to your engine. This does not work for the "spider" engines, which go out and find the pages themselves, however. Since you want as many sites on your engine as possible, you do not want to wait for the Webmasters to submit them. You want to go out and find them. What you could do is support a standard whereby a certain meta tag included in the HTML code of all Web pages contains the keywords the Webmaster thinks are appropriate for his or her site. That is indeed what has happened. HTML supports the meta tag keywords for that very purpose.
When knowledgeable Webmasters build their Websites, they use the meta tag keywords on each page and insert the relevant keywords for that page. The tag looks like this:
Thus, when your search engine indexes sites, it automatically grabs the keywords from the keyword meta tag, and you are good to go. . . Or, are you? What if the Webmasters cheat when coding in their keywords?
Why would Webmasters want to cheat, and how would they go about doing so? Let's switch gears away from our Saturn example for a moment to explore these questions. Although the statistics are now changing, over the last several years the Internet has been used mostly by young men. Thus, the "hot" search terms—the ones most frequently searched for on the search engines—have been things of interest to young males. Because of this, as you might expect, search terms relating to sex and nudity, rock music, famous female stars, and outlaw-type sites (called "warez sites") have been high on the list of popular search terms. Some Webmasters have kept themselves aware of the current popular search terms and used them in their keywords, even though they may not be relevant to their Website. They do this to increase the probability of their site showing up in a search and, therefore, to increase their traffic.
Currently, one of the most popular search terms is "Britney Spears." It may occur to a Webmaster of a business opportunity site that if he adds "Britney Spears" to his keywords, he will increase traffic to his page and perhaps get more signups. But, as a search engine operator, you would have a problem with this Webmaster. You want your search engine visitors to find the pages they are looking for. If some young man is looking for a page discussing Britney Spears and finds business opportunity sites, your search engine has not done a good job. The young man, being unhappy, will take his search engine business elsewhere (so to speak) and your search engine will lose popularity. Thus, as a search engine operator, you are at cross purposes with these Webmasters who want to cheat with their keywords. Because you want that zillion dollar retirement, you are dang sure going to figure out ways to deal with them.
Now let's return to our "Saturn" example and say that a space probe launched by the United States has just gone into orbit around Saturn and is returning images from the planet's surface. NASA is publishing these images on the Internet as they come in. For a period of time before, during, and after these pictures are coming in, the word "Saturn" becomes a hot search term. People are going to search engines and trying to find sites where they can see these pictures of Saturn as they are downloaded from the space probe. A business opportunity Webmaster, being aware of this, decides to add "Saturn" as a keyword to his or her new Website. It will increase traffic to his site to add this keyword. But, the keyword has no relevance to his actual site, as he does not provide the NASA pictures on his site or any other information about Saturn. If your search engine does not catch and correct this attempt to cheat, your search engine will be ineffective. People who are trying to find the real sites with the Saturn pictures will become frustrated using your search engine because all they are finding are business opportunity pages which have nothing to do with Saturn. They will quit using your search engine and your goal of having one of the most popular search engines on the Internet will be defeated. Thus, as a search engine operator, you have to develop means to detect this keyword cheating.
The various search engines do exactly that in various ways. Search engine operators are constantly trying to improve their means of detecting such cheating. They closely guard the algorithms they use to do this as industrial secrets because if they become widely known, people will find ways around them.
When search engines detect people cheating, they can exclude the page from the search engine (and even any future submissions from the same person or company), give it a very low overall ranking, or give it a low or non-existent ranking with respect to the offending keywords.
Since we are never going to know with certainty how the search engines actually go about detecting keyword relevance and ranking sites, at least at any particular point in time, all we can do is imagine that we are the search engine operator and think how we would go about doing those things.
Thinking again like a search engine operator, how would you go about detecting keyword relevance? One thing you would soon realize is that you do not want to throw the baby out with the bath water. That is, you would realize that by only looking to see if the keywords show up in the actual content of the site and assuming that cheating has occurred if the words do not, you will exclude some very relevant, very valuable sites. This, too, would be bad for you as a search engine operator.
There are many legitimate situations where a keyword may not be repeated in the actual content of the site. Say a Webmaster has a regional site which provides news and current activities for a three county area commonly known among people in the region as the "River Basin Area" or perhaps the "Wiregrass Area" or some similar term. Say the three counties are Washington County, Adams County, and Jefferson County. Imagine further that the Webmaster included the state name and the name of all three of these counties in her keywords, but never actually mentions them in the content of her Website. The Website just refers to "news and events for the River Basin Area." The county names are very relevant keywords, even though they do not appear in the content of the site. People looking for news and events for Washington County would want to see this page. As a search engine operator, you would want them to find this page because it has information they are looking for. It would have been better if the Webmaster had included a statement like "covering the Washington County, Adams County, Jefferson County news and events and more" in her content. But, since she did not, you will have a more effective search engine if you recognize that she is not cheating, and her keywords are relevant. Thus, you realize that you are going to have to come up with a pretty sophisticated procedure for determining keyword relevance and ranking the Websites in your search engine.
Since you cannot afford to pay someone to sit and look at all of the millions of pages submitted to your engine in person, you will have to develop some algorithm that will do this task as best it can be done without human intervention. Clearly, it's not going to be extremely accurate. It would be far too difficult to write a program sophisticated enough to figure out all the different variations of relevant keywords. Most likely, you will have to settle for doing it on some statistical bases. Say, for example, you decide that if 90% of the keywords actually appear on the site, then that's close enough. The other 10% could be cheating or it could simply be legitimate oversight like the River Basin Area example above. You may have to settle for that margin of error. On top of that, you could look for certain keywords that suggest cheating and deal with them separately. You could try to develop algorithms which judge whether the keywords which do appear in the content are actually in context or just thrown in to fool the search engines. Whatever means is actually employed, it is a constant struggle between the aggressive Webmasters who would manipulate the search engines and the search engine operators who want to keep their search engines effective.