Web+Research+9



=**Web Research 9 : Search Strategies for Search Engines**=


 * Aim:**

What are some advanced features of search engines?


 * Common Core State Standards:**


 * CCSS.ELA-Literacy.W.11-12.7** Conduct short as well as more sustained research projects to answer a question (including a self-generated question) or solve a problem; narrow or broaden the inquiry when appropriate; synthesize multiple sources on the subject, demonstrating understanding of the subject under investigation.


 * CCSS.ELA-Literacy.W.11-12.8** Gather relevant information from multiple authoritative print and digital sources, using advanced searches effectively; assess the strengths and limitations of each source in terms of the task, purpose, and audience; integrate information into the text selectively to maintain the flow of ideas, avoiding plagiarism and overreliance on any one source and following a standard format for citation.


 * Objectives:**

Students will learn the three components of a search engine:

a. a computer program called a spider or a robot that retrieves hyperlinks attached to documents b. a database that indexes these documents c. software that allows users to enter keywords in search forms to obtain ranked results


 * Vocabulary:**


 * default setting
 * field
 * field searching
 * full-text indexing
 * hidden Internet
 * high precision/high recall
 * high precision/low recall
 * implied Boolean operator
 * invisible Web
 * keyword
 * limiting by date
 * low precision/high recall
 * meta-tag
 * nested Boolean logic
 * phrase searching
 * proximity searching
 * relevance
 * relevancy ranking
 * results per page
 * robot
 * search expression
 * search form
 * spider
 * stemming
 * stop word
 * syntax
 * truncation
 * wildcard


 * Discussion:**


 * Search Engine Similarities and Differences**

All the major search engines are similar in that you enter keywords in a **search form**.

After clicking on **Search, Submit, Find**, or some other similar command button, the database returns a collection of hyperlinks, usually listed according to ther **relevance** to the keyword(s) you typed in, from most relevant to least relevant.

Even though most of the major search engine databases attempt to index as much of the Web as possible, each one has a different way of determining which pages will be listed first.

Some databases list results by term relevancy, employing algorithms that measure how closely a Web document matches the search expression used.

Others list them by using link analysis, which takes into account the context of the Web pages in relation to the search expression, how many quality Web pages link to the pages, of how "popular" they are.

The major search engines differ in several ways:


 * Size of index
 * Search features supported (many search engines support the same features but require different syntax to initiate them)
 * How frequently they database is updated
 * Ranking algorithms
 * How deeply each Web site is indexed

It is important to know these differences because to do an exhaustive search of the WWW, you must be familiar with a few different search tools. No single search engine can be relied upon to satisfy every query.


 * How Search Engines Work**

In search engines, a computer program called a **spider** or **robot** gathers new documents from the WWW.

The program retrieves hyperlinks that are attached to these documents, loads them into a database, and indexes them using a formula that differs from database to database.

Then, when you consult the search engine, it searches the database looking for documents that contain the **keywords** you used in the **search expression**.

No search engine actually indexes the entire Web. There is information that is inaccessible to search engines, commonly referred to as the **invisible Web** or **hidden Internet**.

Much of this content can be located in special databases.

Although robots have many different ways of collecting information from Web pages, the major search engines all claim to index most of the text of each Web document in their databases.

This is called **full-text indexing**.

In some search engines, the robot skips over words that appear often, such as prepositions (in, on, by, to, since) and articles (a, an, the).

These common words are called **stop words**.


 * How a Spider Works - Searching the Internet for New Documents**

Spiders automatically do this gathering of documents at intervals that differ from service to service.

You need to keep in mind that it may be that some portions of a search engine's database may not have been updated in a few weeks.

People can also submit their Web pages to be included in the database.

This often results in a robot visiting the page and collecting information for the search engine's database.

Some robot programs are intuitive; they know which words are important to the meaning of the entire Web page, and some of them can find synonyms to the words and add them to the index.

Some full-text databases use robots that enable them to search on concepts as well as on the search query words.

Some Web page authors include **meta-tags** as part of the HTML code in their pages.

Meta-tags may contain keywords that describe the content and purpose of a Web page, but may not appear on the page.

They appear only in the HTML source file.

You can view the HTML source code by looking at the page source. Click on **View**, then select **Source**. Meta-tags allow Web pages that don't contain a lot of text to come up in a keyword search.

The two most important meta-tags are the //description// and //keywords// tags.

Some search engines will use the description section as the short summary that appears next to the URL in the results list.

Becoming proficient in search techniques is crucial in the full-text environment.

The chance of retrieving irrelevant material is high when you can type in a word and conceivably retrieve thousands of Web pages that have that word in it.


 * Search Features Common To Most Search Engines**

It's important to understand the different search features before you begin using a search engine for research.

The reason for this is that each search engine has its own way of interpreting and manipulating search expressions. In addition, many search engines have **default settings** that you may need to override if you want to obtain the most precise results.

Because a search can bring up so many Web pages, it is very easy to have a lot of hits with few that are relevant to your query. This is called **low precision/high recall**.

You may be satisfied with having very precise search results with a small set returned. This is defined as **high precision/low recall**.

Ideally, using the search expression you enter, the search engine would retrieve all the relevant documents you need.

This would be described as **high precision/high recall**.

Search engines support many search features, though not all engines support each one.

If they do support certain features, they may use different **syntax** in expressing the feature.

Before you use any of these search features, you need to check the search engines' help pages to see how the feature is expressed or if it is supported at all.


 * Implied Boolean Operators** or pseudo-Boolean operators, are shortcuts to typing AND and NOT. In the search engines that support this feature, you type + (plus sign) before a word or phrase that must appear in the document and - (minus sign) before a word or phrase that must not appear in the document.


 * Phrase Searching**

A //phrase// is string of words that must appear next to each other. //Global warming// is a phrase, as is //chronic fatigue syndrome//.

Use phrase-searching capability when the words you are searching for must appear next to each other and must appear in the order in which you type them.

Most search engines require double quotation marks to differentiate a phrase from words searched by themselves.

The two phrases mentioned above would be expressed like this: **"global warming""** and **"chronic fatigue syndrome"**.

In some search tools, a phrase is assumed when more than one word is typed together without a connector between them.


 * Proximity Searching**

//Proximity operators// are words such as //near// or //within//. For example, you are trying to find information on the effects of chloroflurocarbons on global warming. You might want to retrieve results that have the word //chlorofluorocarbons// very close to the phrase //global warming//.

By placing the word NEAR or WITHIN between the two segments of the search expression, you would achieve more relevant results than if the words appeared in the same document but were perhaps pages apart.


 * Truncation**

Truncation looks for multiple forms of a word. Some search engines refer to truncation as **stemming**. For example, to research postmodern art, you might want to retrieve all the recors that had the root word //postmodern//, such as //postmodernist// and //postmodernism//.

Most search engines support truncation by allowing you to place and asterisk (*) at the end of the root word. For example, in this case you would type postmodern*.

Some search engines automatically truncate words. In those databases, you can type //postmodern// and to be sure to retrieve all the endings. In these cases, truncation is a default setting of the search engines.


 * Wildcards**

Using wildcards allows you to search for words that have most of the letters in common. For example, to search for both //woman// and //women//, instead of typing **woman OR women**, we place a wildcard character (most often an asterisk) to replace the fourth letter, like this wom*n.

In addition to searching for both the American and British spellings of certain words, wildcards are also useful when searching for those words that are commonly misspelled. For example. take the word //genealogy//. By placing the wildcard character where the commonly mistaken letters are placed, like this: gen*logy, you can be sure to get documents with the word spelled correctly. Of course, you'll also get pages where the word is misspelled.


 * Field Searching**

Web pages can be broken down into many parts. These parts, or **fields**, include titles, URLs, text, summaries or annotations (if present).


 * Field searching** is the ability to limit your search to certain fields. This ability to search by field can increase the relevance of the retrieved records.

In Google, for example, you can search for Web pages that contain certain words in the title of the page by typing allintitle:obama afghanistan taliban. You can also limit your search to a specific domain, such as educational institutions (.edu), commercial sites (.com), and so forth. In addition, a search can be limited to a particular host (site), such as a company or institution Web site.


 * Language Searching**

The ability to limit results to a specific language can be useful. Several search engines support this feature, including Yahoo! and Google. Some search engines also provide a translation service.


 * Searching by File Format Type**

The ability to search for files of a particular file type can also be a useful feature. For example, let's say you need to create a presentation on a particular topic, such as the environment, and you want some ideas of how others have presented similar information. In Google, you would enter environment filetype:ppt and retrieve links to PowerPoint presentations on the topic of the environment. In Yahoo!, the search would like like this environment originurlextension:ppt.


 * Link Searching**

This feature allows you to search for sites that link to a particular URL. In Google and Yahoo! you type link: before the URL that you are searching for. For example, if you want to see all the Web sites that link to Wikipedia's main page, you would enter link:en.wikipedia.org.


 * Limiting by Date**

Some search engines allow you to search the Web for pages that were added or modified between certain dates. In **limiting by date**, you can narrow your search to only the pages that were entered in the past month, the past year, etc.


 * Results Ranking**

Many search engines measure each Web page's relevance to your search query and arrange the search results from the most relevant to the least relevant.

This is called **relevancy ranking**.

Each search engine has its own algorithm for determining relevance, but it usually involves counting how many times the words in your query appear in the Web pages.

In some search engines, a document is considered more relevant if the words appear in certain fields, such as the title or summary field. In other search engines, relevance is determined by the number of times the keyword appears in a Web page divided by the total number of words in that page. This gives a percentage, and the page with the largest percentage appears first on the list.

Most search engines determine relevancy by how many Web pages link to it or how many people have accessed particular pages in response to similar questions in the past.


 * Annotations or Summaries**

Some search engines include short descriptive paragraphs of each Web page they return to you.

These annotations, or summaries, can help you decide whether you should open a Web page, especially if there is no title for the Web page or if the title doesn't describe the page in detail.


 * Results Per Page**

In some search engines, the **results per page** option allows you to choose how many results you want listed per page. This can be a time saver because it sometimes takes a while to go from page to page as you look through results.


 * Meta-tag Support**

Some search engines acknowledge keywords that a Web page author has placed in the field in the HTML source document. This means that a document may be retrieved by a keyword search, even though the search expression may not appear in the document.


 * In-Class Activity:**

In this activity, we are going to search for resources on a multifaceted topic. We want to find information on U.S. policy in Afghanistan during former President Bill Clinton's administration.

Following most of the steps of the basic search strategy, we need to examine the facets of our search, choose the appropriate keywords, and determine which search features apply. Then, we'll go to Google and read the search instructions. Let's see how this search engine handles this multifaceted topic.

We'll follow these steps:

1. Identify the important concepts of your search. 2. Choose the keywords that describe these concepts. 3. Determine whether there are synonyms, related terms, or other variations of the keywords that should be included. 4. Determine which search features may apply, including truncation, proximity operators, Boolean operators, and so forth. 5. Choose a search engine. 6. Read the search instructions on the search engine's home page. Look for sections entitled Help, Advanced Search, Frequently Asked Questions, and so forth. 7. Create a search expression using syntax that is appropriate for the search engine. 8. Evaluate the results. Were the results relevant to your query? 9. Modify your search if needed. Go back and revise your query accordingly. 10. Try the same search in a different search engine, following Steps 5 through 9 above.


 * Details**

1) Identify the important concepts of your search. (The most important concepts of this search are Bill Clinton, foreign policy, and Afghanistan).

2) Choose the keywords that describe these concepts. (The main terms or keywords include the following, //Bill Clinton, foreign policy//, and //Afghanistan//.

3) Determine whether there are synonyms, related terms, or other variations of the keywords that should be included.


 * For Bill Clinton: none
 * For foreign policy: none
 * For Afghanistan: Taliban

4) Determine which search features may apply, including truncation, proximity operators, Boolean operators, and so forth. (When developing a search expression, keep in mind that you place OR between synonyms and AND between the different concepts, or facets, of the search topic. If you write down all the synonyms you choose, it may help with the construction of the final search phraseology which could look like this:

"bill clinton" AND "foreign policy" AND afghanistan OR taliban

5) Choose a search engine. Do this search in Google, Yahoo! or Exalead. Exalead is not a major search engine but it has search features I'd like to introduce to you. Google supports Boolean searching and phrase searching, and it is the largest search engine database on the Web, so you can start with Google (http://www.google.com).

6) Go to Google's Web Search Help page (https://support.google.com/websearch/?hl=en). (The provided in the search help section tells us that it is necessary to use quotation marks around phrases. Because we want to search for synonyms, it will be necessary to use the OR connector. Google's help section tells us that it is possible to do this, but **OR** must be capitalized).

7) Create a serach expression using syntax that is appropriate for the search engine. Now that you've read the search help, it's time to formulate the search expression. It will help to write it out before you type it in the search form. Here is a possible way to express this search:


 * 'bill clinton" "foreign policy" afghanistan OR taliban**

There is no need to capitalize the proper nouns in the search expression. Keep in mind that you can always modify your search later. Enter the search into Google's search form and click on search.

8) Evaluate the results. Are the results relevant to your query?

9) Modify your search if needed. Go back and revise your search accordingly. (The results seem fairly relevant to our topic. After opening a few Web sites, however, we find that may of the documents discuss Iraq. We are curious to see if we can find some resources that are primarily about Afghanistan. This is a good time to use the NOT feature. Google's help section tells us that NOT can be expressed by typing a - just before the term(s) that you don't want to appear in the results. We will go back and modify our search by adding the term Iraq with a - in front of the word.

In the search form that appears at the top of the page, add -iraq to the end of the search expression. Make sure that there is no space between the hyphen and the word. If you wanted to limite your results to U.S. government resources, you could limit the search to only those documents ending in a .gov domain, by adding **inurl:.gov** to the end of your search expression.

10) Try the same search in a different search engine, following Steps 5 through 9 above.


 * Summary:**

Search engines are information-retrieval systems that allow us to search the vast collection of resources o the Internet and the WWW.

A search engine consists of three components: a computer program called a spider or robot that retrieves hyperlinks attached to documents, a database that indexes these documents, and software that allows users to enter keywords in search forms to obtain ranked results.

Each search engine database is unique and accesses its database differently.

Even though many search engine databases claim to cover as much of the Web as possible, the same search performed in more than one database never returns the exact same results.

If you want to do a thorough search, you should become familiar with a few of the different search engines.

To this end, it is important to understand the major search features, such as Boolean logic, phrase searching, truncation, and others before you get online.

It is also necessary to read each search engine's documentation often, since search engines are constantly changing their search and output features.

In this lesson, I introduces the basic search strategy, a 10-step procedure that can help you formulate search requests, submit them to search engines, and modify the results retrieved.

We have focused on the major search engines on the Web, but there are several hundred smaller search engines on the Web that search smaller databases.


 * Summary Activity:**


 * Homework:**

1. Using the Advanced Search mode in Google, http://www.google.com/advanced_search, and in Yahoo!, http://search.yahoo.com/web/advanced?ei=UTF-8, look for relevant resources on the following topics:

a. The life expectancy of a Sun Conure b. Mary Kingsley's travels in Africa c. Maria Mitchell's contributions to astronomy

Write down the titles of the first three Web pages retrieved by each search engine. Were any of these the same in the two search engines? Write down the search expression you used in each database.

2. Sometimes it is helpful to look for specific types of Web sites about a topic. Go to Google at http://www.google.com and look for Web pages about the inventor Nikola Tesla. Can you tell how many results are found? Now go to Google's Advanced Search page, http://www.google.com/advanced_search, and do the same search, limiting your results to domains that end with **.edu**. How many results do you find now? Change your search to look for results with the **.gov** domain which were updated in the last year. How many results do you find?

3. Find the most recent annual report and a mission statement for Pfizer. What would be the best strategy to use to find this information?

a. Go to Exalead at http://www.exalead.com. What search expression(s) did you use to find the annual report and mission statement? Give the URLs of the page(s) where they are found.

b. Try the same search in Google at http://www.google.com. Which search engine gave you more relevant results?

4. Look for information on how genetically altered corn is affecting Monarch butterflies.

a. First, write down your search strategy. What keywords will you use? What other words might be used instead of "genetically altered?" What search expression will you start with? b. Try your search at Ask, at http://www.ask.com. How many results did you find? Go to the first three sites listed. How relevant are they to your search? Give the URLs of the sites you visited. Do you need to modify your search expression?

5. Virtual Humans have become a topic of interest. Besides being the stuff of speculative fiction, they are becoming the stuff of reality.

a. Go to Google at http://www.google.com and search for virtual humans. How many results do you find? Look at some of the first ten sites in your results list. What is a virtual human? Give the URL of the site where you found your answer. b. Now search for pages that show Peter Plantec's contribution to the field of virtual humans. What was your search expression? How many results did you find? Who is Peter Plantec? Give the URL of the page where you found the answer.

6. Using the advanced search mode in Yahoo! (http://search.yahoo.com/web/advanced?ei=UTF-8), look for information on how mad cow disease (also known as Bovine Spongiform Encephalopathy) causes Creutzfeld-Jakob disease in humans.

a. Write down your search expression and the total number of results. Do you need to modify your search expression? b. Were your results relevant to your request? Write down three of the most relevant titles and their URLs.

7. Go to Yahoo! at http://www.yahoo.com, to find comparison studies of the drugs venlafaxine XR and fluoxetine. What search expression did you use? Go to the first three Web sites listed. What are the brand names of these drugs? Give the titles and URLs of the three sites you visited. Which was most relevant?

8. Just as the Web constantly changes, search engines do as well. Go to Google at http://www.google.com and do a search for comparisons and reviews of search engines. Scan through the search results and go to the most promising sites. Give the titles and URLs of the sites you visited. Which was the best? Why? You may want to put one of these sites in your list of favorites or bookmarks.

9. From cuneiform writing to the printing press, written communication kept changing and becoming more pervasive. By the 19th century, a new invention made a big difference. Try to search for the history of the ball point pen.

a. Who invented it? When did the invention take place? b. Tell what search engine you used, what search expression you used, and give the URL of the site where you found your answer.


 * Source:**

Hartman, K. and Ackerman, E. (2010). //Searching and researching on the Internet and the World Wide Web//. Sherwood, OR: Franklin, Beedle & Associates.