Beat the Puzzle Master
Beat the Puzzle Master!
A web site dedicated to solving the NPR Sunday Puzzle

How to Write Your Own Spider/Robot

When you need to grab data from another website

Using .NET, there are a couple of different ways to grab the source of another web page:

  1. With a WebClient object
  2. With a HttpWebRequest object

The WebClient method is simpler, but like all simple methods, doesn't give you the options you sometimes need. For example, if you're trying to grab a page off Wikipedia, it won't work, because their policy bans anonymous robots, and a WebClient only lets you access pages anonymously.

The Easy Way

Here's how to use a WebClient to read a page and convert the contents to string:
     using System.Web;
    private string SlurpPage(string url) {
        WebClient myWebClient = new WebClient();
        byte[] requestedHtml = myWebClient.DownloadData(url);
        UTF8Encoding myUTF8 = new UTF8Encoding();
        return myUTF8.GetString(requestedHtml);
    }

Use the HttpWebRequest object when you need to customize how you access a web page. If you specify a value for the UserAgent property, you can access sites like Wikipedia that ban anonymous robots. If you have a crappy connection, you can specify a longer timeout period and even read the page in chunks. Reading in chunks will frequently succeed when you have a poor connection.

Full-Featured Method: For Special Needs

How to use the HttpWebRequest method:
     using System.Web;
    using System.IO;

    HttpWebRequest myReq = (HttpWebRequest)WebRequest.Create(url);
    //Wikipedia bans anonymous robots:
    myReq.UserAgent = "PuzzleAgent";
    HttpWebResponse myResponse =  = (HttpWebResponse)myReq.GetResponse();
    //Slurp the page:
    StreamReader sr = new StreamReader(myResponse.GetResponseStream());
    string pageContents = sr.ReadToEnd();

Using this technique, you can get data from a large variety of sites. Wikipedia is a favorite, but I have also used it on a Department of Labor site (to get lists of job titles), the census bureau (to get lists of popular names and to get lists of place names), and many more.

Many times, you will find a page with links to other pages. For example, the DOL job titles page doesn't list any individual job title, but instead has links to other pages, such as the Service Occupations or the Machine Trades Occupations.

Of course, this is no difficulty to us at all; we merely find the links on the page and visit each in turn. My favorite technique is to write a regular expression that matches the links on the page and use its matches to visit all the pages referred to on the original page.


travel-homopterous