How to Write Your Own Spider/Robot
When you need to grab data from another website
Using .NET, there are a couple of different ways to grab the source of another web page:
-
With a WebClient object
-
With a HttpWebRequest object
The WebClient method is simpler, but like all simple methods, doesn't give you the options you sometimes
need. For example, if you're trying to grab a page off Wikipedia, it won't work, because their policy
bans anonymous robots, and a WebClient only lets you access pages anonymously.
The Easy Way
Here's how to use a WebClient to read a page and convert the contents to string:
using System.Web;
private string SlurpPage(string url) {
WebClient myWebClient = new WebClient();
byte[] requestedHtml = myWebClient.DownloadData(url);
UTF8Encoding myUTF8 = new UTF8Encoding();
return myUTF8.GetString(requestedHtml);
}
Use the HttpWebRequest object when you need to customize how you access a web page. If you specify
a value for the UserAgent property, you can access sites like Wikipedia that ban anonymous robots. If
you have a crappy connection, you can specify a longer timeout period and even read the page in chunks.
Reading in chunks will frequently succeed when you have a poor connection.
Full-Featured Method: For Special Needs
How to use the HttpWebRequest method:
using System.Web;
using System.IO;
HttpWebRequest myReq = (HttpWebRequest)WebRequest.Create(url);
//Wikipedia bans anonymous robots:
myReq.UserAgent = "PuzzleAgent";
HttpWebResponse myResponse = = (HttpWebResponse)myReq.GetResponse();
//Slurp the page:
StreamReader sr = new StreamReader(myResponse.GetResponseStream());
string pageContents = sr.ReadToEnd();
Using this technique, you can get data from a large variety of sites. Wikipedia is a favorite, but
I have also used it on a Department of Labor site (to get lists of job titles), the census bureau
(to get lists of popular names and to get lists of place names), and many more.
Many times, you will find a page with links to other pages. For example,
the DOL job titles page
doesn't list any individual job title, but instead has links to other pages, such as the Service Occupations
or the Machine Trades Occupations.
Of course, this is no difficulty to us at all; we merely find the links on the
page and visit each in turn. My favorite technique is to write a regular
expression that matches the links on the page and
use its matches to visit all the pages referred to on the original page.
travel-homopterous