Beat the Puzzle Master
Beat the Puzzle Master!
A web site dedicated to solving the NPR Sunday Puzzle

Using .NET Regular Expressions to Read Web Pages

Why User Regular Expressions?

If you want to solve Will's puzzles, you are generally going to need some kind of list, such as city names, author names, animal names, etc. Where are you going to get those lists? Why, on the web, where else!

Problem: unless the list is short, you're going to have a hard time using it in your code. The data is mixed-in with HTML, and you need to extract the data to create a simple text list. You can either spend a few hours cutting-and-pasting from your browser into a text editor, or else you can write some code to do that for you.

That's where regular expressions come in handy. Regular expressions are great for finding string patterns mixed-in with a lot of html markup. You could write a procedure that extracts just the data, and I predict your procedure will be huge. Or, you could write a short regular expression and do the same thing with a few lines of code. This article explains the basics of writing regular expressions for that purpose.

Regular expressions match patterns in your string variables. For example, you can use them to find all the matches in a large string, or to determine whether a candidate string conforms to your pattern. You could also write a regular expression to return all the links on a text page, or write a regular expression to verify your user entered a valid phone number.

Sample regular expression that recognizes a phone number:

\d{3}-\d{4}
Explanation:
  • \d     matches any digit, i.e. 0-9
  • {3}  insists that there be exactly 3 of the preceding
  • -       literal character, i.e. a hyphen.
  • \d     digits again
  • {4}  exactly four of the preceding
Usage in C# code:
using System.Text.RegularExpressions;
Regex phoneRegex = new Regex(@"\d{3}-\d{4}");
if (phoneRegex.IsMatch("555-1212") {
    Respose.Write("Valid phone number");
}

The best way to learn regular expressions is to look at the Cheat Sheet and practice writing some with The Regulator . It has a few bugs, but it by using it, you will find it far easier to write and test your regexes.

Tip 1: write your regexes on multiple lines with embedded comments. It will be easier to debug or maintain them. If you're using .NET, use RegexOptions.IgnorePatternWhitespace as an option to the Regex constructor.

Sample regex to match links, broken-up into multiple lines with embedded comments:
<a                              #Literal - beginning of the anchor
\s                              #Whitespace
href="                          #Literal - beginning of the address
(                               #Parentheses starts our match group
/wiki/List_of_people_from_      #literal text
[^"]*                           #any character except " 0 or more times
)                               #Close our group
"                               #The close quote (literal)

The following code shows how to grab all the matches in a string and loop through them. Note that the 'Groups' collection refers to the stuff inside our parentheses, such as the group immediately above which matches a URL on Wikipedia. It will contain the matched text you are looking for.

Look at the code below; assume that rText is a string variable containing the regex text above, and pageContents is another string variable containing the text from a Wikipedia page. Then the following code will loop through all the matches on the page and print out the target URL of each matching link.

using System.Text.RegularExpressions;
Regex wikiLink = new Regex(rText, RegexOptions.IgnorePatternWhitespace);
Match m = wikiLink .Match(pageContents);
while (m != Match.Empty) {
    if (m.Success && m.Groups.Count > 1) {
        Response.Write(m.Groups[1].Value);  //print the link
    }
    m = m.NextMatch();      		    //fetch the next match
}
Of course, we don't need to merely print the target URLs, we can also use them to fetch the page contents and then process those pages somehow. (Take a look at how to write your own spider .)