Using .NET Regular Expressions to Read Web Pages
Why User Regular Expressions?
If you want to solve Will's puzzles, you are generally going to need some kind of list, such
as city names, author names, animal names, etc. Where are you going to
get those lists? Why, on the web, where else!
Problem: unless the list is short, you're going to have a hard time using it in your code.
The data is mixed-in with HTML, and you need to extract the data to create a simple text list. You can either spend
a few hours cutting-and-pasting from your browser into a text editor, or else you can write some
code to do that for you.
That's where regular expressions come in handy. Regular expressions are great
for finding string patterns mixed-in with a lot of html markup. You
could write a procedure that extracts just the data, and I predict your procedure will be huge.
Or, you could write a short regular expression and do the same thing with a few lines of code.
This article explains the basics of writing regular expressions for that purpose.
Regular expressions match patterns in your string variables.
For example, you can use them to find all the matches in a large string, or to determine
whether a candidate string conforms to your pattern. You could also write a
regular expression to return all the links on a text page, or write a regular expression to
verify your user entered a valid phone number.
Sample regular expression that recognizes a phone number:
\d{3}-\d{4}
Explanation:
-
\d matches any digit, i.e. 0-9
-
{3} insists that there be exactly 3 of the preceding
-
- literal character, i.e. a hyphen.
-
\d digits again
-
{4} exactly four of the preceding
Usage in C# code:
using System.Text.RegularExpressions;
Regex phoneRegex = new Regex(@"\d{3}-\d{4}");
if (phoneRegex.IsMatch("555-1212") {
Respose.Write("Valid phone number");
}
The best way to learn regular expressions is to look at the
Cheat Sheet
and practice writing some with
The Regulator
. It has a few bugs, but it by using it, you will find it far easier to write and test your regexes.
Tip 1: write your regexes on multiple lines with embedded comments. It will be
easier to debug or maintain them. If you're using .NET, use
RegexOptions.IgnorePatternWhitespace as an option to the Regex constructor.
Sample regex to match links, broken-up into multiple lines with embedded comments:
<a #Literal - beginning of the anchor
\s #Whitespace
href=" #Literal - beginning of the address
( #Parentheses starts our match group
/wiki/List_of_people_from_ #literal text
[^"]* #any character except " 0 or more times
) #Close our group
" #The close quote (literal)
The following code shows how to grab all the matches in a string and loop through them.
Note that the 'Groups' collection refers to the stuff inside our parentheses, such as
the group immediately above which matches a URL on Wikipedia. It will contain the
matched text you are looking for.
Look at the code below; assume that rText is a string variable containing the regex text
above, and pageContents is another string variable containing the text from a Wikipedia
page. Then the following code will loop through all the matches on the page and print out the
target URL of each matching link.
using System.Text.RegularExpressions;
Regex wikiLink = new Regex(rText, RegexOptions.IgnorePatternWhitespace);
Match m = wikiLink .Match(pageContents);
while (m != Match.Empty) {
if (m.Success && m.Groups.Count > 1) {
Response.Write(m.Groups[1].Value); //print the link
}
m = m.NextMatch(); //fetch the next match
}
Of course, we don't need to
merely print the target URLs, we can also use them to fetch the page contents
and then process those pages somehow. (Take a look at how to
write your own spider
.)