From Merle Regal:
Kris Kristofferson's last name starts with his four-letter first name. Can you name a famous
American whose last name ends with his four-letter first name? Hint: The
last name has seven letters.
How to Solve the Puzzle With Code
This puzzle is quite easy if you can just get a list of
famous Americans! Unfortunately, I was not able to find such a list per se.
The closes thing is Wikipedia, which maintains a number of such lists. Like
all pages in Wikipedia, the names we want are closely mixed-in with
HTML markup, so extracting the names was quite the challenge.
This puzzle can be solved in two stages:
-
Get the list of names (hard!)
-
Looping through the list to find the person in question (easy!)
Once you've got the list, all you have to do is
-
Loop through the list, examining every entry
-
Ignore every entry with firstName.Length ≠ 4
-
Ignore all with lastName.Length ≠ 7
-
Spit-out any you find where the trailing characters of the last name match the
first name
Most languages have as string function something like 'EndsWith', so, if you're
programming with, for example, C#, the code inside your loop
would look like this:
if (fName.Length == 4 && lName.Length == 7) {
if (lName.EndsWith(fName)) {
Response.Write(fName + " " + lName);
}
}
This code fragment assumes that you have variables named 'fName' and 'lName'
representing the first name and last name, respectively.
Get the Names!
Getting the names is a lot harder. The basic idea is to 1) write your own spider or robot
program to read all the pages in question, 2) somehow extract just the names out of
all the page markup. The list I built has over 10,000 entries, so editing it manually
is out of the question, and if you've ever examined any page mark-up, you'll immediately
realize that the names make up only a small portion of a typical HTML page.
Try right-clicking on this page and choosing 'view page source' to get an idea of
the near-gibberish appearance of a typical HTML page. Or go to this page
Wikipedia Famous Americans Page
and right-click on their page! Either should quickly convince you to give up any
idea of creating the list manually.
So, getting our names list boils down to this:
-
Writing a program that can fetch the page content from Wikipedia, and
-
Figuring out how to extract just the names from the page markup.
Since writing a spider to read other web pages is so common while solving
puzzles, I dedicated a separate page to it:
How to Write a Spider
Wikipedia has a 'master page' of famous Americans, with links to other pages. For example,
it has links to all the states. For my purposes, it seemed likely that I could get an
exhaustive list by visiting every state page and grabbing the people's names there
I needed two regular expressions to get the list:
-
One to extract the links to the state lists
-
Another to extract the people's names off the state pages
Since regular expressions are used for almost every puzzle, I don't explain
them every time. Read my
introductory article
on regular expressions to learn how to use them to 'scrape a screen'.
The
Wikipedia page
in question has 50 links that look like this:
<a href="/wiki/List_of_people_from_Alabama"
title="List of people from Alabama">Alabama</a>
Here is the regular expression I used to extract these
links.
<a #Beginning of the anchor
\s #Space
href=" #Beginning of the address
(/wiki/List_of_people_from_[^"]*) #The URL we seek
" #The close quote
I'm sure you remember that # is the comment designator for regular expressions,
and that it is easy to turn text above into a regular expression using the
RegexOptions.IgnorePatternWhitespace option.
Once I had the links, it was fairly easy to use my spider/robot to visit all the
pages in turn and read their contents. Now it's a matter of matching the names on each
page with another regular expression and saving them to a file.
Here's a link to a
typical page
on Wikipedia listing famous Americans. If you look at the page source, you will quickly
observe that each person listed looks like this:
<li><a href="/wiki/Hank_Aaron" title="Hank Aaron">Hank Aaron</a>,
Once again, we use a regular expression to match the link (refer to my
regex article
on getting all the matches for a regular expression and using them in code). Basically,
we need a regular expression that has literal text for everything up to the title tag, and
we place it inside parentheses to make it a match group.
Here's the final regex, with embedded comments:
((<li>)|(<td>)) #Starts with either a td or else a list item
(<b>){0,1} #2 pages inexplicably have bold tags; 48 do not
<a #Beginning of a list item and link
\s+ #White space
href="/wiki/[A-Za-z_]+" #Url of the reference with variable letter
\s+ #White space
title="[a-zA-Z\s]+"> #The title
([a-zA-Z\s]+) #The text of the anchor, our group
</a> #close the tag
The regex became complicated because some pages do not follow the same pattern. As a result,
the regex has optional clauses for bold tags and table cell tags (<td>). In order to
make this markup optional in the regex, a the notation {0,1} is used to indicate the markup
will appear either 0 times, or else 1 time. We use parentheses around the markup that might
occur 1 or 0 times.
At this point, all we have to do is plug-in our regular expression, extract the matches,
and write them to a file. Refer to
my article on simple file IO
for a quick introduction to reading/writing to files in .NET.
Download the code! (C#)