Scraping a webpage that doesn't have any CSS selectors?

arithma · Dec 28, 2011

Are you trying to get a set of entries or a single entry.
I once found myself in the same situation, getting some dates out of a certain webpage and made a service out of it. I made use of regular expressions. Try to narrow it down, but not too much, so that it gives you only the data you want.
In the case of the site you shared, I'd take out the block that am interested in as a whole first. Then I'd try to take out data entries individually.

This could help with your building a RegExpression: http://gskinner.com/RegExr/
It's RegExr in actionscript parlance, but should do the trick.

Ra8 · Dec 28, 2011

Edit: Sorry I completely read wrong your question

Ayman · Dec 28, 2011

@arithma, yes that's what I was thinking of, I'll give that a try it should work.

Joe · Dec 28, 2011

I'm not sure I understand what you're going for.

If you're trying to retrieve data from an HTML page, you could use regex, but you'll probably loose your mind (and your soul) trying to re-implement an html parser. Don't ask me, ask Atwood, he'll tell you the story of the programmer who got crazy trying to do it.

On the other hand, HTML is XML. Try to parse it as an XML. Depending on the language you use, you might even have a libhtml available.

arithma · Dec 28, 2011

@rahmu, you're absolutely right. You can't parse HTML into a DOM structure using RegExp easily. But we're not talking about that :) We need to take out specific info, and that's absolutely doable. It's a hack that just works. A one time, no maintenance throw away.
Goes without saying, it shouldn't be in production, live, always gonna run code.

Ayman · Dec 28, 2011

Nice article rahmu! :P

Actually what I am looking forward to do is extract the events listed in that page and the details of each one and throw them in the database for use on another site of mine. What I want to do is an automated system for getting all those events for me and even check for updates on that page, in case a new one is added it would add the new to the DB. The program is expected to run on a daily basis once for for the check.

I need to address another issue which is in case the markup on the page changes the program would detect that something is wrong and I would be alerted, I would then do the necessary changes on it to based on the changes.

All of this is because I have a project related to events and I really hate looking for and entering data myself for the rest of my life.

Joe · Dec 28, 2011

arithma wroteA one time, no maintenance throw away

AymanFarhat wroteThe program is expected to run on a daily basis once for for the check

This is it. I completely agree with you arithma, but that's not what Ayman is looking for. If you need something reliable, use an html parsing lib. Even if it might take (seemingly) more time to do at first.

No matter how clever your regex is, I'm willing it will break at least once, during the first following week at most. (Challenge?)

Ayman · Dec 28, 2011

@rahmu I am not willing to challenge actually as I am pretty sure the Regex would never be perfect and as you said could easily break. Do you suggest any specific HTML parsing libraries? Preferably for Php?

arithma · Dec 28, 2011

How do HTML parsers do anyway in the case of malformed HTML? Will it just tell ayman that the document is malformed? I am sure most XML parsers will throw loads of errors in your face.
Anyway, the amount of repetition in the HTML in this page is a challenge enough to probably make using regex a pain in the bottom.

Ayman · Dec 28, 2011

arithma wroteHow do HTML parsers do anyway in the case of malformed HTML? Will it just tell ayman that the document is malformed? I am sure most XML parsers will throw loads of errors in your face.

Exactly, that's the main problem that I would most probably face.

That's why I am a bit confused what to do in such a case. Both options seem problematic.

Zef · Dec 28, 2011

There's a bookmarklet called SelectorGadget that can help you identify the correct elements in a way that would be really difficult to do in your head. It can generate a css selector or an xpath that you can use to extract the desired content.

I was able to reliably target all the significant info for each movie on that page, like the image, title, and description. I think that will work very well for you without having to use a separate parsing library.

There's a help button that teaches you how to use it, it's a bit confusing at first but makes sense if you read the help.

(Edited: I originally suggested using xpaths, but on second thought it doesn't really matter if you use an xpaths or the css selectors, it's just that you need to target the correct elements.)

arithma · Dec 28, 2011

You can use the following regular expression so far to get to the data td as a first step.
A regular expression:

<td background="images/leftevent.jpg" width=8 style="background-repeat:no-repeat;"></td>(.*)<td background="images/Rightevent.jpg" width=8 style="background-repeat:no-repeat;"></td>

The rest is still not so obvious with their idiotic table/td/table/td structure. But something's gotta give. Sadly, I'll have to see if I can crack it easily later or tomorrow morning.

Ayman · Dec 29, 2011

@Zef That's an excellent tool! Really nice. Just tried it and I am now extracting everything very easily. Thanks for sharing! :)

@Arithma Seems challenging with regex, we could still work on getting the data in regex just for the fun of it if you want :P

arithma · Dec 29, 2011

What language are you using Ayman. Are you able to parse the document without error? I thought the versatility of regex shines in the case where the documents are total crap in terms of validity, and seriously what web pages are not?

BTW, I tried the tool and it's absolutely awesome.

Ayman · Dec 29, 2011

I am using Php actually, the returned document is full of errors actually so what I did is get the page markup as a String, clean and repair it from errors using Php Tidy
then create a new DOM Document from it. Something like this:

$html = get_url_contents('http://www.ticketingboxoffice.com');

$tidy = tidy_parse_string($html, array('clean' => 'yes', 'output-html' => 'yes'), 'utf8');

$tidy->cleanRepair();

$doc = new DOMDocument;

$doc->loadHTML($tidy);