The Blueprint
Ever been faced with a problem where you need data off another site but they have no good API to extract that data? Recently I was presented with a problem where I needed to get grab dates off a band’s Myspace page and use that data offsite. I’d rather fall on a pointed stick than use Myspace’s API so I started my search to grab the data using cURL or something similar.
After 5 hours of RegExp hell I started looking for some Open Source solutions and ran across the Simple HTML DOM class. It’s like the heavens opened and shinned it’s light right on my screen. This class basically grabs a URL and parses in the HTML. You can then run a “find” command that sorts through the DOM and grabs the elements that you filter on. Follow that up by grabbing the text out of each element and the data is yours to have!
The docs are very well written and if the site you are parsing has well formed semantic HTML/CSS you can get your data in very few lines of code. The class even supports DOM traversing (e.g. first_child()) and element attributes for filtering.
Here is their example for grabbing article data from Digg:
// create HTML DOM
$html = file_get_html('http://digg.com/');
// get news block
foreach($html->find('div.news-summary') as $article) {
// get title
$item['title'] = trim($article->find('h3', 0)->plaintext);
// get details
$item['details'] = trim($article->find('p', 0)->plaintext);
// get intro
$item['diggs'] = trim($article->find('li a strong', 0)->plaintext);
$ret[] = $item;
}
print_r($ret);
As you can see, grabbing the data is as easy as knowing how to traverse the DOM. In a matter of a couple hours I had the data I needed parsed and manipulated from within my own code. What a beautiful thing. There are a few headaches when using the find() method when you are parsing data from a site such as Myspace that has very little semantic structure to it at all. Just keep that in mind because that is what took the majority of my time getting the data I wanted. The script works as advertised and saved me hours and hours of coding.. hope you enjoy it as well.
Search The Blog
Code & Projects
Categories
Archives
- April 2010
- October 2009
- September 2009
- August 2009
- June 2009
- May 2009
- April 2009
- February 2009
- December 2008
- November 2008
- October 2008
- September 2008
- August 2008
- April 2008
- March 2008
- February 2008
- July 2007
Great info, just what I needed thanks :D
That’s cool It’s possible :)
thank you for DOM code
I want to pull data from the database of others website.
so it will save my time to create our own database, n update it time to time.
can u help me?…
I’m fairly sure that doing that without consent would be fairly illegal.
Nice,Thank
Hello,
I Used grabbing article data from Digg code but that’s get an error: “Fatal error: Call to undefined function file_get_html() in..” What can i do?
sir i tried your code but
the error code”Fata error: Call to undefined fuction file_get_html…”
I also tried to include the function
simple_html_dom.php
but still I get the same error…
can you please help me..