Here a new thread in simile, a semantic web list serve, which might interest those involved in extracting geographic names.
Donat
-----Original Message----- From: general-bounces@simile.mit.edu [mailto:general-bounces@simile.mit.edu] On Behalf Of Chris Bizer Sent: Saturday, December 02, 2006 10:36 AM To: Richard Cyganiak; Richard Newman Cc: semantic-web@w3.org; 'Damian Steer'; general@simile.mit.edu; 'Karl Dubost' Subject: Wikipedia and Geonames. was: AW: ANN: RDF Book Mashup - IntegratingWeb 2.0 data sources like Amazon and Google into the Semantic Web
I wish that wikipedia had a fully exportable database http://en.wikipedia.org/wiki/Lists_of_films
For example, being able to export all data of this movie as RDF, maybe a templating issue at least for the box on the right. http://en.wikipedia.org/wiki/2046_%28film%29
Should be an easy job for a SIMILE like screen scraper.
If you start scraping down from the Wikipedia film list, you should get a fair amount of data.
Some further ideas along these lines. What about scraping information about geograpic places like countries and cities from Wikipedia and linking the data to geonames (http://www.geonames.org/ontology/)?
Something like http://XXX/wikipedia/Embrun owl:sameAs http://sws.geonames.org/3020251/
The Wikipedia articles about countries and cities all follow relatively similar structures (for instance http://en.wikipedia.org/wiki/Berlin) so it should be easy to scrape them. They already contain links to other places, like the Boroughs and localities in Berlin, which could easily be transformed to RDF links.
Many places have geo-coordinates which together with the place name allow scrapers to automatically create links to localities from geonames.
Wikipedia is GNU, thus there aren't any problems with licensing as with the Google and Amazon data.
As most articles follow the same structure, an approach to implement such a information service could be to:
- Use a crawling/screenscraping framework that fills a relational database with the information from Wikipedia. - Use D2R Server (http://sites.wiwiss.fu-berlin.de/suhl/bizer/d2r-server/) to publish the database on the Web and to provide a SPARQL end-point for querying.
I once read about some pretty sophisticated screen-scraping frameworks that fill relational databases with data from websites but forgot the exact links. Does anybody know?
Cheers,
Chris
----- Original Message ----- From: "Richard Cyganiak" richard@cyganiak.de To: "Richard Newman" r.newman@reading.ac.uk Cc: "Chris Bizer" chris@bizer.de; "'Karl Dubost'" karl@w3.org; "'Damian Steer'" damian.steer@hp.com; semantic-web@w3.org Sent: Friday, December 01, 2006 7:19 PM Subject: Re: AW: ANN: RDF Book Mashup - Integrating Web 2.0 data sources like Amazon and Google into the Semantic Web
On 1 Dec 2006, at 18:27, Richard Newman wrote:
Systemone have Wikipedia dumped monthly into RDF:
http://labs.systemone.at/wikipedia3
A public SPARQL endpoint is on their roadmap, but it's only 47 million triples, so you should be able to load it in a few minutes on your machine and run queries locally.
Unfortunately this only represents the hyperlink structure and basic article metadata in RDF. It does no scraping of data from info boxes or article content. Might be interesting for analyzing Wikipedia's link structure or social dynamics, but not for content extraction.
Richard
-R
On 1 Dec 2006, at 4:30 AM, Chris Bizer wrote:
I wish that wikipedia had a fully exportable database http://en.wikipedia.org/wiki/Lists_of_films
For example, being able to export all data of this movie as RDF, maybe a templating issue at least for the box on the right. http://en.wikipedia.org/wiki/2046_%28film%29
Should be an easy job for a SIMILE like screen scraper.
If you start scraping down from the Wikipedia film list, you should get
a fair amount of data.
To all the Semantic Wiki guys: Has anybody already done something like this? Are there SPARQL end-points/repositories for Wikipedia-scraped data?
_______________________________________________ General mailing list General@simile.mit.edu http://simile.mit.edu/mailman/listinfo/general