Fun with Yelp
I’ve been playing around with some data scraping, and Yelp has been my first test subject. Their DOM makes light (read: uninteresting) work of extracting most data. I did stumble across something pretty useful this evening though.
On every venue page, Yelp adds a property called json_biz to thewindow object. You can access it through your favorite JS console, which is handy if you’re trying to manipulate data through your browser, the real win here is that the JSON representation of this object is inlined in a <script> tag near the bottom of the page.
One of the major caveats (or advantages) to accessing this data over the data presented in the DOM is the categories are referenced by ID, not by name. With a little searching through Yelp’s JavaScript I was able to find a JSON representation of their category structure, which relates these IDs to their corresponding names. I also (sloppily) wrote a function to parse this structure and return a name for a given ID.
Obviously, YMMV with this technique, and please respect Yelp’s data, along with everyone else’s for that matter.