Okay, as promised, here’s my quick’n'dirty Hijack Proof of Concept.
You almost certainly want to download the app to find out how it doesn’t really work well and then complain to me about it. Well, click that link. I’ve tested it on this site, forums.macnn.com, and the forums on the phpBB homepage. It only works on pages of posts, not listings of threads. There’s no way to post. If the site admin even considers changing the HTML, if they even think about it, this build will break.
In other words, it’s not in any way, shape, or form “useful”. You don’t really want to download it is what I’m saying. I know you will anyway.
You can add support for other sites by playing around in the Windows menu if you can understand my incredibly arcane string crap I elaborate on below.
—-
Codegeeks: Here’s what you want.
Basically, it loads your request into a hidden WebView so that it can get WebKit to generate a DOM for us. It hands that DOM off to something that creates a “tagSoup” our of it. Tag Soup is sort of like a giant Xpath of the entire document, but it shows sibling and parent relationships. The idea behind tagSoup was to give a really quick way of finding matches for a page that Hijack has never seen before, to identify what scraping scheme to use. More on tagSoup in a bit (hint, tagSoup sux).
Along the way, it looks to see if it can figure out through keywords what forum package the page belongs to. If it can, it’ll try to use those to figure out how to do the scraping.
Then, it tries to match a ScrapeScheme to the DOM that got loaded. ScrapeSchemes are stored in a CoreData database (binary format because the SQLLite ones don’t support beginsWith, which I needed for some reason I have forgotten). This happens inside of ScrapeSchemeManager. It tries to do this as lazily as it can, but if lazy doesn’t work, it comes down to doing string matching against the tagSoup. More on tagSoup in a bit (hint, tagSoup sux).
If it can find a match, it sucks the data out of the DOM and builds a new generic html/css page out of it in DOMRenderer. This is then rendered in a visible webview by way of what appears to be a generic NSEnumerator but is actually 85000 lines of code pain, making you say “ooh” and “ahh” and “wow, this program totally fucking blows goats.”
A scrape consists of an (optional) prefix, an identifier or regular expression, and an (optional) suffix. The scrape is used to find something in the DOM that matches the scrape. The found item is then matched to a “significance”, which indicates what the matched thing actually is. Your avatar, for example.
The syntax, as I said, is similar to Xpath, but not the same, as it’s a bit more fuzzy. A child is denoted by “{”, a sibling is denoted by “-”, and a parent is denoted by “}”. So, if you want to find the second “tr” of a table with 2 data cells (and no tags inside the data cells), you would use “table{tbody{tr{td-td}tr}}”.
ClassNames are prefixed with “$” (I’d love to have used a period, but it’s a legal part of a className according to the XML spec, which I find baffling). idNames are prefixed with “#”. So, if that second “tr” was actually <tr id=”uniqueRow” class=”firstClass secondClass”>, the match string would be “table{tbody{tr{td-td}tr$firstClass secondClass#uniqueRow}}”.
If the match portion has an innerHTML, it is what is scraped, otherwise, the outerHTML gets scraped. Unless, of course, you specified a regex, in which case the match is scraped, unless the regex has at least one parenthesized portion, in which case the first parenthesized match is scraped.
Fun, huh? :)
—-
But you can pretty much ignore everything I just said because it’s all waaaay too brittle to be useful.
The tagSoup was a nice first attempt, but it’s no good for actual matching, as I discovered when adding this forum to the mix. This forum uses multiple class tags for a given tagName, which might be in any arbitrary order, so naive string matching is useless. tagSoup needs to go away.
Here is a Core Data schema that makes more sense. I began implementing this and got bogged down in writing an editor for it. I’ve since discovered XQuery, which supports regular expressions, which are completely vital (you’ll know why when you look at nassssty phpBB). That’s probably a better route to take.
Reskinning this is easy, just change the html/css in the “HTML” folder.
Finally, this code is the result of under 20 hours of work, done in a rush because I was annoyed at people telling me Hijack wasn’t feasible. So the code is pretty gross. Apologies for that, c’est la vie, yo!
And enjoy!




























Jason Harris
ShapeShifter/Chicken of the VNCJason Harris has been coding up spiffiness and silliness for about ten years, working on such diverse projects as a solid-state quantum computing simulator for electron waves in GaAs semiconductors and a Monte Carlo simulator for electron transport in nanostructure devices. He also wrote insane, down-to-the-metal microcontroller assembly language code for Octofungi, a robotic sculpture. In the Mac world, he's the primary author of ShapeShifter, Mighty Mouse, ThemePark, and heads the open-source Chicken of the VNC and Paranoid Android projects. He digs mountain biking, skateboarding, art, martinis, loud music, and creating oddly euphonious phrases. He never wears shoes if he can help it and can dance like a mofo!
View Jason Harris's Comments →