Tuesday, June 18, 2019

"Hacking" Websites for all their assets

Recently I re-watched the movie The Social Network. Youtube had been recommending me clips of the film for a while, so I finally broke down and watched it. One of my favourite scenes from the movie is the hacking montage, where Mark Zuckerberg uses some "wget magic" to download all the images off a website. I've recently been working on a project that crossed into the wget domain, so I'll cover some of my learning here. 

Disclaimer: Misusing wget can get your IP banned from sites. Always check the robots.txt file.

Before I get any further, it is essential to understand the robots.txt file. Robots.txt serves as a way to tell robots (wget and other scrapers) visiting the site where they are allowed to go. By default, wget will read this file and ignore files and folders it's told to stay away from. This tendency to follow the rules can be turned off by specifying the flag -e robots=off. It is considered proper etiquette to leave this on, though, as I can only imagine how annoying it would be to be a web admin and have someone use wget to spam you continuously.

The Mission

My goal with this task was to get sprite-maps that could be used in a later image recognition project. The issue with anything related to image recognition (or furthermore ML) is that you need an extensive data set to get started with. To overcome this issue, I found a website that crowdsources sprite-maps from retro games. I went through a few sites but figured if I wanted to avoid the problems with duplication, I'd have to stick to one. In the end, I settled on Spriters Resource because they had a lot of pictures, and I wouldn't be breaking their robots.txt.

The Scrape

Our target organizes their images by the console it originated from. This feature is handy because it means we can bound our retrieval to make sure we aren't getting any data we don't require. Once we figure out our console of interest, we can utilize some recursive features to crawl down the file tree for said console. 

The command:

wget -nd -p -r -np -w 1 -A png

wget: The utility we're going to use for scraping. See docs here

-nd: Flattens the file hierarchy, if we retrieve nested files they will be placed in the current folder

 -p: Download all files referenced on a page, without this we will just have references to images

-r: Enable recursive downloading, this makes wget crawl down the file structure

-np: Bound wget to the parent directory when retrieving recursively, this makes sure we don't follow links to places we don't want to go

-w 1: Wait one second between retrievals, this slows us down a lot but makes sure we aren't spamming

 -A png: Accepted file extensions, makes sure we will only save pictures.

- Our base URL to start at

In the command above, if we switch the "base-folder/" text with any console contained by the website, then we can retrieve all of our resources. Happy Scraping!

No comments:

Post a Comment