Monday, October 28, 2019

Anonymous Summary Statistics or How to protect nerds from bullies

School is out for the semester and it’s time for report cards! Each student receives their report card in the mail and opens it to see how their semester has gone. The report card contains a list of subjects, each with the students mark along with some summary statistic (average and standard deviation). Given the summary statistics a student can determine how well they've performed relative to their peers and if they are an exceptional student.

Now let's say that there are some bullies in the class. These bullies got their report cards and weren't too happy with the results. They were far below the class average! The bullies want to know who to direct their anger towards. They know some nerd got great marks and they want to know who that nerd is!

If you are the principal of the school and need to create classes, how can you do so to protect the nerds from bullies?

Let's go over some cases.

Case 1 - Home Schooled

 As if being homeschooled wasn't punishment enough

This case sucks. If you are the nerd in a home school of two then the bully is going to know who you are very quickly. In a case like this, the only way to protect the bully would be to forego sharing any summary statistics

Case 2 - Well Mannered School

Let's suppose the school is largely full of good kids. If the frequency of bullies amongst kids is low then we don't need to make too many considerations for the nerd's anonymity. As long as the number of well mannered kids outnumbers bullies there is little risk to the nerd.

Case 3 - Rough School

 oof
If the school is a rough one then there would likely be a lot of bullies. In a case with a high frequency of bullies, class sizes would need to be large in order to ensure that the nerd has some other good natured students in his class.

In reality most schools would fall into either case 2 or 3. The only issue is that in real life people aren't just bullies or well mannered; they have good days and bad days. This means that in any class, there won't be a static number of bullies.

Knowing all of this how can we design classes and report cards in order to protect nerds?

Security through obscurity

In this design some key information in kept from the bullies. If a group of bullies didn't know how many other students were in the class, they would be unable to make use of any of the summary statistic. By obfuscating some key information the identify if the nerd is kept safe. In practice this is a poor security measure. Key information that is static is likely to be eventually determined by the bullies.

Security through large classes

For this design we make very large class sizes in the hopes that not everyone is a bully. If classes are large enough then the likelihood of all students except the nerd being a bully is low. This is a decent security measure but is faulty though as large class sizes come with trade-offs.

Security though fuzzy metrics

For this design we don't give students the whole picture. Instead of giving the average on a report card we could instead provide a fuzzy metric, something like an arrow to show if you are above or below average. If the classes marks are normally distributed then we expect several students to be above average, protecting the identity of the nerd. This is a great security measure as it protects anonymity without too large a trade-off.

Requirements to determine the nerds marks

In a case were bullies want to determine the marks of the nerd then several requirements are required:
• All students except one must be bullies
• All bullies must work together and share report cards communally
• All bullies must be in the same class as the nerd
• Bullies must know the total number of students
• Bullies must know who in the class isn't a bully

Breaking Character

The example above can be used as a way to conceptualize the Actor Model in Data Science. If Summary Statistics are to be shared between untrusting parties in the presence of bad actors than how can risk be quantified?

To my knowledge there is no magic formula to determine the threat to breaking anonymity. Only strong domain knowledge and an understanding of the risk avenues can lead to the design of safe products.

Tuesday, June 18, 2019

"Hacking" Websites for all their assets

Recently I re-watched the movie The Social Network. Youtube had been recommending me clips of the movie for a while so I finally broke down and watched it. One of my favourite scenes from the movie is the hacking montage where Mark Zuckerberg uses some "wget magic" to download all the images off a website. I've recently been working on a project that crossed into the wget domain so I'll cover some of my learning here.

Disclaimer: Misusing wget can get your IP banned from sites. Always check the robots.txt file.

Before I get any further it is important to understand the robots.txt file. Robots.txt serves as a way to tell robots (wget and other scrapers) visiting the site where they are allowed to go. By default wget will read this file and ignore files and folders it's told to stay away from. This tendency to follow rules can be turned off by specifying the flag -e robots=off. It is considered good etiquette to leave this on though as I can only imagine how annoying it would be to be a web admin and have someone use wget to spam you continuously.

The Mission

My goal with this task was to get sprite-maps that could be used in a later image recognition project. The issue with anything related to image recognition (or further more ML) is that you need a large data set to get started with. To overcome this issue I found a website that crowdsources sprite-maps from retro games. I went through a few sites but figured if I wanted to avoid issues with duplication I'd have to stick to one. In the end I settled on Spriters Resource because they had a-lot of pictures and I wouldn't be breaking their robots.txt.

The Scrape

Our target organizes their images by the console it originated from. This feature is handy because it means we can bound our retrieval to make sure we aren't getting any data we don't require. Once we figure out our console of interest we can utilize some recursive features to crawl down the file tree for said console.

The command:

wget -nd -p -r -np -w 1 -A png https://www.spriters-resource.com/base-folder/

wget: The utility we're going to use for scraping. See docs here

-nd: Flattens the file hierarchy, if we retrieve nested files they will be placed in the current folder

-p: Download all files referenced on a page, without this we will just have references to images

-np: Bound wget to the parent directory when retrieving recursively, this makes sure we don't follow links to places we don't want to go

-w 1: Wait one seconds between retrievals, this slows us down a lot but makes sure we aren't spamming

-A png: Accepted file extensions, makes sure we will only save pictures.

- https://www.spriters-resource.com/base-folder/: Our base url to start at

In the command above if we switch the "base-folder/" text with any console contained by the website then we can retrieve all of our resources. Happy Scraping!

Friday, May 24, 2019

Do these numbers scare you as well?

 The gif in question
Recently while browsing math related Wikipedia pages I came across a very interesting gif. The image was on the Wikipedia page for Pascal's Triangle and at first look it seemed completely out of place. I thought it must have been some sort of last minute edit to sneak in a spooky gif to an otherwise technical Wikipedia page.

My first thought was that the gif is some sort of creepy mutating dome of retro video game sprites. Each frame looks eerily familiar and I felt like I've seen this all before.

Attempting to learn more about the image proved to be an unfruitful endeavor. Reading the rest of the article and scanning the profile of the submitter yielded nothing. The article's edits page contained some discussion from the editors in which they inquired how the gif got there but nothing else.

Not only did the image seem out of place but it came alongside a very technical description that took me a while to wrap my head around. I'll include the description below along with a few images to try and show what it's trying to describe.

Each frame represents a row in Pascal's triangle. Each column of pixels is a number in binary with the least significant bit at the bottom. Light pixels represent ones and the dark pixels are zeroes.

 Pascals Triangle
 Binary representation of row 5
 Frame 5 of the gif

I hope the images above speak to you but if it isn't doing anything for you then the entirety of this article can be simplified to something along the lines of "Math used to make interesting images that look similar to retro video game sprites".

Once I had an idea as to what I was looking at, I set out to recreate the image in Python. I'll go over some of the more important parts of the code here. A link to the complete repo with instructions for getting it running yourself can be found on my Github.

We start by generating Pascal's Triangle. I got a lot of help from stack overflow here to get started but modified the code to produce binary numbers.

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21``` ```def make_triangle(n_rows): """Return pascals triangle in binary. Keyword arguments: n_rows -- the number of rows in the triangle """ # Function to add binary numbers def bin_add(*args): return bin(sum(int(x, 2) for x in args))[2:] results = [] for _ in range(n_rows): row = [bin(1)] # Append a binary 1 to the start of a row if results: # If there are existing results (row > 1) last_row = results[-1] # The following is just a fancy way to say "For each result in the last row add it with its neighbor" # Zip functions collects the previous row with itself and a version indexed one element ahead # The bin_add(*pair) unpacks the pair and calls the bin_add function with the results row.extend([bin_add(*pair) for pair in zip(last_row, last_row[1:])]) row.append(bin(1)) # Append a binary 1 to the end of a row results.append(row) return results ```

Once we have Pascal's triangle generated we can take a row of it and use it to generate a frame. This was one of those problems where retrospectively I could have saved a lot of time if I'd spent more time planning. Instead I just jumped right into it and realistically choose the wrong data types to represent some of the elements.

If I had to redo it all from scratch I'd probably use a NumPy array to represent the screen. I believe some of the functionality in NumPy would have made it easier to deal with the cases where the row of Pascal's triangle was either very short or very long (the "if" statements on rows 12 and 24). Regardless implementing it all with just str and list makes me appreciate SciPy libraries even more.

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41``` ```def gen_frame(row, filename, frame_dim, interpol): """Return an image for a given a row of pascal triangle.""" frame = [] # For each element in a row (represented as binary_strs) unpack it into a list of single 0's and 1's for binary_str in row: binary_str = ''.join(str(i) for i in binary_str if i.isdigit()) binary_list = list(binary_str.zfill(SIZE[0])) binary_list = [int(i) for i in binary_list] # If the binary_list is longer then the output dimensions, trim off the LSBs if len(binary_list) > SIZE[0]: binary_list = binary_list[:SIZE[0]] # Append the binary_list of to the frame frame.extend(binary_list) # If the binary_list doesn't fill the frame than fill the blank space with 0's l_append = [0]*(floor((SIZE[0] - len(row))/2))*SIZE[1] r_append = [0]*(ceil((SIZE[0] - len(row))/2))*SIZE[1] canvas = l_append+frame+r_append # If the binary_list exceeds the size of the frame than trim off the edges if len(canvas) > (SIZE[0]*SIZE[1]): offset = (((len(frame))-(SIZE[0])*SIZE[1]))/2 # Make sure the offset doesn't cause screen tearing if offset % SIZE[0] != 0: offset += SIZE[0]/2 canvas = canvas[int(offset):] # Set image interpolation behaviour based on user input interpol = Image.LANCZOS if interpol else Image.NEAREST # Pack the frame into a byte and generate an image with it data = pack('B'*len(canvas), *[pixel*255 for pixel in canvas]) img = Image.frombuffer('L', SIZE, data) img = img.rotate(-90) img = img.resize(frame_dim, interpol) img.save(filename) ```

Running the code above for x number of rows yields x images in a your working folder. The next function thankfully provides a context manager and makes a temp folder to work in. This makes it easier to keep track of the images we've generated and allows us to stitch them together into a gif easier.

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36``` ```def gen_gif(n_rows, frame_rate, frame_dim, interpol): """Generate a gif with n_rows number of frames and with frame timing of f_time. Keyword arguments: n_rows -- the number of rows in pascals triangle frame_rate -- the frame rate to generate the gif with frame_dim -- the dimensions to resize the frame to interpol -- flag to interpolate the image when upscaleing to frame_dim resolution """ triangle = make_triangle(n_rows) # Generate pascals triangle of n_rows # Make a temp folder and send all outputs to it temp_path = os.path.join(os.getcwd(), r'temp') if os.path.exists(temp_path): shutil.rmtree(temp_path) os.makedirs(temp_path) else: os.makedirs(temp_path) os.chdir(temp_path) # For each row in pascals triangle generate a frame based off it for idx, row in enumerate(triangle): gen_frame(row, "frame_{0}.png".format(idx), frame_dim, interpol) # Generate output gif given the files generated above with get_writer('pascals_triangle_{}.gif'.format(n_rows), mode='I', duration=frame_rate) as writer: filenames = [file for file in os.listdir() if file.endswith('.png')] # Sort files by numbers found in string containing numbers (ex. frame_6.png) def sortby(x): return int(search(r'\d+', x).group()) filenames.sort(key=sortby) for filename in filenames: image = imread(filename) writer.append_data(image)```

Another interesting issue I ran into was sorting the images in the temp folder. The images all follow the pattern "frame_x.png", where we want to sort by x but the png extension is required. If we scan the directory and use the sort method on our list with default arguments our elements are sorted as strings. This leaves us with results looking like:

 `1` ```['frame_0.png', 'frame_1.png', 'frame_10.png', 'frame_100.png', 'frame_101.png', 'frame_102.png', 'frame_103.png', 'frame_104.png', 'frame_105.png'] ```

The sorting we see above obviously isn't what we want. To sort properly we can pass our own custom sortby function into the lists sort method. You can see our function on line 23 in the snippit above. This function uses regular expressions to scrape all of the ints from the input file title. The ints returned by the function are what is used by the sort method (ex. frame_10.png --> 10).

Everything gets wrapped together by the imageio library. We throw our images and some metadata at its get_writer class and then save it with our filename. Below is my best attempts to recreate the original gifs, aside from the noticeable difference in colour its good enough for me.

 My best attempt at recreation
TL:DR - Made gifs using python. Check out full code on the Github repo