Splash Search — How to do Image Search on 250k Images in 100ms

erik
500px Engineering Blog
5 min readAug 23, 2016

--

Hey! My name is Erik, I’m a Systems Design Engineering student at the University of Waterloo and I’m a coop on 500px’s innovation team. One of the things I got to work on this semester was a colour search tool we called Splash.

Every quarter, 500px holds a 2 to 3 day hackathon for all employees. This July, myself and David, a coop on our web team, decided to build what would eventually become Splash search. We quickly realized we would need support from our platform team, so we also enlisted Madigan, another coop from that team, to help us out. With our team ready, we started hacking.

Sam the dog, © Jacob Willemsma

(Splash was originally called “Sam Search” after our coworkers adorable Husky-German Sheppard who walked in while we were trying to pick a name. Turns out we can’t send out a press release like that so the name had to go.)

The Idea

Splash works by allowing you to sketch a terrible drawing of what you want on a canvas and sending it to our backend which searches through tens of thousands of photographs from our marketplace finding the most visually similar images, sorts them based on similarity, and returns the images. For example, if you wanted a very dark portrait of a person in the center of the frame, normally you’d have to search for “dark portrait” and scroll until you find the framing you want, but we wanted to enable you to search by painting a canvas black, adding a dash of colour in the center, restricting the search to people, and seeing what you get.

The Math

The math we use for finding similar images is incredibly simple: resize all images and the users reference sketch to small squares, reshape the image and sketch matrices into 1D vectors, and take a weighted euclidean distance (sort of) between each image and the sketch.

For comparing a sketch to a single image, the process looks like this:

  • Resize both the sketch and picture to 16 by 16.
  • For each channel of each pixel, we square the difference between the value for the sketch and the image.
  • We then weigh the value by the alpha of the sketch, so if the user didn’t draw anything in part of the canvas they don’t care what’s there.
  • We do this for every pixel in our 16 by 16 grid and add up all those values.

Now normally, to get a euclidean distance you’d have to take the square root of that sum but we’re not interested in the actual distance, we only want a relative measure against other images and not square rooting maintains the relative ordering of distances.

We do this to get a distance between the sketch and every single image, sort based on which images are “closest” to the sketch, and return the top 100. That’s it!

The Hack

We chose to use golang for this hack (I wanted Python but I was outvoted…). Like any hackathon project, there were a number of things wrong with it and it was amazing it worked at all. We were initially using 70 by 70 images that had to be loaded from disk every time the service was started, our threading model was a little broken, we hit our own rate limiter before we finished pulling the data we needed so we only had a partial dataset… Somehow we got it working and won the prize ($200 to spend on photography gear) for the most innovative hack! The team decided that we should pursue this project further and release Splash to our community.

The Product

This meant we had to fix all of the small things wrong with it, get the UI done by one of our designers, get it to start up quickly, and generally not be a messy collection of supporting python and SQL scripts that no one else understands how to use. After a couple of weeks of work including an all-nighter by Madigan, it was finally ready — starting up in seconds not minutes, codebase readable by humans, not a nightmare to deploy, looking great — and we released it to the world.

At this point, we had changed the image comparison to using 16 by 16 vectors instead of 70 by 70, fixed our threading, and made a bunch of micro-optimizations that got our search time of 500ms for 15k images to about 70ms for 50k images, a speedup of about ~23x. After a bit of experimentation, we found out you can reduce your image vector size to about 8 by 8 without any significant loss in result quality but we kept it at 16. At that point the bottleneck was RAM and shrinking the vector size did not speed anything up.

The hack just searched through about 15k images, but in production we can get through 50k images quickly with 5 categories to choose from.

The Reception

Shortly after releasing Splash the articles started coming out, nothing we couldn’t handle. But then it was posted to reddit. Within a couple of hours, it had made it to the front page peaking at the second link from the top. At that point we were serving about 7000 concurrent connections and had to ramp up the number of servers from 2 to 18 to survive the hug of death, but we made it without anything going down and I finally get to say I did a thing that got reddit front paged.

Guess when we got linked on reddit?

Working on this hack and bringing it to production was one of the most rewarding things I got to do this semester, the support we got from both our team and our community was fantastic, and seeing the product make a Splash — I’ll show myself out.

We’re hiring!

If you’re interested in joining a startup that has the team and the community to make this kind of hackathon project possible, apply here!

--

--