For almost 10 years, the Research Corporation for Science Advancement—or RSCA—has tapped IDEO for a little data science help. It's a good match, because RCSA shares our enthusiasm for helping people meet and collaborate. Every spring and fall, they organize conferences called Scialogs that bring together scientists of different disciplines to converse on important topics and inspire further collaboration through grants.
At these conferences, attendees are split up into group sessions: some big, some small, some on specific topics, and some designed to create new connections. Though grouping folks together might seem like a straightforward task, there is actually a dizzying number of ways to put people together—for one recent conference, the number in the neighborhood of 14 duodecillion possible combinations. That’s 1.4 x 10 to the 40th, which is close to the number you would get if you were to multiply the number of grains of sand on Earth by the number of stars in the universe.
Fortunately, for data scientists, sifting through haystacks to find needles is our bread and butter. We looked at this dataset with one goal: to create groups of individuals that foster conversations and catalyze new cross-disciplinary research. Easy, right?
We built an optimization algorithm that scores combinations of people based on diversity, trying to create well-mixed groups. To make sure we had truly excellent combinations, we needed to run the algorithm more than 1,000 times, after which a human had to examine the top scoring combinations. A human ultimately judging the top few options is crucial to making sure we incorporate nuances that the algorithm does not get. But at more than four minutes per algorithm run, it took an average of three days to get results. Three days! Three days between prototypes wasn’t going to cut it.
To solve our time problem, we turned to serverless computing. While we typically use it to deploy microservices, we couldn’t resist using it to massively scale our computational ability on the cheap. After all, we love hacking and repurposing things for atypical use cases, and digital tools are no different.
In a way, putting these groups together is like making chili—there are many different ways to make it, and all are valid (and many are tasty!). But to have a truly great chili, it’s essential that the ingredients are balanced and cooked properly to produce a dish that is elevated beyond the sum of its parts.
No humans were harmed in the making of this metaphor.
Think of each set of groups produced by the algorithm as a chili recipe. In the spirit of experimentation, you start putting together recipes that vary randomly in terms of ingredient amounts and cooking procedure. You’ve written down 1,000 recipes, and you’re going to cook and score them to find your ideal chili.
After a day of chopping, sautéing simmering, and tasting, you’ve only made 10 batches of chili. That's great, but in the grand scheme of things, that's nothing. So you think of ways you could cook and test your recipes more quickly. The most basic way is to make your cooking process (or code for your algorithm) more efficient: Chop faster and keep ingredients nearby. But this yields only a marginal gain. You would only crank out one or two more batches a day.
To really speed things up, you’re going to need some help. You could rent space in an industrial kitchen and hire chefs (servers on the cloud) who take a recipe, follow it exactly, and deliver the finished product to you. The downside: It takes a non-trivial amount of time and money to recruit and train chefs (much like setting up servers with the algorithm code).
But what if you had an army of 100 food trucks that are fully stocked, equipped, and happy to help you? That’s the equivalent of our serverless computing service. Serverless computing allows us to simply upload and then execute our algorithm code on cloud servers, taking away the hassle of setting up machines and allowing us to scale to hundreds or thousands of simultaneous runs effortlessly.
Food truck army: go forth and chili.
By using serverless tech, analyzing Scialog results goes from a process that runs in about three days to one that takes less than 10 minutes. With pricing schemes that only charge for the time the servers are being used, we can complete our thousand runs for less than $5.
This is great news, because it allows us to prototype our algorithms at scale much faster. What if, after looking at the results, we realize our scoring is missing a specific kind of input, like the balance of cat people vs. dog people? Previously that would necessitate another three-day wait for results. Now, we can tweak our algorithm, upload the new code, grab a cup of coffee (or chili), and come back to analyze the results.
Serverless computing is worth adding to your toolbox—or recipe box.
We love the creative freedom that serverless computing enables (the ability to iterate on prototypes), and especially being able to get tons of computing power on demand—and cheaply.
If you’re interested in doing something similar, we’ve had great experiences using Zappa to deploy to AWS Lambda, but you can do the same kinds of things with other serverless services like Google Cloud Functions and Microsoft Azure Functions.
Now go forth and prototype at scale. And if you get hungry, you know what to do.