Taming Big Data - Interview with Udi Falkson of iknow.io

These days it's very easy to bump into term "big data". What is it like to develop a service around it? That is what we are about to find out as Udi Falkson of iknow.io will tell a bit about his story and some of the technology choices they have made.

Hi, @udi. Can you tell us something about yourself?

Hi. I'm Udi, the co-founder and head of product at iknow.io. Earlier in my career, I was one of the first engineers working on Yahoo! Answers and I'm also the creator of isitnormal.com.

Can you describe what big data is about? How does iknow.io relate to it?

The amount of raw, available data out there in the world is growing exponentially. Sites like data.gov and others are giving the public all kinds of great data to play with. However, no good tools have emerged to make all this data accessible and useful to the majority of people that really care about it. This is what we're changing with iknow.io.

Today, unless you're a computer programmer or a scientist, working with anything more complicated than what will fit into a simple spreadsheet is not possible. Most people are shut out of the game. We are taking these complex data-sets and making them useful for anyone to explore and analyze. We do this by sourcing, merging and cleaning the data and providing a new breed of intuitive data analysis and comparison tools for people to use.

We currently have detailed data about Movies and US Congress. NBA data is coming very soon, and many more verticals will be added in the future.

What kind of technical challenges have you encountered during its development?

What we are doing is extremely technically challenging. Our amazing back-end team has built systems from the ground-up to scalably handle extremely large datasets of drastically varying structures. To account for the different ways that data can be organized (for example, movie data and congressional data have very different structures), we have built our systems around a proprietary graph database. To complement our graph database, we make heavy use of existing technologies, like Redis, Mongodb, Sphinx Search and Postgres.

The biggest challenge we've encountered is that data is messy, really messy. This fact led us to build a suite of in-house data cleanup and matching tools that have enabled us to efficiently load, update and organize these complex data sets and provide them to our users in a seemingly simple format that hides the real complexity from them.

On the front-end, our data analysis tools are rather advanced and it's taken our very small front-end team (just me) quite a bit of work to get them working nicely. There are also a few particular UI touches that required quite a bit of effort to get working across all browsers, such as our data results table view that has persistent horizontal and vertical scrolling headers.

You mentioned you have chosen to use Backbone on the front-end side. The framework has received some criticism lately. New options, such as Angular and Ember.js have arisen. Why did you choose Backbone? Are you happy with your choice?

Most of our front-end code is standard server-side Python (Django). For our more complex and interactive data analysis tools, we opted to provide our users with a more responsive, client-side experience. We used Backbone.js for this, and the client-side UI communicates with our query engine via REST apis. I selected Backbone after a lot of research and I'm more or less happy with it. It's relatively light-weight and there's very little magic, but I have run into a few small bugs and the code can get messy if you're not careful. A Backbone program will be only as good as the programmer writing it, because it does very little work for you. What it does do is give you just enough of a structure to work with so you can keep things manageable and maintainable.

That said, I'm very interested in Angular and Ember and have read a lot about them. When I started building iknow.io nearly a year ago both of these other frameworks were not quite as mature and I didn't feel comfortable going with either one. If I were to start over again today, I think I'd give them both serious consideration and I'm leaning towards Angular for my next such project.

Javascript is being used more and more on the server, and it has become tempting to use it for the whole stack to be able to share code between the front-end and back-end. In our case, Python is so strong when it comes to number crunching (we make heavy use of Pandas), that it didn't make sense for us to do so, but I see this becoming more and more common in the future and I know a few people that are building sites this way today.

Given JSter is a JavaScript catalog, can you please list some of your favorite libraries?