The next adventure

I am extremely excited to announce that next week I will be joining Project Florida as their Head of Data. Project Florida is a NYC-based hardware/software startup working to harness an expansive breadth of data to improve health outcomes for individuals, and help people better understand their health. I will be leading and growing a team of world-class data scientists and engineers to build parts of the core product. The team at Project Florida is already comprised of an incredible mass of talent from across the tech, product, and health care industries. It is a real privilege to have the opportunity work with and learn from them.

Along with the obvious ego-massaging benefit of writing a blog post about my new job, I also wanted to take this opportunity to discuss how I arrived here. Over the last few years I have had the pleasure of sitting down with several men and women from many different industries -- at different points in their careers -- to discuss the data industry and how they might fit in. It has taken me the better part of a year to make my own decision; and in so far as I was able to give those folks good advice, and to the extent my own personal experience can be informative to others, I want to highlight why I am so excited about Project Florida.

Health data

A few weeks ago I was on the Data for Good panel at the Strata Conference in Santa Clara, CA. In my closing remarks I said a thing was captured by Max Richman on Twitter:

This happens to be one of those rare instances where the benefit of hindsight does not make me regret something said flippantly on a panel. I deeply believe that in order to truly change the world we cannot simply "throw analytics at the problem." To that end, the medical and health industries are perhaps the most primed to be disrupted by data and analytics. To be successful, however, a deep respect for both the methodological and clinical contexts of the data are required.

It is incredibly exciting to be at an organization that is both working within the current framework of health care and data to create new insight for people, but also pushing the envelope with respect to individuals' relationships with their own health. The challenges are technical, sociological, and political; but the potential for innovation that exists in this space comes along very rarely.

I feel lucky to have an opportunity to move into the health data space now.

Sensor data

The past decade of development in "big data" has -- in large part -- been built on top of the need to understand web logs files. Somewhere between Web 1.0 and Web 2.0 people began to realize that there was a tremendous amount of underutilized value in these logs files. This spawned an entire big data ecosystem, and a whole new set of hardware and software tools to support it. Arguably, data science as a discipline and profession was also an offspring of this movement.

We have built technology and algorithms to understand the Web, and we have done a great job. Innovation in this space, however, is now focused on further abstracting away the technical detail in order to deliver analysis further up the business ladder. That is to say, we have by and large solved the web logs files big data problem, and are now trying to make it easier for everyone to participate. But, we have only just begun to even conceive of the scope for the sensor network big data problem.

I believe the next decade of exciting work for data scientists and engineers will be in creating an ecosystem around sensor data. By liberating the in-bound bytes from the Web, we have at once an entire new class of questions that can be asked, and a new class of hardware and software problems that must be solved.

This is particularly relevant to those considering starting a career, or making a pivot, into the data industry.

Strength of team

In my time at IA Ventures I have learned an enormous amount about the dynamics of technology startups, and what can contribute to their success. While this will not surprise those with startup experience, it bears repeating that the strength of startup's team is at least as important as the strength of its technology.

At Project Florida, the team that has been pulled together -- and continues to be assembled -- is one of the strongest I have ever seen. For those considering other opportunities, I would strongly recommend considering the people you will be working with just as much as problem you will be working on.

Of course, I would be remiss to conclude without giving a plug to make the Project Florida team even stronger. I will be looking to add folks to the data team immediately, and we are looking to grow the whole organization. If interested, please feel free to reach out to me directly about opportunities on the data team, or ping WeAre@projectfla.com for information about opportunities throughout the organization.

Here's to the next adventure!

WARNING: Do not feed the wildebeests

Back in May I gave a short talk to the Data Driven NYC meetup entitled, "WARNING: Do not feed the wildebeests."

The focus of the talk was to encourage those in the position to hire data analysts and data scientist to consider social scientists, rather than the "PhD in math, statistics, computer science, or equivalent," that has become the de facto for data science job listings.

Stand up and take the pledge: I will not feed the wildebeests! 

 

Methods for Collecting Large-scale Non-expert Text Coding

I finally got around to uploading a paper on the research I have been doing for the past year on using large-scale, non-expert, coders to derive data from political text.  Enjoy!

Methods for Collecting Large-scale Non-expert Text Coding

The task of coding text for discrete categories or quantifiable scales is a classic problem in political science. Traditionally, this task is executed by qualified ``experts''.  While productive, this method is time consuming, resource intensive, and introduces bias.  In the following paper I present the findings from a series of experiments developed to assess the viability of using crowd-sourcing platforms for political text coding, and how variations in the collection mechanism affects the quality of output.  To do this, the labor pool available on Amazon's Mechanical Turk platform were asked to identify policy statements and positions from a text corpus of party manifestos.  To evaluate the quality of the the non-expert codings, this text corpus is also coded by multiple experts for comparison. The evidence from these experiments show that crowd-sourcing is an effective alternative means to generating quantitative categorization from text. The presence of a filter on workers increases the quality of output, but variation on that filter have little affect.  The primary weakness of the non-experts participating in these experiments is their systematic inability to identify texts that contain no policy statement.

Programming language trends

Every since I started investigating the popularity of programming languages by looking at their usage on Github and StackOverflow I have wanted to improve on the original scatter-plot.

A couple month ago I experimented with a slope-graph, and did some basic cluster analysis to determine popularity tiers from the language. I liked the tiers, but the visualization seemed to muddle things more than enhance them. I actually really like the simplicity of the original scatter-plot, but I wished I could get more information on the trends in languages from the chart.

I have been toying with this for a few weeks, and am finally ready to add this project to The Lab. This new visualization is much closer in content to the original, but in this new version you can explore interactively. Along with being able to see more clearly where languages are in the chart, it also shows individual trends among the languages.  Of course, I have only been collecting data for a few weeks, so there are not many distinct trends yet.

The chart updates weekly, and if people are interested in the data for their own experimentation just ping me and I can get it to you. Also, I am always open to suggestions on how to make these kinds of visualizations clearer, so if you have ideas let me know.