I had the opportunity to spread the good word about data for good to the folks at Informatica World this year. Many thanks to the organizers for giving me a chance to speak about a topic very dear to me.
I am extremely excited to announce that next week I will be joining Project Florida as their Head of Data. Project Florida is a NYC-based hardware/software startup working to harness an expansive breadth of data to improve health outcomes for individuals, and help people better understand their health. I will be leading and growing a team of world-class data scientists and engineers to build parts of the core product. The team at Project Florida is already comprised of an incredible mass of talent from across the tech, product, and health care industries. It is a real privilege to have the opportunity work with and learn from them.
Along with the obvious ego-massaging benefit of writing a blog post about my new job, I also wanted to take this opportunity to discuss how I arrived here. Over the last few years I have had the pleasure of sitting down with several men and women from many different industries -- at different points in their careers -- to discuss the data industry and how they might fit in. It has taken me the better part of a year to make my own decision; and in so far as I was able to give those folks good advice, and to the extent my own personal experience can be informative to others, I want to highlight why I am so excited about Project Florida.
This happens to be one of those rare instances where the benefit of hindsight does not make me regret something said flippantly on a panel. I deeply believe that in order to truly change the world we cannot simply "throw analytics at the problem." To that end, the medical and health industries are perhaps the most primed to be disrupted by data and analytics. To be successful, however, a deep respect for both the methodological and clinical contexts of the data are required.
It is incredibly exciting to be at an organization that is both working within the current framework of health care and data to create new insight for people, but also pushing the envelope with respect to individuals' relationships with their own health. The challenges are technical, sociological, and political; but the potential for innovation that exists in this space comes along very rarely.
I feel lucky to have an opportunity to move into the health data space now.
The past decade of development in "big data" has -- in large part -- been built on top of the need to understand web logs files. Somewhere between Web 1.0 and Web 2.0 people began to realize that there was a tremendous amount of underutilized value in these logs files. This spawned an entire big data ecosystem, and a whole new set of hardware and software tools to support it. Arguably, data science as a discipline and profession was also an offspring of this movement.
We have built technology and algorithms to understand the Web, and we have done a great job. Innovation in this space, however, is now focused on further abstracting away the technical detail in order to deliver analysis further up the business ladder. That is to say, we have by and large solved the web logs files big data problem, and are now trying to make it easier for everyone to participate. But, we have only just begun to even conceive of the scope for the sensor network big data problem.
I believe the next decade of exciting work for data scientists and engineers will be in creating an ecosystem around sensor data. By liberating the in-bound bytes from the Web, we have at once an entire new class of questions that can be asked, and a new class of hardware and software problems that must be solved.
This is particularly relevant to those considering starting a career, or making a pivot, into the data industry.
Strength of team
In my time at IA Ventures I have learned an enormous amount about the dynamics of technology startups, and what can contribute to their success. While this will not surprise those with startup experience, it bears repeating that the strength of startup's team is at least as important as the strength of its technology.
At Project Florida, the team that has been pulled together -- and continues to be assembled -- is one of the strongest I have ever seen. For those considering other opportunities, I would strongly recommend considering the people you will be working with just as much as problem you will be working on.
Of course, I would be remiss to conclude without giving a plug to make the Project Florida team even stronger. I will be looking to add folks to the data team immediately, and we are looking to grow the whole organization. If interested, please feel free to reach out to me directly about opportunities on the data team, or ping WeAre@projectfla.com for information about opportunities throughout the organization.
Here's to the next adventure!
Back in May I gave a short talk to the Data Driven NYC meetup entitled, "WARNING: Do not feed the wildebeests."
The focus of the talk was to encourage those in the position to hire data analysts and data scientist to consider social scientists, rather than the "PhD in math, statistics, computer science, or equivalent," that has become the de facto for data science job listings.
Stand up and take the pledge: I will not feed the wildebeests!
I finally got around to uploading a paper on the research I have been doing for the past year on using large-scale, non-expert, coders to derive data from political text. Enjoy!
Methods for Collecting Large-scale Non-expert Text Coding
The task of coding text for discrete categories or quantifiable scales is a classic problem in political science. Traditionally, this task is executed by qualified ``experts''. While productive, this method is time consuming, resource intensive, and introduces bias. In the following paper I present the findings from a series of experiments developed to assess the viability of using crowd-sourcing platforms for political text coding, and how variations in the collection mechanism affects the quality of output. To do this, the labor pool available on Amazon's Mechanical Turk platform were asked to identify policy statements and positions from a text corpus of party manifestos. To evaluate the quality of the the non-expert codings, this text corpus is also coded by multiple experts for comparison. The evidence from these experiments show that crowd-sourcing is an effective alternative means to generating quantitative categorization from text. The presence of a filter on workers increases the quality of output, but variation on that filter have little affect. The primary weakness of the non-experts participating in these experiments is their systematic inability to identify texts that contain no policy statement.