The Intentional Data Scientist: Three Ways to Get Started
by Daniel Chudnov
Hello there! How nice to see you again. After a few years away, Dick Kaser graciously welcomed me back to CIL to share some of what I’ve learned about the burgeoning field of data science. This month, and again in an upcoming issue, I’ll offer some concrete steps you can take to dip a toe and maybe dive right in to this field.
| The first bit of advice Iíll offer is to give yourself time.
Where have I been, and what do I know about data science? I recently completed a master’s degree in the business analytics program at the George Washington University School of Business, right across the street from where I worked as an IT director at George Washington University’s Gelman Library from 2011 until recently. In this program—a new one, similar to many across the country, and soon to enter its 4th year—we studied core subjects (including probability and statistics, forecasting, data mining, decision and risk modeling, and optimization). Electives included data analysis, data visualization, geospatial statistics, and social network analysis.
The goal of this program, and the training I completed, is simple: to make better decisions, based on data, despite uncertainty. Perhaps the potential value implied by that goal is self-evident to you, but perhaps not. Having spent nearly 2 decades in research libraries, though, I am confident many of you share my experience of having seen a lot of big organizational decisions made based on a mix of intuition, anecdotal experience, trendiness, and personal preferences—few, if any, big decisions are made based on strong statistical evidence.
The uncertainty part should speak for itself. Many of our institutions face a far less certain budget than 20 years ago when I first came up as a librarian. We don’t know how new technologies, our churning politics, and public perception will shift next. So if we’re going to make changes in the services we provide and collections we build, we should base decisions about them on the best evidence we can assemble. We often have data about the services we provide, but fail to make that data available, up-to-date, integrated with related sources, and usable for analysis. And few within our profession understand statistical methods for performing analyses reliably.
Realizing this, it seems clear that the biggest obstacle between me and a practical skill set for data-based decisions was a foundational grounding in the theory and practice of data science. Put simply, data science combines the traditional science of statistics with programming and domain expertise. Drew Conway’s Data Science Venn Diagram shows this combination succinctly.
Most of you reading this are like me, in that we share substantive expertise in our domains. Whether in one or several of the traditional library areas (such as technical services, public services, access services, collections, or the IT side of the house), we know our work pretty well, even to the point of having a good handle on the short-term and long-term trends shifting it over the years. I was primed to become a data scientist because I also had hacking skills—I’ve been working as a software developer throughout my career, so that covers two out of the three key areas. But if you haven’t noticed it yet, take another look at that diagram. Can you see my problem?
I embodied the danger zone. According to Conway, I knew enough to be dangerous. Although I had formal training in mathematics and statistics, much of that training occurred during the first Bush administration. Following tutorials, I could have run a linear regression in R, but I couldn’t have explained its statistical assumptions. Similarly, mimicking examples, I could have created a machine-learning model in Python, but I wouldn’t have had the faintest clue how one modeling technique compared with dozens of others. This knowledge gap meant that any attempt to make decisions based on my own statistical analysis would have a good chance of being weak, incomplete, or wrong.
Surely, we can’t all be data scientists, just as we can’t all be processing archivists, original catalogers, instructional designers, systems librarians, or resource sharing specialists. With that in mind, for those of you wanting to learn about data science—whether to understand just enough to be a more well-rounded professional, a better manager, and a more grounded decision maker, or to build practical skills as I’ve done, or something in between—I suggest three distinct paths.
Dipping a Toe In
If you just want to get a taste for what data science is all about, you’re in luck. With a few well-chosen readings, you can grow your vocabulary, dig into some basic concepts, and learn what to look for from colleagues and studies you might come across, whether they are works of formal scholarship, reportage, or internal work projects. Amy Affelt’s The Accidental Data Scientist (published by Information Today, Inc.) offers a broad overview geared toward librarians, with pointers to useful tools, anecdotes of important milestones in data-driven work in industry, and a friendly explanation of a lot of useful jargon. Its introduction is by Thomas H. Davenport, a scholar and best-selling author of several recent books on analytics, along with a number of influential pieces in
Harvard Business Review (HBR). If you want to go a step further, any of his titles or HBR articles will give you a deeper look into the mindset of those applying data science in the workplace. Finally, the best and most practical guide to making data-driven decisions not requiring substantial training in statistics is Douglas Hubbard’s How to Measure Anything. I wish this one book were required reading in library and information science (LIS) and iSchool programs.
All of these titles are approachable and readable. Some contain a touch more business-speak than you might be used to, but don’t let that distract you—each is worth your time and attention.
Going for a Swim
If you want to incorporate active work with data into your current position or future opportunities, you’re going to need more than a few good books. Assess where you stand in relation to those three circles: domain knowledge, hacking skills, and mathematics and statistics. I’ll stick with my assertion that you probably have substantial expertise already, so how are your hacking skills? Did you get through algebra and calculus, and have you had some training in probability and statistics? If your answers are anything more than “I have no hacking skills” and “I never got past high school trigonometry,” then you’ve got a good chance at building some data chops.
The first bit of advice I’ll offer is to give yourself time. Unless you’re an experienced programmer trained in advanced mathematics, you’ll likely need at least a refresher. You might even require a little schooling and a year or two to build yourself up. Fortunately, there are accessible and affordable options at the ready.
Why will you need hacking skills? (If you don’t recognize the positive connotation of “hacking,” think of it constructively as “solving problems with code and whatever other tools are at the ready,” rather than anything illegal or wrong.) You’ll need hacking skills mostly to wrangle data. It’s axiomatic that “80% of data science is data wrangling.” Wrangling comprises extracting, transforming, mutating, folding, spindling, and otherwise mutilating data from one format into another, from one system into another, from multiple sources into one or vice versa, and countless variations on these themes.
All data is produced with some context, and it is rare to find that the questions you hope to answer and the decisions you need to make with data share that context 100%. Even when you get what looks like “a clean spreadsheet,” you’ll find impossible values, inconsistent value formats, strange formatting, and the like. For my money, the best tools to pick for data wrangling are the UNIX shell, Python or R, and SQL. If you already know at least one of these, you’re ahead of the game.
If at least one of these is new to you, there’s a great set of learning resources available through Software Carpentry (software-carpentry.org/lessons), and you can find extensive lessons in starting from scratch for all of these. I’ve based lectures and assignments on these lessons as exercises to graduate students with no prior background in the UNIX shell or SQL, and they were widely praised by everyone in class. That quality holds throughout their lessons, which are constantly being refined and improved.
If you want to go a level deeper with R, check out the Data Science Specialization on Coursera from Johns Hopkins University (coursera.org/specializations/jhu-data-science). I worked through several of these courses during my program (in which SAS, a great but expensive and proprietary product, tended to be favored over the open source R in some classes), and learned a great deal. I use R with RStudio (rstudio.com), a free software development environment that makes working with R easy. If you want to go deeper with Python, read Jacqueline Kazil and Katherine Jarmul’s Data Wrangling With Python, which will give you a full education in wrangling and beef up your Python skills along the way.
When working on data projects with Python, I use Anaconda (continuum.io/why-anaconda), an easy-to-use packaging system for a large number of Python software libraries for data. Among these libraries are my favorites, csvkit (csvkit.readthedocs.org/en/latest), a Swiss army knife for working with CSV files, and Jupyter Notebook (jupyter.org), a prose-plus-code system I use every day. I hope to write more about Jupyter in an upcoming article.
If you feel you need to refresh or pick up your math and statistics skills, the best place to start is Khan Academy (khanacademy.org). I relied on its brief, easy-to-follow lessons to fill gaps during school. If its style doesn’t match what you need, there are many other free and good massive open online course (MOOC) offerings through Coursera, edX, and Udacity, as well as fee-based courses from Statistics.com. And there’s always the old-fashioned approach: in-person classes at your local college or university. While they are not absolutely essential, it will help you to have a good handle on the basics of algebra, linear algebra, and calculus. It is, however, essential for you to understand the basics of probability and statistics. You don’t have to know how to derive or prove formulas, but your ability to conduct and think critically about data projects will be severely hampered if you don’t have a good grounding in the basics. If you can wrap your head around descriptive statistics, conditional probability, confidence intervals, p-values, and simple regressions, you’ll be off to a good start. If signing up for a formal class or even completing a MOOC is too big of a barrier, go read Larry Gonick and Woollcott Smith’s The Cartoon Guide to Statistics. It’s a silly, fun read, and they do a good job of covering the basics.
Data and information visualization is a whole topic unto itself and one of my favorites. Rather than go on about tools and techniques, I’ll suggest you need at least one modern visualization tool in your toolkit, so pick one and get started. Susan Gardner Archambault’s “Telling Your Story: Using Dashboards and Infographics for Data Visualization” (April 2016, Computers in Libraries) offers tips for several good starting points.
If you’re committing this far, I’ll make two more concrete suggestions. Read Foster Provost and Tom Fawcett’s Data Science for Business, the single best overview I’ve read of the principles of applied data science and data mining in particular. It introduces a number of foundational topics with lucid prose and engaging examples. It should lead you deeper into data mining and machine learning, the fastest-growing and probably the most widely visible applications of data science. Finally, although you might not guess that listening to people talking about data is exciting, there are several great podcasts about data science. Data Stories (datastori.es), PolicyViz (policyviz.com/the-policyviz-podcast), and the O’Reilly Data Show (radar.oreilly.com/tag/oreilly-data-show-podcast) are three of my favorites. They cover a wide range of topics, and their respective guests include engaging storytellers and designers and world-class researchers and engineers.
The High Dive
The real plunge here is to start a program similar to the one I recently finished. Whether it’s data science, applied statistics, or business analytics, most of these master’s degree and certification programs have a similar curriculum. Many are based on North Carolina State University’s analytics program, which is highly regarded and one of the first of its kind. I’ll be honest, that second master’s burned me out, but now that I’m on the other side, it seems as if a whole new world is opening up. I can recall several projects that could have been much better if I’d known some of what I’ve learned recently, and the possibilities for innovative new work using these new skills seem to appear daily. I don’t know many colleagues who’ve chosen this path—there are just a few of us so far—so there’s still time to be a trailblazer.
What we know for sure is that the volume and variety of data in our work and the world around us won’t grow smaller anytime soon. If you have a real interest in understanding the possibilities data science offers, you can get started now on any of the three paths I’ve suggested. You might start down one path and change course, but I’d encourage you to engage in the introductory material listed and open yourself up to the possibility that there might be data work—and data-based decisions—in your future.