We data scientists love to create exciting data visualizations and insightful statistical models. However, before we get to that point, usually much effort goes into obtaining, scrubbing, and exploring (in other words: crunching) the required data.
The command line, although invented decades ago, is an amazing environment for performing such data science tasks. By combining small, yet powerful, command-line tools you can quickly explore your data and hack together prototypes. New tools such as parallel, jq, and csvkit allow you to use the command line for today's data challenges. Even if you're already comfortable processing data with, for example, R or Python, being able to also leverage the power of the command line can make you a more efficient data scientist.
This hands-on workshop is based on the book Data Science at the Command Line, written by instructor
Jeroen Janssens. In one day, we'll cover, through several real-world use cases, the following topics:
- Setting up the Data Science Toolbox
- Essential tools and concepts of the Unix command line
- Obtaining data from logs, web APIs, databases, and spreadsheets
- Filters such as cut, grep, sed, and awk
- Scraping websites using curl, scrape, xml2json, and jq
- Parallelizing and distributing data-intensive pipelines using GNU Parallel
- Executing R one-liners and SQL queries directly to CSV data
- Turning existing code, such as Python or R, into reusable command-line tools
- Computing aggregate statistics
- Creating data visualizations
This workshop is aimed at data scientist, data engineers, data journalists, and everyone else who has an affinity with data. We will make use of the Data Science Toolbox, which is a free, open-source virtual environment that contains all the necessary command-line tools. The Data Science Toolbox runs not only on Linux, but also on Mac OS X and Microsoft Windows, so everybody is able to follow along. Whether you're entirely new to the command line or already dreaming in shell scripts, by the end of this workshop you will have a solid understanding of how to integrate the command line in your data science workflow.
Jeroen Janssens is an assistant professor of data science at Tilburg University. As an independent consultant and trainer, Jeroen helps organizations make sense of their data. Previously, he was a data scientist at Elsevier in Amsterdam and the startups YPlan and Outbrain in New York City. Jeroen holds a PhD in machine learning from Tilburg University and an MSc in artificial intelligence from Maastricht University. He is the author of Data Science at the Command Line, published by O’Reilly Media. He blogs at jeroenjanssens.com and tweets as @jeroenhjanssens.