Crunch Data Engineering and Analytics Conference Budapest October 29-31, 2018

CRUNCH is a use case heavy conference for people interested in building the finest data driven businesses. No matter the size of your venture or your job description you will find exactly what you need on the two-track CRUNCH conference. A data engineering and a data analytics track will serve diverse business needs and levels of expertise.

If you are a Data Engineer, Data Scientist, Product Manager or simply interested how to utilise data to develop your business, this conference is for you. No matter the size of your company or the volume of your data, come and learn from the Biggest players of Big Data, get inspiration from their practices, from their successes and failures and network with other professionals like you.

29
October
CONFERENCE DAY #1, MONDAY

The day will start at 9AM and the last talk will end around 6PM. After the sessions there will be a Crunch party at the conference venue.

30
October
CONFERENCE DAY #2, TUESDAY

The day will start at 9AM and the closing ceremony will end around 6PM.

31
October
WORKSHOP DAY

Our full-day workshops will be announced soon. You need to buy separate workshop tickets to attend them.


Location

Meet Budapest, a really awesome city

Here are a few reasons why you need to visit Budapest

MAGYAR VASÚTTÖRTÉNETI PARK

BUDAPEST, TATAI ÚT 95, 1142

The Magyar Vasúttörténeti Park (Hungarian Railway History Park) is Europe’s first interactive railway museum located at a railway station and workshop of the Hungarian State Railways. There are over a hundred vintage trains, locomotives, cars and other types of railroad equipment on display, including a steam engine built in 1877, a railcar from the 1930’s and a dining car built in 1912 for the famous Orient Express.

On the conference days there will be direct Crunch trains in the morning from Budapest-Nyugati Railway Terminal to the venue, and in the evening from the venue to Budapest-Nyugati Railway Terminal, so we recommend to find a hotel near to Nyugati station.


Speakers

David Leonhardt

David Leonhardt

Op-Ed columnist at The New York Times
Data Journalism at The New York Times
Intermediate
Business Analytics Data Journalism

The use of data has changed the coverage of politics, climate, sport and many other issues at The New York Times. The organization's data visualizations have become central to its strategy of attracting a large base of paying subscribers in recent years. David Leonhardt, a writer and editor who has worked on many data projects in recent years, explains how they have improved the quality of Times journalism.

Bio

David Leonhardt is an Op-Ed columnist for The New York Times, and he writes a daily e-mail newsletter called Opinion Today. His prior assignment was leading a strategy group that helped Times leadership shape the future of the newsroom. David was previously the Washington bureau chief and the founding editor of The Upshot, a Times section covering politics and policy, often through data visualization. In 2011, he received the Pulitzer Prize for commentary. David serves on the board of the American Academy of Political and Social Science and on the jury of the Aspen Prize for Community College Excellence. He studied applied mathematics at Yale and is a third-generation native of New York.

Wes McKinney

Wes McKinney

Director of Ursa Labs, PMC for Apache Arrow
Apache Arrow: A Cross-language Development Platform for In-memory Data
Intermediate
Data Engineering In-Memory Data Processing Open Source

This talk discusses Apache Arrow project and its uses for high performance analytics and system interoperability.
Data processing systems have historically been full-stack systems features memory management, IO, file format adapters, runtime memory format, in-memory query engine, and front-end user interfaces. Many of these components are fully "bespoke" or "custom", in part due to a lack of open standards for many of the pieces.
Apache Arrow was created by a diverse group of open source data system developers to define open standards and community-maintained libraries for high performance in-memory data processing. Since the beginning of 2016, we have been building a cross-language development platform for data processing to help create systems that are faster, more scalable, and more interoperable.
I discuss the current development initiative and future roadmap as it relates to the data science and data engineering worlds.

Bio

Wes McKinney is an open source software developer focusing on data processing tools. He created the Python pandas project and has been a major contributor to many other OSS projects. He is a Member of the Apache Software Foundation and a project PMC member for Apache Arrow and Apache Parquet. He is the director of Ursa Labs, an innovation lab for open source data science tools powered by Apache Arrow.

Juliet Hougland

Juliet Hougland

Data Platform Engineering Manager at Stitch Fix
Enabling Full Stack Data Scientists
Intermediate
Data Engineering Infrastructure

Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. Data Scientists are expected to build their systems end to end and maintain them in the long run. We rely on automation, documentation, and collaboration to enable data scientist to build and maintain production services. In this talk I will discuss the platform we have built and how we communicate about these tools with our data scientists.

Bio

Juliet Hougland leads a team that builds data science infrastructure at Stitch Fix. She is a data scientist and engineer with expertise in computational mathematics and years of hands-on machine learning and big data experience. She has built and deployed production ML models, advised Fortune 500 companies on infrastructure and worked on a variety of open source projects (Apache Spark, Scalding, and Kiji) at the intersection of big data and machine learning. She has worked at Cloudera as well as two tiny (now defunct) startups.

Thomas Dinsmore

Thomas Dinsmore

Senior Director for DataRobot
The Path to Open Data Science
General
Data Science Open Source Culture

There’s a lot of talk about machine learning and AI. Every day, it seems, there is a new story about some remarkable thing powered my machine learning, such as diagnosing cancer, finding oil and gas, writing music, or detecting plant diseases. According to the Harvard Business Review, machine learning is already changing business, improving the quality of work, and making prediction cheaper.


Open data science is the critical capability that makes it possible for organizations to apply machine learning and artificial intelligence at scale. Working data scientists prefer to use open source software, such as Python, R, and Apache Spark, for many reasons. These include comprehensive functionality, flexibility, extensibility, transparency, and innovation.


However, many organizations have a large footprint of legacy analytics software. Executives in these organizations struggle to manage the growing cost of this software and to encourage users to adopt open source tooling.


Migration to open data science is challenging for several reasons. Existing users of legacy software often have strong personal preferences and resist switching. Programs written with legacy software must be rebuilt in new tools. Data may be siloed within the legacy platform. Complicating matters, commercial software vendors use community-building techniques to cultivate loyalty among end users.


Nevertheless, we see organizations successfully transition to a culture of open data science. This makes it possible for us to identify a series of transitional steps for organizations. These include understanding user needs; aligning software to needs; eliminating data silos; code migration; and training users on new tools.


We close the presentation with a discussion of keys to success in building an open data science culture. They include such things as executive leadership, cost transparency, and clear metrics of user adoption and success with open data science tools.

Bio

Thomas W. Dinsmore is a Senior Director for DataRobot, an AI startup based in Boston, Massachusetts, where he is responsible for competitor and market intelligence. Thomas’ previous experience includes service for Cloudera, The Boston Consulting Group, IBM Big Data, and SAS. Thomas has worked with data and machine learning for more than 30 years. He has led or contributed to projects for more than 500 clients around the world, including AT&T, Banco Santander, Citibank, CVS, Dell, J.C.Penney, Monsanto, Morgan Stanley, Office Depot, Sony, Staples, United Health Group, UBS, Vodafone, and Zurich Insurance Group. Apress published Thomas’ book, Disruptive Analytics, in 2016. Previously, he co-authored Modern Analytics Methodologies and Advanced Analytics Methodologies for FT Press and served as a reviewer for the Spark Cookbook. He posts observations about the machine learning business on his personal blog at thomaswdinsmore.com.

Matt Digan

Matt Digan

Executive Director - Data Engineering, The New York Times
Moving to a House in the Clouds - Our New Data Infrastructure and Lessons We Learned
General
Data Engineering Infrastructure Cloud Case study

The New York Times data engineering team revitalized its data warehouse with cloud services. We built a new home for our customer data on Google Cloud Platform and moved out of our 15 year old data warehouse. Our data engineers are now able to focus on data flow, data quality, and data governance instead of operations, and analysts and data scientists can enjoy less complicated interactions with large datasets. There were challenges with managing a long project with many system and process dependencies, changing the skills and composition of the team, and keeping the old system functioning well while building the new one. Establishing and staying within our budget during development and while in production was also a concern. Our story will be of interest to those who are considering a move to new technologies, languages, and skills, as well as others who are looking to gain flexibility and take advantage of innovations in cloud services.

Bio

Matt Digan leads the Data Engineering group at The New York Times. The group runs the infrastructure that collects and supplies fresh, clean, and trustworthy data to the entire company and enables other teams to do the same.

Tyler Akidau

Tyler Akidau

Software engineer at Google
Foundations of streaming SQL or: How I learned to love stream and table theory
Intermediate
Data Engineering Architecture Stream Processing

What does it mean to execute robust streaming queries in SQL? What is the relationship of streaming queries to classic relational queries? Are streams and tables the same thing conceptually, or different? And how does all of this relate to the programmatic frameworks like we’re all familiar with? This talk will address all of those questions in two parts, providing a survey of core points from chapters 6 and 8 in the recently published Streaming Systems book.
First, we’ll explore the relationship between the Beam Model for data processing (as described in The Dataflow Model paper and the Streaming 101 and Streaming 102 blog posts) and stream & table theory (as popularized by Martin Kleppmann and Jay Kreps, amongst others, but essentially originating out of the database world). It turns out that stream & table theory does an illuminating job of describing the low-level concepts that underlie the Beam Model.
Second, we’ll apply our clear understanding of that relationship towards explaining what is required to provide robust stream processing support in SQL. We’ll discuss concrete efforts that have been made in this area by the Apache Beam, Calcite, and Flink communities, compare to other offerings such as Apache Kafka’s KSQL and Apache Spark’s Structured streaming, and talk about new ideas yet to come.
In the end, you can expect to have a much better understanding of the key concepts underpinning data processing, regardless of whether that data processing batch or streaming, SQL or programmatic, as well as a concrete notion of what robust stream processing in SQL looks like.

Bio

Tyler Akidau is a software engineer at Google, where he is the technical lead for the Data Processing Languages & Systems group, responsible for Google's Apache Beam efforts, Google Cloud Dataflow, and internal data processing tools like Google Flume, MapReduce, and MillWheel. His also a founding member of the Apache Beam PMC. Though deeply passionate and vocal about the capabilities and importance of stream processing, he is also a firm believer in batch and streaming as two sides of the same coin, with the real endgame for data processing systems the seamless merging between the two. He is the author of the Streaming Systems book from O'Reilly, the 2015 Dataflow Model paper, and the Streaming 101 and Streaming 102 articles. His preferred mode of transportation is by cargo bike, with his two young daughters in tow.

Tim Berglund

Tim Berglund

Senior Director of Developer Experience at Confluent
Kafka as a Platform: the Ecosystem from the Ground Up
General
Data Engineering Architecture Data Pipeline Kafka

Kafka has become a key data infrastructure technology, and we all have at least a vague sense that it is a messaging system, but what else is it? How can an overgrown message bus be getting this much buzz? Well, because Kafka is merely the center of a rich streaming data platform that invites detailed exploration.


In this talk, we’ll look at the entire open-source streaming platform provided by the Apache Kafka and Confluent Open Source projects. Starting with a lonely key-value pair, we’ll build up topics, partitioning, replication, and low-level Producer and Consumer APIs. We’ll group consumers into elastically scalable, fault-tolerant application clusters, then layer on more sophisticated stream processing APIs like Kafka Streams and KSQL. We’ll help teams collaborate around data formats with schema management. We’ll integrate with legacy systems without writing custom code. By the time we’re done, the open-source project we thought was Big Data’s answer to message queues will have become an enterprise-grade streaming platform.

Bio

Tim is a teacher, author, and technology leader with Confluent, where he serves as the Senior Director of Developer Experience. He can frequently be found at speaking at conferences in the United States and all over the world. He is the co-presenter of various O’Reilly training videos on topics ranging from Git to Distributed Systems, and is the author of Gradle Beyond the Basics. He tweets as @tlberglund, blogs very occasionally at http://timberglund.com, is the co-host of the http://devrelrad.io podcast, and lives in Littleton, CO, USA with the wife of his youth and their youngest child, the other two having mostly grown up.

Ananth Packkildurai

Ananth Packkildurai

Senior data engineer at Slack
Operating data pipeline using Airflow @ Slack
Advanced
Data Engineering Architecture Case Study Data Pipeline Airflow

Slack is a communication and collaboration platform for teams. Our millions of users spend 10+ hrs connected to the service on a typical working day. The Slack data engineering team goal is simple: Drive up speed, efficiency, and reliability of making data-informed decisions. For engineers, For people managers, For salespeople, For every slack customer. Airflow is the core system in our data infrastructure to orchestrate our data pipeline. We use Airflow to schedule Hive/ Tez, spark, Flink and TensorFlow applications. Airflow helps us to manage our stream processing, statistical analytics, machine learning, and deep learning pipelines. About six months back, we started on-call rotation for our data pipeline to adopt what we learned from devops paradigm. We found out several airflow performance bottleneck and operational inefficiency that siloed with ad-hoc pipeline management. In this talk, I will speak about how we identified Airflow performance issues and fixed it. I will talk about our experience as we thrive to resolve our on-call nightmares and make data pipeline simpler and pleasant to operate and the hacks we did to improve alerting and visibility of our data pipeline. Though the talk tune towards Airflow, the principles we applied for data pipeline visibility engineering is more generic and can apply to any tools/ data pipeline.

Bio

I work Senior data engineer at Slack manage core data infrastructures like Airflow, Kafka, Flink, and Pinot. I love talking about all things ethical data management.

Abhishek Tiwari

Abhishek Tiwari

Staff Software Engineer at LinkedIn, Apache Gobblin PPMC / Committer
Stream and Batch Data Integration at LinkedIn scale using Apache Gobblin
Intermediate
Data Engineering Data Integration Case Study Open Source

This talk will discuss about how Apache Gobblin powers stream and batch data integration at LinkedIn for use cases such as: ingestion of 300+ billion / daily Kafka events, storage management of several petabytes of data on HDFS, and near real-time processing of thousands of enterprise customer jobs.

Bio

Abhishek Tiwari is a Committer and PPMC member of Apache Gobblin (incubating). He is the Tech Lead for Data Integration Infrastructure at LinkedIn. Before joining LinkedIn, he had worked on building Amazon CloudSearch service at AWS, platform for Watson supercomputer at Nuance, Hadoop infrastructure at Yahoo, and web architecture for several million monthly users at AOL.

Jon Morra

Jon Morra

Vice President of Data Science at Zefr
Clustering YouTube: A Top Down & Bottom up Approach
Intermediate
Data Science Clustering Case Study

At ZEFR we know that when an advertisement on YouTube is relevant to the content a user is watching it is a better experience for both the user and the advertiser. In order to facilitate this experience we discover billions of videos on YouTube and cluster them into concepts that advertisers and brands want to buy to align with their particular creatives. To serve our clients we use two different clustering strategies, a top down supervised learning approach and a bottom up unsupervised learning approach. The top down approach involves using human annotated data and a very fast and robust machine learning model deployment system that solves problems with model drift. Our clients are also interested in discovering topics on YouTube. To serve this need we use unsupervised clustering of videos to surface clusters that are relevant. This type of clustering allows ZEFR to highlight what users are currently interested in. We show how using Latent Dirichlet Allocation can help to solve this problem. Along the way we will show some of the tricks that produce an accurate unsupervised learning system. This talk will touch on some common machine learning engines including Keras, TensorFlow, and Vowpal Wabbit. We will also introduce our open source Scala DSL for model representation, Aloha. We show how Aloha solves a key problem in a typical data scientist's workflow, namely ensuring that feature functions make it from the data scientist's machine to production with zero changes.

Bio

Jon Morra is the Vice President of Data Science at Zefr, a video ad-tech company. His team's main focus is on figuring out the best videos to deliver to Zefr's client to optimize their advertising campaign objectives based on the content of the videos. In this role he leads a team of data scientists whom are responsible for extracting information form both videos and our clients to create data driven models. Prior to Zefr, Jon was the Director of Data Science at eHarmony where he helped increase both the breath and depth of data science usage. Jon holds a B.S. from Johns Hopkins and a Ph.D. from UCLA both in Biomedical Engineering.

Szilard Pafka

Szilard Pafka

Chief Scientist at Epoch USA
Better than Deep Learning: Gradient Boosting Machines (GBM)
Advanced
Data Science Machine Learning Gradient Boosting Machines

With all the hype about deep learning and "AI", it is not well publicized that for structured/tabular data widely encountered in business applications it is actually another machine learning algorithm, the gradient boosting machine (GBM) that most often achieves the highest accuracy in supervised learning tasks. In this talk we'll review some of the main GBM implementations available as R and Python packages such as xgboost, h2o, lightgbm etc, we'll discuss some of their main features and characteristics, and we'll see how tuning GBMs and creating ensembles of the best models can achieve the best prediction accuracy for many business problems.

Bio

Szilard studied Physics in the 90s and obtained a PhD by using statistical methods to analyze the risk of financial portfolios. He worked in finance, then more than a decade ago moved to become the Chief Scientist of a tech company in Santa Monica, California doing everything data (analysis, modeling, data visualization, machine learning, data infrastructure etc). He is the founder/organizer of several meetups in the Los Angeles area (R, data science etc) and the data science community website datascience.la. He is the author of a well-known machine learning benchmark on github (1000+ stars), a frequent speaker at conferences (keynote/invited at KDD, R-finance, Crunch, eRum and contributed at useR!, PAW, EARL etc.), and he has developed and taught graduate data science and machine learning courses as a visiting professor at two universities (UCLA in California and CEU in Europe).

Ajay Gopal

Ajay Gopal

Chief Data Scientist at Deserve, Inc
"Full-Stack" Data Science with R
Intermediate
Data Science R Full-Stack Data Science

In the past 5 years, there has been a rapid evolution of the ecosystem of R packages and services. This enables the crossover of R from the domain of statisticians to being an efficient functional programming language that can be used across the board for data engineering, analytics, reporting and data science. I'll illustrate how startups and medium-size companies can use R as a common language for

  1. engineering functions such as ETL and creation of data APIs,
  2. analytics through scalable real-time reporting dashboards and
  3. the prototyping and deployment of ML models.
Along the way, I'll specifically identify open-source tools that allow scalable stacks to be built in the cloud with minimal budgets. The efficiency gained enables small teams of R programmers & data scientist to provide diverse lateral intelligence across a company.

Bio

Ajay is a California resident, building his second FinTech Startup Data Science team as Chief Data Scientist at Deserve. Before that, he built the data science & digital marketing automation functions at CARD.com – another CA FinTech Startup. In both roles, he has built diverse teams, cloud data science infrastructures and R&D/Prod workflows with a "full stack" approach to scalable intelligence & IP generation. Ajay holds a PhD in physical chemistry and researched bio-informatics and graph theory as a post-doc before transitioning to the startup world.

Jeffrey Theobald

Jeffrey Theobald

Staff Engineer at Zendesk
Machine Learning: The Journey to Production
Intermediate
Data Science Machine Learning Data Product Tensorflow

Simply building a successful machine learning product is extremely challenging, and just as much effort is needed to turn that model into a customer-facing product. As we cover the journey of Zendesk’s article recommendation product, we’ll discuss design challenges and real-world problems you may encounter when building a machine learning product at scale. We’ll talk in detail about the evolution of the machine learning system, from individual models per customer (using Hadoop to aggregate the training data) to a universal deep learning model for all customers using TensorFlow, and outline some challenges they faced while building the infrastructure to serve TensorFlow models. They also explore the complexities of seamlessly upgrading to a new version of the model and detail the architecture that handles the constantly changing collection of articles that feed into the recommendation engine.

Bio

Jeffrey Theobald is a Staff Engineer at Zendesk, a customer support company who provide a myriad of solutions to help their customers improve their relationships with their end users. He has been working in data processing for around 9 years, across several companies, in several languages from Python and Ruby, through bash to C++, and Java. He has used Hadoop since 2011 and has built analytics and batch processing systems as well as data preparation tools for machine learning. When not stressing about data correctness, he enjoys hiking and recently climbed Kilimanjaro. He believes that talks should be entertaining as well as informative and has tried to promote interesting and unusual talks about software engineering by organising the talk series Software Art Thou (Link: www.softwareartthou.com).

Jacek Laskowski

Jacek Laskowski

Spark, Kafka, Kafka Streams Consultant, Developer and Technical Instructor
Deep Dive into Query Execution in Spark SQL 2.3
Advanced
Data Engineering Spark SQL Query Optimization

If you want to get even slightly better performance of your structured queries (regardless whether they are batch or streaming) you have to peek at the foundations of Dataset API starting with QueryExecution. That’s where any structured query ends at and my talk starts from. The talk will show you what stages a structured query has to go through before execution in Spark SQL. I’ll be talking about the different phases of query execution and the logical and physical optimizations. I’ll show the different optimizations in Spark SQL 2.3 and how to write one yourself (in Scala).

Bio

Jacek Laskowski is an independent consultant, software developer and technical instructor specializing in Apache Spark, Apache Kafka and Kafka Streams (with Scala, sbt, Kubernetes, DC/OS, Apache Mesos, and Hadoop YARN). He is best known by the gitbooks at https://jaceklaskowski.gitbooks.io about Apache Spark, Spark Structured Streaming, and Apache Kafka. Follow Jacek at https://twitter.com/jaceklaskowski.

Andrey Sharapov

Andrey Sharapov

Data Scientist and Data Engineer at Lidl
Building data products: from zero to hero!
General
Data Engineering Data Product Case Study

Modern organizations are overwhelmingly becoming data-driven in order to optimize internal process and increase competitiveness. At Lidl we turn data into products and provide our internal customers with business insights at scale. Come to learn how we started from zero and turned into data heroes!

Bio

Andrey Sharapov is a data scientist and data engineer at Lidl. He is currently working on various projects related to machine learning and data product development. Previously, he spent 2 years at Xaxis where he help to develop a campaign optimization tool for GroupM agencies, then at TeamViewer, where he led data science initiatives and developed a tools for customer analytics. Andrey is interested in “explainable AI” and is passionate about making machine learning accessible to general public.

Wael Elrifai

Wael Elrifai

VP of Big Data, IOT & AI at Hitachi Vantara
AI & IOT for Good
General
Business Analytics Case Study

In this session, Wael Elrifai shares his experience working in the IoT and AI space; covering complexities, pitfalls, and opportunities to explain why innovation isn’t just good for business — it’s a societal imperative. Key takeaways include:

  • Deeper understanding of what Big Data, IOT, and AI mean at a functional level, not just what brands the buzzwords refer to.
  • Detailed understanding of some use-cases, and why solving these is more complex than it seems.
  • Not just what it’s for, but who it is for, and how to think about the “business case” or social imperative around it.

Bio

Wael Elrifai is a thought leader, book author & public speaker in the AI & IOT space in addition to his role as VP of Big Data, IOT & AI at Hitachi Vantara. He has served corporate and government clients in North America, Europe, the Middle East, and East Asia across a number of industry verticals and has presented at conferences worldwide. With graduate degrees in both electrical engineering economics he’s a member of the Association for Computing Machinery, the Special Interest Group for Artificial Intelligence, the Royal Economic Society, and The Royal Institute of International Affairs.

Jonathon Morgan

Jonathon Morgan

Founder and CEO of New Knowledge
Machine Learning and Information Warfare
Intermediate
Business Analytics Machine Learning Knowledge Graph Social Media

We've entered an age of information war. Hyper-partisan rhetoric, social media filter bubbles, and massive networks of fake social media accounts are being used to undermine elections, sow discord, and even inspire acts of violence. We can quantify this manipulation and its impact using new and novel approaches to natural language understanding and information semantics. We'll look at how knowledge graph embeddings can help humans quickly identify computational propaganda, and investigate how word vectors from models trained on partisan corpora can measure radicalization and polarization in political discourse.

Bio

Jonathon Morgan is the founder and CEO of New Knowledge, a technology company building AI for disinformation defense. He is also the founder of Data for Democracy, a policy, research, and volunteer collective with nearly 4,000 members that's bridging the gap between technology and society. Prior to founding New Knowledge, Jonathon published research about extremist groups manipulating social media with the Brookings Institution, The Atlantic, and the Washington Post, presented at NATO's Center of Excellence for Defense Against Terrorism, the United States Institute for Peace, and the African Union. He also served as an adviser to the US State Department, developing strategies for digital counter-terrorism. He regularly provides expert commentary about online disinformation for publications such as the New York Times, NBC, NPR, and Wired, and has published op-eds about information warfare and computational propaganda for CNN, The Guardian, and VICE.

Milene Darnis
Atul Gupte

Milene Darnis & Atul Gupte

Uber
Democratizing Data Science at Uber
General
Data Science Data Culture

At Uber, we’re changing the way people think about transportation. As an integral part of the logistical fabric in 600+ cities across 65 countries around the world, we’re using technology to give people what they want, when they want it. So whether it’s a ride, a sandwich, or a package, our systems are tirelessly optimizing every part of the journey to ensure each experience is nothing short of magical.


To do this, teams across Uber depend on data to power every insight and inform every decision. People working with data at Uber bring various skills and proficiencies to the table: some advanced users know exactly what they’re looking for while others are learning to explore various techniques to wrangle and make sense of data. But they all have one thing in common: they want to make intelligent data-driven decisions.


In this talk, we’ll discuss how we think about building tools and services to help each of these users be more productive and work better together. Specifically, we’ll explore how we’re working with our most advanced users to democratize complex techniques for less-technical people. From exploratory and open-ended tools designed to allow rapid exploration of data and optimization of processes to specialized tools designed to allow deep-dives into a single, complex problem-space, our strategy is to provide our users with a catalog of products that they can rely on to make informed decisions. Data science is not just for data scientists anymore!

Bio - Milene Darnis

Milene Darnis is a Data Product Manager at Uber, focusing on building a world-class experimentation platform. From her role as a Product Manager and her previous experience at Uber as a Data Engineer modeling core datasets, she has developed a passion for linking data to concrete business problems. Previously, Milene was a Business Intelligence Engineer at a mobile gaming company. She holds a Master’s Degree in Engineering from Telecom ParisTech, France.

Bio - Atul Gupte

Atul Gupte is a Product Manager at Uber. He holds a BS in Computer Science from the University of Illinois at Urbana-Champaign. At Uber, he focuses on building an exploratory machine learning platform that helps teams make sense of the data they have, using advanced tooling, stable compute resources and foundational infrastructure to power Uber’s global ambitions. Previously, at Zynga, he built some of the world’s most loved social games and also helped develop the company’s mobile advertising platform.

Kishore Gopalakrishna

Kishore Gopalakrishna

Senior Staff Software Engineer, Data infrastructure at LinkedIn
Scaling the wall of real-time analytics with Pinot
Intermediate
Data Engineering Case Study Scaling Real-Time analytics Open Source

Most analytical use cases are for internal users within the company. While they require sub-second latency, the number of concurrent requests is low. However, LinkedIn has many site facing applications such as "who viewed my profile" that serve a large user base (500+ million) and demands low latency response time at very large qps. There is another class of application such as anomaly detection that generates bursty workload. Even though the underlying data and the query pattern is the same, we used different systems to power these varied use cases. This resulted in duplication of data and functionality along with operational overhead of maintaining many systems. In order to address these challenges, we built Pinot, a real-time distributed OLAP engine. Pinot is a single system used at Linkedin to power 50+ site facing applications along with dozens of internal applications. In this talk, I will discuss the details of Pinot and also provide performance comparison with Druid.

Bio

Kishore Gopalakrishna is a Senior staff software engineer and the architect for LinkedIn’s analytics infra team. Kishore is passionate about solving hard problems in distributed systems. He has authored various distributed systems at LinkedIn such as Apache Helix, Espresso, Pinot, and ThirdEye. He is currently focused on enhancing Pinot, a real-time distributed OLAP engine and ThirdEye, a platform for anomaly detection and root cause analysis at LinkedIn.

Christoph Reininger

Christoph Reininger

Head of Business Intelligence at Runtastic GmbH
From Data Science to Business Science - How data scientists @ Runtastic translate stakeholder needs
General
Business Analytics Data Culture Case Study

You got a working business model, scalable analytics infrastructure and highly skilled data scientists. But somehow you just don’t seem to be generating value from your data. Digital health and fitness company Runtastic shares their experience in translating business requirements into actionable data products that drive innovation.
After scaling a capable analytics infrastructure and building skilled data science / engineering teams, Runtastic’s challenge was to apply these capabilities to it’s fast growing and ever-changing business. Business processes like user acquisition and customer relationship management had matured quickly and got more complex and sophisticated. Existing analytics and data products did not fit the business requirements any more and external solutions appeared both too expensive and limited. The solution was that data scientists took a step back from the data they knew to take a hard look on the business and how it works.
The requirements engineering process that leads to a functional and valuable data product is a big challenge that involves a lot of different stakeholders and requires a wide variety of skills. In the past 24 months Runtastic tackled and revamped some if its most crucial business processes und discovered a lot of learnings on this way.

Bio

After receiving his master’s degree in Medical Informatics at the Medical University Vienna, Christoph started working for gespag, one of Austria’s biggest healthcare providers. After 4+ years of working as an IT architect and project manager Christoph joined Runtastic in 2013 to start their business intelligence initiative. In his position as Head of Business Intelligence he implemented and grew the data infrastructure and organization at Runtastic for the past 5 years. Working for a very innovative organization in the mobile health & fitness area, has given Christoph the opportunity to not only apply his knowledge in data management but to expand his experience regarding business processes and agile product management.

Nate Kupp

Nate Kupp

Director of Infrastructure and Data Science at Thumbtack
From humble beginnings: building the data stack at Thumbtack
Intermediate
Data Engineering Architecture Case Study AB Testing

As Paul Graham writes in “Do Things that Don’t Scale”, building marketplaces is incredibly hard. From decidedly non-scalable, humble beginnings, Thumbtack today helps millions of people complete their projects, generating more than $1B / year in business for our professionals.


When I arrived at Thumbtack four years ago, we had no data infrastructure, a non-functional A/B testing tool, and zero machine learning in production. Since that time, we’ve built out data infrastructure on Google Cloud Platform that serves tens of thousands of analytics queries and executes hundreds of Airflow-scheduled batch jobs per day. On top of that platform, we’ve also built internal A/B testing infrastructure that our team uses on a daily basis to drive product decisions, along with TensorFlow-based machine learning pipelines to drive our marketplace dynamics.


In this talk, I will share some of our key learnings on our path to scale, and how these systems evolved at Thumbtack to meet the needs of our product and engineering teams.

Bio

Nate Kupp is Director of Infrastructure and Data Science at Thumbtack, a company creating marketplaces for local services in San Francisco. Over his nearly 4 years at Thumbtack, he has worked on scaling both engineering teams and infrastructure to support Thumbtack's rapid growth. Before joining Thumbtack, Nate spent several years at Apple, where he built out data infrastructure for iOS and watchOS battery life analytics. His work at Apple enabled aggregating and analyzing hardware and software time series data from millions of iOS devices and drove optimizations and bug fixes to improve battery life for iOS and watchOS products. Prior to Apple, Nate completed his Ph.D. at Yale University in 2012; his research focused on machine learning applications in semiconductor manufacturing.

Chris Stucchio

Chris Stucchio

Director of Data Science at Simpl
AI Ethics, Impossibility Theorems and Tradeoffs
General
Data Science Ethics Statistics

In analytical philosophy, armchair philosophers ask theoretical questions like "is it wrong to push one fat man onto train tracks in order to stop a trolley from smashing into 5 italian grandmothers?" As AI ethics has become a concern, these problems have suddenly become practical and quantitative.


In this talk, I'll present some western liberal ethical principles that are important to algorithmic decision making. I'll provide a number of examples in lending, criminology and education which illustrate how it's impossible to simultaneously satisfy all these ethical principles. If time permits, I'll also discuss how these western principles are far from universal, and how most of the literature on the topic is relatively useless in non-western contexts (e.g., most literature focused on the US takes N=2, I don't even know how to count N in India).


As a content warning, this talk will bring up many uncomfortable tradeoffs (with respect to race in the US, caste in India, and gender in both).

Bio

Chris currently leads the data team at Simpl, India's leading pay later platform. In past lives he's been a quantum physicist, ad targeter and a quant trader. He tries to make all his important decisions with a Python notebook and encourages everyone to gamble on the things they strongly believe.

Chris Stucchio

Chris Travers

Head of Databases at adjust.com
PostgreSQL at 20TB and Beyond
Advanced
Data Engineering Architecture Case Study Open Source Postgresql

At Adjust, we produce near-real-time analytics on over 400TB of high velocity data. This talk is a brief introduction to how we do it, and it serves as a showcase for what PostgreSQL is capable of doing in a big data environment. This talk will be of interest to people looking for information about how open source databases can be used at massive scales, approaches to federated data, and general open source success case studies.

Bio

Chris heads the database team at Adjust GmbH, which manages over 4PB of data in PostgreSQL. As part of his duties, he acts as database administrator, database engineer, software developer, and more. Chris has been an open source contributor for nearly two decades mostly in projects surrounding PostgreSQL. Over the years he has worked on everything from accounting software to high velocity and volume bioprospecting platforms and truly massive ad-tech analytics environments.

Mona Eldam
Yunchi Nam

Mona Eldam & Yunchi Nam

Morgan Stanley
Data, data, everywhere, Nor any drop to drink: Morgan Stanley's journey through the ocean of data
Intermediate
Business Analytics Case Study Data Governance Architecture Data Quality

We live in a world surrounded by data. Despite the abundance of data, organizations can leverage only a fraction of the data effectively, and turn it into insights that drive value for their clients. This talk will discuss how Morgan Stanley navigates the often challenging sea of data and the evolution of our data architecture, including metadata management, data lineage, data quality and data access. They will highlight use cases related to data persistence, and demonstrate data governance as required by many of the regulators describing how a complex organization like Morgan Stanley enforces usage of standard architecture, tools and frameworks. At the end the session, the audience will gain understanding and insights into the data challenges global financial services firms face and a strategy that can be adopted to them.

Bio - Mona Eldam

Mona Eldam is a Managing Director at Morgan Stanley and Global Head of the Transactional Data Team which manages the firm critical Institutional Trade Capture databases among many others and is responsible for the OTC Derivatives Trades Internal Reporting. Mona joined the Firm in 1999 as an Associate in Ops Technology, and has since held a number of roles within various Technology divisions. Prior to working at Morgan Stanley, Mona was a consultant on state government financial systems, allowing her to live and work in several states including Hawaii and the U.S. Virgin Islands. Mona earned her Bachelor’s Degree in Computer Science from UCLA and her Master’s degree in Science for Information Systems from NYU. She lives in New York and enjoys traveling and hiking.

Bio - Yunchi Nam

Yunchi Nam is a Managing Director of Morgan Stanley and Global Head of Data Engineering in Technology with responsibility for providing centralized engineering of Databases, Big Data Platforms, Client Reporting & Distribution Infrastructure, Data Middleware and Business Intelligence & Analytics Infrastructure. The Data Engineering group manages over 70,000 databases, 15 PB of data and distributes over 33M reports a month to Morgan Stanley’s external clients. Yunchi also leads Morgan Stanley’s Data Infrastructure Strategy program, developing a coherent data container and analytics strategy for the enterprise. Yunchi joined the Firm in 1999 as an Associate in Technology and has served in a variety of roles in application development and infrastructure engineering.


Workshops31 Oct, Wed

Mate Gulyas

Apache Spark™ Essentials (OFFICIAL DATABRICKS WORKSHOP)

CEO and Senior Instructor at Datapao

Apache Spark Essentials will help you get productive with the core capabilities of Spark, as well as provide an overview and examples for some of Spark’s more advanced features. This full-day course features hands-on technical exercises so that you can become comfortable applying Spark to your datasets. In this class, you will get hands-on experience with ETL, exploration, and analysis using real world data.


Overview

This 1-day course is for data engineers, analysts, architects, data scientist, software engineers, IT operations, and technical managers interested in a brief hands-on overview of Apache Spark.

The course provides an introduction to the Spark architecture, some of the core APIs for using Spark, SQL and other high-level data access tools, as well as Spark’s streaming capabilities and machine learning APIs. The class is a mixture of lecture and hands-on labs.

Each topic includes lecture content along with hands-on labs in the Databricks notebook environment. Students may keep the notebooks and continue to use them with the free Databricks Community Edition offering after the class ends; all examples are guaranteed to run in that environment.


Objectives
After taking this class, students will be able to:

  • Use a subset of the core Spark APIs to operate on data.
  • Articulate and implement simple use cases for Spark
  • Build data pipelines and query large data sets using Spark SQL and DataFrames
  • Create Structured Streaming jobs
  • Understand how a Machine Learning pipeline works
  • Understand the basics of Spark’s internals

Topics

  • Spark Overview
  • Introduction to Spark SQL and DataFrames, including:
    • Reading & Writing Data
    • The DataFrames/Datasets API
    • Spark SQL
    • Caching and caching storage levels
  • Overview of Spark internals
    • Cluster Architecture
    • How Spark schedules and executes jobs and tasks
    • Shuffling, shuffle files, and performance
    • The Catalyst query optimizer
  • Spark Structured Streaming
    • Sources and sinks
    • Structured Streaming APIs
    • Windowing & Aggregation
    • Checkpointing & Watermarking
    • Reliability and Fault Tolerance
  • Overview of Spark’s MLlib Pipeline API for Machine Learning
    • Transformer/Estimator/Pipeline API
    • Perform feature preprocessing
    • Evaluate and apply ML models

Prerequisites:
This class doesn't require any Spark knowledge. Some experience in Python and some familiarity with big data or parallel processing concepts is helpful.

More details on this workshop:
https://databricks.com/training/instructor-led-training/courses/apache-spark-overview

Bio

CEO and Senior Instructor at Datapao, a Big Data and Cloud consultancy and training firm, focusing on industrial applications (aka Industry 4.0). Datapao helps Fortune 500 companies kick off and mature their data analytics infrastructure by giving them Apache Spark, Big Data and Data Analytics training and consultancy. Mate also serves as Senior Instructor in the Professional Services Team at Databricks, the company founded by the authors of Apache Spark. Previously he was Co-Founder and CTO of enbrite.ly, an award-winning Budapest based startup.
Mate has experience spanning more than a decade with Big Data architectures, data analytics pipelines, operation of infrastructures and growing organisations by focusing on culture. Mate also teaches Big Data analytics at Budapest University of Technology and Economics​. Speaker and organiser of local and international conferences and meetups.

Gergely Daróczi

Practical Introduction to Data Science and Engineering with R

Gergely Daróczi, Passionate R developer

This is an introductory 1-day workshop on how to use the R programming language and software environment for the most common data engineering and data science tasks. After a brief overview on the R ecosystem and language syntax, we quickly get on to speed with hands-on examples on

  • reading data from local files (CSV, Excel) or simple web pages and APIs
  • manipulating datasets (filtering, summarizing, ordering data, creating new variables)
  • computing descriptive statistics
  • building basic models
  • visualizing data with `ggplot2`, the R implementation of the Grammar of Graphics
  • doing multivariate data analysis for dummies (eg anomaly detection with principal component analysis; dimension reduction with multidimensional-scaling to transform the distance matrix of European cities into a map)
  • introduction to decision trees, random forest and boosted trees with `h2o`

No prior R knowledge or programming skills required.

Bio

Gergely has been using R for more than 10 years in academia (teaching data analysis and statistics at MA programs of PPCU, Corvinus and CEU) and in industry as well. He started his professional career at public opinion and market research companies, automating the survey analysis workflow, then founded and become the CTO of rapporter.net, a reporting SaaS based on R and Ruby on Rails, that role he quit to move to Los Angeles, to standardize the data infrastructure of a fintech startup. Currently, he is the Senior Director of Data Operations of an adtech company in Venice, CA. Gergely is an active member of the R community (main organizer of the Hungarian R meetup and the first satRday conference the eRum 2018 conference; speaker at international R conferences; developer and maintainer of CRAN packages).

Ananth Packkildurai

Introduction to data pipeline management with Airflow

Ananth Packkildurai, Senior data engineer at Slack

The modern Data Warehouse increase in complexity it is necessary to have a dependable, scalable, intuitive, and simple scheduling and management program to monitor the flow of data and watch how transformations are completed.

Apache Airflow, help manage the complexities of their Enterprise Data Warehouse, is being adopted by tech companies everywhere for its ease of management, scalability, and elegant design. Airflow is rapidly becoming the go-to technology for companies scaling out large data warehouses.

The Introduction to the data pipeline management with Airflow training course is designed to familiarize participants with the use of Airflow schedule and maintain numerous ETL processes running on a large scale Enterprise Data Warehouse. The class cover with the hands-on exercises on,

  • Introduction to Airflow framework and python
  • Introduction to Airflow core concepts (DAGs, tasks, operators, sensors)
  • Airflow UI
  • Airflow scheduler
  • Airflow operators & Sensors - deep dive
  • Hooks, connections, and variables
  • templating with Airflow
  • SLA, monitoring & alerting
  • Participants should have a technology background, basic programming skills in Python and be open to sharing their thoughts and questions.

Participants need to bring their laptops. Further information about the technical environment will be communicated after registration.

Bio

Ananth Packkildurai works as a Senior data engineer at Slack manage core data infrastructures like Airflow, Kafka, Flink, and Pinot. He is passionate about all things related to ethical data management.

Tamas Srancsik

Beyond relational data

Tamas Srancsik, Data Analyst at Bitrise

SQL and relational databases are essential tools in analyst's hands. Our goal is to demonstrate solutions dealing with other forms of information: documents, images, API responses and graph databases via building small media analysis applications in Python.

  • Recap working with flat tables, relational databases
  • Limitations of the tabular form
  • Working with REST APIs: GET and POST HTTP requests
  • Parsing JSON
  • Storing data and querying document databases
  • Introduction to Neo4j graph database and Cypher query language

Audience

Beginner and intermediate level analysts

Bio

Tamás Srancsik is a Data Analyst at Bitrise. He is responsible for almost all aspects of data from the integration of third-party services till business intelligence reports and deeper analysis for Product and Growth Teams. This variety of teams comes with the variety of data sources and solutions to present. Prior to Bitrise, Tamás worked as a Statistician Manager at Nielsen and as Customer and Product Analyst at Prezi.

Philipp Krenn

Elastic Stack Workshop: Search and Beyond

Philipp Krenn, Developer Advocate at Elastic

Elasticsearch is the most widely used full-text search engine, but is also very common for logging, metrics, and analytics. This workshops shows you what the rage is all about:

  1. Overview of Elasticsearch and how it became the Elastic Stack.
  2. Full-text search deep dive:
    • How does full-text search work in general and what are the differences to databases.
    • How the score or quality of a search result is calculated.
    • How to handle languages, search for terms and phrases, run boolean queries, add suggestions, work with ngrams, and more with Elasticsearch.
  3. Going from search to logging, metrics, and analytics:

    • System metrics: Keep track of network traffic and system load.
    • Application logs: Collect structured logs in a central location from your systems and applications.
    • Uptime monitoring: Ping services and actively monitor their availability and response time.
    • Application metrics: Get the information from the applications such as nginx, MySQL, or your custom Java applications.
    • Request tracing: Trace requests through an application and show how long each call takes and where errors are happening.

And we will do all of that live, since it is so easy and much more interactive that way.

Bio

Philipp lives to demo interesting technology. Having worked as a web, infrastructure, and database engineer for more than ten years, Philipp is now working as a developer advocate at Elastic — the company behind the open source Elastic Stack consisting of Elasticsearch, Kibana, Beats, and Logstash. Based in Vienna, Austria, he is constantly traveling Europe and beyond to speak and discuss about open source software, search, databases, infrastructure, and security.


Tickets

Can’t plan ahead? No worries. Full or partial refunds (65%) are available according to your date of claim.
Check our Terms for our detailed refund policy.

If you can’t see any tickets above please click here.

Got questions? Contact us at hello@crunchconf.com

Crunch will be held together with Impact (a product management conference) and Amuse (a ux conference).
Your ticket allows you to attend all three tracks.

Sponsors

Diamond

Platinum

Gold

Silver

Bronze

CRUNCH is a non-profit conference. We are looking for sponsors who help us make this conference happen.
Take a look at our sponsor packages and contact us at hello@crunchconf.com


Code of Conduct

All attendees, speakers, sponsors and volunteers at our conference are required to agree with the following code of conduct. Organisers will enforce this code throughout the event. We are expecting cooperation from all participants to help ensuring a safe environment for everybody. If you need help, contact us at hello@crunchconf.com.

EXCELLENT WITH EACH OTHER

Our conference is dedicated to providing a harassment-free conference experience for everyone, regardless of gender, age, sexual orientation, disability, physical appearance, body size, race, or religion (or lack thereof). We do not tolerate harassment of conference participants in any form. Sexual language and imagery is not appropriate for any conference venue, including talks, workshops, parties, Twitter and other online media. Conference participants violating these rules may be sanctioned or expelled from the conference without a refund at the discretion of the conference organisers.

THE LESS QUICK VERSION

Harassment includes offensive verbal comments related to gender, age, color, national origin, genetic information, sexual orientation, disability, physical appearance, body size, race, religion, sexual images in public spaces, deliberate intimidation, stalking, following, harassing photography or recording, sustained disruption of talks or other events, inappropriate physical contact, and unwelcome sexual attention.

Participants asked to stop any harassing behavior are expected to comply immediately.

Sponsors are also subject to the anti-harassment policy. In particular, sponsors should not use sexualised images, activities, or other material. Booth staff (including volunteers) should not use sexualised clothing/uniforms/costumes, or otherwise create a sexualised environment.

If a participant engages in harassing behavior, the conference organisers may take any action they deem appropriate, including warning the offender or expulsion from the conference with no refund.

If you are being harassed, notice that someone else is being harassed, or have any other concerns, please contact a member of conference staff immediately. Conference staff can be identified as they'll be wearing branded t-shirts.

Conference staff will be happy to help participants contact hotel/venue security or local law enforcement, provide escorts, or otherwise assist those experiencing harassment to feel safe for the duration of the conference. We value your attendance.

We expect participants to follow these rules at conference and workshop venues and conference-related social events.

Contact

Crunch Conference is organized by

Ádám Boros
Event Organizer, Prezi
Attila Petróczi
Head of Data, liligo.com
Balázs Szakács
Business Intelligence Manager, IBM Budapest Lab
Dániel Molnár
Data Engineer at Shopify
Dorina Szabadi
Event Specialist at Prezi
Edina Németh
Travel Specialist and CSR coordinator, Prezi
Gergely Krasznai
Data Analyst, Prezi
Máté Gulyás
CEO, Datapao
Medea Baccifava
Head of conference management, Prezi
Tádé Arató
Conference Financial Coordinator, Prezi
Tamás Imre
Data Analytics Manager, Prezi
Tamás Németh
Data Engineer, Prezi
Zoltán Prekopcsák
VP Big Data, RapidMiner
Zoltán Tóth
Big Data and Hadoop expert, Datapao; Teacher, CEU Business School

Questions? Drop us a line at hello@crunchconf.com