Crunch Data Engineering and Analytics Conference Budapest October 29-31, 2018

CRUNCH is a use case heavy conference for people interested in building the finest data driven businesses. No matter the size of your venture or your job description you will find exactly what you need on the two-track CRUNCH conference. A data engineering and a data analytics track will serve diverse business needs and levels of expertise.

If you are a Data Engineer, Data Scientist, Product Manager or simply interested how to utilise data to develop your business, this conference is for you. No matter the size of your company or the volume of your data, come and learn from the Biggest players of Big Data, get inspiration from their practices, from their successes and failures and network with other professionals like you.


The day will start at 9AM and the last talk will end around 6PM. After the sessions there will be a Crunch party at the conference venue.


The day will start at 9AM and the closing ceremony will end around 6PM.


Our full-day workshops will be announced soon. You need to buy separate workshop tickets to attend them.


Meet Budapest, a really awesome city

Here are a few reasons why you need to visit Budapest



The Magyar Vasúttörténeti Park (Hungarian Railway History Park) is Europe’s first interactive railway museum located at a railway station and workshop of the Hungarian State Railways. There are over a hundred vintage trains, locomotives, cars and other types of railroad equipment on display, including a steam engine built in 1877, a railcar from the 1930’s and a dining car built in 1912 for the famous Orient Express.

On the conference days there will be direct Crunch trains in the morning from Budapest-Nyugati Railway Terminal to the venue, and in the evening from the venue to Budapest-Nyugati Railway Terminal, so we recommend to find a hotel near to Nyugati station.


Jon Morra

Jon Morra

Vice President of Data Science at Zefr
Clustering YouTube: A Top Down & Bottom up Approach

At ZEFR we know that when an advertisement on YouTube is relevant to the content a user is watching it is a better experience for both the user and the advertiser. In order to facilitate this experience we discover billions of videos on YouTube and cluster them into concepts that advertisers and brands want to buy to align with their particular creatives. To serve our clients we use two different clustering strategies, a top down supervised learning approach and a bottom up unsupervised learning approach. The top down approach involves using human annotated data and a very fast and robust machine learning model deployment system that solves problems with model drift. Our clients are also interested in discovering topics on YouTube. To serve this need we use unsupervised clustering of videos to surface clusters that are relevant. This type of clustering allows ZEFR to highlight what users are currently interested in. We show how using Latent Dirichlet Allocation can help to solve this problem. Along the way we will show some of the tricks that produce an accurate unsupervised learning system. This talk will touch on some common machine learning engines including Keras, TensorFlow, and Vowpal Wabbit. We will also introduce our open source Scala DSL for model representation, Aloha. We show how Aloha solves a key problem in a typical data scientist's workflow, namely ensuring that feature functions make it from the data scientist's machine to production with zero changes.


Jon Morra is the Vice President of Data Science at Zefr, a video ad-tech company. His team's main focus is on figuring out the best videos to deliver to Zefr's client to optimize their advertising campaign objectives based on the content of the videos. In this role he leads a team of data scientists whom are responsible for extracting information form both videos and our clients to create data driven models. Prior to Zefr, Jon was the Director of Data Science at eHarmony where he helped increase both the breath and depth of data science usage. Jon holds a B.S. from Johns Hopkins and a Ph.D. from UCLA both in Biomedical Engineering.

Daniel Porter

Daniel Porter

Co-founding member of BlueLabs
Using Rapid Experiments and Uplift Modeling to Optimize Outreach at Scale

In the current environment, media consumption is fragmenting, cord cutters are an increasingly large segment of the population, and “digital” is no longer a ubiquitous, single medium. As such, large companies and other organizations looking to do outreach at scale to change individuals’ behavior have an overwhelming number of choices for how to deploy their outreach resources. In this talk, Daniel Porter, co-founder and Chief Analytics Officer of BlueLabs, will discuss how current tools which combine uplift models with state of the art allocation algorithms make it possible for organizations ranging from Fortune 100 companies to Presidential Campaigns to large government agencies to optimize these decisions at the individual level, leading to ensuring delivery of the right message to the right person at the right time, through media channels where an individual is most likely to engage positively with the content.


Dan is a co-founding member of BlueLabs and has led its data science team since the company’s inception. Over the past 5 years, he has overseen the team’s growth in applying data science to new industries, spearheaded the development of critical new tools and methodologies, and expanded the team’s technical capabilities to incorporate the world’s most recent cutting-edge data science innovations. Dan’s interest is in using data science as a tool to predict, and more importantly, influence behaviors. As Director of Statistical Modeling on the 2012 Obama Campaign, his team was the first in the history of Presidential Politics to use persuasion modeling to determine the voters who were most likely to be persuaded by the campaign’s outreach. Since co-founding BlueLabs, Dan’s team has iterated on this work to influence behaviors and attitudes for applications ranging from perceptions of a Fortune 10 company, to buying products from big-box retail stores, and the uptake of key Federal Government services. Much of Dan’s team’s recent work has focused on how different individuals can be influential on each other’s attitudes and behaviors in asymmetric ways. Dan is passionate about understanding how these key drivers of influence are critical to organizations seeking to achieve their campaign, brand, or policy goals. Dan has a MA in Quantitative Methods from Columbia University, and a BA from Wesleyan University. He is an avid sports fan (always watching from a statistical perspective), and, sadly, enjoys optimizing his frequent flyer miles portfolio between vacations almost as much as vacation itself.

Ananth Packkildurai

Ananth Packkildurai

Senior data engineer at Slack
Operating data pipeline using Airflow @ Slack

Slack is a communication and collaboration platform for teams. Our millions of users spend 10+ hrs connected to the service on a typical working day. The Slack data engineering team goal is simple: Drive up speed, efficiency, and reliability of making data-informed decisions. For engineers, For people managers, For salespeople, For every slack customer. Airflow is the core system in our data infrastructure to orchestrate our data pipeline. We use Airflow to schedule Hive/ Tez, spark, Flink and TensorFlow applications. Airflow helps us to manage our stream processing, statistical analytics, machine learning, and deep learning pipelines. About six months back, we started on-call rotation for our data pipeline to adopt what we learned from devops paradigm. We found out several airflow performance bottleneck and operational inefficiency that siloed with ad-hoc pipeline management. In this talk, I will speak about how we identified Airflow performance issues and fixed it. I will talk about our experience as we thrive to resolve our on-call nightmares and make data pipeline simpler and pleasant to operate and the hacks we did to improve alerting and visibility of our data pipeline. Though the talk tune towards Airflow, the principles we applied for data pipeline visibility engineering is more generic and can apply to any tools/ data pipeline.


I work Senior data engineer at Slack manage core data infrastructures like Airflow, Kafka, Flink, and Pinot. I love talking about all things ethical data management.

Szilard Pafka

Szilard Pafka

Chief Scientist at Epoch USA
Better than Deep Learning: Gradient Boosting Machines (GBM)

With all the hype about deep learning and "AI", it is not well publicized that for structured/tabular data widely encountered in business applications it is actually another machine learning algorithm, the gradient boosting machine (GBM) that most often achieves the highest accuracy in supervised learning tasks. In this talk we'll review some of the main GBM implementations available as R and Python packages such as xgboost, h2o, lightgbm etc, we'll discuss some of their main features and characteristics, and we'll see how tuning GBMs and creating ensembles of the best models can achieve the best prediction accuracy for many business problems.


Szilard studied Physics in the 90s and obtained a PhD by using statistical methods to analyze the risk of financial portfolios. He worked in finance, then more than a decade ago moved to become the Chief Scientist of a tech company in Santa Monica, California doing everything data (analysis, modeling, data visualization, machine learning, data infrastructure etc). He is the founder/organizer of several meetups in the Los Angeles area (R, data science etc) and the data science community website He is the author of a well-known machine learning benchmark on github (1000+ stars), a frequent speaker at conferences (keynote/invited at KDD, R-finance, Crunch, eRum and contributed at useR!, PAW, EARL etc.), and he has developed and taught graduate data science and machine learning courses as a visiting professor at two universities (UCLA in California and CEU in Europe).

Thomas Dinsmore

Thomas Dinsmore

Senior Director for DataRobot
The Path to Open Data Science


Thomas W. Dinsmore is a Senior Director for DataRobot, an AI startup based in Boston, Massachusetts, where he is responsible for competitor and market intelligence. Thomas’ previous experience includes service for Cloudera, The Boston Consulting Group, IBM Big Data, and SAS. Thomas has worked with data and machine learning for more than 30 years. He has led or contributed to projects for more than 500 clients around the world, including AT&T, Banco Santander, Citibank, CVS, Dell, J.C.Penney, Monsanto, Morgan Stanley, Office Depot, Sony, Staples, United Health Group, UBS, Vodafone, and Zurich Insurance Group. Apress published Thomas’ book, Disruptive Analytics, in 2016. Previously, he co-authored Modern Analytics Methodologies and Advanced Analytics Methodologies for FT Press and served as a reviewer for the Spark Cookbook. He posts observations about the machine learning business on his personal blog at

Ajay Gopal

Ajay Gopal

Chief Data Scientist at Deserve, Inc
"Full-Stack" Data Science with R

In the past 5 years, there has been a rapid evolution of the ecosystem of R packages and services. This enables the crossover of R from the domain of statisticians to being an efficient functional programming language that can be used across the board for data engineering, analytics, reporting and data science. I'll illustrate how startups and medium-size companies can use R as a common language for

  1. engineering functions such as ETL and creation of data APIs,
  2. analytics through scalable real-time reporting dashboards and
  3. the prototyping and deployment of ML models.
Along the way, I'll specifically identify open-source tools that allow scalable stacks to be built in the cloud with minimal budgets. The efficiency gained enables small teams of R programmers & data scientist to provide diverse lateral intelligence across a company.


Ajay is a California resident, building his second FinTech Startup Data Science team as Chief Data Scientist at Deserve. Before that, he built the data science & digital marketing automation functions at – another CA FinTech Startup. In both roles, he has built diverse teams, cloud data science infrastructures and R&D/Prod workflows with a "full stack" approach to scalable intelligence & IP generation. Ajay holds a PhD in physical chemistry and researched bio-informatics and graph theory as a post-doc before transitioning to the startup world.

More speakers will be announced soon

If you want to be a speaker at Crunch submit your proposal here:


Crunch will be held together with Impact (a product management conference) and Amuse (a ux conference).
Your ticket allows you to attend all three tracks.


Crunch Conference is organized by

Questions? Drop us a line at