Crunch Data Engineering and Analytics Conference Budapest October 18-20, 2017

CRUNCH is a use case heavy conference for people interested in building the finest data driven businesses. No matter the size of your venture or your job description you will find exactly what you need on the two-track CRUNCH conference. A data engineering and a data analytics track will serve diverse business needs and levels of expertise.

If you are a Data Engineer, Data Scientist, Product Manager or simply interested how to utilise data to develop your business, this conference is for you. No matter the size of your company or the volume of your data, come and learn from the Biggest players of Big Data, get inspiration from their practices, from their successes and failures and network with other professionals like you.


Our full-day workshops will be announced soon. You need to buy separate workshop tickets to attend them.


The day will start at 9AM and the last talk will end around 6PM. After the sessions there will be an Crunch party at the conference venue.


The day will start at 9AM and the closing ceremony will end around 6PM.


Charles Smith

Charles Smith

Manager - Big Data Platform Architecture, Netflix
Working hard to build an easy data platform at Netflix

Here is a problem: You would like to buy the next great show for Netflix. The dream is that, given your data and a question, you can find the next House of Cards with a click of the mouse. But is that the reality? Why does it seem like data engineers and analysts spend so much time talking about memory requirements and stack traces? This talk will explore the past, present, and some of the future of the Netflix data platform, as well as how we are prioritizing work that will make it easier to focus on data problems rather than the complexities of the platform.


Charles Smith leads the Big Data Platform Architecture team at Netflix, whose mission is to make using data easy and efficient. He and his team are responsible for envisioning how the data platform allows data scientists to make Netflix's service even better.

Gyula Fóra

Gyula Fóra

Data Warehouse Engineer, King
Real-time analytics at King

This talk gives a technical overview of the different tools and systems we are using at King to process and analyse over 30 billion events in real-time every day.
The core topic of this talk is RBEA (Rule-Based Event Aggregator) , the scalable real-time analytics platform developed by King’s Streaming Platform team. RBEA is a streaming-as-a-service platform built on top of Apache Flink and Kafka which allows developer and data scientists to write analytics scripts in a high level DSL and deploy them on the live event streams in a matter of few clicks.
The distinguishing feature of this platform is that new analytics jobs are not deployed as independent Flink programs, but instead, a fix number of continuously running jobs serve as backends for the RBEA platform. By streaming both the events and new scripts to the backends, scripts share both the incoming data and the state they may build up when analyzing user activity in the games. This design makes new deployments very lightweight and the whole architecture highly efficient without sacreficing expressivity.
We push the Apache Flink framework to it’s full potential in order to provide highly scalable stateful and windowed processing logic for the analytics applications. We will show how we have built a high-level DSL on the abstractions provided by Flink that is more approachable to developers without stream-processing experience and how we use code-generation to execute the programs efficiently at scale.
In addition to our streaming platform we will also introduce other tools that we have developed in order to make deployment and monitoring of real-time applications as simple as possible at scale.


Gyula is a Data Warehouse Engineer in the Streaming Platform team at King, working hard on shaping the future of real-time data processing. This includes researching, developing and sharing awesome streaming technologies. Gyula grew up in Budapest where he first started working on distributed stream processing and later became a core contributor to the Apache Flink project. Among his everyday funs and challenges, you find endless video game battles, super spicy foods and thinking about stupid bugs at night.
Gyula has been a speaker at numerous big data related conferences and meetups, talking about stream processing technologies and use-cases.

Shirshanka Das

Shirshanka Das

Principal Staff Software Engineer, Linkedin
Taming the ever-evolving Compliance Beast: Lessons learned at LinkedIn

Just when you think you have your Kafka and Hadoop clusters set up and humming and you’re well on your path to democratizing data, you realize that you now have a very different set of challenges to solve. You want to provide unfettered access to data to your data scientists but at the same time you need to preserve the privacy of the data that your members have entrusted you with.

In this session, I outline the path LinkedIn has taken to protect member privacy on our scalable distributed data ecosystem built around Kafka and Hadoop. Like most companies, in our early days, our first priority was getting data flowing freely and reliably. Over the past few years we’ve made significant advances in data governance, going above and beyond expectations with regard to the commitments we’ve made to our members in how we handle their data.

Specifically, I’ll discuss how we’ve handled the Irish Data Protection Commissioner’s requirements for ensuring that our member’s data was purged from all our data systems including Hadoop within the required timeframe and the kind of systems we had to build to solve it. I discuss three foundational building blocks that we’ve focused on: a centralized metadata system, a standardized data movement platform and a unified data access layer. Some of these systems are open source and can be of use to companies that are in a similar situation as us. I’ll also look to the future as the General Data Protection Regulation comes into effect in 2018, and outline our plans in addressing those requirements and the challenges that lie ahead.

Technology is just part of the solution.
In this talk I’ll also discuss the culture and process change we’ve seen happen at the company, and our learnings around sustainable process and governance.


Shirshanka is a Principal Staff Software Engineer and the architect for LinkedIn’s Data & Analytics team. He was among the original authors of a variety of open and closed source projects built at LinkedIn, including Databus, Espresso, and Apache Helix. He is currently working with his team on simplifying the big data analytics space at LinkedIn through a multitude of mostly open-source projects: Pinot, a high-performance distributed OLAP engine; Gobblin, a data lifecycle management platform for Hadoop; WhereHows, a data discovery and lineage platform and Dali, a data virtualization layer for Hadoop.

Justin Bozonier

Justin Bozonier

Lead Data Scientist, Finance & Analytics, GrubHub
Science the shit out of your business

The mission of my data science team is to make a science out of our business at GrubHub. We work on understanding how every initiative our company undertakes affects our bottomline. I will discuss how we analyze every feature shipped to production, marketing programs, customer service, and more using a variety of statistical, machine learning, and decision theoretic tools and techniques. Most importantly, I will cover how we have learned to tune these tools, not with just abstract or theoretical scores, but by connecting model error with bottom line impact.


Justin Bozonier is the author of Test-Driven Machine Learning (published by Packt) and Lead Data Scientist in GrubHub's Financial Planning & Analytics group. The founding data scientist of GrubHub's split testing efforts, his team runs the company's experiment analysis platform, develops experiments and models to tune larger business operations, and data mines experiments and operational data to look for new business opportunities and value existing programs. He has spoken previously at PyData Seattle, Kellogg at Northwestern, PyData Chicago's monthly meetup, and more.
He lives in Lake Villa, IL (just outside the greater Chicago area) with his wife Savannah and soon, their first child. In his spare time he studies math, video game development, and enjoys running.

Sean Kross

Sean Kross

Chief Technology Officer, Johns Hopkins Data Science Lab
Lessons from teaching data science to over a million people

My colleges and I saw the demand for data scientists ballooning and we decided to do something about it. In this talk I will explain how the Johns Hopkins Data Science Lab leveraged the latest statistical, computational, and open source methods in order to create over a million new data scientists. We'll talk about what happens as you take data-newbies through their first serious programming experiences, rigorous mathematical training, and the creation of their first data products. We'll discuss the data we collected about how students handle these challenges and how you can take our insights to implement better data science training and understanding in your organization.


Sean Kross is a PhD student at the University of California San Diego where he studies data science, human-computer interaction, and distributed education. Sean formerly worked in the Johns Hopkins Data Science Lab where he and his colleagues developed The Data Science Specialization on Sean is the author of Mastering Software Development in R, Developing Data Products, and The Unix Workbench. He blogs less often than he would like at and you can find him on Twitter @seankross.

Gio Fernandez-Kincade

Gio Fernandez-Kincade

Co-Founder @ Formerly Staff Engineer @ Etsy
AI in Production

Read enough Hacker News and you will quickly become convinced that building AI products looks something like:

  1. Fire up Tensore Flow
  2. Choose your favorite network architecture (or better yet, generate one!)
  3. Pipe in tons of data
  4. Profit

That couldn’t be farther from the truth. In this talk, we’ll figure out what it really takes to ship AI products in production.


Gio has been working with data, architecting systems, and leading teams of engineers for over a decade. He’s currently a co-founder at Related Works, which aims to build simple, intelligent products that help cultural institutions share their collections with the world. Previously he worked as a Staff Engineer at Etsy, where he lead the Search Ranking and Search Experience teams. He focused on Search from the ground up: infrastructure, ranking and machine-learned relevance, diversity, fairness, query understanding, autosuggest, faceting, navigation, experimentation, etc. Prior to working at Etsy, Gio worked at CapitalIQ where he designed, built, and maintained a multi-terabyte database, real-time processing-system, and search engine for globally-sourced financial reports.

Cassandra Jacobs

Cassandra Jacobs

Data Scientist, Stitch Fix
Imposing structure on unstructured text at Stitch Fix

At Stitch Fix, we have a wealth of text data related to each Fix we send out to clients. Fixes contain 5 apparel and non-apparel fashion items, ranging anywhere from blouses to leggings to shoes. Stylists are shown algorithmically scored pieces and ultimately use their own discretion to decide what to send to a client. After they’ve picked everything, stylists write notes detailing the items they selected, and once clients have received their Fix, they leave feedback on the pieces that we sent them. These notes and feedback can be leveraged to learn about our inventory so we can explore what occasions an item is good for or learn features that might not be in the descriptions of an item in our databases that function like a knowledge base. Ultimately we can use this information to make recommendations to stylists about what to write about an item if they’re suffering from writer’s block, automatically make suggestions about what the client might like given a request note they’ve written, or even help stylists find better similar items to ones they are considering sending.

Unfortunately, our text data is largely unstructured – stylists can talk about anything they send and in any order and clients don’t necessarily talk about the item’s prints or fabrics, or occasions that an item is good for. I will discuss a technique I have developed that builds upon a number of existing information extraction methods in natural language processing that allows us to impose structure on these notes and comments. This way we can find out how a stylist talks about an item even if we don’t know where it’s mentioned. The technique results in a network that defines words and items in a common space that we can use to make recommendations about how to talk about an item in a note, or for finding the right item in our inventory.


Cassandra Jacobs is a data scientist at Stitch Fix. A lover of unstructured data, she works primarily on natural language processing systems for recommendation algorithms, helping expert stylists pick the right pieces to send to clients. After earning her BA in Linguistics at the University of Texas, she earned her PhD in Cognitive Psychology and an MS in Computer Science at the University of Illinois at Urbana-Champaign. In her spare time, she likes to go on backpacking trips, reading literary science fiction, and learning foreign languages.

Thomas in’t Veld

Thomas in’t Veld

Head of Data Science, Peak Labs
Event Driven Growth Hacking

Peak acquired more than 25 million users in two years by combining event analytics with marketing attribution and predictive modelling. In this talk, I will take you on a journey through what makes this tick, how we built it and why it is one of the best ways to grow a new business. Event analytics is the cool new thing used by everyone from Facebook to your second cousin's dog's start-up, but why are so many people doing it wrong? And what will be the next step?
I am a theoretical physicist turned data scientist, and after building shiny data tools for Sky and The Guardian I joined Peak in 2015 to build a data science team. My continuing mission: making sure that every decision at Peak is made with as much data as possible.

Our mission here at Peak is to make lifelong progress enjoyable. We believe there’s always a little room for improvement, and we should strive to better ourselves bit by bit. That’s why we use a combination of neuroscience, technology and fun to get those little grey cells active and striding purposefully towards their full potential. Peak is the number one brain training app on mobile and, since it launched in 2014, has been downloaded more than 25 million times. It has been recognised by both Apple and Google as one of the best apps available, winning Best of 2014, Best of 2015 and Best of 2016 awards as well as Editors’ Choice on both the App Store and Play Store.


I am a theoretical physicist turned data scientist, and after building shiny data tools for Sky and The Guardian I joined Peak in 2015 to build a data science team. My continuing mission: making sure that every decision at Peak is made with as much data as possible.

Dirk Gorissen

Dirk Gorissen

Senior Engineer, Oxbotica
Beyond Ad-Click Prediction

We all know machine learning is great for helping you tag friends on Facebook, suggesting what brand of toothpaste will improve your smile, and picking the ad most likely to unlock your wallet. In this talk, however, I hope to demonstrate you that there are some interesting applications you may not have thought of. Such as detecting landmines from drone mounted radar, finding orangutans in the Bornean Jungle, or helping a car avoid pedestrians.


Dirk Gorissen has a background in Computer Science & AI and worked in academic and commercial research labs across Europe and the US. His interests span machine learning, robotics, and computational engineering as well as their application into the humanitarian and development areas. He has been a regular consultant for the World Bank in Tanzania and closely involved with a number of Drone related startups. He currently is a senior engineer in self driving car Oxbotica and on the side is an active STEM Ambassador, and organiser of the London Big-O Algorithms & Machine Learning meetups.

Maxime Beauchemin

Maxime Beauchemin

Data Engineer, Airbnb
Advanced Data Engineering Patterns with Apache Airflow

Analysis automation and analytic services are the future of data engineering! Apache Airflow's DSL makes it natural to build complex DAGs of tasks dynamically, and Airbnb has been leveraging this feature in intricate ways, creating a wide array of services as dynamic workflows. In this talk, we'll explain the mechanics of dynamic pipeline generation using Apache Airflow, and present advanced use cases that have been developed at Airbnb.


Maxime Beauchemin works at Airbnb as part of the "Analytics & Experimentation Products team", developing open source products that reduce friction and help generating insight from data. He is the creator and a lead maintainer of Apache Airflow [incubating] (a workflow engine), Superset (a data visualization platform), and recognized as a thought leader in the data engineering field. Before Airbnb, Maxime worked at Facebook on computation frameworks powering engagement and growth analytics, on clickstream analytics at Yahoo!, and as a data warehouse architect at Ubisoft.

Melanie Warrick

Melanie Warrick

Senior Developer Advocate, Google
Reinforcement Learning Overview

Reinforcement learning is a popular subfield in machine learning because of its success in beating humans at complex games like Go and Atari. The field’s value is in utilizing an award system to develop models and find more optimal ways to solve complex, real-world problems. This approach allows software to adapt to its environment without full knowledge of what the results should look like. This talk will cover reinforcement learning fundamentals and examples to help you understand how it works.


Melanie Warrick is a Senior Developer Advocate at Google. Previous experience includes work as a founding engineer on Deeplearning4J as well as implementing machine learning in production at Prior experience also covers business consulting and large enterprise technology implementations for a wide variety of companies. Over the last couple years, she's spoken at many conferences about artificial intelligence, and her passions include working on machine learning problems at scale.

Dr. Martin Loetzsch

Dr. Martin Loetzsch

Chief Data Officer, Project A Ventures
ETL Patterns with Postgres

Some companies have to process data volumes that by far exceed the capacity of “small” database clusters and they definitely have a valid use case for one of the modern parallelizing / streaming / big data processing technologies. For all others, expressing transformations in plain SQL is just fine and PostgreSQL is the perfect workhorse for that purpose.
In this talk, I will go through some of our best practices for building fast, robust, and tested data integration pipelines inside PostgreSQL. I will explain many of our technical patterns, for example for schema management or for splitting large computations by chunking and table partitioning. And I will show how to apply standard software engineering techniques to maintain agility, consistency, and correctness.


Martin Loetzsch works at Project A, a Berlin-based operational VC focusing on digital business models. As Chief Data Officer, he has helped many of Project A’s portfolio companies forming teams that build data warehouses and other data-driven applications. Before joining Project A (with a short interlude at Rocket Internet), he worked in artificial intelligence labs in Paris and Brussels on computational linguistics and robotics. He received a PhD in computer science from the Humboldt University of Berlin.

Shrikanth Shankar

Shrikanth Shankar

Director of Engineering, Data Analytics Infrastructure, Linkedin
Scaling Reporting and Analytics at LinkedIn

At LinkedIn, we have been working on the next generation of our reporting infrastructure. This talk will describe our journey to build a centralized platform that scales to hundreds of users, thousands of metrics and supports applications ranging from simple dash boarding to anomaly detection. We will discuss how a combination of technology and processes has allowed us to scale our user base while preserving trust in our metrics. We will also cover some of the exciting work we have been doing running metrics across a wide variety of platforms (from MR to Streaming systems like Samza).


Shrikanth Shankar is a Director of Engineering at LinkedIn where he leads multiple teams that work on infrastructure and platforms to support LinkedIn's analytic needs. Shrikanth has a long background in data and has worked at big companies and startups in a variety of technical and management roles.

András Németh

András Németh

Chief Technology Officer, Lynx Analytics
Scalable Distributed Graph Algorithms on Apache Spark

Graph analysis is extremely important to get insights out of the ever increasing amounts of data available today. Be it connections in a social network, calls placed among subscribers of a mobile network, connections among computers and routers, webpages with links or proteins reacting with each other there are vast datasets which can best be modeled as graphs.

To make sense of these datasets we need to run various graph algorithms on them. To be able to identify critical nodes we need pagerank, centrality, clustering coefficient, to find related groups of nodes we might want to find maximal cliques, communities or a modular clustering, to decompose into independent sets we need graph coloring, and so on.

The above list is made of fairly standard graph problems with well understood algorithms to solve. But very often these algorithms are unsuitable for trivial parallelization, they intrinsically require the full graph to be available in the memory of a single computer.

So how do we handle graphs too large for a single computer?

This talk is exactly about that. We at Lynx Analytics have built a big graph analysis engine on top of Apache Spark with a large library of graph algorithms readily available for users. In this talk we will dive into a few representative single computer graph algorithms and show how to translate them into Spark's execution model. We will also get into some hands on technical details. We will see how to optimally partition the data. We will show some tricks on how to deal with skewed graphs with vertices of immense degree without running out of memory.


Andras is the CTO at Lynx. Based on Apache Spark, Andras and his team built innovative tools that help users build, run, and generate insights from big data graphs.

Andras joined Lynx in 2014 from Google, where he served as Lead Software Engineer in Zurich. At Google, Andras contributed to YouTube’s revolutionary personal ad-targeting system, a cross-media prediction engine, and semantic web analysis based on Google’s knowledge graph.

Prior to Google, Andras worked for Applied Logic Laboratory in Budapest. His primary efforts were on speech recognition, text classification and intelligent retrieval systems.

Andras holds two Master’s degrees in Mathematics and Computer Science, from the Eotovos Lorand University and Budapest University of Technology and Economics respectively.

Sameer Farooqui

Sameer Farooqui

Freelancer / AI + Deep Learning
Separating hype from reality in Deep Learning

Deep Learning is all the rage these days, but where does the reality of what Deep Learning can do end and the media hype begin? In this talk, I will dispel common myths about Deep Learning that are not necessarily true and help you decide whether you should practically use Deep Learning in your software stack. I will begin with a technical overview of common neural network architectures like CNNs, RNNs, GANs and their common use cases like computer vision, language understanding or unsupervised machine learning.

Then I'll separate the hype from reality around questions like:

  • When should you prefer traditional ML systems like scikit learn or Spark.ML instead of Deep Learning?
  • Do you no longer need to do careful feature extraction and standardization if using Deep Learning?
  • Do you really need terabytes of data when training neural networks or can you 'steal' pre-trained lower layers from public models by using transfer learning?
  • How do you decide which activation function (like ReLU, leaky ReLU, ELU, etc) or optimizer (like Momentum, AdaGrad, RMSProp, Adam, etc) to use in your neural network?
  • Should you randomly initialize the weights in your network or use more advanced strategies like Xavier or He initialization?
  • How easy is it to overfit/overtrain a neural network and what are the common techniques to ovoid overfitting (like l1/l2 regularization, dropout and early stopping)?

Sameer Farooqui is a freelancer who teaches corporate classes on big data and machine learning. Over the past 5 years, he has taught 150+ classes globally at conferences and private clients on topics like NoSQL, Hadoop, Cassandra, HBase, Hive, Couchbase, and Spark. Sameer has been teaching Spark classes for three years and was the first full time hire into Databricks’ training department, where he worked closely with the Apache Spark committers on designing curriculum. Previously, Sameer also worked as a Systems Architect at Hortonworks and a Emerging Data Platforms consultant at Accenture R&D. When not working on Spark projects, Sameer enjoys exploring ideas in AI + Deep Learning, especially Google’s new TensorFlow library.

Zareen Farooqui

Zareen Farooqui

Business Intelligence Analyst, Wayfair
Breaking into Data Analytics

Are you interested in starting a career in data analytics, but don't know where to begin? Attend a boot camp or teach yourself? Python or R? Last year, I quit my sales job to learn programming and break into the analytics field. In this talk, I will share the advice I learned and wish I had known then. I'll discuss common tools and technologies used in industry, how to continue developing tech skills after landing your first analytics job, and recommendations for managers to support direct reports with data science related ambitions.


Zareen is a Business Intelligence Analyst at Wayfair, where she focuses on marketing analytics. Previously, she interned at the Wikimedia Foundation and worked on projects to help understand how readers around the globe consume Wikipedia. She transitioned into data analytics after working as a sales engineer for 3 years and then taking time off to learn Python, SQL, and Visualization. Zareen holds a B.S. in Industrial and Systems Engineering from the University of Florida.

Nilan Peiris
Balázs Barna

Nilan Peiris & Balázs Barna

Data driven growth. Using data science to power the hyper growth of a Mission Driven Startup

TransferWise is a new kind of financial services company - that started in cross border payments. Its now moving well over $1bn in month and has grown fast, doubling in size every 6 months to date.

The unusual thing about its growth is how it happens - with over 80% of new customers coming in through Word of Mouth. The team at TransferWise have orientated the entire organisation around the drivers of Word of Mouth growth.

The key of the explosive growth from a product perspective has been building a product that is 10x better than the alternatives on the parameters of speed, price and convince. But to enable this to happen quickly TransferWise has built autonomous KPI driven teams, around the drivers of Net Promoter Score, that are incentivised to figure out how to drive step changes in the experience customers have.

The KPI nature of the teams means that teams use both quant (heavy analysis) + qual (customer research, interviews) - to understand where they should focus to try to make changes in the experience for customers. Teams use data to inform where to focus, understand the impact of their changes and for building key parts of the product.

Nilan will talk about all these and the way TransferWise is structured organisational wise in a way that empowers data driven growth.

Bio - Nilan Peiris

Nilan Peiris is VP Growth at TransferWise, the international money transfer platform.

TransferWise is the low cost and fair way of transferring money internationally. Using peer-to-peer technology and without any hidden fees, it makes sending money abroad up to eight times cheaper compared to using a bank. TransferWise customers send £1.3 billion every month using the platform, and it’s attracted $117m from investors such as world’s largest VC firm Andreessen Horowitz, Sir Richard Branson, Peter Thiel and Max Levchin, the co-founders of PayPal.

Prior to TransferWise Nilan was VP Growth at HouseTrip, in charge of scaling the company’s growth in the European market. He’s also worked as Chief Marketing Technology Officer at Holiday Extras, where he was responsible for all areas of technology, marketing and customer acquisition. Nilan also advises a number of early stage startups on growth and getting to traction,

Bio - Balázs Barna

Balazs works as a product engineer at Transferwise - his current focus is building speed as a product. Prior to Transferwise he worked as a software engineer in investment banking and financial services. Balazs holds a Masters degree in Computer Science from Pannon University and a BSC in Business Information Systems from Corvinus University of Budapest.

Meina Zhou

Meina Zhou

Senior Consultant, Data Scientist, Capco
Implementation of Digital Strategy and Data Science Solutions in Financial Institutions

With the rapid increase of data science solutions in the finance industry, more and more financial institutions are feeling the pressure to catch up with the trend and become the leader. However, transforming the large amount of financial data into insights and data-driven strategies is never an easy task for financial institutions. The lack of data science expertise and project transparency have impeded the process of data science innovation.

During this session, Meina Zhou will discuss how Capco successfully helped different financial institutions implement large-scale data science solutions, and examine how those financial institutions benefited from those solutions. She will discuss both the core technology components and the technical and business challenges involved throughout the implementation of data science solutions. She will also address the key success factors in successfully executing data science solutions.


Meina is a data scientist and senior consultant at Capco. Prior to Capco, Meina Zhou worked as a data scientist at Bitly for two years. Her core expertise lies in the application of proven data science tools and techniques to conduct business analytics and predictive modeling. Meina has used her business acumen and data science skills to solve business problems, such as churn and upsell predictive modeling. She also has a background in big data analytics and enjoys data visualizations. Meina is a thought leader in the data science world and is an active conference speaker. She enjoys public speaking and sharing innovative data science ideas with other people. Meina received her Master of Science in Data Science from New York University and her Bachelor of Arts in Mathematics and Economics from Agnes Scott College.

George Shu

George Shu

Manager, Financial Engineering, Uber
Engineering for Intelligent Financial Forecasting and Planning

In this talk we discuss how we enhance Uber’s financial planning process with a technology stack embracing artificial intelligence, big data and distributed applications. Starting with a brief introduction of rider sharing business model and the challenges to forecasting and planning at global scale, we will expand into a few engineering specific topics including the architecture of a model training and deployment platform, supporting multi-tanent scenario planning with extensible optimization algorithm and machine learning models, as well as building a feedback loop to analyze the quality of our AI with human insights. Toward the end we will briefly touch upon some of our exciting new research projects relevant to financial intelligence at Uber.


George Shu is an engineering manager in Uber’s business intelligence organization. He currently leads a finance engineering team to build machine learning platform and applications to enpower Uber’s financial forecasting, planning and optimization.
George joined Uber in 2013 as a senior engineer and since then he has led and worked on various areas including dispatch system, mobile protocols and fraud engineering.
George obtained his Ph.D. degree in computer science from the Ohio State University in 2008, and Bachelor degree from Peking University, China.

Lee Faus

Lee Faus

Sr. Solutions Architect, GitHub
GitHub Workflows for Collaboration with Data Science, Machine Learning and Development teams

Jeff Bezos has said that the company with the most data and leverages that data to it's greatest strength will win the next evolution of application development. Therefore, cross collaboration with data science and machine learning teams is becoming more important for companies wanting a competitive advantage. With the addition of support for Git Large File Storage (LFS) and Jupyter Notebooks, GitHub and the GitHub Workflow enables cross team collaboration to allow teams to deliver data intensive applications faster to market. We will review best practices around setting up a collaborative workspace for development teams to communicate effectively and how to establish software communities around these data intensive applications. Finally, we will look at examples of how open source and Enterprise companies leverage existing development integrations to enhance your data intensive workflows.


Lee has been in information technology for over 20 years. He has been a teacher, mentor, and consultant focused on information technology value through enterprise application integration and application modernization. His experiences span verticals including transportation, healthcare, financial services, telecommunications and insurance. Information technology should be an enabler to simplify business processes by automating tasks and enhancing decision support. He is an advocate of open source by contributing to projects at Apache, Eclipse, and Fedora. He currently works at GitHub working with customers to streamline their application delivery toolchain and mentors companies on git, CI and CD best practices.

Mikko Järvenpää

Mikko Järvenpää

CEO, Infogram
Why Data Visualization Works

Data visualization is a way to make numbers intuitively understandable. In this talk, we will look at the underpinnings of why and how we understand data visualizations, how creating and reading visualizations transforms data into information, and how different levels of data literacy should inform our use of visualizations. Based on the theoretical understanding of data visualization, the talk will also give the audience actionable things to maximise the impact of their communications.


Mikko Järvenpää is CEO for Infogram, the popular tool for creating interactive charts and infographics. Infogram has served over 1.5 billion views of data visualizations and is used by customers globally for marketing, media and internal communications content. Prior to Infogram, Mikko has been a PMM with Google, CMG with HackFwd, and a tour manager with a death metal band.

Ronnie Chen

Ronnie Chen

Data Engineer, Slack
Luck Driven Development: Building for Serendipity in Slack's Data Platform

How does a small team of data engineers scale a data platform and pipeline in a company undergoing rapid growth? While it is tempting to paint a picture of meticulous planning and brilliant foresight, some of the most interesting elements of Slack's data platform were never originally planned for. They were projects born out of a mix of necessity, opportunity, and a healthy dose of luck. This talk explores the design principles that allow us to optimize for luck and how we harness those opportunities to build tooling to iterate and scale.


Ronnie Chen designs, builds, and scales data infrastructure at Slack. Previously, she was a tech lead of data engineering at Braintree and a backend engineer at Paypal. She is a deep sea technical diver and was also the sous chef of a Michelin-starred restaurant in a previous life.

Mohammad Shahangian

Mohammad Shahangian

Head of Data Science, Pinterest
Analyzing Your First 200M Users

To stay competitive, companies must develop a data strategy that evolves with the needs of their business. Each organization will undoubtedly have a unique path that addresses their specific needs, making it challenging to know where and when to invest their data and analytics resources. This talk describes Pinterest's evolution of analytical capabilities by walking through concrete problems we faced at each stage of growth and describes the solutions we developed to address them. These capabilities are presented through generalizable framework that can be applied for prioritizing a data strategy at any consumer product company.


Mohammad Shahangian is Head of Data Science at Pinterest. Formerly, Mohammad lead Discovery Science at Pinterest where teams were responsible for making Pinterest’s billions of daily recommendations relevant. He was Pinterest’s first data scientist and initially led the development of the company’s core data infra and analytics.

Zoltán Prekopcsák

Zoltán Prekopcsák

VP Engineering, Rapidminer
How to Ruin your Business with Data Science and Machine Learning

Everyone talks about how machine learning will transform business forever and generate massive outcomes. However, it’s surprisingly simple to draw completely wrong conclusions from statistical models, and “correlation does not imply causation” is just the tip of the iceberg.

The trend of the democratization of data science further increases the risk for applying models in a wrong way. In this talk, we will discuss capital mistakes as well as small errors that add up to completely ruin the potential positive impact of many data science projects. Attending this talk will hopefully help you to avoid many of those mistakes.


Zoltan leads all engineering teams as well as internal data science efforts at RapidMiner. Zoltan has been the CEO and founder of Radoop, a big data analytics company which RapidMiner acquired in 2014. He has experience in data science projects in various industries including telecommunications, financial services, e-commerce, neuroscience, and many more. Previously, he has been a data scientist at Secret Sauce Partners, Inc. where he created a patented technology for predicting customer behavior. He has dozens of publications and is a regular speaker at international conferences.

Workshops18 Oct, Wed

Zoltan C. Toth

Apache® Spark™ Foundations (Databricks training)

Zoltan C. Toth, Senior Instructor & Consultant, Databricks

This hands-on 1-day course is for data engineers, analysts, and architects; software engineers; IT operations; and technical managers interested in a brief hands-on overview of Apache Spark.
The course covers core APIs for using Spark, basic internals of the framework, SQL and other high-level data access tools, as well as Spark’s streaming capabilities and machine learning APIs. Each topic includes slide and lecture content along with hands-on use of a Spark cluster through a web-based notebook environment.

After taking this class, you will be able to:

  • Experiment with use cases for Spark and Databricks, including extract-transform-load operations, data analytics, data visualization, batch analysis, machine learning, graph processing, and stream processing.
  • Identify Spark and Databricks capabilities appropriate to your business needs.
  • Communicate with team members and engineers using appropriate terminology.
  • Build data pipelines and query large data sets using Spark SQL and DataFrames.
  • Execute and modify extract-transform-load (ETL) jobs to process big data using the Spark API, DataFrames, and Resilient Distributed Datasets (RDD).
  • Analyze Spark jobs using the administration UIs and logs inside Databricks.
  • Find answers to common Spark and Databricks questions using the documentation and other resources.


  • Spark Overview
  • RDD Fundamentals
  • SparkSQL and DataFrames
  • Spark Job Execution
  • Intro to Spark Streaming
  • Machine Learning Basics

More details on this workshop:


Zoltan works as Senior Spark Instructor and Consultant at Databricks, the company founded by the creators of Apache Spark. Earlier he worked on’s data infrastructure and managed the team that scaled it up to an infrastructure that crunches over 1 Petabyte of data. Later he joined RapidMiner, a global leader in predictive analytics and worked on kicking off the company's Apache Spark integration. Besides Databricks, he designs and prototypes Big Data architectures and regularly gives Spark courses on Conferences and for companies.

Sameer Farooquie

Deep Learning Fundamentals with TensorFlow and Keras

Sameer Farooquie, Freelancer / AI + Deep Learning

Abstract: Are you a software engineer who has been curious to get hands on experience with Deep Learning? In this workshop, I'll introduce the fundamentals concepts of Deep Learning and walk through code examples of common use cases. The class will be 60% lecture and 40% labs. The labs will run in Google Cloud and all students should sign up for Google Cloud prior to class. Note that new users of Google Cloud will receive a $300 USD credit valid for 12 months after sign up, which is sufficient to run all of the class examples and code for free.

In the morning, the class will cover:

  • What is Deep Learning?
  • Math fundamentals of Neural Networks (matrices, derivatives, gradient descent)
  • Initialization, Activation, Loss and Optimization functions
  • Fundamentals of TensorFlow
  • Data preprocessing and feature engineering for different use cases
  • Overfitting and Underfitting
  • Introduction to the Keras API
  • Lab: TensorBoard UI
  • Lab: MNIST
  • Lab: Regression
  • Lab: Classification

In the afternoon, we will cover:

  • Common network architectures
  • Convolutional Neural Networks (CNNs) for computer vision
  • Recurrent Neural Networks (RNNs) for language understanding (including LSTMs)
  • Stealing pre-trained layers with Transfer Learning
  • Lab: Object Detection in Images
  • Lab: Text and Language Understanding

Sameer Farooqui is a freelancer who teaches corporate classes on big data and machine learning. Over the past 5 years, he has taught 150+ classes globally at conferences and private clients on topics like NoSQL, Hadoop, Cassandra, HBase, Hive, Couchbase, and Spark. Sameer has been teaching Spark classes for three years and was the first full time hire into Databricks’ training department, where he worked closely with the Apache Spark committers on designing curriculum. Previously, Sameer also worked as a Systems Architect at Hortonworks and a Emerging Data Platforms consultant at Accenture R&D. When not working on Spark projects, Sameer enjoys exploring ideas in AI + Deep Learning, especially Google’s new TensorFlow library.

Gergely Daróczi

Practical Introduction to Data Science and Engineering with R

Gergely Daróczi, Passionate R developer

This is an introductory 1-day workshop on how to use the R programming language and software environment for the most common data engineering and data science tasks. After a brief overview on the R ecosystem and language syntax, we quickly get on to speed with hands-on examples on

  • reading data from local files (CSV, Excel) or simple web pages and APIs
  • manipulating datasets (filtering, summarizing, ordering data, creating new variables)
  • computing descriptive statistics
  • building basic models
  • visualizing data with `ggplot2`, the R implementation of the Grammar of Graphics
  • doing multivariate data analysis for dummies (eg anomaly detection with principal component analysis; dimension reduction with multidimensional-scaling to transform the distance matrix of European cities into a map)
  • introduction to decision trees, random forest and boosted trees with `h2o`

Gergely has been using R for more than 10 years in academia (teaching data analysis and statistics at MA programs of PPCU, Corvinus and CEU) and in industry as well. He started his professional career at public opinion and market research companies, automating the survey analysis workflow, then founded and become the CTO of, a reporting SaaS based on R and Ruby on Rails, that role he quit to move to Los Angeles, to standardize the data infrastructure of a fintech startup. Currently, he is the Senior Director of Data Operations of an adtech company in Venice, CA. Gergely is an active member of the R community (main organizer of the Hungarian R meetup and the first satRday conference; speaker at international R conferences; developer and maintainer of CRAN packages).

Ágoston Nagy

Visualizing data using machine learning (t-SNE) in javascript (Cooperation with Starschema)

Ágoston Nagy, HCI Researcher, Prezi

The workshop is a one day, hands-on introduction to visualizing data using Unsupervised Machine Learning. The t-SNE algorithm is a useful technique to find previously unknown correlations and patterns within datasets. If you are working with data visualization, data science, data driven design, this workshop is for you.

By the end of the day, you will be able to visualize patterns and correlations within your own or other publicly available datasets using JavaScript. You will learn how to fetch data from different APIs, display and interact with data on a 2D/3D canvas, using different inputs and animations.

What we cover:

  • Drawing (canvas 2D / 3D)
  • Generating Data
  • Loading Data using APIs, publicly available datasets
  • Visualizing Data
  • Feature extraction
  • Dimension Reduction (t-SNE)

Agoston is making interaction design, experimental media, generative arts using free & open source tools. He is designing dynamic systems & interfaces for networked installations, developing creative mobile applications. He is addicted to hacking, altering functions of existing contexts and ordinary objects. He regularly gives workshops and courses on interaction design and creative coding using several open source languages. He is a guest lecturer at Bergen University of Fine Arts (Norway), Moholy-Nagy University of Design & Arts (MOME, Hungary) and a HCI researcher at As of 2016, he is making a post doctoral research in Realtime Interactions & Machine Learning at MOME. His works have been exhibited worldwide including China, India, Canada, Germany, Italy, Norway, Poland, United States, Belgium, Hungary among others. He is the co-founder of the experimental new media design group Binaura.

Csaba Kassai

Big Data on the Google Cloud – Apache Beam, Dataflow, Bigquery

Csaba Kassai, Software architect, Doctusoft

If you work with a huge amount of data, from either the analyst or the developer side, and you are always looking at what’s next regarding Big Data technologies, come join us and explore Apache Beam and the Google Big Data platform on Doctusoft’s full-day workshop.

  • Get introduced to the Apache Beam model
  • Build and execute batch and streaming pipelines using Google Cloud Dataflow and other Big Data service on the GCP platform such as Cloud Pub/Sub, Google BigQuery etc.
  • See how easy to run your pipelines on other runners like Apache Spark
  • Learn about real business use cases and project experiences.
  • Get the full picture about how these Big Data products differ from other well-known solutions and know which one to choose to suit your business needs or the technological requirements you work with.

Whether you come from a small start-up or a big multinational company, this workshop is useful for anyone who wants to learn first-hand how to deal with Apache Beam and Google’s BigData services.

Participants should have a technology background, basic programming skills in Java and be open to sharing their thoughts and questions.

Participants will need to bring their own laptops and have a Google account. Further information about the technical environment will be communicated after registration.


Csaba has been a software architect at Doctusoft ‒ the only Google Cloud Platform partner in Hungary ‒ for 6 years. He has participated in several Big Data projects, solving the problems of different retail, telecommunication, and start-up companies using Google and Hadoop technologies. He has also worked for one of the biggest banks in Hungary on Big-Data-focused projects such as optimizing the query time of the transaction history database with ElasticSearch. Csaba’s main professional interests are Google’s Big Data products and their related programming languages and database technologies.


Meet Budapest, a really awesome city

Here are a few reasons why you need to visit Budapest



The Magyar Vasúttörténeti Park (Hungarian Railway History Park) is Europe’s first interactive railway museum located at a railway station and workshop of the Hungarian State Railways. There are over a hundred vintage trains, locomotives, cars and other types of railroad equipment on display, including a steam engine built in 1877, a railcar from the 1930’s and a dining car built in 1912 for the famous Orient Express.

On the conference days there will be direct Crunch trains in the morning from Budapest-Nyugati Railway Terminal to the venue, and in the evening from the venue to Budapest-Nyugati Railway Terminal, so we recommend to find a hotel near to Nyugati station.


09:00 - 18:00 Workshops (only in case you have purchased a separate Workshop ticket)
Locations of workshops:
18:00 - 22:00 CrunchConf warmup meetup & party (RSVP required)







Academy partner

Media partners

CRUNCH is a non-profit conference. We are looking for sponsors who help us make this conference happen.
Take a look at our sponsor packages and contact us at


Crunch Conference is organized by

Ádám Boros
Ádám Boros
Marketing Intern, Prezi
Attila Balogi
Attila Balogi
Event manager, Prezi
Attila Petróczi
Attila Petróczi
R&D and Data Science Manager, Realeyes
Balázs Szakács
Balázs Szakács
Business Intelligence Manager, IBM Budapest Lab
Dániel Molnár
Dániel Molnár
Senior Data & Applied Scientist, Microsoft Deutschland GmbH / Wunderlist Team
Katalin Marosvölgyi
Katalin Marosvölgyi
Travel and accommodation manager, Prezi
Medea Baccifava
Medea Baccifava
Head of conference management, Prezi
Tamás Imre
Tamás Imre
Lead Analyst, Prezi
Tamás Németh
Tamás Németh
Data Engineer, Prezi
Zoé Rimay
Zoé Rimay
Software Developer, Morgan Stanley
Zoltán Prekopcsák
Zoltán Prekopcsák
VP Big Data, RapidMiner
Zoltán Tóth
Zoltán Tóth
Big Data and Hadoop expert, Datapao; Teacher, CEU Business School
Ryan McCabe
Ryan McCabe
Data Analyst, Prezi
Gergely Krasznai
Gergely Krasznai
Data Analyst, Prezi

Questions? Drop us a line at