About the conference
DataSphere is a conference devoted to data-centric systems and the technologies making them tick.
Whether it is data engineering or AI application challenges - they all fit well in.
From technical details to concrete business use cases, no fluff.
Agenda – overview
Day 1 - Sunday
9:00 to 17:00Workshops, Hackthons, Training
Day 2 - Monday
8:00 to 9:00Registration
9:00 to 10:45Keynote sessions
11:00 to 18:30Talks
18:45 to **After Party
Day 3 - Tuesday
9:00 to 18:30Talks
19:00 to 21:00Open Meetups
Workshops, Hackthons, TrainingMore details soon
Open MeetupsMore details soon
Armand Ruiz Gabernet
Lead Product Manager at IBM. Armand is Product Manager of Advanced Analytics solutions and technology enthusiast. Motivated and self-starter to create new innovative products. Strong organizational skills and able to navigate across different teams as well as varying personalities. Motivated by great design, product simplicity and high-quality user experience.
Vladimir is Artificial Intelligence enthusiast. Perfectionist at heart, with a pragmatic mindset. He is a trainer at DataWorkshop.eu where he explains how to use machine learning in real life. He has a podcast about Artificial Intelligence – BiznesMysli.pl (in Polish). He is an architect at General Electric. He participates in Kaggle’s competitions. He loves data and its challenges.
Do you know that the 4th industrial revolution is coming? Do you think that the talks about taking over work by robots concern only other professions and the programmer will still be needed? Partial is true, but on the other hand – there are really big changes going on over the next 5-10 years. Therefore, if you want to prepare for them, I recommend checking who the programmer 2.0 will be.
It turns out that learning machine learning can be an interesting adventure (and doing a PhD is optional).
I will show you examples of how relatively little effort you can do interesting and valuable things. The goal of the presentation is to inspire and break the barrier as to the complexity of the topic of machine learning. If you can program, you can do ML too!
William Benton leads a team of data scientists and engineers at Red Hat, where he has applied analytic techniques to problems ranging from forecasting cloud infrastructure costs to designing better cycling workouts. His current focus is investigating the best ways to build and deploy intelligent applications in cloud-native environments, but he has also conducted research and development in the areas of static program analysis, managed language runtimes, logic databases, cluster configuration management, and music technology.
Implementing Machine Learning Algorithms for Scale-Out Parallelism
Frameworks for elastic scale-out computation, like Apache Spark and Apache Flink, are important tools for putting machine intelligence into production applications. However, these frameworks do not always offer the same breadth or depth of algorithm coverage as specialized machine learning libraries that run on a single node, and the gulf between being a competent framework user and a seasoned library developer who can extend a framework can be quite daunting.
In this talk, we’ll walk through the process of developing a parallel implementation of a machine learning algorithm. We’ll start with the basics, by considering what makes algorithms difficult to parallelize and showing how we’d design a parallel implementation of an unsupervised learning technique. We’ll then introduce a simple parallel implementation of our technique on Apache Spark, and iteratively improve it to make it more efficient and more user-friendly. While some of the techniques we’ll introduce will be specific to the Spark implementation of our example, most of the material in this talk is broadly applicable to other distributed computing frameworks. We’ll conclude by briefly examining some techniques to complement scale-out performance by scaling our code up, taking advantage of specialized hardware to accelerate single-worker performance. You’ll leave this talk with everything you need to implement a new machine learning technique that takes advantage of parallelism and resources in the public cloud.
Umit is a Data Scientist at IBM, extensively focusing on IBM Data Science Experience and IBM Watson Machine Learning to solve complex business problems. His research spans across many areas from statistical modeling of financial asset prices to using evolutionary algorithms to improve the performance of machine learning models. Before joining to IBM, he worked on various domains such as high-frequency trading, supply chain management and consulting. He likes to learn from others and also share his insights at universities, conferences and local meet-ups.
Recent advancements in NLP and deep learning: a quant’s perspective
There is a gold-rush among hedge-funds for text mining algorithms to
quantify textual data and generate trading signals. Harnessing the power of
alternative data sources became crucial to find novel ways of enhancing
With the proliferation of new data sources, natural language data became
one of the most important data sources which could represent the public
sentiment and opinion about market events, which then can be used to
predict financial markets.
Talk is split into 5 parts:
- Who is a quant and how do they use NLP?
- How deep learning has changed NLP?
- Let’s get dirty with word embeddings
- Performant deep learning layer for NLP: The Recurrent Layer
- Using all that to make money
Lukasz Cmielowski joined IBM in 2008. In 2009 he successfully defended PhD dissertation in bioinformatics. Since then he worked as QA architect focused on putting AI into old school software quality activities. His domain was automation of software failure prediction. In 2015 he joined new team working on analytics solutions as automation architect and data scientist. He is also also a big fan of Norman Davies’ and Terry Pratchett’s book series.
From Spark MLlib model to learning system with Watson Machine Learning
A biomedical company that produces heart drugs has collected data about a set of patients, all of whom suffered from the same illness. During their course of treatment, each patient responded to one of five medications.
Based on treatment records they would like to predict the best drug for the patient. They also need to ensure that their prediction model is always up-to-date providing the highest possible quality of predictions.
During this session I will demonstrate how continuous learning system (part of Watson Machine Learning) can be used to achieve those goals.
Maciej has built large-scale data analytics and AI products in both research and industry, previously at DERI/INSIGHT Galway, currently as Chief Data Scientist at Altocloud. He is the founder of the Galway Data Meetup with over 250 members and received a number of awards for his work. He was shortlisted as one of the four finalists of the DatSci 2016 competition in the Data Scientist of the Year category.
Artificial Intelligence, creating value with data-driven products, decision support systems, recommender systems
Topic:Building Successful Machine Learning Products
With recent advancements in the AI ecosystem, the entry barriers for utilisation of Machine Learning techniques are lower than ever. The growing availability of tools and platforms together with decreasing cost of computation, allows smaller teams to build ML products faster and add value in a number of industries, from self-driving cars to personal assistants. This not only creates new opportunities, but also poses a number of challenges related to the design of products that we interact with on a daily basis.
In this talk I will share a number of experiences and examples of products using Machine Learning, focusing on the common gaps as well as the key steps for designing a successful ML product. I will describe how techniques such as human-centered design or design thinking play an important role in choosing the right problem to solve with Machine Learning and in shapIng the user experience when the algorithms fail to deliver. The second part of the talk will focus on the engineering challenges, including data collection, model training and deployment at scale.
Software Engineer. Last 10 years spent solving data problems in companies like Google, Sun, Base. As of today happy coder acting as Director of Data Science at AirHelp.
Life after the model
Convolutional neural network is ready, F1 score calculated and ROC curve drawn. So you have a model.
This tale is about what happens next: when and how to cleanse the model, how do push it into production and what defines quality for the model.
I will show it using one of the projects we deployed here at AirHelp.
I will also try to show a few tips & tricks that helps me manage Machine Learning projects.
Michał Kaczmarczyk (Tech Lead, Software Architect, Project Manager, Ph.D.) is leading a development team implementing Spark-based fully automated predictive modeling system in cooperation with NEC Laboratories America. Michał received his PhD from Warsaw University and is exploring the field of distributed systems from year 2005. He worked for companies such as NEC Labs (Princeton, NJ), Microsoft (Redmond, WA), 9LivesData (Warsaw, currently). During this time he worked on core system components and published research papers on conferences such as FAST and SYSTOR. Since 2015 devoted to Spark and charmed with Scala.
Marcin Kulka is a Senior Software Engineer in 9LivesData. In cooperation with NEC Labs America machine learning researchers, he works on Spark-based fully automated predictive modelling system. He holds master’s degree in both Computer Science and Mathematics from Warsaw University. His biggest areas of interests are big data, machine learning, distributed systems and algorithms. Marcin has almost 10 years of professional experience in software engineering, most of which spent working on HYDRAstor – cutting edge, distributed and highly scalable backup system. Privately he is happy husband and father of two daughters.
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark
Building accurate machine learning models has been an art of data scientists, i.e., algorithm selection, hyper parameter tuning, feature selection and so on. Recently, challenges to breakthrough this difficult art have got started. In cooperation with our partner, NEC Laboratories America, we have developed a Spark-based automatic predictive modeling system. The system automatically searches the best algorithm, parameters and features without any manual work. In this talk, we will share how the automation system is designed to exploit attractive advantages of Spark. The evaluation with real open data demonstrates that our system can explore hundreds of predictive models and discovers the most accurate ones in minutes on a Ultra High Density Server, which employs 272 CPU cores, 2TB memory and 17TB SSD in 3U chassis. We will also share open challenges to learn such a massive number of models on Spark, particularly from reliability and stability standpoints.
Haskell developer in the Luna language team, changing the way people think about software development and data processing. Functional programming enthusiast, especially Haskell and Scala. Doctoral candidate at the Faculty of Computer Science of AGH University of Science and Technology, working on seamlessly integrating serverless architecture with visual and functional programming.
Luna – presentation
The talk is a presentation of Luna, the visual-textual programming language and environment for data processing. It showcases a novel paradigm for data processing and explains how strongly-typed, purely functional programming can be combined with visual representation to help people create pipelines that are more intuitive, easier to comprehend and less error-prone. We demonstrate interactive examples and discuss the possibilities of such paradigm to change the way the data is being processed across industries.
Main field of interest: statistics, international academic cooperation, management. A professor at University of California (USA), Polish-American Higher School of Business – National-Louis University (Poland) and Cracow University of Technology. A visiting Professor in: USA, Mexico, France, Brazil, Ukraine, Kyrgyzstan, Sweden.
Management experience: a Director of Statistical Consulting Lab, University of California (USA), an owner of StatLab International Consulting (Poland), a vice-rector for Research, WSB-NLU (Poland), a Member of Committee for monitoring MRPO and a director of research projects financed by NATO (three times) and National Science Centre (Poland).
Professional distinction: a triple laureate of NATO grant in matters of analysis and security of telecommunication signals (1995, 2000, 2008) and a laureate of award for the best publication in matters of statistical signal analysis, European Signal Processing Society, 2007.
Big Data and Data Analytics
The main goal of my presentation is to introduce the participants to the most novel concepts in computational statistics and their implications into broader range of decisions based on Big Data. These days a data scientist has to work simultaneously on at least three different fronts. First, the data gathering plan called the design. Invariably, designs become extremely useful as we are flooded with data. Designs help us to focus on the aim of the study and on reduction of complexity. Secondly, the software environment we choose to analyze our data. The competition here is quite strong but the future will belong to open source solutions. The Author of this presentation belongs to the club of R-ofiles, that is a world-wide community of data scientists willing to share their tools. Finally, the third front of the battle of the data scientist is selection of an appropriate statistical algorithm. Here the Big Data revolution has dramatically changed the perspective. We now move much more audaciously to extremely high dimensional data with new statistical tools. The concepts of the talk will be illustrated with examples from the Authors experience, that is from signal processing, medical and financial data.
He is an entrepreneur, software engineer and machine learning practitioner. Currently holds the position of CTO at Craftinity, machine learning startup based out of Krakow. He’s been working on machine learning projects since graduation from university, first at Siemens, then at Craftinity. When he doesn’t train neural networks, he reads a lot on history, economics and plays football.
Explaining neural networks predictions
Recently Deep Neural Networks have become superior in many machine learning tasks. However, they are more difficult to interpret than simpler models like Support Vector Machines or Decision Trees. One may say that neural nets are black boxes that produce predictions but we can’t explain why a given prediction is made. Such a condition is not acceptable in industries like healthcare or law. In this talk, I will show known ways of understanding neural networks predictions.
In the past algebraic topologist, for more than 10 years JVM developer (Java, Scala, …), currently usually as architect/(lead) developer – but my roles vary from analysis to devops.
My main fields of interest are integration (Camel, OSGi), functional programming and stream processing systems (Akka, Kafka, Flink).
I also like to give talks at conferences – Confitura, JEEConf, VoxxedDays just to name a few.
Currently I’m leader of TouK Nussknacker – project which enables analysts to create streaming jobs with friendly UI.
Stream processing in telco – case study based on Apache Flink & TouK Nussknacker
Stream processing is one of hypes of last two years. Apache Flink, Spark Streaming and the likes conquer the world. We can hear about quite a few interesting use cases but most come from startups/technology companies – Netflix, Uber or Alibaba are good examples. I’d like to talk about case which is a bit different.
Two years ago we helped to introduce Apache Flink in one of the largest mobile operators in Poland – at first to help with real time marketing. The data used included information from billing and signalling systems.
We wanted to enable analysts and semi-technical people to create and monitor processes and that’s how Nussknacker – our open source GUI for Flink was born. Today, many steaming jobs are created by analysts, without the need of developers assistance. I’ll tell about this journey –
what features of stream processing are important for telco business, what barriers do we see in Flink adoption in enterprise and what we consider to be it’s main selling points.
We have learnt that common data model can be reused for different purposes – the most important one is real time fraud detection.
Today we’re processing billions of events daily from more than dozen of sources with > 40 processes running in production.
I’ll also talk about our current architecture, where it seems applicable and what are our plans for the future.
The target audience of this talk are developers/analysts and architects who consider introducing stream processing in their organizations.
Software Engineer in the Data Services section of the Beams Controls group at CERN. Since 2015 he has played a major role in the design and development of the next generation Accelerators Logging Service (NXCALS), which is based on state of the art Big Data technologies including Apache Spark and Kafka.
Nikolay is driven by the goal of making the life of the Data Scientists easier, providing native structured access to logged data and integrated tools for data analysis. He strives to deliver pragmatic solutions and is passionate about dealing with huge amounts of data and building distributed systems.
Next CERN Accelerator Logging Service Architecture
The Next Accelerator Logging Service (NXCALS) is a new Big Data project at CERN aiming to replace the existing Oracle-based service.
The main purpose of the system is to store technical accelerator data needed by machine operators and data scientists at CERN. Gathered from thousands of devices in the whole accelerators complex the data is used to operate the machines, improve their performance and conduct studies for new beam types or future experiments.
This presentation is a dive into the Hadoop/Spark based NXCALS architecture. Nikolay will speak about the service requirements, the design choices and present the Ingestion API as one of the main components of the system. He will also reveal the core abstraction behind the Meta-data provider and the Spark-based Extraction API where simple changes to the result schema improved the overall usability and performance of the system.
This talk can be of interest to any companies or institutes confronted with similar Big Data problems as the system itself is not CERN specific.
He spent last 15 years in the internet-industry as a entrepreneur, advisor, and board member of several companies. Founder and managing director of one of the biggest polish software houses (grown from 2 to 200 employees). After M&A process he made an exit with selling his shares, and now he is involved in creating an intelligent assistant Edward, as a co-founder and CEO of 2040.io.
Co-founder of Krakow Artificial Intelligence Meetup Group, interested in modern user interfaces, and social aspects of artificial intelligence.
He was also a co-founder and board member of PROFEO – the community for professionals (Polish Linkedin competitor), founder of Techcamp – biggest Polish technological barcamp meetings and co-founder of Ecommerce director’s club.
What we’ve learned from creating Edward.ai
This will be a story about creation of AI powered sales assistant. How did it all start, and what challenges we’ve faced during the last 18 months? How did we apply AI into our software and what our customers are saying about the usability of such tool? And what are the plans for the nearest future in comparison of artificial intelligence advancement.
Other sphere.it events
React.sphere.it is a conference focused on Reactive Programming and Reactive System Design. Now in its 2nd Edition, it’s a perfect opportunity to meet and share knowledge with experts in this field.
Scala.sphere.it is a unique event devoted to important topic for every Scala Software Developer – Dev Tools.
Main venueThe Opera of Kraków
Day of practiceThere will be several workshops, hackathons and training on 15th of April. More details soon
Code of Conduct
The following Code of Conduct is inspired by that from other prominent conferences such as ScalaDays or Scala eXchange.
DataSphere is dedicated to providing a harassment-free experience for everyone, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, nationality, age or religion. We do not tolerate harassment of participants in any form. Please show respect for those around you. This applies to both in-person and online behavior.
All communication should be appropriate for a technical audience,
including people of many different backgrounds. Sexual language, innuendo, and imagery is not appropriate for any conference venue, including talks.
If you are being harassed, notice that someone else is being harassed, or have any other concerns, please contact a member of staff immediately. If an individual engages in harassing behaviour, the DataSphere staff may take any action they deem appropriate, including warning the offender or expulsion from the event.
We expect all attendees to follow above rules during our DataSphere Conference.