List of OSA Con 2024 sessions

Title

Aerodynamic Data Models: Flying Fast at Scale with DuckDB

by Mike Driscoll
At Rill, we rely on DuckDB to power uniquely fast dashboards for exploring time-series metrics. To achieve this interactivity, Rill’s dashboards generate up to 100 parallel queries in response to each user interaction. In this lightning talk, we’ll share a series of optimization and data modeling techniques that have been pivotal in achieving remarkably fast, sub-second response times using DuckDB. Our primary tactics include employing parallel connections to facilitate simultaneous query processing and organizing data in a chronological order to enhance the effectiveness of min-max indexes.

AI Reality Checkpoint: The Good, the Bad, and the Overhyped

by Maxime Beauchemin
In the past 18 months, artificial intelligence has not just entered our workspaces – it has taken over. As we stand at the crossroads of innovation and automation, it’s time for a candid reflection on how AI has reshaped our professional lives, and to talk about where it’s been a game changer, where it’s falling short, and what’s about to shift dramatically in the short term. Since the release of ChatGPT in December 2022, I’ve developed a “first-reflex” to augment and accelerate nearly every task with AI.

Airflow 3 is coming!

by Jarek Potiuk
This session is about the new major release of Airflow that we plan to release early in 2025. This is the first major release of Airflow since 2021 when we released Airflow 2 and it is a result of 4 years of improvements we’ve implemented as minor releases, but also a lot of listening to our users, and changing industry. While Airlfow remains the most important and strongest ETL/Data orchestrator in use, with the advent of LLM/GenAI becoming mainstream part of the data orchestration and a wealth of workflow and tooling specialising in those, Airflow 3 is aiming to become the only True Open-Source, Open-Governance Enterprise-level strong Orchestration solution for all your batch processing worfklow needs.

Anatomy of a real-time analytics dashboard

by Dunith Dhanushka
Typically, data visualization is the last mile in a data pipeline, as it presents insights in a way that is easily understood by users. When insights are fresh and relevant, humans can act upon them on time. However, the process of implementing a visually appealing real-time dashboard is not as simple as it is thought. The challenges in data collection and processing at scale, and delivering the metrics to users on time make things difficult.

Apache Doris: an alternative lakehouse solution for real-time analytics

by Mingyu Chen
Lakehouse is a big data solution that combines the advantages of data warehouse and data lake, helping users to perform fast data analysis and efficient data management on the data lake. Apache Doris is an OLAP database for fast data analytics. It provides self-managed table format for high-concurrency and low-latency queries, semi-structured data analytics and complex ad-hoc queries, all by using standard SQL. It can also query data from various lake formation such as Apache Hudi, Apache Iceberg, Apache Paimon, etc.

Bring streaming to AI: introducing Bytewax connectors

by Laura Gutierrez Funderburk
As AI and machine learning become more integral to business operations and decision-making, the need for real-time data processing has never been more critical. Whether you’re monitoring live streams from edge devices, responding to events as they happen, or managing data in distributed databases, the ability to process and act on data in real-time can be the difference between success and irrelevance. In this talk, we’ll showcase how connecting to real-time data sources can create more responsive and adaptive AI systems within the Python ecosystem, using Bytewax.

Build a Great Business on Open Source without Selling Your Soul

by Robert Hodges, Tatiana Krupenya & Peter Zaitsev
Open source software is fun to work on, but building a real business is tough. We’ve all read about commercial open source companies that switched to proprietary models or flat went out of business. Our panel of three CEOs have all built successful companies on open source without resorting to licensing rug-pulls or other fauxpen source tricks that disrespect users. We’ll discuss what worked for us, what didn’t, and how we balanced our belief in open source communities with making payroll every two weeks.

Building a Thriving DevRel Program for OSS Projects

by Anita Ihuman
Over the years, we have seen a good number of OSS projects with great features and services go unnoticed and often get overshadowed by their proprietary counterparts. This highlights a key point: having a brilliant open-source project alone isn’t enough. To truly grow an open source project, you need a vibrant developer community that uses, contributes to, and champions your project. In this session, we’ll look at how to build a successful open source project through strategic DevRel practices.

Building your AI Data Hub with PyAirbyte and Iceberg

by Michel Tricot
To provide great results, AI applications need access to great data. While Iceberg is quickly becoming the gold standard for cloud data storage, PyAirbyte makes it easy to reliably move data from anywhere to anywhere else, directly in Python. We’ll show you how to combine these two tools and build a scalable data hub for GenAI applications and analytics - getting started in minutes building out your own AI data hub.

Creating an open-source data platform for Swiss Government

by Michael Disteli & Micha Eichmann
Together with the local government of one of the biggest cantons in Switzerland, Bern, we created the fully open-source data plattform HelloDATA https://github.com/kanton-bern/hellodata-be We leveraged established open-source data tools such as Superset, DBT, Airflow and JupyterHub to create an integrated, “one-stop-shop” data platform. Using HelloDATA we are driving government agencies of all types to better understand and utilize their data and generate value for themselves and their citizens. In this session we would like to give insights into:

Exploring Data Analysis in Time Series Databases

by Aliaksandr Valialkin
Kubernetes has changed everything. Not only the way we deploy our applications. But also how we monitor them, how we collect, store, visualize, and alert on time series data generated by monitoring systems. What are the challenges in modern monitoring? Why have new-generation time series databases like VictoriaMetrics and Prometheus emerged? Why is there no SQL support in these databases? Why are Grafana dashboards so fancy? Join us as we explore these questions and many other questions related to the specifics of time series data analysis.

Flink for a non-JVM user, an introduction to pyflink

by Diptiman Raichaudhuri

From Raw Data to Insights: Introduction to Data Preprocessing

by Odeajo Israel
Data preprocessing is a crucial step in the machine learning pipeline. It involves cleaning, transforming, and preparing raw data to make it suitable for building and training machine learning models. In this hands-on workshop, attendees will gain a deep understanding of the importance of data preprocessing and learn essential techniques for working with real-world data.

Getting data materialization right

by Gian Merlino
Materialization moves computation from query time to ingest time by creating specialized derived tables, or materialized views, that are simpler than the source tables and are geared towards supporting specific workloads. This is one of the most powerful and common techniques for speeding up OLAP workloads. You can implement materialization in various ways, including built-in “materialized view” or “projection” features in many databases, as well as with third-party stream processors and workflow orchestrators that sit outside the database.

How Open Source Marketing is Similar to Business Marketing

by Arjun Sharda
Open source is an amazing place for developers to contribute to exciting new projects and also sharpen their coding skills while at it. However, as an open source maintainer of about 30 projects myself, I’ve realized that open source marketing is very similar to business marketing. Both open source projects and companies aim to receive users. Just as 90% of businesses fail or become abandoned, the same can be said for open source projects.

Hydra Architecture: Orchestrating ML across clusters, regions, and clouds

by Donny Greenberg
Increasingly, ML teams must satisfy requirements for executing workflows across multiple Kubernetes clusters, regions, or clouds. Challenges include constrained GPU availability within a single cloud, cloud credits or commitments across multiple providers, or strict data residency requirements. Traditional methods for replicating workflows across environments are both resource-intensive and operationally cumbersome. In this talk, we propose a Hydra architecture, a novel approach where applications are deployed to a single cluster but can selectively take advantage of others.

Ingesting and analyzing millions of events per second in real-time using open source tools

by Javier Ramirez
Teams of all shapes and sizes benefit from near real-time analytics. In this session, I will present a project template that can serve as the foundation to build one such high performing system, powered by Apache Kafka, QuestDB, Grafana OSS, and Jupyter Notebook. The first step of a data pipeline is ingestion, and even though we could directly ingest into a fast database, I will use Apache Kafka to ingest data.

Leveraging Argo Events and Argo Workflows for Scalable Data Ingestion

by Siri Varma Vegiraju
As Argo Workflows and Argo Events continue to gain popularity for their powerful capabilities in event-driven automation and complex job orchestration, this presentation will delve into how we used this architecture to process millions of records daily. You will gain insights into the specific architecture that integrates Argo Events and Argo Workflows to achieve efficient data aggregation and ingestion. We will discuss the challenges encountered during this process and share the strategies we employed to overcome these issues.

Leveraging Data Streaming Platform for Analytics and GenAI

by Jun Rao
Apache Kafka is becoming the standard for integrating all information within an enterprise. This provides an opportunity for each enterprise to take actions on what’s happening in its business in real time. One common use case is to take this data and ingest it to a data lake for analytics. I will show that by integrating Kafka with Apache Iceberg can make analytics much easier. Another rising use case is GenAI.

Low latency Change Data Capture (CDC) to your data lake, using Apache Flink and Apache Paimon

by Ali Alemi & Subham Rakshit

Managing your repo with AI — What works, and why open-source will win

by Evan Rusackas
Maintaining an OSS repository is hard. Scaling contributors is nigh impossible. As AI platforms proliferate, let’s take a look at tools you should and shouldn’t leverage in automating your GitHub repo, Slack workspace, and more! We’ll also talk about why Open Source Software stands to benefit more from this revolution than private/proprietary codebases.

Modern LLMs with Graph DB - Exploring boundaries with text-2-cypher

by Diptiman Raichaudhuri
In this lightning talk, Diptiman would present the techniques used to execute text-2-cypher analytical query generations with the help of modern large language models like GPT-4o, Claude 3 etc . This session will dive deep into Graph database querying and how LLMs assist developers get analytical queries executed on popular Graph databases like Neo4j and Amazon Neptune.

Observability for Large Language Models with OpenTelemetry

by Guangya Liu & Nir Gazit
Large Language Models (LLMs) mark a transformative advancement in artificial intelligence. These models are trained on vast datasets comprising text and code, enabling them to handle complex tasks such as text generation, language translation, and interactive querying. As LLMs continue to integrate into various applications ranging from chatbots and search engines to creative writing aids, the need to monitor and comprehend their behaviors intensifies. Observability plays a crucial role in this context.

Observable Framework: a new open-source static site generator to get data past the last mile

by Allison Horst
Data teams often have established workflows to access data from different sources, process, and analyze data, but can be stymied by the “last mile problem” in data: creating rich, fast, and fully customized apps and dashboards. Closed-source, GUI-based BI options pose challenges for data visualization developers, restricting them to pre-built data integrations, out-of-the-box chart components and layouts, and restricted publishing options. Observable Framework is a new open-source static site generator, command line tool, and local preview server.

Open Source Analytic Databases - Past, Present, and Future

by Robert Hodges
Rising floods of data and technology improvements ignited a Cambrian explosion of innovation in analytics that has lasted for decades. In this talk we’ll survey the arc of analytic databases and major trends influencing system architecture for tomorrow’s applications. Have we reached peak cloud data warehouse? Why is AI leading to disaggregation of analytic database systems? And what’s a real-time data lake? Most important, why are analytics just so darn fun to work on?

Open Source and the Data Lakehouse

by Alex Merced
The open data lakehouse offers those frustrated with the costs and complex pipelines of using traditional warehouses an alternative that offers performance with affordability and simpler pipelines. In this talk, we’ll be talking about technologies that are making the open data lakehouse possible. In this talk we will learn: What is a data lakehouse What are the components of a data lakehouse What is Apache Arrow What is Apache Iceberg What is Project Nessie

Open Source Success: Learnings from 1 Billion Downloads

by Avi Press
In this talk, we share the results of an in-depth analysis of data gathered from over 1 billion open source package downloads across more than 2000 diverse projects on Scarf. Our findings offer valuable insights into user behaviors and interactions with open source software, making it essential for maintainers, founders, and executives in open source companies. During the presentation, we delve deep into our data, uncovering the best practices employed by successful open source projects.

pg_duckdb: Adding analytics to your application database

by Jordan Tigani
PostgreSQL is the fastest growing open source transactional database. DuckDB is the fastest growing open source analytical one. pg_duckdb is a new Postgres extension that brings the two together, and lets you run analytical queries on your postgres instance, with the full performance of DuckDB. This talk will show how to use pg_duckdb to do analytics over your application data, lakehouse data, and to scale it to the cloud via MotherDuck.

Real-Time Games Analytics and Leaderboard with RisingWave, Kafka, and Superset (Preset)

by Fahad shah
In this session, we will set up a real-time pipeline using Kafka, RisingWave, and Superset in Preset. We will ingest player-related data into a Kafka topic and configure RisingWave to consume this data, creating materialized views for real-time analysis. With RisingWave and Superset, we can generate real-time visual dashboards, set up alerts, and create reports, enabling us to monitor player performance, create real-time leaderboards, and analyze game trends in real-time.

Running ClickHouse in Production – Lessons Learned

by Noa Baron
When we chose ClickHouse as our main data lake for analytics at Cato Networks, we envisioned it as a silver bullet solution for our data needs, promising effortless data ingestion and ready-to-query dashboards. However, the journey from that initial setup to our current, sophisticated data platform has been filled with trials and tribulations, alongside valuable lessons. We first used ClickHouse as a black box magical persistence layer, simply feeding data points and querying ready-made GraphQL datasets.

SQL Window Functions - An Introduction

by Dave Stokes
Widow Functions allows you to group rows in a table for in-depth investigation if you are analyzing data in a relational database. Structured Query Language is fantastic for retrieving data, but once you get that data, you need a way to classify it. Window Functions provide a way to obtain sales totals to the current date, group time spent by department, or calculate running totals over data point clusters. We will start with the basics by defining what a window can be and proceed into rankings, calculating quartiles, and how to include aggregate functions.

Uncover insights in your complex data with graph visualization

by Andrew Madson
Finding hidden relationships is the key to unlocking insights. Traditional charts and graphs fall short when visualizing complex, interconnected data. This presentation will describe into the world of graph visualization, a powerful technique for unveiling hidden patterns, dependencies, and anomalies in your data. We will explore: Graph Fundamentals: An introduction to graph theory concepts, including nodes, edges, and different types of graphs, providing a foundation for understanding graph visualization. Open-Source Graph Visualization Tools: A showcase of popular open-source libraries and tools like NetworkX, PyVis, and Gephi, demonstrating their capabilities for creating interactive and informative graph visualizations.

Unified Data Management with ClickHouse and Postgres

by Shivji Kumar Jha & Tarun Annapareddy
Given the AI hype, organisations want to capture every data point which very quickly results into capturing extensive data from analytics, user interactions, transactions, metrics, logs, and time series. Given different shapes, volumes, scale, consistency and availability requirements, this evolves to many data stores to explore, manage and nurture. However, relying on multiple specialized databases increases operational costs, makes developers loaded with cognitive overload, necessitates various dashboards, and eventually demands significant expertise spread across teams.

Vector Search in Modern Databases

by Peter Zaitsev
In this talk, we’ll explore the emergent landscape of vector search in databases, a paradigm shift in information retrieval. Vector search, traditionally the domain of specialized systems, is now being integrated into mainstream databases and search engines like Lucene, Elasticsearch, Solr, PostgreSQL, MySQL, MongoDB, and Manticore. This integration marks a significant evolution in handling complex data structures and search queries. Introduction to Vectors and Embeddings in Databases Definition and significance of vectors and embeddings.

Zero-instrumentation observability based on eBPF

by Nikolay Sivko
Observability is a critical aspect of any infrastructure as it enables teams to promptly identify and address issues. Nevertheless, achieving system observability comes with its own set of challenges. It is a time- and resource-intensive process as it necessitates the incorporation of instrumentation into every application. In this talk, we will delve into the gathering of telemetry data, including metrics, logs, and traces, using eBPF. We will explore tracking various container activities, such as network calls and filesystem operations.