Title |
---|
Aerodynamic Data Models: Flying Fast at Scale with DuckDBby Mike Driscoll
At Rill, we rely on DuckDB to power uniquely fast dashboards for exploring time-series metrics. To achieve this interactivity, Rill’s dashboards generate up to 100 parallel queries in response to each user interaction.
In this lightning talk, we’ll share a series of optimization and data modeling techniques that have been pivotal in achieving remarkably fast, sub-second response times using DuckDB.
Our primary tactics include employing parallel connections to facilitate simultaneous query processing and organizing data in a chronological order to enhance the effectiveness of min-max indexes.
|
AI Reality Checkpoint: The Good, the Bad, and the Overhypedby Maxime Beauchemin
In the past 18 months, artificial intelligence has not just entered our workspaces – it has taken over. As we stand at the crossroads of innovation and automation, it’s time for a candid reflection on how AI has reshaped our professional lives, and to talk about where it’s been a game changer, where it’s falling short, and what’s about to shift dramatically in the short term.
Since the release of ChatGPT in December 2022, I’ve developed a “first-reflex” to augment and accelerate nearly every task with AI.
|
Airflow 3 is coming!by Jarek Potiuk
This session is about the new major release of Airflow that we plan to release early in 2025. This is the first major release of Airflow since 2021 when we released Airflow 2 and it is a result of 4 years of improvements we’ve implemented as minor releases, but also a lot of listening to our users, and changing industry. While Airlfow remains the most important and strongest ETL/Data orchestrator in use, with the advent of LLM/GenAI becoming mainstream part of the data orchestration and a wealth of workflow and tooling specialising in those, Airflow 3 is aiming to become the only True Open-Source, Open-Governance Enterprise-level strong Orchestration solution for all your batch processing worfklow needs.
|
Anatomy of a real-time analytics dashboardby Dunith Dhanushka
Typically, data visualization is the last mile in a data pipeline, as it presents insights in a way that is easily understood by users. When insights are fresh and relevant, humans can act upon them on time.
However, the process of implementing a visually appealing real-time dashboard is not as simple as it is thought. The challenges in data collection and processing at scale, and delivering the metrics to users on time make things difficult.
|
Apache Doris: an alternative lakehouse solution for real-time analyticsby Mingyu Chen
Lakehouse is a big data solution that combines the advantages of data warehouse and data lake, helping users to perform fast data analysis and efficient data management on the data lake.
Apache Doris is an OLAP database for fast data analytics. It provides self-managed table format for high-concurrency and low-latency queries, semi-structured data analytics and complex ad-hoc queries, all by using standard SQL. It can also query data from various lake formation such as Apache Hudi, Apache Iceberg, Apache Paimon, etc.
|
Bring streaming to AI: introducing Bytewax connectorsby Laura Gutierrez Funderburk
As AI and machine learning become more integral to business operations and decision-making, the need for real-time data processing has never been more critical. Whether you’re monitoring live streams from edge devices, responding to events as they happen, or managing data in distributed databases, the ability to process and act on data in real-time can be the difference between success and irrelevance.
In this talk, we’ll showcase how connecting to real-time data sources can create more responsive and adaptive AI systems within the Python ecosystem, using Bytewax.
|
Build a Great Business on Open Source without Selling Your Soulby Robert Hodges, Tatiana Krupenya & Peter Zaitsev
Open source software is fun to work on, but building a real business is tough. We’ve all read about commercial open source companies that switched to proprietary models or flat went out of business. Our panel of three CEOs have all built successful companies on open source without resorting to licensing rug-pulls or other fauxpen source tricks that disrespect users. We’ll discuss what worked for us, what didn’t, and how we balanced our belief in open source communities with making payroll every two weeks.
|
Build your AI Data Hub with Airbyte and MotherDuckby AJ Steers
Great AI applications start with great data. While DuckDB and MotherDuck are rapidly gaining traction for open-source AI and data engineering, PyAirbyte provides seamless and reliable data movement—directly in Python. In this session, we’ll show you how to combine these powerful tools to build a scalable data hub for GenAI applications and analytics, getting started in just minutes. We’ll conclude by demonstrating how you can build your next GenAI app directly in the database, all on a foundation of great data.
|
Building a Thriving DevRel Program for OSS Projectsby Anita Ihuman
Over the years, we have seen a good number of OSS projects with great features and services go unnoticed and often get overshadowed by their proprietary counterparts. This highlights a key point: having a brilliant open-source project alone isn’t enough. To truly grow an open source project, you need a vibrant developer community that uses, contributes to, and champions your project.
In this session, we’ll look at how to build a successful open source project through strategic DevRel practices.
|
Composable Data Platforms and The Rise of Data Platform Engineeringby Nick Schrock
Every company has a data platform, the systems and tools that produce data, They are critical to every modern business, and can incur massive cost and complexity. In this talk we argue that data platforms must be composable and flexible, and that necessitates a new skill: data platform engineering. We will then discuss how an advanced orchestrator and control plane is the essential technology to engineering a composable data platform, how in turn manage complexity, avoid vendor lock-in, enable end-to-end ownership for practitioner teams, and contain the costs of your data platform.
|
Creating an open-source data platform for Swiss Governmentby Michael Disteli & Micha Eichmann
Together with the local government of one of the biggest cantons in Switzerland, Bern, we created the fully open-source data plattform HelloDATA https://github.com/kanton-bern/hellodata-be
We leveraged established open-source data tools such as Superset, DBT, Airflow and JupyterHub to create an integrated, “one-stop-shop” data platform. Using HelloDATA we are driving government agencies of all types to better understand and utilize their data and generate value for themselves and their citizens.
In this session we would like to give insights into:
|
Designing a Lakehouse for product engineersby Zhou Sun
Open Lakehouses are among the most transformative innovations in big data, celebrated widely within data communities. Yet, for most product engineers, the lakehouse is off the radar. In this talk, we introduce Mooncake Labs and our mission to bridge this gap—connecting applications seamlessly with lakehouse capabilities. We’ll also dive into our open-source project, pg_mooncake, which empowers developers to build and manage lakehouse tables directly from within PostgreSQL.
|
Exploring Data Analysis in Time Series Databasesby Aliaksandr Valialkin
Kubernetes has changed everything. Not only the way we deploy our applications. But also how we monitor them, how we collect, store, visualize, and alert on time series data generated by monitoring systems.
What are the challenges in modern monitoring? Why have new-generation time series databases like VictoriaMetrics and Prometheus emerged? Why is there no SQL support in these databases? Why are Grafana dashboards so fancy? Join us as we explore these questions and many other questions related to the specifics of time series data analysis.
|
Flink for a non-JVM user, an introduction to pyflinkby Diptiman Raichaudhuri
Apache Flink has steadily established itself as the leader in stream processing technologies. With thousands of users implementing simple to advanced streaming use cases, the future of the flink community looks bright.
While Apache Flink runs on JVM, for non-JVM users Flink has a well defined pyflink port which helps python developers build sophisticated stream processing jobs. Today, most of the data engineers, data scientists and data analysts prefer using python as their main programming language of choice to build complex use cases.
|
From Raw Data to Insights: Introduction to Data Preprocessingby Odeajo Israel
Data preprocessing is a crucial step in the machine learning pipeline. It involves cleaning, transforming, and preparing raw data to make it suitable for building and training machine learning models. In this hands-on workshop, attendees will gain a deep understanding of the importance of data preprocessing and learn essential techniques for working with real-world data.
|
Getting data materialization rightby Gian Merlino
Materialization moves computation from query time to ingest time by creating specialized derived tables, or materialized views, that are simpler than the source tables and are geared towards supporting specific workloads. This is one of the most powerful and common techniques for speeding up OLAP workloads. You can implement materialization in various ways, including built-in “materialized view” or “projection” features in many databases, as well as with third-party stream processors and workflow orchestrators that sit outside the database.
|
How Open Source Marketing is Similar to Business Marketingby Arjun Sharda
Open source is an amazing place for developers to contribute to exciting new projects and also sharpen their coding skills while at it. However, as an open source maintainer of about 30 projects myself, I’ve realized that open source marketing is very similar to business marketing. Both open source projects and companies aim to receive users. Just as 90% of businesses fail or become abandoned, the same can be said for open source projects.
|
Hydra Architecture: Orchestrating ML across clusters, regions, and cloudsby Donny Greenberg
Increasingly, ML teams must satisfy requirements for executing workflows across multiple Kubernetes clusters, regions, or clouds. Challenges include constrained GPU availability within a single cloud, cloud credits or commitments across multiple providers, or strict data residency requirements. Traditional methods for replicating workflows across environments are both resource-intensive and operationally cumbersome.
In this talk, we propose a Hydra architecture, a novel approach where applications are deployed to a single cluster but can selectively take advantage of others.
|
Ingesting and analyzing millions of events per second in real-time using open source toolsby Javier Ramirez
Teams of all shapes and sizes benefit from near real-time analytics. In this session, I will present a project template that can serve as the foundation to build one such high performing system, powered by Apache Kafka, QuestDB, Grafana OSS, and Jupyter Notebook.
The first step of a data pipeline is ingestion, and even though we could directly ingest into a fast database, I will use Apache Kafka to ingest data.
|
Leveraging Argo Events and Argo Workflows for Scalable Data Ingestionby Siri Varma Vegiraju
As Argo Workflows and Argo Events continue to gain popularity for their powerful capabilities in event-driven automation and complex job orchestration, this presentation will delve into how we used this architecture to process millions of records daily.
You will gain insights into the specific architecture that integrates Argo Events and Argo Workflows to achieve efficient data aggregation and ingestion. We will discuss the challenges encountered during this process and share the strategies we employed to overcome these issues.
|
Leveraging Data Streaming Platform for Analytics and GenAIby Jun Rao
Apache Kafka is becoming the standard for integrating all information within an enterprise. This provides an opportunity for each enterprise to take actions on what’s happening in its business in real time. One common use case is to take this data and ingest it to a data lake for analytics. I will show that by integrating Kafka with Apache Iceberg can make analytics much easier. Another rising use case is GenAI.
|
Low latency Change Data Capture (CDC) to your data lake, using Apache Flink and Apache Paimonby Ali Alemi & Subham Rakshit
Change Data Capture (CDC) from various source databases to a data lake is critical for analytics workload. However different CDC mechanism exists and there are trade-offs with the approach. While using open table formats, you could get around the issue of record-level upserts and deletes but compaction of the data, schema evolution along with latency with Merge-on-Read is still a big challenge. In this session, we will share how you could use Apache Paimon and Apache Flink to build your CDC pipeline that overcomes these challenges and do a low latency sync of CDC data.
|
Managing your repo with AI — What works, and why open-source will winby Evan Rusackas
Maintaining an OSS repository is hard. Scaling contributors is nigh impossible. As AI platforms proliferate, let’s take a look at tools you should and shouldn’t leverage in automating your GitHub repo, Slack workspace, and more! We’ll also talk about why Open Source Software stands to benefit more from this revolution than private/proprietary codebases.
|
Modern LLMs with Graph DB - Exploring boundaries with text-2-cypherby Diptiman Raichaudhuri
In this lightning talk, Diptiman would present the techniques used to execute text-2-cypher analytical query generations with the help of modern large language models like GPT-4o, Claude 3 etc . This session will dive deep into Graph database querying and how LLMs assist developers get analytical queries executed on popular Graph databases like Neo4j and Amazon Neptune.
|
Move Data Not Drama: Simplifying your Workflow with Meltanoby Taylor Murphy
Meltano is a powerful open-source data movement tool that has revolutionized the way organizations handle their data pipelines at scale. Before the Analytics Development Lifecycle was even a thing, Meltano was working to bring software engineering best practices to data teams. This session will explore how Meltano addresses common data management challenges with a single, unified, and customizable platform.
|
Observability for Large Language Models with OpenTelemetryby Guangya Liu & Nir Gazit
Large Language Models (LLMs) mark a transformative advancement in artificial intelligence. These models are trained on vast datasets comprising text and code, enabling them to handle complex tasks such as text generation, language translation, and interactive querying.
As LLMs continue to integrate into various applications ranging from chatbots and search engines to creative writing aids, the need to monitor and comprehend their behaviors intensifies.
Observability plays a crucial role in this context.
|
Observable Framework: a new open-source static site generator to get data past the last mileby Allison Horst
Data teams often have established workflows to access data from different sources, process, and analyze data, but can be stymied by the “last mile problem” in data: creating rich, fast, and fully customized apps and dashboards. Closed-source, GUI-based BI options pose challenges for data visualization developers, restricting them to pre-built data integrations, out-of-the-box chart components and layouts, and restricted publishing options.
Observable Framework is a new open-source static site generator, command line tool, and local preview server.
|
Open Source and the Data Lakehouseby Alex Merced
The open data lakehouse offers those frustrated with the costs and complex pipelines of using traditional warehouses an alternative that offers performance with affordability and simpler pipelines. In this talk, we’ll be talking about technologies that are making the open data lakehouse possible.
In this talk we will learn:
What is a data lakehouse What are the components of a data lakehouse What is Apache Arrow What is Apache Iceberg What is Project Nessie
|
Open Source State of the Unionby Ali LeClerc, Alyssa Wright & Josep Prat
Making big investments in open source software? We thought so. 2024 has been a tumultuous year for open source projects with relicensing, un-relicensing, and other adventures. Our panel of in-the-trenches experts will opine on what’s going well for users, what are the current trainwrecks, and what’s plain fun to watch. Get tips to protect your existing apps and see new opportunities. Best of all, we’ll talk about how you can make open source work better for everyone.
|
Open Source Success: Learnings from 1 Billion Downloadsby Avi Press
In this talk, we share the results of an in-depth analysis of data gathered from over 1 billion open source package downloads across more than 2000 diverse projects on Scarf. Our findings offer valuable insights into user behaviors and interactions with open source software, making it essential for maintainers, founders, and executives in open source companies.
During the presentation, we delve deep into our data, uncovering the best practices employed by successful open source projects.
|
pg_duckdb: Adding analytics to your application databaseby Boaz Leskes
PostgreSQL is the fastest growing open source transactional database. DuckDB is the fastest growing open source analytical one. pg_duckdb is a new Postgres extension that brings the two together, and lets you run analytical queries on your postgres instance, with the full performance of DuckDB.
This talk will show how to use pg_duckdb to do analytics over your application data, lakehouse data, and to scale it to the cloud via MotherDuck.
|
Presto Native Engine at Meta and IBMby Aditi Pandit & Amit Dutta
Presto 2.0 is a full rewrite of the Presto query execution engine (https://prestodb.io/). The goal is to bring a 3-4x improvement in Presto performance and scalability by moving from the old Java implementation to a modern C++ one. This move towards native execution aligns with industry initiatives like Databricks Photon and Apache DataFusion, among others. We are very excited to bring this technology to Presto to make it the best Open Data Lakehouse engine in the market.
|
Real-Time Games Analytics and Leaderboard with RisingWave, Kafka, and Superset (Preset)by Fahad shah
In this session, we will set up a real-time pipeline using Kafka, RisingWave, and Superset in Preset. We will ingest player-related data into a Kafka topic and configure RisingWave to consume this data, creating materialized views for real-time analysis. With RisingWave and Superset, we can generate real-time visual dashboards, set up alerts, and create reports, enabling us to monitor player performance, create real-time leaderboards, and analyze game trends in real-time.
|
Replicating data between transactional databases and ClickHouse®by Kanthi Subramanian & Arnaud Adant
Transactional databases like MySQL, PostgreSQL and MongoDB are usually not a great fit for real time analytics, especially as the data volume grows.
They are also less space efficient than columnar databases and require regular purge or archival. This talk presents a solution to synchronize data in real time between MySQL and ClickHouse. The Altinity Sink Connector open source project (https://github.com/Altinity/clickhouse-sink-connector) is designed to efficiently replicate data and schema changes with accuracy, operational simplicity and performance in mind.
|
Restaurants or Food Trucks? Mobile Analytic Databases and the Real-Time Data Lakeby Robert Hodges
Cloud data warehouses are the dominant life form for modern analytic systems. They work like restaurants where users visit to feed on data. Larger data sets, AI, and user decisions to keep information in their own data lakes are undermining the restaurant model. What we need now is food trucks that move anywhere users need them. The food truck metaphor helps us envision a powerful new analytic system: the real-time data lake.
|
Running ClickHouse® in Production – Lessons Learnedby Noa Baron
When we chose ClickHouse as our main data lake for analytics at Cato Networks, we envisioned it as a silver bullet solution for our data needs, promising effortless data ingestion and ready-to-query dashboards.
However, the journey from that initial setup to our current, sophisticated data platform has been filled with trials and tribulations, alongside valuable lessons.
We first used ClickHouse as a black box magical persistence layer, simply feeding data points and querying ready-made GraphQL datasets.
|
SQL Window Functions - An Introductionby Dave Stokes
Widow Functions allows you to group rows in a table for in-depth investigation if you are analyzing data in a relational database. Structured Query Language is fantastic for retrieving data, but once you get that data, you need a way to classify it. Window Functions provide a way to obtain sales totals to the current date, group time spent by department, or calculate running totals over data point clusters. We will start with the basics by defining what a window can be and proceed into rankings, calculating quartiles, and how to include aggregate functions.
|
Uncover insights in your complex data with graph visualizationby Andrew Madson
Finding hidden relationships is the key to unlocking insights. Traditional charts and graphs fall short when visualizing complex, interconnected data. This presentation will describe into the world of graph visualization, a powerful technique for unveiling hidden patterns, dependencies, and anomalies in your data.
We will explore:
Graph Fundamentals: An introduction to graph theory concepts, including nodes, edges, and different types of graphs, providing a foundation for understanding graph visualization. Open-Source Graph Visualization Tools: A showcase of popular open-source libraries and tools like NetworkX, PyVis, and Gephi, demonstrating their capabilities for creating interactive and informative graph visualizations.
|
Unified Data Management with ClickHouse® and Postgresby Shivji Kumar Jha & Sachidananda Maharana
Given the AI hype, organisations want to capture every data point which very quickly results into capturing extensive data from analytics, user interactions, transactions, metrics, logs, and time series. Given different shapes, volumes, scale, consistency and availability requirements, this evolves to many data stores to explore, manage and nurture. However, relying on multiple specialized databases increases operational costs, makes developers loaded with cognitive overload, necessitates various dashboards, and eventually demands significant expertise spread across teams.
|
Vector Search in Modern Databasesby Peter Zaitsev
In this talk, we’ll explore the emergent landscape of vector search in databases, a paradigm shift in information retrieval. Vector search, traditionally the domain of specialized systems, is now being integrated into mainstream databases and search engines like Lucene, Elasticsearch, Solr, PostgreSQL, MySQL, MongoDB, and Manticore. This integration marks a significant evolution in handling complex data structures and search queries.
Introduction to Vectors and Embeddings in Databases Definition and significance of vectors and embeddings.
|
Zero-instrumentation observability based on eBPFby Nikolay Sivko
Observability is a critical aspect of any infrastructure as it enables teams to promptly identify and address issues. Nevertheless, achieving system observability comes with its own set of challenges. It is a time- and resource-intensive process as it necessitates the incorporation of instrumentation into every application.
In this talk, we will delve into the gathering of telemetry data, including metrics, logs, and traces, using eBPF. We will explore tracking various container activities, such as network calls and filesystem operations.
|