Program | The Open Source Analytics Conference

Welcome to the session program for OSA CON 2024.

Times shown are in your local timezone.
If you prefer the previous layout from sessionize, you can access it here.
Sessions will be immediately available for on-demand viewing on the event platform (Airmeet). Register to get access.

Tuesday, November 19, 2024

2024-11-19T15:50:00.000Z

Opening remarks

2024-11-19T16:10:00.000Z

2024-11-19T16:40:00.000Z

2024-11-19T17:10:00.000Z

2024-11-19T17:20:00.000Z

2024-11-19T17:50:00.000Z

2024-11-19T18:00:00.000Z

2024-11-19T18:40:00.000Z

2024-11-19T19:20:00.000Z

2024-11-19T20:00:00.000Z

Restaurants or Food Trucks? Mobile Analytic Databases and the Real-Time Data Lake

By Robert Hodges

Cloud data warehouses are the dominant life form for modern analytic systems. They work like restaurants where users visit to feed on data. Larger data sets, AI, and user decisions to keep information in their own data lakes are undermining the restaurant model. What we need now is food trucks that move anywhere users need them. The food truck metaphor helps us envision a powerful new analytic system: the real-time data lake.

See details ...

11/19/2024 4:10 PM 11/19/2024 4:35 PM UTC OSACon: Restaurants or Food Trucks? Mobile Analytic Databases and the Real-Time Data Lake Presented by Robert Hodges.

Cloud data warehouses are the dominant life form for modern analytic systems. They work like restaurants where users visit to feed on data. Larger data sets, AI, and user decisions to keep information in their own data lakes are undermining the restaurant model. What we need now is food trucks that move anywhere users need them. The food truck metaphor helps us envision a powerful new analytic system: the real-time data lake.

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

Leveraging Data Streaming Platform for Analytics and GenAI

By Jun Rao

Apache Kafka is becoming the standard for integrating all information within an enterprise. This provides an opportunity for each enterprise to take actions on what’s happening in its business in real time. One common use case is to take this data and ingest it to a data lake for analytics. I will show that by integrating Kafka with Apache Iceberg can make analytics much easier. Another rising use case is GenAI.

See details ...

11/19/2024 4:40 PM 11/19/2024 5:10 PM UTC OSACon: Leveraging Data Streaming Platform for Analytics and GenAI Presented by Jun Rao.

Apache Kafka is becoming the standard for integrating all information within an enterprise. This provides an opportunity for each enterprise to take actions on what’s happening in its business in real time. One common use case is to take this data and ingest it to a data lake for analytics. I will show that by integrating Kafka with Apache Iceberg can make analytics much easier. Another rising use case is GenAI. I will show how we have extended Apache Flink to support realtime inference with GenAI.

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

Real-Time Games Analytics and Leaderboard with RisingWave, Kafka, and Superset (Preset)

By Fahad shah

In this session, we will set up a real-time pipeline using Kafka, RisingWave, and Superset in Preset. We will ingest player-related data into a Kafka topic and configure RisingWave to consume this data, creating materialized views for real-time analysis. With RisingWave and Superset, we can generate real-time visual dashboards, set up alerts, and create reports, enabling us to monitor player performance, create real-time leaderboards, and analyze game trends in real-time.

See details ...

11/19/2024 5:10 PM 11/19/2024 5:20 PM UTC OSACon: Real-Time Games Analytics and Leaderboard with RisingWave, Kafka, and Superset (Preset) Presented by Fahad shah.

In this session, we will set up a real-time pipeline using Kafka, RisingWave, and Superset in Preset. We will ingest player-related data into a Kafka topic and configure RisingWave to consume this data, creating materialized views for real-time analysis. With RisingWave and Superset, we can generate real-time visual dashboards, set up alerts, and create reports, enabling us to monitor player performance, create real-time leaderboards, and analyze game trends in real-time.

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

Anatomy of a real-time analytics dashboard

By Dunith Dhanushka

Typically, data visualization is the last mile in a data pipeline, as it presents insights in a way that is easily understood by users. When insights are fresh and relevant, humans can act upon them on time. However, the process of implementing a visually appealing real-time dashboard is not as simple as it is thought. The challenges in data collection and processing at scale, and delivering the metrics to users on time make things difficult.

See details ...

11/19/2024 5:20 PM 11/19/2024 5:50 PM UTC OSACon: Anatomy of a real-time analytics dashboard Presented by Dunith Dhanushka.

Typically, data visualization is the last mile in a data pipeline, as it presents insights in a way that is easily understood by users. When insights are fresh and relevant, humans can act upon them on time.

However, the process of implementing a visually appealing real-time dashboard is not as simple as it is thought. The challenges in data collection and processing at scale, and delivering the metrics to users on time make things difficult.

In this talk, we deconstruct a real-time analytics dashboard into several layers as data collection, metrics computation, and insights serving. Then we take a real real-world use case, an IoT dashboard, and build it from scratch while walking through each layer in detail, taking open-source technology components for the implementation.

In the second half of the talk, we discuss the challenges in the process and find ways to mitigate them.

This talk is ideal for anyone interested in the practical application of real-time data processing and visualization. Attendees will gain a comprehensive understanding of each layer, its importance, and how they interact with each other to create a seamless, real-time dashboard.

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

Unified Data Management with ClickHouse® and Postgres

By Shivji Kumar Jha & Sachidananda Maharana

Given the AI hype, organisations want to capture every data point which very quickly results into capturing extensive data from analytics, user interactions, transactions, metrics, logs, and time series. Given different shapes, volumes, scale, consistency and availability requirements, this evolves to many data stores to explore, manage and nurture. However, relying on multiple specialized databases increases operational costs, makes developers loaded with cognitive overload, necessitates various dashboards, and eventually demands significant expertise spread across teams.

See details ...

11/19/2024 5:20 PM 11/19/2024 5:50 PM UTC OSACon: Unified Data Management with ClickHouse® and Postgres Presented by Shivji Kumar Jha & Sachidananda Maharana.

Given the AI hype, organisations want to capture every data point which very quickly results into capturing extensive data from analytics, user interactions, transactions, metrics, logs, and time series. Given different shapes, volumes, scale, consistency and availability requirements, this evolves to many data stores to explore, manage and nurture. However, relying on multiple specialized databases increases operational costs, makes developers loaded with cognitive overload, necessitates various dashboards, and eventually demands significant expertise spread across teams.

This talk proposes a unified data platform using just two databases: ClickHouse and Postgres. Postgres protocol is evolving quickly to all kinds of distributed data. But the ecosystem outside OLTP is not mature yet. With Clickhouse and Postgres both allowing to query foreign data, we’ll demonstrate how ClickHouse can efficiently handle OLAP, metrics, logs and transactional workloads. By consolidating these diverse workloads, organisations can achieve substantial cost savings, streamline resource utilisation, and simplify data management.

We will delve into strategies for integrating these functionalities, ensuring data isolation and smooth operations across teams. Real-world examples will highlight the effectiveness of ClickHouse and Postgres, showcasing how this unified approach enhances efficiency and reduces complexity.

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

Aerodynamic Data Models: Flying Fast at Scale with DuckDB

By Mike Driscoll

At Rill, we rely on DuckDB to power uniquely fast dashboards for exploring time-series metrics. To achieve this interactivity, Rill’s dashboards generate up to 100 parallel queries in response to each user interaction. In this lightning talk, we’ll share a series of optimization and data modeling techniques that have been pivotal in achieving remarkably fast, sub-second response times using DuckDB. Our primary tactics include employing parallel connections to facilitate simultaneous query processing and organizing data in a chronological order to enhance the effectiveness of min-max indexes.

See details ...

11/19/2024 5:50 PM 11/19/2024 6:00 PM UTC OSACon: Aerodynamic Data Models: Flying Fast at Scale with DuckDB Presented by Mike Driscoll.

At Rill, we rely on DuckDB to power uniquely fast dashboards for exploring time-series metrics. To achieve this interactivity, Rill’s dashboards generate up to 100 parallel queries in response to each user interaction.

In this lightning talk, we’ll share a series of optimization and data modeling techniques that have been pivotal in achieving remarkably fast, sub-second response times using DuckDB.

Our primary tactics include employing parallel connections to facilitate simultaneous query processing and organizing data in a chronological order to enhance the effectiveness of min-max indexes. We also utilize enum types for more efficient handling of string column queries, along with config tunings. These approaches have collectively enabled us to enhance DuckDB’s capability to handle larger datasets (100+ GBs) with sub-second query responses.

We invite you to join us in this insightful session to discover how these optimizations can significantly improve your data processing and query performance in DuckDB.

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

Apache Doris: an alternative lakehouse solution for real-time analytics

By Mingyu Chen

Lakehouse is a big data solution that combines the advantages of data warehouse and data lake, helping users to perform fast data analysis and efficient data management on the data lake. Apache Doris is an OLAP database for fast data analytics. It provides self-managed table format for high-concurrency and low-latency queries, semi-structured data analytics and complex ad-hoc queries, all by using standard SQL. It can also query data from various lake formation such as Apache Hudi, Apache Iceberg, Apache Paimon, etc.

See details ...

11/19/2024 6:00 PM 11/19/2024 6:30 PM UTC OSACon: Apache Doris: an alternative lakehouse solution for real-time analytics Presented by Mingyu Chen.

Lakehouse is a big data solution that combines the advantages of data warehouse and data lake, helping users to perform fast data analysis and efficient data management on the data lake.

Apache Doris is an OLAP database for fast data analytics. It provides self-managed table format for high-concurrency and low-latency queries, semi-structured data analytics and complex ad-hoc queries, all by using standard SQL. It can also query data from various lake formation such as Apache Hudi, Apache Iceberg, Apache Paimon, etc.

In this session, you will learn what Apache Doris is, what Doris can do for real-time analytics, and how to build a fast data analysis engine on data lake.

Introduction of Apache Doris
Core futures of Apache Doris
Building a fast data analysis engine on datalake

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

Uncover insights in your complex data with graph visualization

By Andrew Madson

Finding hidden relationships is the key to unlocking insights. Traditional charts and graphs fall short when visualizing complex, interconnected data. This presentation will describe into the world of graph visualization, a powerful technique for unveiling hidden patterns, dependencies, and anomalies in your data. We will explore: Graph Fundamentals: An introduction to graph theory concepts, including nodes, edges, and different types of graphs, providing a foundation for understanding graph visualization. Open-Source Graph Visualization Tools: A showcase of popular open-source libraries and tools like NetworkX, PyVis, and Gephi, demonstrating their capabilities for creating interactive and informative graph visualizations.

See details ...

11/19/2024 6:00 PM 11/19/2024 6:30 PM UTC OSACon: Uncover insights in your complex data with graph visualization Presented by Andrew Madson.

Finding hidden relationships is the key to unlocking insights. Traditional charts and graphs fall short when visualizing complex, interconnected data. This presentation will describe into the world of graph visualization, a powerful technique for unveiling hidden patterns, dependencies, and anomalies in your data.

We will explore:

Graph Fundamentals: An introduction to graph theory concepts, including nodes, edges, and different types of graphs, providing a foundation for understanding graph visualization.
Open-Source Graph Visualization Tools: A showcase of popular open-source libraries and tools like NetworkX, PyVis, and Gephi, demonstrating their capabilities for creating interactive and informative graph visualizations.
Use Cases: Real-world applications of graph visualization across various domains, such as social network analysis, fraud detection, recommendation systems, and knowledge graphs. We’ll delve into specific examples and case studies to highlight the versatility and impact of this approach.
Best Practices and Design Principles: Guidelines for designing effective graph visualizations that are both aesthetically pleasing and informative. We’ll discuss topics like layout algorithms, node and edge styling, interaction design, and scalability considerations.
(if time allows) Hands-On Demonstration: A live demonstration of building a graph visualization using an open-source tool, showcasing the step-by-step process and highlighting key features.

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

From Raw Data to Insights: Introduction to Data Preprocessing

By Odeajo Israel

Data preprocessing is a crucial step in the machine learning pipeline. It involves cleaning, transforming, and preparing raw data to make it suitable for building and training machine learning models. In this hands-on workshop, attendees will gain a deep understanding of the importance of data preprocessing and learn essential techniques for working with real-world data.

See details ...

11/19/2024 6:40 PM 11/19/2024 7:10 PM UTC OSACon: From Raw Data to Insights: Introduction to Data Preprocessing Presented by Odeajo Israel.

Data preprocessing is a crucial step in the machine learning pipeline. It involves cleaning, transforming, and preparing raw data to make it suitable for building and training machine learning models. In this hands-on workshop, attendees will gain a deep understanding of the importance of data preprocessing and learn essential techniques for working with real-world data.

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

Presto Native Engine at Meta and IBM

By Aditi Pandit & Amit Dutta

Presto 2.0 is a full rewrite of the Presto query execution engine (https://prestodb.io/). The goal is to bring a 3-4x improvement in Presto performance and scalability by moving from the old Java implementation to a modern C++ one. This move towards native execution aligns with industry initiatives like Databricks Photon and Apache DataFusion, among others. We are very excited to bring this technology to Presto to make it the best Open Data Lakehouse engine in the market.

See details ...

11/19/2024 6:40 PM 11/19/2024 7:10 PM UTC OSACon: Presto Native Engine at Meta and IBM Presented by Aditi Pandit & Amit Dutta.

Presto 2.0 is a full rewrite of the Presto query execution engine (https://prestodb.io/). The goal is to bring a 3-4x improvement in Presto performance and scalability by moving from the old Java implementation to a modern C++ one. This move towards native execution aligns with industry initiatives like Databricks Photon and Apache DataFusion, among others. We are very excited to bring this technology to Presto to make it the best Open Data Lakehouse engine in the market.

Presto 2.0 has been in active development for about 4 years and we have some production deployments at Meta and IBM now. This project has a very active open-source community comprising engineers from Meta, Ahana/IBM, Uber, ByteDance, Pinterest, Intel, Neuroblade among others.

This session will give an overview of the project, its architecture and our experiences with launching this project at Meta and TPC-DS benchmarking at IBM watsonx.data (https://www.ibm.com/blog/announcement/delivering-superior-price-performance-and-enhanced-data-management-for-ai-with-ibm-watsonx-data/)

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

Observable Framework: a new open-source static site generator to get data past the last mile

By Allison Horst

Data teams often have established workflows to access data from different sources, process, and analyze data, but can be stymied by the “last mile problem” in data: creating rich, fast, and fully customized apps and dashboards. Closed-source, GUI-based BI options pose challenges for data visualization developers, restricting them to pre-built data integrations, out-of-the-box chart components and layouts, and restricted publishing options. Observable Framework is a new open-source static site generator, command line tool, and local preview server.

See details ...

11/19/2024 7:20 PM 11/19/2024 8:00 PM UTC OSACon: Observable Framework: a new open-source static site generator to get data past the last mile Presented by Allison Horst.

Data teams often have established workflows to access data from different sources, process, and analyze data, but can be stymied by the “last mile problem” in data: creating rich, fast, and fully customized apps and dashboards. Closed-source, GUI-based BI options pose challenges for data visualization developers, restricting them to pre-built data integrations, out-of-the-box chart components and layouts, and restricted publishing options.

Observable Framework is a new open-source static site generator, command line tool, and local preview server. It’s files-based, so integrates seamlessly into existing data workflows. Framework’s data loaders support back-end data processing in any programming language, bridging the gap between data teams and developers, and improving app performance. And, when working in Framework, everything is created with code, which means developers can build fully customized, interactive graphics and pages without constraints.

In this talk we’ll share the scoop on Framework, highlighting features that can help developers get their data past the last mile, including:

Files-based and flexible: How Framework lets you use your existing skills and preferred tools to get quickly up-and-running with dashboard development.
Data loaders 101: What are they? Why are they a win for data teams, app performance, and decision-makers? And how can you write your own?
Code-based and fully customizable: What’s possible when you aren’t constrained by your BI tools.

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

pg_duckdb: Adding analytics to your application database

By Boaz Leskes

PostgreSQL is the fastest growing open source transactional database. DuckDB is the fastest growing open source analytical one. pg_duckdb is a new Postgres extension that brings the two together, and lets you run analytical queries on your postgres instance, with the full performance of DuckDB. This talk will show how to use pg_duckdb to do analytics over your application data, lakehouse data, and to scale it to the cloud via MotherDuck.

See details ...

11/19/2024 7:20 PM 11/19/2024 8:00 PM UTC OSACon: pg_duckdb: Adding analytics to your application database Presented by Boaz Leskes.

PostgreSQL is the fastest growing open source transactional database. DuckDB is the fastest growing open source analytical one. pg_duckdb is a new Postgres extension that brings the two together, and lets you run analytical queries on your postgres instance, with the full performance of DuckDB.

This talk will show how to use pg_duckdb to do analytics over your application data, lakehouse data, and to scale it to the cloud via MotherDuck.

The work is a joint effort from MotherDuck, Hydras, DuckDB Labs, Neon, and Microsoft that combines deep Postgres expertise from Hydras, Neon and Microsoft with the DuckDB know-how from DuckDB labs (the creators of DuckDB) and MotherDuck.

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

Flink for a non-JVM user, an introduction to pyflink

By Diptiman Raichaudhuri

Apache Flink has steadily established itself as the leader in stream processing technologies. With thousands of users implementing simple to advanced streaming use cases, the future of the flink community looks bright. While Apache Flink runs on JVM, for non-JVM users Flink has a well defined pyflink port which helps python developers build sophisticated stream processing jobs. Today, most of the data engineers, data scientists and data analysts prefer using python as their main programming language of choice to build complex use cases.

See details ...

11/19/2024 8:00 PM 11/19/2024 8:30 PM UTC OSACon: Flink for a non-JVM user, an introduction to pyflink Presented by Diptiman Raichaudhuri.

Apache Flink has steadily established itself as the leader in stream processing technologies. With thousands of users implementing simple to advanced streaming use cases, the future of the flink community looks bright.

While Apache Flink runs on JVM, for non-JVM users Flink has a well defined pyflink port which helps python developers build sophisticated stream processing jobs. Today, most of the data engineers, data scientists and data analysts prefer using python as their main programming language of choice to build complex use cases.

In this session, I will explore Flink APIs wearing the non-JVM hat and will deep dive into pyflink Table APIs and UDFs. Pyflink appeals to python developers since complex stream processing techniques like windowing, event time semantics could be written in simple python DSLs,.

I will also look at how pyflink Table API and Flink SQL could work hand-in-hand in developing streaming pipelines.

The session will also have a short demo showcasing how pyflink ingests fast moving data from Kafka and runs pyflink Table API DSLs to process such streams.

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

Low latency Change Data Capture (CDC) to your data lake, using Apache Flink and Apache Paimon

By Ali Alemi & Subham Rakshit

Change Data Capture (CDC) from various source databases to a data lake is critical for analytics workload. However different CDC mechanism exists and there are trade-offs with the approach. While using open table formats, you could get around the issue of record-level upserts and deletes but compaction of the data, schema evolution along with latency with Merge-on-Read is still a big challenge. In this session, we will share how you could use Apache Paimon and Apache Flink to build your CDC pipeline that overcomes these challenges and do a low latency sync of CDC data.

See details ...

11/19/2024 8:00 PM 11/19/2024 8:30 PM UTC OSACon: Low latency Change Data Capture (CDC) to your data lake, using Apache Flink and Apache Paimon Presented by Ali Alemi & Subham Rakshit.

Change Data Capture (CDC) from various source databases to a data lake is critical for analytics workload. However different CDC mechanism exists and there are trade-offs with the approach. While using open table formats, you could get around the issue of record-level upserts and deletes but compaction of the data, schema evolution along with latency with Merge-on-Read is still a big challenge. In this session, we will share how you could use Apache Paimon and Apache Flink to build your CDC pipeline that overcomes these challenges and do a low latency sync of CDC data. We will also cover partial-update merge engine and changelog tracking of streaming data. Finally, we will compare Apache Paimon with Apache Hudi and Apache Iceberg and provide prescriptive guidance when to use one over the other.

Join this session and learn how Apache Paimon differs in the approach of solving the CDC problem.

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

2024-11-19T15:50:00.000Z - 2024-11-19T16:10:00.000Z.

Opening remarks

2024-11-19T16:10:00.000Z - 2024-11-19 16:35:00 +0000 UTC.

Restaurants or Food Trucks? Mobile Analytic Databases and the Real-Time Data Lake

By Robert Hodges

Cloud data warehouses are the dominant life form for modern analytic systems. They work like restaurants where users visit to feed on data. Larger data sets, AI, and user decisions to keep information in their own data lakes are undermining the restaurant model. What we need now is food trucks that move anywhere users need them. The food truck metaphor helps us envision a powerful new analytic system: the real-time data lake.

See details ...

2024-11-19T16:40:00.000Z - 2024-11-19 17:10:00 +0000 UTC.

Leveraging Data Streaming Platform for Analytics and GenAI

By Jun Rao

Apache Kafka is becoming the standard for integrating all information within an enterprise. This provides an opportunity for each enterprise to take actions on what’s happening in its business in real time. One common use case is to take this data and ingest it to a data lake for analytics. I will show that by integrating Kafka with Apache Iceberg can make analytics much easier. Another rising use case is GenAI.

See details ...

2024-11-19T17:10:00.000Z - 2024-11-19 17:20:00 +0000 UTC.

Real-Time Games Analytics and Leaderboard with RisingWave, Kafka, and Superset (Preset)

By Fahad shah

In this session, we will set up a real-time pipeline using Kafka, RisingWave, and Superset in Preset. We will ingest player-related data into a Kafka topic and configure RisingWave to consume this data, creating materialized views for real-time analysis. With RisingWave and Superset, we can generate real-time visual dashboards, set up alerts, and create reports, enabling us to monitor player performance, create real-time leaderboards, and analyze game trends in real-time.

See details ...

2024-11-19T17:20:00.000Z - 2024-11-19 17:50:00 +0000 UTC.

Unified Data Management with ClickHouse® and Postgres

By Shivji Kumar Jha & Sachidananda Maharana

Given the AI hype, organisations want to capture every data point which very quickly results into capturing extensive data from analytics, user interactions, transactions, metrics, logs, and time series. Given different shapes, volumes, scale, consistency and availability requirements, this evolves to many data stores to explore, manage and nurture. However, relying on multiple specialized databases increases operational costs, makes developers loaded with cognitive overload, necessitates various dashboards, and eventually demands significant expertise spread across teams.

See details ...

2024-11-19T17:20:00.000Z - 2024-11-19 17:50:00 +0000 UTC.

Anatomy of a real-time analytics dashboard

By Dunith Dhanushka

Typically, data visualization is the last mile in a data pipeline, as it presents insights in a way that is easily understood by users. When insights are fresh and relevant, humans can act upon them on time. However, the process of implementing a visually appealing real-time dashboard is not as simple as it is thought. The challenges in data collection and processing at scale, and delivering the metrics to users on time make things difficult.

See details ...

2024-11-19T17:50:00.000Z - 2024-11-19 18:00:00 +0000 UTC.

Aerodynamic Data Models: Flying Fast at Scale with DuckDB

By Mike Driscoll

At Rill, we rely on DuckDB to power uniquely fast dashboards for exploring time-series metrics. To achieve this interactivity, Rill’s dashboards generate up to 100 parallel queries in response to each user interaction. In this lightning talk, we’ll share a series of optimization and data modeling techniques that have been pivotal in achieving remarkably fast, sub-second response times using DuckDB. Our primary tactics include employing parallel connections to facilitate simultaneous query processing and organizing data in a chronological order to enhance the effectiveness of min-max indexes.

See details ...

2024-11-19T18:00:00.000Z - 2024-11-19 18:30:00 +0000 UTC.

Apache Doris: an alternative lakehouse solution for real-time analytics

By Mingyu Chen

Lakehouse is a big data solution that combines the advantages of data warehouse and data lake, helping users to perform fast data analysis and efficient data management on the data lake. Apache Doris is an OLAP database for fast data analytics. It provides self-managed table format for high-concurrency and low-latency queries, semi-structured data analytics and complex ad-hoc queries, all by using standard SQL. It can also query data from various lake formation such as Apache Hudi, Apache Iceberg, Apache Paimon, etc.

See details ...

2024-11-19T18:00:00.000Z - 2024-11-19 18:30:00 +0000 UTC.

Uncover insights in your complex data with graph visualization

By Andrew Madson

Finding hidden relationships is the key to unlocking insights. Traditional charts and graphs fall short when visualizing complex, interconnected data. This presentation will describe into the world of graph visualization, a powerful technique for unveiling hidden patterns, dependencies, and anomalies in your data. We will explore: Graph Fundamentals: An introduction to graph theory concepts, including nodes, edges, and different types of graphs, providing a foundation for understanding graph visualization. Open-Source Graph Visualization Tools: A showcase of popular open-source libraries and tools like NetworkX, PyVis, and Gephi, demonstrating their capabilities for creating interactive and informative graph visualizations.

See details ...

2024-11-19T18:40:00.000Z - 2024-11-19 19:10:00 +0000 UTC.

Presto Native Engine at Meta and IBM

By Aditi Pandit & Amit Dutta

Presto 2.0 is a full rewrite of the Presto query execution engine (https://prestodb.io/). The goal is to bring a 3-4x improvement in Presto performance and scalability by moving from the old Java implementation to a modern C++ one. This move towards native execution aligns with industry initiatives like Databricks Photon and Apache DataFusion, among others. We are very excited to bring this technology to Presto to make it the best Open Data Lakehouse engine in the market.

See details ...

2024-11-19T18:40:00.000Z - 2024-11-19 19:10:00 +0000 UTC.

From Raw Data to Insights: Introduction to Data Preprocessing

By Odeajo Israel

Data preprocessing is a crucial step in the machine learning pipeline. It involves cleaning, transforming, and preparing raw data to make it suitable for building and training machine learning models. In this hands-on workshop, attendees will gain a deep understanding of the importance of data preprocessing and learn essential techniques for working with real-world data.

See details ...

2024-11-19T19:20:00.000Z - 2024-11-19 20:00:00 +0000 UTC.

pg_duckdb: Adding analytics to your application database

By Boaz Leskes

PostgreSQL is the fastest growing open source transactional database. DuckDB is the fastest growing open source analytical one. pg_duckdb is a new Postgres extension that brings the two together, and lets you run analytical queries on your postgres instance, with the full performance of DuckDB. This talk will show how to use pg_duckdb to do analytics over your application data, lakehouse data, and to scale it to the cloud via MotherDuck.

See details ...

2024-11-19T19:20:00.000Z - 2024-11-19 20:00:00 +0000 UTC.

Observable Framework: a new open-source static site generator to get data past the last mile

By Allison Horst

Data teams often have established workflows to access data from different sources, process, and analyze data, but can be stymied by the “last mile problem” in data: creating rich, fast, and fully customized apps and dashboards. Closed-source, GUI-based BI options pose challenges for data visualization developers, restricting them to pre-built data integrations, out-of-the-box chart components and layouts, and restricted publishing options. Observable Framework is a new open-source static site generator, command line tool, and local preview server.

See details ...

2024-11-19T20:00:00.000Z - 2024-11-19 20:30:00 +0000 UTC.

Flink for a non-JVM user, an introduction to pyflink

By Diptiman Raichaudhuri

Apache Flink has steadily established itself as the leader in stream processing technologies. With thousands of users implementing simple to advanced streaming use cases, the future of the flink community looks bright. While Apache Flink runs on JVM, for non-JVM users Flink has a well defined pyflink port which helps python developers build sophisticated stream processing jobs. Today, most of the data engineers, data scientists and data analysts prefer using python as their main programming language of choice to build complex use cases.

See details ...

2024-11-19T20:00:00.000Z - 2024-11-19 20:30:00 +0000 UTC.

Low latency Change Data Capture (CDC) to your data lake, using Apache Flink and Apache Paimon

By Ali Alemi & Subham Rakshit

Change Data Capture (CDC) from various source databases to a data lake is critical for analytics workload. However different CDC mechanism exists and there are trade-offs with the approach. While using open table formats, you could get around the issue of record-level upserts and deletes but compaction of the data, schema evolution along with latency with Merge-on-Read is still a big challenge. In this session, we will share how you could use Apache Paimon and Apache Flink to build your CDC pipeline that overcomes these challenges and do a low latency sync of CDC data.

See details ...

Wednesday, November 20, 2024

2024-11-20T16:15:00.000Z

2024-11-20T17:00:00.000Z

2024-11-20T17:10:00.000Z

2024-11-20T17:50:00.000Z

2024-11-20T18:30:00.000Z

2024-11-20T19:10:00.000Z

2024-11-20T19:50:00.000Z

AI Reality Checkpoint: The Good, the Bad, and the Overhyped

By Maxime Beauchemin

In the past 18 months, artificial intelligence has not just entered our workspaces – it has taken over. As we stand at the crossroads of innovation and automation, it’s time for a candid reflection on how AI has reshaped our professional lives, and to talk about where it’s been a game changer, where it’s falling short, and what’s about to shift dramatically in the short term. Since the release of ChatGPT in December 2022, I’ve developed a “first-reflex” to augment and accelerate nearly every task with AI.

See details ...

11/20/2024 4:00 PM 11/20/2024 4:30 PM UTC OSACon: AI Reality Checkpoint: The Good, the Bad, and the Overhyped Presented by Maxime Beauchemin.

In the past 18 months, artificial intelligence has not just entered our workspaces – it has taken over. As we stand at the crossroads of innovation and automation, it’s time for a candid reflection on how AI has reshaped our professional lives, and to talk about where it’s been a game changer, where it’s falling short, and what’s about to shift dramatically in the short term.

Since the release of ChatGPT in December 2022, I’ve developed a “first-reflex” to augment and accelerate nearly every task with AI. As a founder and CEO, this spans a wide array of responsibilities from fundraising, internal communications, legal, operations, product marketing, finance, and beyond. In this keynote, I’ll cover diverse use cases across all areas of business, offering a comprehensive view of AI’s impact.

I’ve also been working closely with the data team at Preset, leveraging AI to assist and augment all aspects of data work. While I’ll explore a broad spectrum of tasks beyond data, I’ll delve deeper into the data-related aspects for as this facet of my work is most relevant to OSA CON attendees.

Join me as I sort out through this new reality and try and forecast the future of AI in our work. It’s time for a radical checkpoint. Everything’s changing fast. In some areas, AI has been a slam dunk; in others, it’s been frustrating as hell. And once a few key challenges are tackled, we’re on the cusp of a tsunami of transformation.

3 major milestones are right around the corner: top-human-level reasoning, solid memory accumulation and recall, and proper executive skills. How is this going to affect all of us?

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

Leveraging Argo Events and Argo Workflows for Scalable Data Ingestion

By Siri Varma Vegiraju

As Argo Workflows and Argo Events continue to gain popularity for their powerful capabilities in event-driven automation and complex job orchestration, this presentation will delve into how we used this architecture to process millions of records daily. You will gain insights into the specific architecture that integrates Argo Events and Argo Workflows to achieve efficient data aggregation and ingestion. We will discuss the challenges encountered during this process and share the strategies we employed to overcome these issues.

See details ...

11/20/2024 5:00 PM 11/20/2024 5:10 PM UTC OSACon: Leveraging Argo Events and Argo Workflows for Scalable Data Ingestion Presented by Siri Varma Vegiraju.

As Argo Workflows and Argo Events continue to gain popularity for their powerful capabilities in event-driven automation and complex job orchestration, this presentation will delve into how we used this architecture to process millions of records daily.

You will gain insights into the specific architecture that integrates Argo Events and Argo Workflows to achieve efficient data aggregation and ingestion. We will discuss the challenges encountered during this process and share the strategies we employed to overcome these issues. Attendees will also learn how we use techniques like “Work avoidance” to ensure we don’t redo the work.

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

Airflow 3 is coming!

By Jarek Potiuk

This session is about the new major release of Airflow that we plan to release early in 2025. This is the first major release of Airflow since 2021 when we released Airflow 2 and it is a result of 4 years of improvements we’ve implemented as minor releases, but also a lot of listening to our users, and changing industry. While Airlfow remains the most important and strongest ETL/Data orchestrator in use, with the advent of LLM/GenAI becoming mainstream part of the data orchestration and a wealth of workflow and tooling specialising in those, Airflow 3 is aiming to become the only True Open-Source, Open-Governance Enterprise-level strong Orchestration solution for all your batch processing worfklow needs.

See details ...

11/20/2024 5:10 PM 11/20/2024 5:40 PM UTC OSACon: Airflow 3 is coming! Presented by Jarek Potiuk.

This session is about the new major release of Airflow that we plan to release early in 2025. This is the first major release of Airflow since 2021 when we released Airflow 2 and it is a result of 4 years of improvements we’ve implemented as minor releases, but also a lot of listening to our users, and changing industry. While Airlfow remains the most important and strongest ETL/Data orchestrator in use, with the advent of LLM/GenAI becoming mainstream part of the data orchestration and a wealth of workflow and tooling specialising in those, Airflow 3 is aiming to become the only True Open-Source, Open-Governance Enterprise-level strong Orchestration solution for all your batch processing worfklow needs. This talk will tell about basic principles and plans we have that will make Airlfow even more suited for most of your data pipeline needs.

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

Modern LLMs with Graph DB - Exploring boundaries with text-2-cypher

By Diptiman Raichaudhuri

In this lightning talk, Diptiman would present the techniques used to execute text-2-cypher analytical query generations with the help of modern large language models like GPT-4o, Claude 3 etc . This session will dive deep into Graph database querying and how LLMs assist developers get analytical queries executed on popular Graph databases like Neo4j and Amazon Neptune.

See details ...

11/20/2024 5:10 PM 11/20/2024 5:40 PM UTC OSACon: Modern LLMs with Graph DB - Exploring boundaries with text-2-cypher Presented by Diptiman Raichaudhuri.

In this lightning talk, Diptiman would present the techniques used to execute text-2-cypher analytical query generations with the help of modern large language models like GPT-4o, Claude 3 etc . This session will dive deep into Graph database querying and how LLMs assist developers get analytical queries executed on popular Graph databases like Neo4j and Amazon Neptune.

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

Composable Data Platforms and The Rise of Data Platform Engineering

By Nick Schrock

Every company has a data platform, the systems and tools that produce data, They are critical to every modern business, and can incur massive cost and complexity. In this talk we argue that data platforms must be composable and flexible, and that necessitates a new skill: data platform engineering. We will then discuss how an advanced orchestrator and control plane is the essential technology to engineering a composable data platform, how in turn manage complexity, avoid vendor lock-in, enable end-to-end ownership for practitioner teams, and contain the costs of your data platform.

See details ...

11/20/2024 5:50 PM 11/20/2024 6:20 PM UTC OSACon: Composable Data Platforms and The Rise of Data Platform Engineering Presented by Nick Schrock.

Every company has a data platform, the systems and tools that produce data, They are critical to every modern business, and can incur massive cost and complexity. In this talk we argue that data platforms must be composable and flexible, and that necessitates a new skill: data platform engineering. We will then discuss how an advanced orchestrator and control plane is the essential technology to engineering a composable data platform, how in turn manage complexity, avoid vendor lock-in, enable end-to-end ownership for practitioner teams, and contain the costs of your data platform.

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

Exploring Data Analysis in Time Series Databases

By Aliaksandr Valialkin

Kubernetes has changed everything. Not only the way we deploy our applications. But also how we monitor them, how we collect, store, visualize, and alert on time series data generated by monitoring systems. What are the challenges in modern monitoring? Why have new-generation time series databases like VictoriaMetrics and Prometheus emerged? Why is there no SQL support in these databases? Why are Grafana dashboards so fancy? Join us as we explore these questions and many other questions related to the specifics of time series data analysis.

See details ...

11/20/2024 5:50 PM 11/20/2024 6:20 PM UTC OSACon: Exploring Data Analysis in Time Series Databases Presented by Aliaksandr Valialkin.

Kubernetes has changed everything. Not only the way we deploy our applications. But also how we monitor them, how we collect, store, visualize, and alert on time series data generated by monitoring systems.

What are the challenges in modern monitoring? Why have new-generation time series databases like VictoriaMetrics and Prometheus emerged? Why is there no SQL support in these databases? Why are Grafana dashboards so fancy? Join us as we explore these questions and many other questions related to the specifics of time series data analysis.

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

Managing your repo with AI — What works, and why open-source will win

By Evan Rusackas

Maintaining an OSS repository is hard. Scaling contributors is nigh impossible. As AI platforms proliferate, let’s take a look at tools you should and shouldn’t leverage in automating your GitHub repo, Slack workspace, and more! We’ll also talk about why Open Source Software stands to benefit more from this revolution than private/proprietary codebases.

See details ...

11/20/2024 6:30 PM 11/20/2024 7:00 PM UTC OSACon: Managing your repo with AI — What works, and why open-source will win Presented by Evan Rusackas.

Maintaining an OSS repository is hard. Scaling contributors is nigh impossible. As AI platforms proliferate, let’s take a look at tools you should and shouldn’t leverage in automating your GitHub repo, Slack workspace, and more! We’ll also talk about why Open Source Software stands to benefit more from this revolution than private/proprietary codebases.

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

Observability for Large Language Models with OpenTelemetry

By Guangya Liu & Nir Gazit

Large Language Models (LLMs) mark a transformative advancement in artificial intelligence. These models are trained on vast datasets comprising text and code, enabling them to handle complex tasks such as text generation, language translation, and interactive querying. As LLMs continue to integrate into various applications ranging from chatbots and search engines to creative writing aids, the need to monitor and comprehend their behaviors intensifies. Observability plays a crucial role in this context.

See details ...

11/20/2024 6:30 PM 11/20/2024 7:00 PM UTC OSACon: Observability for Large Language Models with OpenTelemetry Presented by Guangya Liu & Nir Gazit.

Large Language Models (LLMs) mark a transformative advancement in artificial intelligence. These models are trained on vast datasets comprising text and code, enabling them to handle complex tasks such as text generation, language translation, and interactive querying.

As LLMs continue to integrate into various applications ranging from chatbots and search engines to creative writing aids, the need to monitor and comprehend their behaviors intensifies.

Observability plays a crucial role in this context. It involves the systematic collection and analysis of data to enhance LLM performance, identify and correct biases, troubleshoot issues, and ensure AI systems are both reliable and trustworthy.

In this discussion, we will explore the concept of LLM observability in depth, including the initial LLM semantic convention which has just been adopted by the OpenTelemetry community, and how OpenTelemetry can fit into the world of LLM observability . Additionally, we will share more detail about how OpenLLMetry is leveraging OpenTelemetry to do LLM Observability for the whole AI stack, including vector DB, LLMs, Model Orchestration Platform etc.

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

Bring streaming to AI: introducing Bytewax connectors

By Laura Gutierrez Funderburk

As AI and machine learning become more integral to business operations and decision-making, the need for real-time data processing has never been more critical. Whether you’re monitoring live streams from edge devices, responding to events as they happen, or managing data in distributed databases, the ability to process and act on data in real-time can be the difference between success and irrelevance. In this talk, we’ll showcase how connecting to real-time data sources can create more responsive and adaptive AI systems within the Python ecosystem, using Bytewax.

See details ...

11/20/2024 7:10 PM 11/20/2024 7:40 PM UTC OSACon: Bring streaming to AI: introducing Bytewax connectors Presented by Laura Gutierrez Funderburk.

As AI and machine learning become more integral to business operations and decision-making, the need for real-time data processing has never been more critical. Whether you’re monitoring live streams from edge devices, responding to events as they happen, or managing data in distributed databases, the ability to process and act on data in real-time can be the difference between success and irrelevance.

In this talk, we’ll showcase how connecting to real-time data sources can create more responsive and adaptive AI systems within the Python ecosystem, using Bytewax. We’ll explore practical scenarios where this capability enhances insight and efficiency, particularly in the context of modern AI tools like LLM applications and RAG systems.

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

Open Source and the Data Lakehouse

By Alex Merced

The open data lakehouse offers those frustrated with the costs and complex pipelines of using traditional warehouses an alternative that offers performance with affordability and simpler pipelines. In this talk, we’ll be talking about technologies that are making the open data lakehouse possible. In this talk we will learn: What is a data lakehouse What are the components of a data lakehouse What is Apache Arrow What is Apache Iceberg What is Project Nessie

See details ...

11/20/2024 7:10 PM 11/20/2024 7:40 PM UTC OSACon: Open Source and the Data Lakehouse Presented by Alex Merced.

The open data lakehouse offers those frustrated with the costs and complex pipelines of using traditional warehouses an alternative that offers performance with affordability and simpler pipelines. In this talk, we’ll be talking about technologies that are making the open data lakehouse possible.

In this talk we will learn:

What is a data lakehouse
What are the components of a data lakehouse
What is Apache Arrow
What is Apache Iceberg
What is Project Nessie

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

Hydra Architecture: Orchestrating ML across clusters, regions, and clouds

By Donny Greenberg

Increasingly, ML teams must satisfy requirements for executing workflows across multiple Kubernetes clusters, regions, or clouds. Challenges include constrained GPU availability within a single cloud, cloud credits or commitments across multiple providers, or strict data residency requirements. Traditional methods for replicating workflows across environments are both resource-intensive and operationally cumbersome. In this talk, we propose a Hydra architecture, a novel approach where applications are deployed to a single cluster but can selectively take advantage of others.

See details ...

11/20/2024 7:50 PM 11/20/2024 8:20 PM UTC OSACon: Hydra Architecture: Orchestrating ML across clusters, regions, and clouds Presented by Donny Greenberg.

Increasingly, ML teams must satisfy requirements for executing workflows across multiple Kubernetes clusters, regions, or clouds. Challenges include constrained GPU availability within a single cloud, cloud credits or commitments across multiple providers, or strict data residency requirements. Traditional methods for replicating workflows across environments are both resource-intensive and operationally cumbersome.

In this talk, we propose a Hydra architecture, a novel approach where applications are deployed to a single cluster but can selectively take advantage of others. Management and orchestration happens in a single cluster, but execution flows flexibly through arbitrary remote compute. To accomplish this, we will articulate an open source approach that allows standard Python programs to be dispatched to any compute resource without repackaging multiple deployments for each compute locale or otherwise imposing restrictions. By unbundling orchestration and execution, ML platforms teams can now provide their fleet of compute to the practitioners in a single abstract layer.

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

Vector Search in Modern Databases

By Peter Zaitsev

In this talk, we’ll explore the emergent landscape of vector search in databases, a paradigm shift in information retrieval. Vector search, traditionally the domain of specialized systems, is now being integrated into mainstream databases and search engines like Lucene, Elasticsearch, Solr, PostgreSQL, MySQL, MongoDB, and Manticore. This integration marks a significant evolution in handling complex data structures and search queries. Introduction to Vectors and Embeddings in Databases Definition and significance of vectors and embeddings.

See details ...

11/20/2024 7:50 PM 11/20/2024 8:20 PM UTC OSACon: Vector Search in Modern Databases Presented by Peter Zaitsev.

In this talk, we’ll explore the emergent landscape of vector search in databases, a paradigm shift in information retrieval. Vector search, traditionally the domain of specialized systems, is now being integrated into mainstream databases and search engines like Lucene, Elasticsearch, Solr, PostgreSQL, MySQL, MongoDB, and Manticore. This integration marks a significant evolution in handling complex data structures and search queries.

Introduction to Vectors and Embeddings in Databases

Definition and significance of vectors and embeddings.
The historical context of vector search and its integration into databases.

Computing Embeddings: Where and How

Strategies for embedding computation: In-database processing vs. external tools.
Current capabilities of databases like MySQL (referring to PlanetScale’s initiative), PostgreSQL, etc., in embedding computation.

Indexing for Enhanced Vector Search

The role of indexing in optimizing vector search.
Different indexing strategies and their impact on performance and accuracy.

Hybrid Search Approaches

Combining vector search with traditional search methods.

Measuring Performance and Quality

Beyond speed: Assessing the effectiveness of vector search.
Metrics for evaluating the quality of search results.

Conclusion

The session will conclude with insights into future trends and the potential impact of vector search technologies on data retrieval, AI applications, and beyond.

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

2024-11-20T16:00:00.000Z - 2024-11-20 16:30:00 +0000 UTC.

AI Reality Checkpoint: The Good, the Bad, and the Overhyped

By Maxime Beauchemin

In the past 18 months, artificial intelligence has not just entered our workspaces – it has taken over. As we stand at the crossroads of innovation and automation, it’s time for a candid reflection on how AI has reshaped our professional lives, and to talk about where it’s been a game changer, where it’s falling short, and what’s about to shift dramatically in the short term. Since the release of ChatGPT in December 2022, I’ve developed a “first-reflex” to augment and accelerate nearly every task with AI.

See details ...

2024-11-20T17:00:00.000Z - 2024-11-20 17:10:00 +0000 UTC.

Leveraging Argo Events and Argo Workflows for Scalable Data Ingestion

By Siri Varma Vegiraju

As Argo Workflows and Argo Events continue to gain popularity for their powerful capabilities in event-driven automation and complex job orchestration, this presentation will delve into how we used this architecture to process millions of records daily. You will gain insights into the specific architecture that integrates Argo Events and Argo Workflows to achieve efficient data aggregation and ingestion. We will discuss the challenges encountered during this process and share the strategies we employed to overcome these issues.

See details ...

2024-11-20T17:10:00.000Z - 2024-11-20 17:40:00 +0000 UTC.

Modern LLMs with Graph DB - Exploring boundaries with text-2-cypher

By Diptiman Raichaudhuri

In this lightning talk, Diptiman would present the techniques used to execute text-2-cypher analytical query generations with the help of modern large language models like GPT-4o, Claude 3 etc . This session will dive deep into Graph database querying and how LLMs assist developers get analytical queries executed on popular Graph databases like Neo4j and Amazon Neptune.

See details ...

2024-11-20T17:10:00.000Z - 2024-11-20 17:40:00 +0000 UTC.

Airflow 3 is coming!

By Jarek Potiuk

This session is about the new major release of Airflow that we plan to release early in 2025. This is the first major release of Airflow since 2021 when we released Airflow 2 and it is a result of 4 years of improvements we’ve implemented as minor releases, but also a lot of listening to our users, and changing industry. While Airlfow remains the most important and strongest ETL/Data orchestrator in use, with the advent of LLM/GenAI becoming mainstream part of the data orchestration and a wealth of workflow and tooling specialising in those, Airflow 3 is aiming to become the only True Open-Source, Open-Governance Enterprise-level strong Orchestration solution for all your batch processing worfklow needs.

See details ...

2024-11-20T17:50:00.000Z - 2024-11-20 18:20:00 +0000 UTC.

Composable Data Platforms and The Rise of Data Platform Engineering

By Nick Schrock

Every company has a data platform, the systems and tools that produce data, They are critical to every modern business, and can incur massive cost and complexity. In this talk we argue that data platforms must be composable and flexible, and that necessitates a new skill: data platform engineering. We will then discuss how an advanced orchestrator and control plane is the essential technology to engineering a composable data platform, how in turn manage complexity, avoid vendor lock-in, enable end-to-end ownership for practitioner teams, and contain the costs of your data platform.

See details ...

2024-11-20T17:50:00.000Z - 2024-11-20 18:20:00 +0000 UTC.

Exploring Data Analysis in Time Series Databases

By Aliaksandr Valialkin

Kubernetes has changed everything. Not only the way we deploy our applications. But also how we monitor them, how we collect, store, visualize, and alert on time series data generated by monitoring systems. What are the challenges in modern monitoring? Why have new-generation time series databases like VictoriaMetrics and Prometheus emerged? Why is there no SQL support in these databases? Why are Grafana dashboards so fancy? Join us as we explore these questions and many other questions related to the specifics of time series data analysis.

See details ...

2024-11-20T18:30:00.000Z - 2024-11-20 19:00:00 +0000 UTC.

Managing your repo with AI — What works, and why open-source will win

By Evan Rusackas

Maintaining an OSS repository is hard. Scaling contributors is nigh impossible. As AI platforms proliferate, let’s take a look at tools you should and shouldn’t leverage in automating your GitHub repo, Slack workspace, and more! We’ll also talk about why Open Source Software stands to benefit more from this revolution than private/proprietary codebases.

See details ...

2024-11-20T18:30:00.000Z - 2024-11-20 19:00:00 +0000 UTC.

Observability for Large Language Models with OpenTelemetry

By Guangya Liu & Nir Gazit

Large Language Models (LLMs) mark a transformative advancement in artificial intelligence. These models are trained on vast datasets comprising text and code, enabling them to handle complex tasks such as text generation, language translation, and interactive querying. As LLMs continue to integrate into various applications ranging from chatbots and search engines to creative writing aids, the need to monitor and comprehend their behaviors intensifies. Observability plays a crucial role in this context.

See details ...

2024-11-20T19:10:00.000Z - 2024-11-20 19:40:00 +0000 UTC.

Bring streaming to AI: introducing Bytewax connectors

By Laura Gutierrez Funderburk

As AI and machine learning become more integral to business operations and decision-making, the need for real-time data processing has never been more critical. Whether you’re monitoring live streams from edge devices, responding to events as they happen, or managing data in distributed databases, the ability to process and act on data in real-time can be the difference between success and irrelevance. In this talk, we’ll showcase how connecting to real-time data sources can create more responsive and adaptive AI systems within the Python ecosystem, using Bytewax.

See details ...

2024-11-20T19:10:00.000Z - 2024-11-20 19:40:00 +0000 UTC.

Open Source and the Data Lakehouse

By Alex Merced

The open data lakehouse offers those frustrated with the costs and complex pipelines of using traditional warehouses an alternative that offers performance with affordability and simpler pipelines. In this talk, we’ll be talking about technologies that are making the open data lakehouse possible. In this talk we will learn: What is a data lakehouse What are the components of a data lakehouse What is Apache Arrow What is Apache Iceberg What is Project Nessie

See details ...

2024-11-20T19:50:00.000Z - 2024-11-20 20:20:00 +0000 UTC.

Vector Search in Modern Databases

By Peter Zaitsev

In this talk, we’ll explore the emergent landscape of vector search in databases, a paradigm shift in information retrieval. Vector search, traditionally the domain of specialized systems, is now being integrated into mainstream databases and search engines like Lucene, Elasticsearch, Solr, PostgreSQL, MySQL, MongoDB, and Manticore. This integration marks a significant evolution in handling complex data structures and search queries. Introduction to Vectors and Embeddings in Databases Definition and significance of vectors and embeddings.

See details ...

2024-11-20T19:50:00.000Z - 2024-11-20 20:20:00 +0000 UTC.

Hydra Architecture: Orchestrating ML across clusters, regions, and clouds

By Donny Greenberg

Increasingly, ML teams must satisfy requirements for executing workflows across multiple Kubernetes clusters, regions, or clouds. Challenges include constrained GPU availability within a single cloud, cloud credits or commitments across multiple providers, or strict data residency requirements. Traditional methods for replicating workflows across environments are both resource-intensive and operationally cumbersome. In this talk, we propose a Hydra architecture, a novel approach where applications are deployed to a single cluster but can selectively take advantage of others.

See details ...

Thursday, November 21, 2024

2024-11-20T16:00:00.000Z

2024-11-20T16:30:00.000Z

2024-11-20T17:00:00.000Z

2024-11-20T17:10:00.000Z

2024-11-20T17:40:00.000Z

2024-11-20T17:50:00.000Z

2024-11-20T18:30:00.000Z

2024-11-20T19:10:00.000Z

2024-11-20T19:50:00.000Z

Build a Great Business on Open Source without Selling Your Soul

By Robert Hodges, Tatiana Krupenya & Peter Zaitsev

Open source software is fun to work on, but building a real business is tough. We’ve all read about commercial open source companies that switched to proprietary models or flat went out of business. Our panel of three CEOs have all built successful companies on open source without resorting to licensing rug-pulls or other fauxpen source tricks that disrespect users. We’ll discuss what worked for us, what didn’t, and how we balanced our belief in open source communities with making payroll every two weeks.

See details ...

11/21/2024 4:00 PM 11/21/2024 4:30 PM UTC OSACon: Build a Great Business on Open Source without Selling Your Soul Presented by Robert Hodges, Tatiana Krupenya & Peter Zaitsev.

Open source software is fun to work on, but building a real business is tough. We’ve all read about commercial open source companies that switched to proprietary models or flat went out of business. Our panel of three CEOs have all built successful companies on open source without resorting to licensing rug-pulls or other fauxpen source tricks that disrespect users. We’ll discuss what worked for us, what didn’t, and how we balanced our belief in open source communities with making payroll every two weeks.

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

Designing a Lakehouse for product engineers

By Zhou Sun

Open Lakehouses are among the most transformative innovations in big data, celebrated widely within data communities. Yet, for most product engineers, the lakehouse is off the radar. In this talk, we introduce Mooncake Labs and our mission to bridge this gap—connecting applications seamlessly with lakehouse capabilities. We’ll also dive into our open-source project, pg_mooncake, which empowers developers to build and manage lakehouse tables directly from within PostgreSQL.

See details ...

11/21/2024 4:30 PM 11/21/2024 5:00 PM UTC OSACon: Designing a Lakehouse for product engineers Presented by Zhou Sun.

Open Lakehouses are among the most transformative innovations in big data, celebrated widely within data communities. Yet, for most product engineers, the lakehouse is off the radar. In this talk, we introduce Mooncake Labs and our mission to bridge this gap—connecting applications seamlessly with lakehouse capabilities. We’ll also dive into our open-source project, pg_mooncake, which empowers developers to build and manage lakehouse tables directly from within PostgreSQL.

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

Replicating data between transactional databases and ClickHouse®

By Kanthi Subramanian & Arnaud Adant

Transactional databases like MySQL, PostgreSQL and MongoDB are usually not a great fit for real time analytics, especially as the data volume grows. They are also less space efficient than columnar databases and require regular purge or archival. This talk presents a solution to synchronize data in real time between MySQL and ClickHouse. The Altinity Sink Connector open source project (https://github.com/Altinity/clickhouse-sink-connector) is designed to efficiently replicate data and schema changes with accuracy, operational simplicity and performance in mind.

See details ...

11/21/2024 4:30 PM 11/21/2024 5:00 PM UTC OSACon: Replicating data between transactional databases and ClickHouse® Presented by Kanthi Subramanian & Arnaud Adant.

Transactional databases like MySQL, PostgreSQL and MongoDB are usually not a great fit for real time analytics, especially as the data volume grows.

They are also less space efficient than columnar databases and require regular purge or archival. This talk presents a solution to synchronize data in real time between MySQL and ClickHouse. The Altinity Sink Connector open source project (https://github.com/Altinity/clickhouse-sink-connector) is designed to efficiently replicate data and schema changes with accuracy, operational simplicity and performance in mind. It also provides tools to checksum and efficiently dump and load Terabytes of data. The Connector is now generally available.

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

Building a Thriving DevRel Program for OSS Projects

By Anita Ihuman

Over the years, we have seen a good number of OSS projects with great features and services go unnoticed and often get overshadowed by their proprietary counterparts. This highlights a key point: having a brilliant open-source project alone isn’t enough. To truly grow an open source project, you need a vibrant developer community that uses, contributes to, and champions your project. In this session, we’ll look at how to build a successful open source project through strategic DevRel practices.

See details ...

11/21/2024 5:00 PM 11/21/2024 5:10 PM UTC OSACon: Building a Thriving DevRel Program for OSS Projects Presented by Anita Ihuman.

Over the years, we have seen a good number of OSS projects with great features and services go unnoticed and often get overshadowed by their proprietary counterparts. This highlights a key point: having a brilliant open-source project alone isn’t enough. To truly grow an open source project, you need a vibrant developer community that uses, contributes to, and champions your project.

In this session, we’ll look at how to build a successful open source project through strategic DevRel practices. Drawing from my experience, I will discuss the synergy between open source development and developer relations (DevRel). We will look at how to achieve a welcoming and user-friendly experience for developers. We will also explore ways to track and evaluate the effectiveness of your DevRel efforts within open source projects.

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

Creating an open-source data platform for Swiss Government

By Michael Disteli & Micha Eichmann

Together with the local government of one of the biggest cantons in Switzerland, Bern, we created the fully open-source data plattform HelloDATA https://github.com/kanton-bern/hellodata-be We leveraged established open-source data tools such as Superset, DBT, Airflow and JupyterHub to create an integrated, “one-stop-shop” data platform. Using HelloDATA we are driving government agencies of all types to better understand and utilize their data and generate value for themselves and their citizens. In this session we would like to give insights into:

See details ...

11/21/2024 5:10 PM 11/21/2024 5:40 PM UTC OSACon: Creating an open-source data platform for Swiss Government Presented by Michael Disteli & Micha Eichmann.

Together with the local government of one of the biggest cantons in Switzerland, Bern, we created the fully open-source data plattform HelloDATA https://github.com/kanton-bern/hellodata-be

We leveraged established open-source data tools such as Superset, DBT, Airflow and JupyterHub to create an integrated, “one-stop-shop” data platform. Using HelloDATA we are driving government agencies of all types to better understand and utilize their data and generate value for themselves and their citizens.

In this session we would like to give insights into:

-The platform and its functionalities

-The process of developing, maintaining and improving HelloDATA

-Our learnings and take-aways in creating such an open-source platform together with large scale government agencies

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

Open Source State of the Union

By Ali LeClerc, Alyssa Wright & Josep Prat

Making big investments in open source software? We thought so. 2024 has been a tumultuous year for open source projects with relicensing, un-relicensing, and other adventures. Our panel of in-the-trenches experts will opine on what’s going well for users, what are the current trainwrecks, and what’s plain fun to watch. Get tips to protect your existing apps and see new opportunities. Best of all, we’ll talk about how you can make open source work better for everyone.

See details ...

11/21/2024 5:10 PM 11/21/2024 5:40 PM UTC OSACon: Open Source State of the Union Presented by Ali LeClerc, Alyssa Wright & Josep Prat.

Making big investments in open source software? We thought so. 2024 has been a tumultuous year for open source projects with relicensing, un-relicensing, and other adventures. Our panel of in-the-trenches experts will opine on what’s going well for users, what are the current trainwrecks, and what’s plain fun to watch. Get tips to protect your existing apps and see new opportunities. Best of all, we’ll talk about how you can make open source work better for everyone.

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

How Open Source Marketing is Similar to Business Marketing

By Arjun Sharda

Open source is an amazing place for developers to contribute to exciting new projects and also sharpen their coding skills while at it. However, as an open source maintainer of about 30 projects myself, I’ve realized that open source marketing is very similar to business marketing. Both open source projects and companies aim to receive users. Just as 90% of businesses fail or become abandoned, the same can be said for open source projects.

See details ...

11/21/2024 5:40 PM 11/21/2024 5:50 PM UTC OSACon: How Open Source Marketing is Similar to Business Marketing Presented by Arjun Sharda.

Open source is an amazing place for developers to contribute to exciting new projects and also sharpen their coding skills while at it. However, as an open source maintainer of about 30 projects myself, I’ve realized that open source marketing is very similar to business marketing. Both open source projects and companies aim to receive users. Just as 90% of businesses fail or become abandoned, the same can be said for open source projects. In this lightning talk, I’ll be discussing the similarities between open source marketing and business marketing.

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

Build your AI Data Hub with Airbyte and MotherDuck

By AJ Steers

Great AI applications start with great data. While DuckDB and MotherDuck are rapidly gaining traction for open-source AI and data engineering, PyAirbyte provides seamless and reliable data movement—directly in Python. In this session, we’ll show you how to combine these powerful tools to build a scalable data hub for GenAI applications and analytics, getting started in just minutes. We’ll conclude by demonstrating how you can build your next GenAI app directly in the database, all on a foundation of great data.

See details ...

11/21/2024 5:50 PM 11/21/2024 6:20 PM UTC OSACon: Build your AI Data Hub with Airbyte and MotherDuck Presented by AJ Steers.

Great AI applications start with great data. While DuckDB and MotherDuck are rapidly gaining traction for open-source AI and data engineering, PyAirbyte provides seamless and reliable data movement—directly in Python. In this session, we’ll show you how to combine these powerful tools to build a scalable data hub for GenAI applications and analytics, getting started in just minutes. We’ll conclude by demonstrating how you can build your next GenAI app directly in the database, all on a foundation of great data.

Whether you’re new to data or a seasoned professional, you’ll discover how to harness Airbyte’s hundreds of open source data connectors—or even build your own—for a solution that’s approachable for hobby projects and proofs of concept, yet robust enough for large-scale applications.

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

Zero-instrumentation observability based on eBPF

By Nikolay Sivko

Observability is a critical aspect of any infrastructure as it enables teams to promptly identify and address issues. Nevertheless, achieving system observability comes with its own set of challenges. It is a time- and resource-intensive process as it necessitates the incorporation of instrumentation into every application. In this talk, we will delve into the gathering of telemetry data, including metrics, logs, and traces, using eBPF. We will explore tracking various container activities, such as network calls and filesystem operations.

See details ...

11/21/2024 5:50 PM 11/21/2024 6:20 PM UTC OSACon: Zero-instrumentation observability based on eBPF Presented by Nikolay Sivko.

Observability is a critical aspect of any infrastructure as it enables teams to promptly identify and address issues. Nevertheless, achieving system observability comes with its own set of challenges. It is a time- and resource-intensive process as it necessitates the incorporation of instrumentation into every application.

In this talk, we will delve into the gathering of telemetry data, including metrics, logs, and traces, using eBPF. We will explore tracking various container activities, such as network calls and filesystem operations. Additionally, we will discuss the effective utilization of this telemetry data for troubleshooting.

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

Ingesting and analyzing millions of events per second in real-time using open source tools

By Javier Ramirez

Teams of all shapes and sizes benefit from near real-time analytics. In this session, I will present a project template that can serve as the foundation to build one such high performing system, powered by Apache Kafka, QuestDB, Grafana OSS, and Jupyter Notebook. The first step of a data pipeline is ingestion, and even though we could directly ingest into a fast database, I will use Apache Kafka to ingest data.

See details ...

11/21/2024 6:30 PM 11/21/2024 7:00 PM UTC OSACon: Ingesting and analyzing millions of events per second in real-time using open source tools Presented by Javier Ramirez.

Teams of all shapes and sizes benefit from near real-time analytics. In this session, I will present a project template that can serve as the foundation to build one such high performing system, powered by Apache Kafka, QuestDB, Grafana OSS, and Jupyter Notebook.

The first step of a data pipeline is ingestion, and even though we could directly ingest into a fast database, I will use Apache Kafka to ingest data. We will see how to use Python, JavaScript, and Go to send messages into Kafka.

Now, we need an analytics database, and for real-time data, a time-series database seems like a good match. I will demonstrate how to use QuestDB, an Apache 2.0 licensed project, to ingest and query data in milliseconds or faster.

Data analytics often require a graphical dashboard. For this purpose, I will use Grafana OSS, where we will create a couple of real-time charts updating several times per second.

And, of course, it’s 2024, so you might want to delve into some data science. No worries. I will demonstrate how Jupyter Notebook can be used to read from your database and perform interactive data exploration and time-series forecasting.

This will be a demo-driven presentation and all the code is open sourced at https://github.com/questdb/time-series-streaming-analytics-template. You can use it as a starting point for your streaming data projects.

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

Running ClickHouse® in Production – Lessons Learned

By Noa Baron

When we chose ClickHouse as our main data lake for analytics at Cato Networks, we envisioned it as a silver bullet solution for our data needs, promising effortless data ingestion and ready-to-query dashboards. However, the journey from that initial setup to our current, sophisticated data platform has been filled with trials and tribulations, alongside valuable lessons. We first used ClickHouse as a black box magical persistence layer, simply feeding data points and querying ready-made GraphQL datasets.

See details ...

11/21/2024 6:30 PM 11/21/2024 7:00 PM UTC OSACon: Running ClickHouse® in Production – Lessons Learned Presented by Noa Baron.

When we chose ClickHouse as our main data lake for analytics at Cato Networks, we envisioned it as a silver bullet solution for our data needs, promising effortless data ingestion and ready-to-query dashboards.

However, the journey from that initial setup to our current, sophisticated data platform has been filled with trials and tribulations, alongside valuable lessons.

We first used ClickHouse as a black box magical persistence layer, simply feeding data points and querying ready-made GraphQL datasets. As our requirements grew more complex, our implementation evolved in complexity to meet these demands.

In this talk we’ll dive into the challenges and successes we encountered as a high-scale production user, such as making ClickHouse a GDPR-compatible store, discovering its limitations as an enrichment engine, and leveraging it as a robust alternative to KSQL for streaming data.

Additionally, we’ll explore the necessity and implications of migrating our schema three separate times (three’s a charm!). Join us to learn from our experiences what to do and not to do with ClickHouse in production.

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

Getting data materialization right

By Gian Merlino

Materialization moves computation from query time to ingest time by creating specialized derived tables, or materialized views, that are simpler than the source tables and are geared towards supporting specific workloads. This is one of the most powerful and common techniques for speeding up OLAP workloads. You can implement materialization in various ways, including built-in “materialized view” or “projection” features in many databases, as well as with third-party stream processors and workflow orchestrators that sit outside the database.

See details ...

11/21/2024 7:10 PM 11/21/2024 7:40 PM UTC OSACon: Getting data materialization right Presented by Gian Merlino.

Materialization moves computation from query time to ingest time by creating specialized derived tables, or materialized views, that are simpler than the source tables and are geared towards supporting specific workloads. This is one of the most powerful and common techniques for speeding up OLAP workloads. You can implement materialization in various ways, including built-in “materialized view” or “projection” features in many databases, as well as with third-party stream processors and workflow orchestrators that sit outside the database.

But materialization isn’t all smooth sailing. While it can boost performance, it also adds complexity and reduces flexibility in your data infrastructure. The cost implications are nuanced: typically, compute costs at query time go down, but storage costs may go up. Additionally, repopulating large materialized datasets can be expensive, and ensuring users see a consistent view of the data can be challenging.

In this talk, we’ll cover the various ways that you can manage materialization in a data system. We’ll discuss when to use materialization, the complexities that can arise, and how to handle them. We’ll also examine how materialization is implemented across various systems and weigh the trade-offs between performance, cost, and simplicity.

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

Open Source Success: Learnings from 1 Billion Downloads

By Avi Press

In this talk, we share the results of an in-depth analysis of data gathered from over 1 billion open source package downloads across more than 2000 diverse projects on Scarf. Our findings offer valuable insights into user behaviors and interactions with open source software, making it essential for maintainers, founders, and executives in open source companies. During the presentation, we delve deep into our data, uncovering the best practices employed by successful open source projects.

See details ...

11/21/2024 7:10 PM 11/21/2024 7:40 PM UTC OSACon: Open Source Success: Learnings from 1 Billion Downloads Presented by Avi Press.

In this talk, we share the results of an in-depth analysis of data gathered from over 1 billion open source package downloads across more than 2000 diverse projects on Scarf. Our findings offer valuable insights into user behaviors and interactions with open source software, making it essential for maintainers, founders, and executives in open source companies.

During the presentation, we delve deep into our data, uncovering the best practices employed by successful open source projects. We explore a wide array of topics, including various download formats, packaging systems, regional download trends, and user-favored documentation types. Additionally, we discuss the impact of community engagement and how maintainers can harness their user base to boost project adoption and drive business growth.

Attendees can expect to leave this talk equipped with actionable insights and best practices to optimize their open source projects and thrive in the competitive landscape of open source software.

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

Move Data Not Drama: Simplifying your Workflow with Meltano

By Taylor Murphy

Meltano is a powerful open-source data movement tool that has revolutionized the way organizations handle their data pipelines at scale. Before the Analytics Development Lifecycle was even a thing, Meltano was working to bring software engineering best practices to data teams. This session will explore how Meltano addresses common data management challenges with a single, unified, and customizable platform.

See details ...

11/21/2024 7:50 PM 11/21/2024 8:20 PM UTC OSACon: Move Data Not Drama: Simplifying your Workflow with Meltano Presented by Taylor Murphy.

Meltano is a powerful open-source data movement tool that has revolutionized the way organizations handle their data pipelines at scale. Before the Analytics Development Lifecycle was even a thing, Meltano was working to bring software engineering best practices to data teams. This session will explore how Meltano addresses common data management challenges with a single, unified, and customizable platform.

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

SQL Window Functions - An Introduction

By Dave Stokes

Widow Functions allows you to group rows in a table for in-depth investigation if you are analyzing data in a relational database. Structured Query Language is fantastic for retrieving data, but once you get that data, you need a way to classify it. Window Functions provide a way to obtain sales totals to the current date, group time spent by department, or calculate running totals over data point clusters. We will start with the basics by defining what a window can be and proceed into rankings, calculating quartiles, and how to include aggregate functions.

See details ...

11/21/2024 7:50 PM 11/21/2024 8:20 PM UTC OSACon: SQL Window Functions - An Introduction Presented by Dave Stokes.

Widow Functions allows you to group rows in a table for in-depth investigation if you are analyzing data in a relational database. Structured Query Language is fantastic for retrieving data, but once you get that data, you need a way to classify it. Window Functions provide a way to obtain sales totals to the current date, group time spent by department, or calculate running totals over data point clusters. We will start with the basics by defining what a window can be and proceed into rankings, calculating quartiles, and how to include aggregate functions. Consider this a mandatory session if you need to crunch numbers from MySQL or PostgreSQL in your job.

https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e

2024-11-21T16:00:00.000Z - 2024-11-21 16:30:00 +0000 UTC.

Build a Great Business on Open Source without Selling Your Soul

By Robert Hodges, Tatiana Krupenya & Peter Zaitsev

Open source software is fun to work on, but building a real business is tough. We’ve all read about commercial open source companies that switched to proprietary models or flat went out of business. Our panel of three CEOs have all built successful companies on open source without resorting to licensing rug-pulls or other fauxpen source tricks that disrespect users. We’ll discuss what worked for us, what didn’t, and how we balanced our belief in open source communities with making payroll every two weeks.

See details ...

2024-11-21T16:30:00.000Z - 2024-11-21 17:00:00 +0000 UTC.

Designing a Lakehouse for product engineers

By Zhou Sun

Open Lakehouses are among the most transformative innovations in big data, celebrated widely within data communities. Yet, for most product engineers, the lakehouse is off the radar. In this talk, we introduce Mooncake Labs and our mission to bridge this gap—connecting applications seamlessly with lakehouse capabilities. We’ll also dive into our open-source project, pg_mooncake, which empowers developers to build and manage lakehouse tables directly from within PostgreSQL.

See details ...

2024-11-21T16:30:00.000Z - 2024-11-21 17:00:00 +0000 UTC.

Replicating data between transactional databases and ClickHouse®

By Kanthi Subramanian & Arnaud Adant

Transactional databases like MySQL, PostgreSQL and MongoDB are usually not a great fit for real time analytics, especially as the data volume grows. They are also less space efficient than columnar databases and require regular purge or archival. This talk presents a solution to synchronize data in real time between MySQL and ClickHouse. The Altinity Sink Connector open source project (https://github.com/Altinity/clickhouse-sink-connector) is designed to efficiently replicate data and schema changes with accuracy, operational simplicity and performance in mind.

See details ...

2024-11-21T17:00:00.000Z - 2024-11-21 17:10:00 +0000 UTC.

Building a Thriving DevRel Program for OSS Projects

By Anita Ihuman

Over the years, we have seen a good number of OSS projects with great features and services go unnoticed and often get overshadowed by their proprietary counterparts. This highlights a key point: having a brilliant open-source project alone isn’t enough. To truly grow an open source project, you need a vibrant developer community that uses, contributes to, and champions your project. In this session, we’ll look at how to build a successful open source project through strategic DevRel practices.

See details ...

2024-11-21T17:10:00.000Z - 2024-11-21 17:40:00 +0000 UTC.

Creating an open-source data platform for Swiss Government

By Michael Disteli & Micha Eichmann

Together with the local government of one of the biggest cantons in Switzerland, Bern, we created the fully open-source data plattform HelloDATA https://github.com/kanton-bern/hellodata-be We leveraged established open-source data tools such as Superset, DBT, Airflow and JupyterHub to create an integrated, “one-stop-shop” data platform. Using HelloDATA we are driving government agencies of all types to better understand and utilize their data and generate value for themselves and their citizens. In this session we would like to give insights into:

See details ...

2024-11-21T17:10:00.000Z - 2024-11-21 17:40:00 +0000 UTC.

Open Source State of the Union

By Ali LeClerc, Alyssa Wright & Josep Prat

Making big investments in open source software? We thought so. 2024 has been a tumultuous year for open source projects with relicensing, un-relicensing, and other adventures. Our panel of in-the-trenches experts will opine on what’s going well for users, what are the current trainwrecks, and what’s plain fun to watch. Get tips to protect your existing apps and see new opportunities. Best of all, we’ll talk about how you can make open source work better for everyone.

See details ...

2024-11-21T17:40:00.000Z - 2024-11-21 17:50:00 +0000 UTC.

How Open Source Marketing is Similar to Business Marketing

By Arjun Sharda

Open source is an amazing place for developers to contribute to exciting new projects and also sharpen their coding skills while at it. However, as an open source maintainer of about 30 projects myself, I’ve realized that open source marketing is very similar to business marketing. Both open source projects and companies aim to receive users. Just as 90% of businesses fail or become abandoned, the same can be said for open source projects.

See details ...

2024-11-21T17:50:00.000Z - 2024-11-21 18:20:00 +0000 UTC.

Build your AI Data Hub with Airbyte and MotherDuck

By AJ Steers

Great AI applications start with great data. While DuckDB and MotherDuck are rapidly gaining traction for open-source AI and data engineering, PyAirbyte provides seamless and reliable data movement—directly in Python. In this session, we’ll show you how to combine these powerful tools to build a scalable data hub for GenAI applications and analytics, getting started in just minutes. We’ll conclude by demonstrating how you can build your next GenAI app directly in the database, all on a foundation of great data.

See details ...

2024-11-21T17:50:00.000Z - 2024-11-21 18:20:00 +0000 UTC.

Zero-instrumentation observability based on eBPF

By Nikolay Sivko

Observability is a critical aspect of any infrastructure as it enables teams to promptly identify and address issues. Nevertheless, achieving system observability comes with its own set of challenges. It is a time- and resource-intensive process as it necessitates the incorporation of instrumentation into every application. In this talk, we will delve into the gathering of telemetry data, including metrics, logs, and traces, using eBPF. We will explore tracking various container activities, such as network calls and filesystem operations.

See details ...

2024-11-21T18:30:00.000Z - 2024-11-21 19:00:00 +0000 UTC.

Ingesting and analyzing millions of events per second in real-time using open source tools

By Javier Ramirez

Teams of all shapes and sizes benefit from near real-time analytics. In this session, I will present a project template that can serve as the foundation to build one such high performing system, powered by Apache Kafka, QuestDB, Grafana OSS, and Jupyter Notebook. The first step of a data pipeline is ingestion, and even though we could directly ingest into a fast database, I will use Apache Kafka to ingest data.

See details ...

2024-11-21T18:30:00.000Z - 2024-11-21 19:00:00 +0000 UTC.

Running ClickHouse® in Production – Lessons Learned

By Noa Baron

When we chose ClickHouse as our main data lake for analytics at Cato Networks, we envisioned it as a silver bullet solution for our data needs, promising effortless data ingestion and ready-to-query dashboards. However, the journey from that initial setup to our current, sophisticated data platform has been filled with trials and tribulations, alongside valuable lessons. We first used ClickHouse as a black box magical persistence layer, simply feeding data points and querying ready-made GraphQL datasets.

See details ...

2024-11-21T19:10:00.000Z - 2024-11-21 19:40:00 +0000 UTC.

Open Source Success: Learnings from 1 Billion Downloads

By Avi Press

In this talk, we share the results of an in-depth analysis of data gathered from over 1 billion open source package downloads across more than 2000 diverse projects on Scarf. Our findings offer valuable insights into user behaviors and interactions with open source software, making it essential for maintainers, founders, and executives in open source companies. During the presentation, we delve deep into our data, uncovering the best practices employed by successful open source projects.

See details ...

2024-11-21T19:10:00.000Z - 2024-11-21 19:40:00 +0000 UTC.

Getting data materialization right

By Gian Merlino

Materialization moves computation from query time to ingest time by creating specialized derived tables, or materialized views, that are simpler than the source tables and are geared towards supporting specific workloads. This is one of the most powerful and common techniques for speeding up OLAP workloads. You can implement materialization in various ways, including built-in “materialized view” or “projection” features in many databases, as well as with third-party stream processors and workflow orchestrators that sit outside the database.

See details ...

2024-11-21T19:50:00.000Z - 2024-11-21 20:20:00 +0000 UTC.

SQL Window Functions - An Introduction

By Dave Stokes

Widow Functions allows you to group rows in a table for in-depth investigation if you are analyzing data in a relational database. Structured Query Language is fantastic for retrieving data, but once you get that data, you need a way to classify it. Window Functions provide a way to obtain sales totals to the current date, group time spent by department, or calculate running totals over data point clusters. We will start with the basics by defining what a window can be and proceed into rankings, calculating quartiles, and how to include aggregate functions.

See details ...

2024-11-21T19:50:00.000Z - 2024-11-21 20:20:00 +0000 UTC.

Move Data Not Drama: Simplifying your Workflow with Meltano

By Taylor Murphy

Meltano is a powerful open-source data movement tool that has revolutionized the way organizations handle their data pipelines at scale. Before the Analytics Development Lifecycle was even a thing, Meltano was working to bring software engineering best practices to data teams. This session will explore how Meltano addresses common data management challenges with a single, unified, and customizable platform.

See details ...