Cloud data warehouses are the dominant life form for modern analytic systems. They work like restaurants where users visit to feed on data. Larger data sets, AI, and user decisions to keep information in their own data lakes are undermining the restaurant model. What we need now is food trucks that move anywhere users need them. The food truck metaphor helps us envision a powerful new analytic system: the real-time data lake.
https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121eApache Kafka is becoming the standard for integrating all information within an enterprise. This provides an opportunity for each enterprise to take actions on what’s happening in its business in real time. One common use case is to take this data and ingest it to a data lake for analytics. I will show that by integrating Kafka with Apache Iceberg can make analytics much easier. Another rising use case is GenAI. I will show how we have extended Apache Flink to support realtime inference with GenAI.
https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121eIn this session, we will set up a real-time pipeline using Kafka, RisingWave, and Superset in Preset. We will ingest player-related data into a Kafka topic and configure RisingWave to consume this data, creating materialized views for real-time analysis. With RisingWave and Superset, we can generate real-time visual dashboards, set up alerts, and create reports, enabling us to monitor player performance, create real-time leaderboards, and analyze game trends in real-time.
https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121eTypically, data visualization is the last mile in a data pipeline, as it presents insights in a way that is easily understood by users. When insights are fresh and relevant, humans can act upon them on time.
However, the process of implementing a visually appealing real-time dashboard is not as simple as it is thought. The challenges in data collection and processing at scale, and delivering the metrics to users on time make things difficult.
In this talk, we deconstruct a real-time analytics dashboard into several layers as data collection, metrics computation, and insights serving. Then we take a real real-world use case, an IoT dashboard, and build it from scratch while walking through each layer in detail, taking open-source technology components for the implementation.
In the second half of the talk, we discuss the challenges in the process and find ways to mitigate them.
This talk is ideal for anyone interested in the practical application of real-time data processing and visualization. Attendees will gain a comprehensive understanding of each layer, its importance, and how they interact with each other to create a seamless, real-time dashboard.
https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121eGiven the AI hype, organisations want to capture every data point which very quickly results into capturing extensive data from analytics, user interactions, transactions, metrics, logs, and time series. Given different shapes, volumes, scale, consistency and availability requirements, this evolves to many data stores to explore, manage and nurture. However, relying on multiple specialized databases increases operational costs, makes developers loaded with cognitive overload, necessitates various dashboards, and eventually demands significant expertise spread across teams.
This talk proposes a unified data platform using just two databases: ClickHouse and Postgres. Postgres protocol is evolving quickly to all kinds of distributed data. But the ecosystem outside OLTP is not mature yet. With Clickhouse and Postgres both allowing to query foreign data, we’ll demonstrate how ClickHouse can efficiently handle OLAP, metrics, logs and transactional workloads. By consolidating these diverse workloads, organisations can achieve substantial cost savings, streamline resource utilisation, and simplify data management.
We will delve into strategies for integrating these functionalities, ensuring data isolation and smooth operations across teams. Real-world examples will highlight the effectiveness of ClickHouse and Postgres, showcasing how this unified approach enhances efficiency and reduces complexity.
https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121eAt Rill, we rely on DuckDB to power uniquely fast dashboards for exploring time-series metrics. To achieve this interactivity, Rill’s dashboards generate up to 100 parallel queries in response to each user interaction.
In this lightning talk, we’ll share a series of optimization and data modeling techniques that have been pivotal in achieving remarkably fast, sub-second response times using DuckDB.
Our primary tactics include employing parallel connections to facilitate simultaneous query processing and organizing data in a chronological order to enhance the effectiveness of min-max indexes. We also utilize enum types for more efficient handling of string column queries, along with config tunings. These approaches have collectively enabled us to enhance DuckDB’s capability to handle larger datasets (100+ GBs) with sub-second query responses.
We invite you to join us in this insightful session to discover how these optimizations can significantly improve your data processing and query performance in DuckDB.
https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121eLakehouse is a big data solution that combines the advantages of data warehouse and data lake, helping users to perform fast data analysis and efficient data management on the data lake.
Apache Doris is an OLAP database for fast data analytics. It provides self-managed table format for high-concurrency and low-latency queries, semi-structured data analytics and complex ad-hoc queries, all by using standard SQL. It can also query data from various lake formation such as Apache Hudi, Apache Iceberg, Apache Paimon, etc.
In this session, you will learn what Apache Doris is, what Doris can do for real-time analytics, and how to build a fast data analysis engine on data lake.
Introduction of Apache Doris
Core futures of Apache Doris
Building a fast data analysis engine on datalake
Finding hidden relationships is the key to unlocking insights. Traditional charts and graphs fall short when visualizing complex, interconnected data. This presentation will describe into the world of graph visualization, a powerful technique for unveiling hidden patterns, dependencies, and anomalies in your data.
We will explore:
Data preprocessing is a crucial step in the machine learning pipeline. It involves cleaning, transforming, and preparing raw data to make it suitable for building and training machine learning models. In this hands-on workshop, attendees will gain a deep understanding of the importance of data preprocessing and learn essential techniques for working with real-world data.
https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121ePresto 2.0 is a full rewrite of the Presto query execution engine (https://prestodb.io/). The goal is to bring a 3-4x improvement in Presto performance and scalability by moving from the old Java implementation to a modern C++ one. This move towards native execution aligns with industry initiatives like Databricks Photon and Apache DataFusion, among others. We are very excited to bring this technology to Presto to make it the best Open Data Lakehouse engine in the market.
Presto 2.0 has been in active development for about 4 years and we have some production deployments at Meta and IBM now. This project has a very active open-source community comprising engineers from Meta, Ahana/IBM, Uber, ByteDance, Pinterest, Intel, Neuroblade among others.
This session will give an overview of the project, its architecture and our experiences with launching this project at Meta and TPC-DS benchmarking at IBM watsonx.data (https://www.ibm.com/blog/announcement/delivering-superior-price-performance-and-enhanced-data-management-for-ai-with-ibm-watsonx-data/)
https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121eData teams often have established workflows to access data from different sources, process, and analyze data, but can be stymied by the “last mile problem” in data: creating rich, fast, and fully customized apps and dashboards. Closed-source, GUI-based BI options pose challenges for data visualization developers, restricting them to pre-built data integrations, out-of-the-box chart components and layouts, and restricted publishing options.
Observable Framework is a new open-source static site generator, command line tool, and local preview server. It’s files-based, so integrates seamlessly into existing data workflows. Framework’s data loaders support back-end data processing in any programming language, bridging the gap between data teams and developers, and improving app performance. And, when working in Framework, everything is created with code, which means developers can build fully customized, interactive graphics and pages without constraints.
In this talk we’ll share the scoop on Framework, highlighting features that can help developers get their data past the last mile, including:
PostgreSQL is the fastest growing open source transactional database. DuckDB is the fastest growing open source analytical one. pg_duckdb is a new Postgres extension that brings the two together, and lets you run analytical queries on your postgres instance, with the full performance of DuckDB.
This talk will show how to use pg_duckdb to do analytics over your application data, lakehouse data, and to scale it to the cloud via MotherDuck.
The work is a joint effort from MotherDuck, Hydras, DuckDB Labs, Neon, and Microsoft that combines deep Postgres expertise from Hydras, Neon and Microsoft with the DuckDB know-how from DuckDB labs (the creators of DuckDB) and MotherDuck.
https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121eApache Flink has steadily established itself as the leader in stream processing technologies. With thousands of users implementing simple to advanced streaming use cases, the future of the flink community looks bright.
While Apache Flink runs on JVM, for non-JVM users Flink has a well defined pyflink port which helps python developers build sophisticated stream processing jobs. Today, most of the data engineers, data scientists and data analysts prefer using python as their main programming language of choice to build complex use cases.
In this session, I will explore Flink APIs wearing the non-JVM hat and will deep dive into pyflink Table APIs and UDFs. Pyflink appeals to python developers since complex stream processing techniques like windowing, event time semantics could be written in simple python DSLs,.
I will also look at how pyflink Table API and Flink SQL could work hand-in-hand in developing streaming pipelines.
The session will also have a short demo showcasing how pyflink ingests fast moving data from Kafka and runs pyflink Table API DSLs to process such streams.
https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121eChange Data Capture (CDC) from various source databases to a data lake is critical for analytics workload. However different CDC mechanism exists and there are trade-offs with the approach. While using open table formats, you could get around the issue of record-level upserts and deletes but compaction of the data, schema evolution along with latency with Merge-on-Read is still a big challenge. In this session, we will share how you could use Apache Paimon and Apache Flink to build your CDC pipeline that overcomes these challenges and do a low latency sync of CDC data. We will also cover partial-update merge engine and changelog tracking of streaming data. Finally, we will compare Apache Paimon with Apache Hudi and Apache Iceberg and provide prescriptive guidance when to use one over the other.
Join this session and learn how Apache Paimon differs in the approach of solving the CDC problem.
https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121eIn the past 18 months, artificial intelligence has not just entered our workspaces – it has taken over. As we stand at the crossroads of innovation and automation, it’s time for a candid reflection on how AI has reshaped our professional lives, and to talk about where it’s been a game changer, where it’s falling short, and what’s about to shift dramatically in the short term.
Since the release of ChatGPT in December 2022, I’ve developed a “first-reflex” to augment and accelerate nearly every task with AI. As a founder and CEO, this spans a wide array of responsibilities from fundraising, internal communications, legal, operations, product marketing, finance, and beyond. In this keynote, I’ll cover diverse use cases across all areas of business, offering a comprehensive view of AI’s impact.
I’ve also been working closely with the data team at Preset, leveraging AI to assist and augment all aspects of data work. While I’ll explore a broad spectrum of tasks beyond data, I’ll delve deeper into the data-related aspects for as this facet of my work is most relevant to OSA CON attendees.
Join me as I sort out through this new reality and try and forecast the future of AI in our work. It’s time for a radical checkpoint. Everything’s changing fast. In some areas, AI has been a slam dunk; in others, it’s been frustrating as hell. And once a few key challenges are tackled, we’re on the cusp of a tsunami of transformation.
3 major milestones are right around the corner: top-human-level reasoning, solid memory accumulation and recall, and proper executive skills. How is this going to affect all of us?
https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121eAs Argo Workflows and Argo Events continue to gain popularity for their powerful capabilities in event-driven automation and complex job orchestration, this presentation will delve into how we used this architecture to process millions of records daily.
You will gain insights into the specific architecture that integrates Argo Events and Argo Workflows to achieve efficient data aggregation and ingestion. We will discuss the challenges encountered during this process and share the strategies we employed to overcome these issues. Attendees will also learn how we use techniques like “Work avoidance” to ensure we don’t redo the work.
https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121eThis session is about the new major release of Airflow that we plan to release early in 2025. This is the first major release of Airflow since 2021 when we released Airflow 2 and it is a result of 4 years of improvements we’ve implemented as minor releases, but also a lot of listening to our users, and changing industry. While Airlfow remains the most important and strongest ETL/Data orchestrator in use, with the advent of LLM/GenAI becoming mainstream part of the data orchestration and a wealth of workflow and tooling specialising in those, Airflow 3 is aiming to become the only True Open-Source, Open-Governance Enterprise-level strong Orchestration solution for all your batch processing worfklow needs. This talk will tell about basic principles and plans we have that will make Airlfow even more suited for most of your data pipeline needs.
https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121eIn this lightning talk, Diptiman would present the techniques used to execute text-2-cypher analytical query generations with the help of modern large language models like GPT-4o, Claude 3 etc . This session will dive deep into Graph database querying and how LLMs assist developers get analytical queries executed on popular Graph databases like Neo4j and Amazon Neptune.
https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121eEvery company has a data platform, the systems and tools that produce data, They are critical to every modern business, and can incur massive cost and complexity. In this talk we argue that data platforms must be composable and flexible, and that necessitates a new skill: data platform engineering. We will then discuss how an advanced orchestrator and control plane is the essential technology to engineering a composable data platform, how in turn manage complexity, avoid vendor lock-in, enable end-to-end ownership for practitioner teams, and contain the costs of your data platform.
https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121eKubernetes has changed everything. Not only the way we deploy our applications. But also how we monitor them, how we collect, store, visualize, and alert on time series data generated by monitoring systems.
What are the challenges in modern monitoring? Why have new-generation time series databases like VictoriaMetrics and Prometheus emerged? Why is there no SQL support in these databases? Why are Grafana dashboards so fancy? Join us as we explore these questions and many other questions related to the specifics of time series data analysis.
https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121eMaintaining an OSS repository is hard. Scaling contributors is nigh impossible. As AI platforms proliferate, let’s take a look at tools you should and shouldn’t leverage in automating your GitHub repo, Slack workspace, and more! We’ll also talk about why Open Source Software stands to benefit more from this revolution than private/proprietary codebases.
https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121eLarge Language Models (LLMs) mark a transformative advancement in artificial intelligence. These models are trained on vast datasets comprising text and code, enabling them to handle complex tasks such as text generation, language translation, and interactive querying.
As LLMs continue to integrate into various applications ranging from chatbots and search engines to creative writing aids, the need to monitor and comprehend their behaviors intensifies.
Observability plays a crucial role in this context. It involves the systematic collection and analysis of data to enhance LLM performance, identify and correct biases, troubleshoot issues, and ensure AI systems are both reliable and trustworthy.
In this discussion, we will explore the concept of LLM observability in depth, including the initial LLM semantic convention which has just been adopted by the OpenTelemetry community, and how OpenTelemetry can fit into the world of LLM observability . Additionally, we will share more detail about how OpenLLMetry is leveraging OpenTelemetry to do LLM Observability for the whole AI stack, including vector DB, LLMs, Model Orchestration Platform etc.
https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121eAs AI and machine learning become more integral to business operations and decision-making, the need for real-time data processing has never been more critical. Whether you’re monitoring live streams from edge devices, responding to events as they happen, or managing data in distributed databases, the ability to process and act on data in real-time can be the difference between success and irrelevance.
In this talk, we’ll showcase how connecting to real-time data sources can create more responsive and adaptive AI systems within the Python ecosystem, using Bytewax. We’ll explore practical scenarios where this capability enhances insight and efficiency, particularly in the context of modern AI tools like LLM applications and RAG systems.
https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121eThe open data lakehouse offers those frustrated with the costs and complex pipelines of using traditional warehouses an alternative that offers performance with affordability and simpler pipelines. In this talk, we’ll be talking about technologies that are making the open data lakehouse possible.
In this talk we will learn:
Increasingly, ML teams must satisfy requirements for executing workflows across multiple Kubernetes clusters, regions, or clouds. Challenges include constrained GPU availability within a single cloud, cloud credits or commitments across multiple providers, or strict data residency requirements. Traditional methods for replicating workflows across environments are both resource-intensive and operationally cumbersome.
In this talk, we propose a Hydra architecture, a novel approach where applications are deployed to a single cluster but can selectively take advantage of others. Management and orchestration happens in a single cluster, but execution flows flexibly through arbitrary remote compute. To accomplish this, we will articulate an open source approach that allows standard Python programs to be dispatched to any compute resource without repackaging multiple deployments for each compute locale or otherwise imposing restrictions. By unbundling orchestration and execution, ML platforms teams can now provide their fleet of compute to the practitioners in a single abstract layer.
https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121eIn this talk, we’ll explore the emergent landscape of vector search in databases, a paradigm shift in information retrieval. Vector search, traditionally the domain of specialized systems, is now being integrated into mainstream databases and search engines like Lucene, Elasticsearch, Solr, PostgreSQL, MySQL, MongoDB, and Manticore. This integration marks a significant evolution in handling complex data structures and search queries.
Definition and significance of vectors and embeddings.
The historical context of vector search and its integration into databases.
Strategies for embedding computation: In-database processing vs. external tools.
Current capabilities of databases like MySQL (referring to PlanetScale’s initiative), PostgreSQL, etc., in embedding computation.
The role of indexing in optimizing vector search.
Different indexing strategies and their impact on performance and accuracy.
Beyond speed: Assessing the effectiveness of vector search.
Metrics for evaluating the quality of search results.
Conclusion
The session will conclude with insights into future trends and the potential impact of vector search technologies on data retrieval, AI applications, and beyond.
https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121eOpen source software is fun to work on, but building a real business is tough. We’ve all read about commercial open source companies that switched to proprietary models or flat went out of business. Our panel of three CEOs have all built successful companies on open source without resorting to licensing rug-pulls or other fauxpen source tricks that disrespect users. We’ll discuss what worked for us, what didn’t, and how we balanced our belief in open source communities with making payroll every two weeks.
https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121eOpen Lakehouses are among the most transformative innovations in big data, celebrated widely within data communities. Yet, for most product engineers, the lakehouse is off the radar. In this talk, we introduce Mooncake Labs and our mission to bridge this gap—connecting applications seamlessly with lakehouse capabilities. We’ll also dive into our open-source project, pg_mooncake, which empowers developers to build and manage lakehouse tables directly from within PostgreSQL.
https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121eTransactional databases like MySQL, PostgreSQL and MongoDB are usually not a great fit for real time analytics, especially as the data volume grows.
They are also less space efficient than columnar databases and require regular purge or archival. This talk presents a solution to synchronize data in real time between MySQL and ClickHouse. The Altinity Sink Connector open source project (https://github.com/Altinity/clickhouse-sink-connector) is designed to efficiently replicate data and schema changes with accuracy, operational simplicity and performance in mind. It also provides tools to checksum and efficiently dump and load Terabytes of data. The Connector is now generally available.
https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121eOver the years, we have seen a good number of OSS projects with great features and services go unnoticed and often get overshadowed by their proprietary counterparts. This highlights a key point: having a brilliant open-source project alone isn’t enough. To truly grow an open source project, you need a vibrant developer community that uses, contributes to, and champions your project.
In this session, we’ll look at how to build a successful open source project through strategic DevRel practices. Drawing from my experience, I will discuss the synergy between open source development and developer relations (DevRel). We will look at how to achieve a welcoming and user-friendly experience for developers. We will also explore ways to track and evaluate the effectiveness of your DevRel efforts within open source projects.
https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121eTogether with the local government of one of the biggest cantons in Switzerland, Bern, we created the fully open-source data plattform HelloDATA https://github.com/kanton-bern/hellodata-be
We leveraged established open-source data tools such as Superset, DBT, Airflow and JupyterHub to create an integrated, “one-stop-shop” data platform. Using HelloDATA we are driving government agencies of all types to better understand and utilize their data and generate value for themselves and their citizens.
In this session we would like to give insights into:
-The platform and its functionalities
-The process of developing, maintaining and improving HelloDATA
-Our learnings and take-aways in creating such an open-source platform together with large scale government agencies
https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121eMaking big investments in open source software? We thought so. 2024 has been a tumultuous year for open source projects with relicensing, un-relicensing, and other adventures. Our panel of in-the-trenches experts will opine on what’s going well for users, what are the current trainwrecks, and what’s plain fun to watch. Get tips to protect your existing apps and see new opportunities. Best of all, we’ll talk about how you can make open source work better for everyone.
https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121eOpen source is an amazing place for developers to contribute to exciting new projects and also sharpen their coding skills while at it. However, as an open source maintainer of about 30 projects myself, I’ve realized that open source marketing is very similar to business marketing. Both open source projects and companies aim to receive users. Just as 90% of businesses fail or become abandoned, the same can be said for open source projects. In this lightning talk, I’ll be discussing the similarities between open source marketing and business marketing.
https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121eGreat AI applications start with great data. While DuckDB and MotherDuck are rapidly gaining traction for open-source AI and data engineering, PyAirbyte provides seamless and reliable data movement—directly in Python. In this session, we’ll show you how to combine these powerful tools to build a scalable data hub for GenAI applications and analytics, getting started in just minutes. We’ll conclude by demonstrating how you can build your next GenAI app directly in the database, all on a foundation of great data.
Whether you’re new to data or a seasoned professional, you’ll discover how to harness Airbyte’s hundreds of open source data connectors—or even build your own—for a solution that’s approachable for hobby projects and proofs of concept, yet robust enough for large-scale applications.
https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121eObservability is a critical aspect of any infrastructure as it enables teams to promptly identify and address issues. Nevertheless, achieving system observability comes with its own set of challenges. It is a time- and resource-intensive process as it necessitates the incorporation of instrumentation into every application.
In this talk, we will delve into the gathering of telemetry data, including metrics, logs, and traces, using eBPF. We will explore tracking various container activities, such as network calls and filesystem operations. Additionally, we will discuss the effective utilization of this telemetry data for troubleshooting.
https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121eTeams of all shapes and sizes benefit from near real-time analytics. In this session, I will present a project template that can serve as the foundation to build one such high performing system, powered by Apache Kafka, QuestDB, Grafana OSS, and Jupyter Notebook.
The first step of a data pipeline is ingestion, and even though we could directly ingest into a fast database, I will use Apache Kafka to ingest data. We will see how to use Python, JavaScript, and Go to send messages into Kafka.
Now, we need an analytics database, and for real-time data, a time-series database seems like a good match. I will demonstrate how to use QuestDB, an Apache 2.0 licensed project, to ingest and query data in milliseconds or faster.
Data analytics often require a graphical dashboard. For this purpose, I will use Grafana OSS, where we will create a couple of real-time charts updating several times per second.
And, of course, it’s 2024, so you might want to delve into some data science. No worries. I will demonstrate how Jupyter Notebook can be used to read from your database and perform interactive data exploration and time-series forecasting.
This will be a demo-driven presentation and all the code is open sourced at https://github.com/questdb/time-series-streaming-analytics-template. You can use it as a starting point for your streaming data projects.
https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121eWhen we chose ClickHouse as our main data lake for analytics at Cato Networks, we envisioned it as a silver bullet solution for our data needs, promising effortless data ingestion and ready-to-query dashboards.
However, the journey from that initial setup to our current, sophisticated data platform has been filled with trials and tribulations, alongside valuable lessons.
We first used ClickHouse as a black box magical persistence layer, simply feeding data points and querying ready-made GraphQL datasets. As our requirements grew more complex, our implementation evolved in complexity to meet these demands.
In this talk we’ll dive into the challenges and successes we encountered as a high-scale production user, such as making ClickHouse a GDPR-compatible store, discovering its limitations as an enrichment engine, and leveraging it as a robust alternative to KSQL for streaming data.
Additionally, we’ll explore the necessity and implications of migrating our schema three separate times (three’s a charm!). Join us to learn from our experiences what to do and not to do with ClickHouse in production.
https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121eMaterialization moves computation from query time to ingest time by creating specialized derived tables, or materialized views, that are simpler than the source tables and are geared towards supporting specific workloads. This is one of the most powerful and common techniques for speeding up OLAP workloads. You can implement materialization in various ways, including built-in “materialized view” or “projection” features in many databases, as well as with third-party stream processors and workflow orchestrators that sit outside the database.
But materialization isn’t all smooth sailing. While it can boost performance, it also adds complexity and reduces flexibility in your data infrastructure. The cost implications are nuanced: typically, compute costs at query time go down, but storage costs may go up. Additionally, repopulating large materialized datasets can be expensive, and ensuring users see a consistent view of the data can be challenging.
In this talk, we’ll cover the various ways that you can manage materialization in a data system. We’ll discuss when to use materialization, the complexities that can arise, and how to handle them. We’ll also examine how materialization is implemented across various systems and weigh the trade-offs between performance, cost, and simplicity.
https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121eIn this talk, we share the results of an in-depth analysis of data gathered from over 1 billion open source package downloads across more than 2000 diverse projects on Scarf. Our findings offer valuable insights into user behaviors and interactions with open source software, making it essential for maintainers, founders, and executives in open source companies.
During the presentation, we delve deep into our data, uncovering the best practices employed by successful open source projects. We explore a wide array of topics, including various download formats, packaging systems, regional download trends, and user-favored documentation types. Additionally, we discuss the impact of community engagement and how maintainers can harness their user base to boost project adoption and drive business growth.
Attendees can expect to leave this talk equipped with actionable insights and best practices to optimize their open source projects and thrive in the competitive landscape of open source software.
https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121eMeltano is a powerful open-source data movement tool that has revolutionized the way organizations handle their data pipelines at scale. Before the Analytics Development Lifecycle was even a thing, Meltano was working to bring software engineering best practices to data teams. This session will explore how Meltano addresses common data management challenges with a single, unified, and customizable platform.
https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121eWidow Functions allows you to group rows in a table for in-depth investigation if you are analyzing data in a relational database. Structured Query Language is fantastic for retrieving data, but once you get that data, you need a way to classify it. Window Functions provide a way to obtain sales totals to the current date, group time spent by department, or calculate running totals over data point clusters. We will start with the basics by defining what a window can be and proceed into rankings, calculating quartiles, and how to include aggregate functions. Consider this a mandatory session if you need to crunch numbers from MySQL or PostgreSQL in your job.
https://us.airmeet.com/e/69f1f9b0-2f11-11ef-82f4-1d5f1667121e