Streaming analytics refers to the process of analyzing real-time data that is generated continuously and rapidly from various sources, such as sensors, applications, social media, and other internet-connected devices. Streaming analytics platforms enable organizations to extract business value from data in motion, similar to how traditional analyti…
Distributed databases are necessary for storing and managing data across multiple nodes in a network. They provide scalability, fault tolerance, improved performance, and cost savings. By distributing data across nodes, they allow for efficient processing of large amounts of data and redundancy against failures. They can also be used to store data …
DataSet is a log analytics platform provided by Sentinel One that helps DevOps, IT engineering, and security teams get answers from their data across all time periods, both live streaming and historical. It’s powered by a unique architecture that uses a massively parallel query engine to provide actionable insights from the data available. John Har…
There are many types of early stage funding available from friends and family to seed to series A. Some firms invest across a wide set of technologies and seek only to provide capital. Others are in it for the long haul – they focus on specific areas of technology and develop both long term relationships and deep expertise over time. Today, we are …
The Presto/Trino project makes distributed querying easier across a variety of data sources. As the need for machine learning and other high volume data applications has increased, the need for support, tooling, and cloud infrastructure for Presto/Trino has increased with it. Starburst helps your teams run fast queries on any data source. With Star…
Building and managing data-intensive applications has traditionally been costly and complex, and has placed an operational burden on developers to maintain as their organization scales. Todays’ developers, data scientists, and data engineers need a streamlined, single cloud data platform for building applications, pipelines, and machine learning mo…
Data analytics technology and tools have seen significant improvements in the past decade. But, it can still take weeks to prototype, build and deploy new transformations and deployments, usually requiring considerable engineering resources. Plus, most data isn’t real-time. Instead, most of it is still batch-processed. Tinybird Analytics provides a…
Ian Coe CEO Adam Kamor Head of Engineering Companies that gather data about their users have an ethical obligation and legal responsibility to protect the personally identifiable information in their dataset. Ideally, developers working on a software application wouldn’t need access to production data. Yet without high-quality example data, many te…
Couchbase is a distributed NoSQL cloud database. Since its creation, Couchbase has expanded into edge computing, application services, and most recently, a database-as-a-service called Capella. Couchbase started as an in-memory cache and needed to be rearchitected to be a persistent storage system. In this episode, We interviewed Ravi Mayuram, SVP …
Streaming data platforms like Kafka, Pulsar, and Kinesis are now common in mainstream enterprise architectures, providing low-latency real-time messaging for analytics and applications. However, stream processing – the act of filtering, transforming, or analyzing the data inside the messages – is still an exercise left to the receiving microservice…
Data-as-a-service is a company category type that is not as common as API-as-a-service, software-as-a-service, or platform-as-a-service. In order to vend data, a data-as-a-service provider needs to define how that data will be priced, stored, and delivered to users: streaming over an API or served via static files. Naqeeb Memon of Safegraph joins t…
Data labeling allows machine learning algorithms to find patterns among the data. There are a variety of data labeling platforms that enable humans to apply labels to this data and ready it for algorithms. Heartex is a data labeling platform with an open source core. Michael Malyuk joins the show to talk through the platform and modern usage of dat…
Real-time analytics are difficult to achieve because large amounts of data must be integrated into a data set as that data streams in. As the world moved from batch analytics powered by Hadoop into a norm of “real-time” analytics, a variety of open source systems emerged. One of these was Apache Pinot. StarTree is a company based on Apache Pinot th…
Data loss can occur when large data sources such as Slack or Google Drive get leaked. In order to detect and avoid leaks, a data asset graph can be built to understand the risks of a company environment. Polymer is a data loss prevention product that helps companies avoid problematic data leaks. Yasir Ali is the founder of Polymer and joins the sho…
Data integration infrastructure is not easy to build. Moving large amounts of data from one place to another has historically required developers to build ad hoc integration points to move data between SaaS services, data lakes, and data warehouses. Today, there are dedicated systems and services for moving these large batches of data. Airbyte buil…
Modern organizations eventually face data governance challenges. Keeping track of where data came from, what systems update it, in what ways updates can be made are just some of the issues to be tackled. Large organizations face additional challenges like training, onboarding, and capturing the institutional knowledge that leaves with the departure…
The solution many turn to for capturing their streaming data is InfluxDB. In this episode, I interview Brian Gilmore, Director of Product Management at InfluxData, about how real time applications achieve success built on top of InfluxDB. When most people hear the phrase Internet of Things, it typically evokes an image of connected devices we insta…
Lior Gavish James Densmore Data infrastructure is a fast-moving sector of the software market. As the volume of data has increased, so too has the quality of tooling to support data management and data engineering. In today’s show, we have a guest from a data intensive company as well as a company that builds a popular data engineering product. Jam…
Running a database company requires expertise in both technical and managerial skills. There are deeply technical engineering questions around query paths, scalability, and distributed systems. And there are complex managerial questions around developer productivity and task allocation. Sam Lambert is the CEO of PlanetScale, which is building moder…
SingleStore is a multi-use, multi-model database designed for transactional and analytic workloads, as well as search and other domain specific applications. SingleStore is the evolution of the database company MemSQL, which sought to bring fast, in-memory SQL database technology to market. Jordan Tigani is Chief Product Officer of SingleStore and …
DuckDB is a relational database management system with no external dependencies, with a simple system for deployment and integration into build processes. It enables complex queries in SQL with a large function library, and provides transactional guarantees through multi-version concurrency control. Hannes Mühleisen works on DuckDB and joins the sh…
Customer data pipelines power the backend of many successful web platforms. In a customer data pipeline, data is collected from sources such as mobile apps and cloud SaaS tools, transformed and munged using data engineering, stored in data warehouses, and piped to analytics, advertising platforms, and data infrastructure. RudderStack is an open sou…
The data lake architecture has become broadly adopted in a relatively short period of time. In a nutshell, that means data in it’s raw format stored in cloud object storage. Modern software and data engineers have no shortage of options for accessing their data lake, but that list shrinks quickly if you care about features like transactions. Apache…
A data catalog provides an index into the data sets and schemas of a company. Data teams are growing in size, and more companies than ever have a data team, so the market for data catalog is larger than ever. Mark is the CEO of Stemma and the co-creator of Amundsen, a data catalog that came out of Lyft. We have previously explored the basics of Amu…
Splunk is a monitoring and logging platform that has evolved over its 18 years of existence. In its modern focus on observability it is focused on open source and AIOps. Observability has evolved with the growth of Kubernetes, and Splunk’s work around OpenTelemetry has kept parity with the open source community of Kubernetes. Spiros Xanthos is the …
/*! elementor - v3.5.5 - 03-02-2022 */.elementor-widget-image{text-align:center}.elementor-widget-image a{display:inline-block}.elementor-widget-image a img[src$=".svg"]{width:48px}.elementor-widget-image img{vertical-align:middle;display:inline-block} Barry McCardel Co-Founder and CEO at Hex Caitlin Colgrove Co-Founder and CTO at Hex In contrast t…
When writing code, test driven development is a common accepted methodology to ensure the development of high quality software. Your organization’s data, on the other hand, is an entirely different challenge. Data can be missing due to human error, a failure with a 3rd party provider, a botched release, or dozens of other issues. When not missing, …
Database product companies typically have a few phases. First, the company will develop a technology with some kind of innovation such as speed, scalability, or durability. The company will offer support contracts around that technology for a period of time, before eventually building a managed, hosted offering. PlanetScale is a database company bu…
Couchbase is a distributed NoSQL cloud database. Since its creation, Couchbase has expanded into edge computing, application services, and most recently a database-as-a-service called Capella. Couchbase started as an in-memory cache and needed to be rearchitected to be a persistent storage system. In this episode, I interview Ravi Mayuram, SVP Prod…
If you haven’t encountered a data quality problem, then you haven’t yet worked on a large enough project. Invariably, a gap exists between the state of raw data and what an analyst or machine learning engineer needs to solve their problem. Many organizations needing to automate data preparation workflows look to Trifacta as a solution. In this epis…
Relational databases have been a fixture of software applications for decades. They are highly tuned for performance and typically offer explicit guarantees like transactional consistency. More recently, there’s been a figurative cambrian explosion of other-than-relational databases. Simple key value stores or counters were an early win in this spa…
The lifeblood of most companies is their sales departments. When you’re selling something other than a commodity, it’s typically necessary to carefully groom the onboarding experience for inbound future customers. Historically, companies approached this in a one-size-fits-all manner, giving all customers a common experience. In today’s data-driven …
Application observability is a fairly mature area. Engineering teams have a wide selection of tools they can choose to adopt and a significant amount of thought leadership and philosophy already exists giving guidance for managing your application. That application is going to persist data. As you scale up, your system is invariably going to experi…
Consumers are increasingly becoming aware of how detrimental it can be when companies mismanage data. This demand has fueled regulations, defined standards, and applied pressure to companies. Modern enterprises need to consider corporate risk management and regulatory compliance. In this interview, I speak with Terry O’Daniel, Director of Engineeri…
The internet is a layer cake of technologies and protocols. At a fundamental level, the internet runs on the TCP/IP protocol. It’s a packet based system. When your browser requests a file from a web server, that server chops up the file into tiny pieces known as packets and puts them on the network labeled with your machine’s address as its destina…
It does not matter if it runs on your machine. Your code must run in the production environment and it must do so performantly. For that, you need tooling to better understand your application’s behavior under different circumstances. In the earliest days of software development, all we had were logs, which are still around and incredibly useful. Y…
The manner in which users interact with technology has rapidly switched to mobile consumption. The devices almost all of us carry with us at all times open endless opportunities for developers to create location-based experiences. Foursquare became a household name when the introduced social check-ins. Today they’re a location data platform. Ankit …
Modern business applications are complex. It’s not enough to have raw logs or some basic telemetry. Today’s enterprise organizations require an application performance monitoring solution or APM. Today’s applications are complex distributed systems whose performance depends on a wide variety of factors. Every single line of code can affect producti…
Infrastructure as Code is an approach to machine provisioning and setup in which a programmer describes the underlying services they need for their projects. However, this infrastructure code doesn’t compile a binary artifact like traditional source code. The successful completion of running the code signals that the servers and other components de…
The first industrial deployments of machine learning and artificial intelligence solutions were bespoke by definition and often had brittle operating characteristics. Almost no one builds custom databases, web servers, or email clients. Yet technology groups today often consider developing homegrown ML and data solutions in order to solve their uni…
In a version control system, a Monorepo is a version control management strategy in which all your code is contained in one potentially large but complete repository. The monorepo is in stark contrast to an alternative approach in which software teams independently manage microservices or deliver software as libraries to be imported in other projec…
Companies that gather data about their users have an ethical obligation and legal responsibility to protect the personally identifiable information in their dataset. Ideally, developers working on a software application wouldn’t need access to production data. Yet without high-quality example data, many technology groups stumble on avoidable proble…
Applications write data to persistent storage like a database. The most popular database query language is SQL which has many similar dialects. SQL is expressive and powerful for describing what data you want. What you do with that data requires a solution in the form of a data pipeline. Ideally, these analytical workflows can follow similar best p…
Tedious, repetitive tasks are better handled by machines. Unless these tasks truly require human intelligence, repetitive tasks are often good candidates for automation. Implementing process automation can be challenging and technical. Increasingly, engineers are seeking out tools and platforms to facilitate faster, more reliable automation. In thi…
A developer’s core deliverables are individual commits and the pull requests they aggregate into. While the number of lines of code written alone may not be very informative, in total, the code and metadata about the code found in tracking systems present a rich dataset with great promise for analysis and productivity optimization insights. LinearB…
Modern companies leverage dozens or even hundreds of software solutions to solve specific needs of the business. Organizations need to collect all these disparate data sources into a data warehouse in order to add value. The raw data typically needs transformation before it can be analyzed. In many cases, companies develop homegrown solutions, thus…
Instabase is a technology platform for building automation solutions. Users deploy it onto their own infrastructure and can leverage the tools offered by the platform to build complex workflows for handling tasks like income verification and claims processing. In this episode we interview Anant Bhardwaj, founder of Instabase. He describes Instabase…
Time series data are simply measurements or events that are tracked, monitored, downsampled, and aggregated over time. This could be server metrics, application performance monitoring, network data, sensor data, events, clicks, trades in a market, and many other types of analytics data (influxdata.com). The platform InfluxData is designed for build…
Whether sending messages, shopping in an app, or watching videos, modern consumers expect information and responsiveness to be near-instant in their apps and devices. From a developer’s perspective, this means clean code and a fast database. Apache Druid is a database built to power real-time analytic workloads for event-driven data, like user-faci…