Calming Data Chaos to Empower Real-Time Decisions

December 7, 2023

Complex problems require massive amounts of data from multiple different sources, formats, and locations, to solve. Take emergency response, for example. Government officials and aid agencies must use information from dozens of disparate sources, systems, and databases to implement their emergency response plans, optimize resource allocation, and ensure accurate decision making at speed. Items like tax and parcel records, property deeds, demographic data, and satellite imagery are all used to implement and coordinate efforts, requiring intense workflows and tremendous computing and human power.

Founded in 2020, SeerAI is revolutionizing the way the world understands and interacts with data. We provide our customers with an analytics platform optimized for spatiotemporal data that collapses and streamlines all that disparate data into a unified workflow to answer complex questions. We leverage ML and AI to help data scientists and analysts simplify using massive disparate multi-domain data, to dramatically condense their ability to solve hard problems in hours and minutes instead of weeks and months.

Data Pipelining and Versioning

SeerAI scales analytics horizontally and exponentially, simplifying and accelerating problem-solving workflows. Our customers are public and private sector consultants and decision-makers who use analytics technology to forecast everything from supply chain bottlenecks to the environmental impact of climate change. We provide them with tools and access to the data sets they need to run complex analytics workflows. We also help organizations eliminate internal obstacles by connecting and integrating elements that were previously siloed.

From the beginning, we knew we needed a powerful data pipelining and versioning tool to accelerate and automate our workflows, build complex pipelines, and integrate disparate ML models and data sets. We researched the best of breed pipelining tools and weighed the cost-to-benefit ratio of building an in-house solution.

Our core software platform was built from the ground up on Cloud Native Microservices. We needed a data pipelining solution compatible with our containerized workflows with the ability to trace the lineage of all our actions to simplify repeatability. We looked at pipelining tools like Apache NiFi and Apache Airflow, but none of them had data versioning. The only product that offered both pipelining and versioning functionalities was HPE Machine Learning Data Management Software (MLDMS), then known as Pachyderm.

Stabilizing and Simplifying Pipeline Management

To maximize efficiencies within our development team, we created a clear division of labor: our DevOps, engineering, and ML Ops teams built Continuous Integration (CI) pipelines in Kubernetes and orchestrated Docker containers, effectively automating all the labor-intensive elements. This freed time to focus on software development and accelerate the work of our engineers.

We further automated these assets with HPE MLDMS, which is a beast when it comes to configuring pipelines and versioning data. We built a framework of checks that took us from manually coding pipelines over hours—sometimes days—to building them in minutes with HPE MLDMS.

When running dozens of simultaneous pipelines, a dashboard helps you see everything at a glance.

Since rolling out HPE MLDMS in early 2022, our pipelines have been very stable and there’s been a marked difference in our productivity. We’ve begun using HPE MLDMS’s improved dashboards to manage workflows, and the user interface has immensely simplified the process. You can do a lot with HPE MLDMS command line, but when you’re running dozens of simultaneous pipelines it is not easy to quickly see how everything is connected. The dashboard helps you to see everything at a glance. Some of our pipelines have six or seven steps, and the user interface makes it easy to support and improve analytics when a data scientist calls us for help.

Scaling Complex Customer Workflows Through Parallel Processing

HPE MLDMS allows us to accelerate complex workflows through horizontal scaling. For example, we created a workflow that identifies, and processes pairs of satellite images collected over eight years. Identifying image pairs happens fairly fast, but processing each pair takes 30–60 minutes and requires massive amounts of storage and computing. HPE MLDMS leverages Kubernetes to process pairs in parallel, which is critical to delivering data on short timelines.

The belief that AI and ML are universal remedies is widespread, yet without the ability to provide massive amounts of data from multiple sources in a way the machines can use it, even the most advanced algorithms will fail to be effective.

We recently did a POC for an enterprise client who required the ability to generate data and process an image set on a very short timeline. Without HPE MLDMS, we would have spent that time (and more) writing new pipelines to parallelize the workflow and waiting for the output. Instead, we delivered the results on time, and the POC led to a long-term engagement with that company.

Our powerful technology can leverage hundreds of petabytes of authoritative data, including satellite imagery, tabular data sets, weather data, vector data and USGS seismic data. We use HPE MLDMS to sort out dependencies and perform complex transformations to make data and data sets analysis-ready for our models. The belief that AI and ML are universal remedies is widespread, yet without the ability to provide massive amounts of data from multiple sources in a way the machines can use it, even the most advanced algorithms will fail to be effective.

Building the World’s First Data Mesh Optimized for Spatiotemporal Data

SeerAI’s platform encompasses the world’s first and only data mesh technology optimized for spatiotemporal data. Expanding on the groundbreaking concept of data mesh, SeerAI’s technology goes further to create an Intelligent Mesh. SeerAI’s Intelligent Mesh acts as the linchpin, ensuring that data distributed across various domains is not only accessible but also usable by diverse systems.

We can access virtually any data set without moving them or performing ETL transformations, avoiding data replication, duplication, and the associated latency. This saves significant amounts of time, resources, and cost associated with moving, storing and transforming data. Every data set becomes a node in a knowledge graph that encodes intrinsic relationships within and between data sets. Understanding how these data sets are connected allows users to solve the problems and build repeatable solutions for further interaction and collaboration.

These data sources reside in dozens of systems worldwide and are organized with different topologies, ontologies, and schemas. The data mesh creates a unified data layer ensuing every user, whether machine or human, has not just access to the data, but also the profound knowledge embedded within it. The data might originate from the Google Earth Engine, AWS Public Cloud, Microsoft Planetary Computer, or the ArcGIS Living Atlas, but to our users, it’s one big usable data set. SeerAI brings these sources together with HPE MLDMS giving our customers tools to perform complex analyses on these amalgamated assets.

Delivering Veracity at Velocity

ML and AI have revolutionized analytics at a time when data is growing exponentially. Imaging and remote technology deliver increasingly detailed data, and the Internet of Things captures more information and requires more storage than we ever imagined.

HPE MLDMS is helping SeerAI integrate these massive and disparate data sets at scale, reducing processing times. We are enabling organizations to build logical connections between disparate data sets, discover the unrealized potential of big data, and deliver veracity at velocity.

Malachi Keddington

CRO at SeerAI, Inc.