Powering Climate Risk Analysis


Business leaders are increasingly factoring the effects of climate change into their decisions, but there's a lot of uncertainty in the space. This uncertainty makes it hard to know where to steer the ship—and leaders need data to help them navigate the waters.

RiskThinking.AI uses complex data sets and large-scale data models to predict the effects of climate change on businesses all the way out to the year 2100 for a variety of risk factors. Our customers include small and large banks, corporations, government entities, and nonprofits. Our projections enable them to mitigate and future-proof their activities in response to changing climate conditions like rising sea levels, drought, and extreme weather events. 

As the Lead Software Developer at RiskThinking.AI, I teach our data scientists to think carefully and critically about the shape of their data—the structure, the data types, the hierarchies—to move seamlessly between pipelines and compute stages. But we also rely on technology to make it happen.

Datasets Are the Core of Our Business

I joined RiskThinking.AI in 2019 as a full-stack developer and helped launch our first product. Our VELO suite of products uses physical asset data, global climate data, and climate risk score calculations to provide customized data outputs that can be integrated with broader risk management analyses, comprising multiple scenario pathways, future time horizons, and a combination of risk factors.


Governments and financial institutions are increasingly requiring businesses to demonstrate their sustainability commitments—including adherence to the United Nations Sustainable Development Goals (SDGs)—to secure financing using the Task Force on Climate-Related Financial Disclosure (TFCD) framework.

Datasets are the core of our business. In addition to acting as a liaison between our developers, data team, CTO, and Chief Science Officer (CSO), I work with our data scientists to store and orchestrate data efficiently. We create automations to ingest global climate data and build pipelines to automate climate risk projections, reports, and deliverables for our customers and to power our VELO platform.

Data Lineage and Versioning Are Crucial

As a startup, we had to learn ML modeling and how to generate and optimize data for our enterprise clients. They have strict data retention, security, and versioning requirements, and we had to build tools that met these criteria from the ground up. It took time and money—both of which were in short supply—to engineer our platform, but we pulled it off with the tools we had.


The bulk of our data resides in Google Cloud, which provides just the foundation for organizing and cleaning data. Building out the rest of the infrastructure on top to support the enterprise-grade and regulatory requirements that our clients needed was a laborious process, and was still lacking critical pieces needed to organize all of our data.

Data lineage is crucial to our climate risk calculations because our customers might want to drill down from the climate risk score projections all the way to our source data, which is sizeable and widespread, with differences in structure, format, quality, and organization. Versioning is equally important because customers may want to explore different risk factors, transition risk factors, and forecasting methodologies. Finally, we need to store all the data we use for customer deliverables in case we have to go back in time and see exactly what code, methodology, and data was used, allowing a customer to understand how their projections and risk changes over time.

When looking for a solution to meet our requirements, I considered Data Versioning Control (DVC), SageMaker, and MLFlow. Unfortunately, our team couldn’t learn and manage these products with our limited time and resources because these products were complex, and individually didn't provide solutions to all of our core requirements in a complete package.

Software That Addressed Our Pain Points

I heard about HPE Machine Learning Data Management Software (MLDMS) (formerly Pachyderm) on a Y Combinator Hacker News thread about DVC alternatives. We’d previously tried to deploy various versioning tools, but HPE engineers helped us stand everything up, and we got off the ground very quickly. Our whole team bought into it because we had finally found an all-in-one solution that stores and versions our data, runs our data pipelines, and scales flexibly. We finally had a tool that addressed all our pain points.

HPE MLDMS partitions data and orchestrates pipelines, organizing data hierarchies and preparing ML models up front so we can adjust our data pipelines and scale compute resources to arrive at the desired outcomes. If a customer asks us to deliver a risk analysis tomorrow instead of next week, we can allocate more compute resources to the task and meet the deadline without modifying our inputs or data sources. One example of this is that we can add thousands of CPU cores to a pipeline by changing a single line of configuration. Because our pipelines can scale linearly, any increase in compute resources that allocate to a task will result in an equal and correlated decrease in processing time, allowing us to predictably allocate resources for time-sensitive deliverables.

When data is verifiable from top to bottom, companies can pass along this level of transparency to customers.

This also fits into our value proposition. We want our data to be verifiable from top to bottom and fully transparent to audits. HPE MLDMS delivers these features internally, allowing us to pass along this level of transparency to our customers. 

Thinking About Data Differently

If a customer asks for specifics, like the impact of rising sea temperatures on their fishing fleets in 2050, we can go back to the source data and drill down to the models and specific data we used to generate our analysis. HPE MLDMS’s data lineage and versioning features allow us to pinpoint the information and analytic tools we use to assess the risks for individual assets, confirming how we arrived at our conclusions. 

We couldn’t do that out of the box with Google Cloud. We had to manually come up with a way to manage lineage with Google Cloud Storage, and then had to hand-code pipelines and infrastructure to retrieve that data. We still use Google Cloud, but strictly to store our raw climate data, including 10 TB generated by our 500 HPE MLDMS pipelines. We have largely automated these pipelines, which is a big help for customers who ask for monthly reports on their sustainability initiatives and climate impacts.

Engineers should shape data and ML models to be scalable from the onset instead of coding extra resources after the fact.

Working with HPE Machine Learning Data Management Software allowed us to launch our next-generation VELO platform, and it's also transformed how I think about data. As a full-stack developer, I had basic assumptions about modeling data. Working with HPE engineers and our internal data scientists completely changed my perspective. I have used this newfound knowledge to improve how we store our climate data, build our pipelines, and plan deliverables for our clients.