Prioritizing Water Quality Sampling Using Machine Learning

Half of the world's population uses groundwater for drinking, and nearly as many utilize it to irrigate crops. This water stored in saturated zones under the earth's surface is essential to supply demands around the globe, making water quality critical.

Groundwater is typically a safe and reliable source of potable water. But as pumping exceeds recharge, water quality declines. That's because contaminants from pesticides and fertilizers, sewers and septic tanks, and hazardous materials and landfills may concentrate in the aquifers. Predicting the inorganic pollutants likely to be present in a groundwater supply is therefore vital.

Yet most state agencies and municipalities don't have the resources to conduct groundwater monitoring that would help prioritize where and what they should test. Narrowing that down is imperative because testing takes longer and costs more as locations to sample and chemicals to detect increase.

A recent study from a team at Arizona State and North Carolina State Universities published in Environmental Science & Technology presents a machine learning (ML) framework that can pinpoint high-risk sites that could compromise groundwater quality. The framework helps reduce uncertainty when prioritizing sampling locations and chemical analyses—crucial information for water test laboratories.

ML Models Predict Pollutant Groupings

Geologic and environmental factors cause pollutants, such as arsenic or lead, to occur naturally, along with specific other elements. Some of these elements pose a risk to human health. Others, like phosphorus, can be beneficial in agricultural contexts but pose environmental risks elsewhere.

The research team set out to determine how to predict the presence and concentrations of pollutants given the limited water quality data available for most groundwater supplies. They started with a dataset containing 20+ million data points—the result of monitoring more than 50 water quality parameters over 140 years.

The researchers used this huge dataset to train an ML model. The model predicted which elements would be present based on the available water quality data. Even if the team had data only on a handful of parameters in an area, the program could still predict which inorganic pollutants were likely to be in the water and how abundant they might be.

ML Framework Helps Locate Hotspots

Many datasets are missing values, including one spanning over a century. However, the researchers' work allowed them to replace missing data with estimated values. This inputted data revealed the number of sampling locations that would potentially exceed health-based limits was two to five times the number initially expected. In addition, the results identified samples where two to six chemicals may co-occur and surpass health-based levels. Linking inputted data to sampling locations can pinpoint hotspots with elevated chemical levels and guide additional field sampling and analysis.

A key finding is that the model suggests that pollutants exceed drinking water standards in more groundwater sources than previously documented. Although field data indicated that up to 80% of sampled locations were within safe limits, the ML framework predicts that only 15% to 55% may truly be risk-free.

That means agencies should prioritize many groundwater sites for additional testing. Identifying these potential hotspots lets state and local governments strategically allocate resources to high-risk areas, ensuring more targeted sampling and supporting more effective water treatment solutions.

Model Enhancements Offer Global Potential

As part of future work, the researchers plan to enhance the model. Expanding the training data will allow use across diverse U.S. regions. Integrating new data sources like environmental data layers can help address emerging contaminants. And conducting real-world testing could help ensure robust, targeted groundwater safety measures worldwide.

These improvements could make the groundwater supply safer worldwide and help labs working with agencies identify critical sampling locations and test parameters.


Read These Next


Diana Kightlinger
Journalist

Diana Kightlinger is an experienced journalist, copywriter, and blogger for science, technology, and medical organizations. She writes frequently for Fortune 500 corporate clients but also has a passion for explaining scientific research, raising awareness of issues, and targeting positive outcomes for people and communities. Diana holds master’s degrees in environmental science and journalism.