Clay cracks in central South Australia
Rian Dutch

Using Python to handle large geochemical datasets

The Geological Survey of South Australia (GSSA) holds a wealth of data in its geoscientific database SA Geodata. SA Geodata is the primary repository for geoscientific information in South Australia and contains open source data collected from a variety of sources.

The SA Geodata database contains over 10 Gb of geochemical data. That’s a lot of chemistry. Explorers often request extracts of this data set, but then find it a challenge to handle all that data. Because of the size and amount of data, programs like Excel wont even open the file, and if the extract is small enough to open, explorers often find the format of the data a challenge. Generally, people like to use tidy data for analysis, where each row in a table represents all the data about a single sample. But the data exports are in a long format where each row represents a single data point.

In this blog, I’ll show a simple method using python and the Dask and Pandas libraries to query and extract the desired elements, and then quickly pivot them into the desired tidy structure. This example uses a copy of the SA Geodata geochemical dataset provided for the Unearthed ExploreSA: Gawler Challenge and is valid as at February 2020.

When there’s too much data

Dask is an open source library for parallel computing in python. It can be used to parallelise computations across multiple machines, or larger-than-memory data sets on a single machine. It’s also tightly integrated with the python PyData stack including Pandas and Numpy, sharing a vary similar API and making working with large DataFrames easy.

From the Dask website: “A Dask DataFrame is a large parallel DataFrame composed of many smaller Pandas DataFrames, split along the index. These Pandas DataFrames may live on disk for larger-than-memory computing on a single machine, or on many different machines in a cluster.”

Most operations in Dask are ‘lazy’, which means they don’t actually evaluate anything until you explicitly ask for it. To return a computed result, you need to call the compute method and the results need to be able to fit in memory as this will produce an out put or convert a Dask DataFrame into a Pandas DataFrame.

Lets work through an example to see how this all works. First we need to import the required libraries:

import dask.dataframe as dd
import pandas as pd
import numpy as np
import re

Loading the dataset

Because of the similar API between Dask and Pandas, loading the dataset is exactly the same as we would using Pandas, except we call Dask (dd) instead of Pandas (pd):

ddf = dd.read_csv(r'D:\Unearthed_SARIG_Data_Package\SARIG_Data_Package\sarig_rs_chem_exp.csv', 
                    dtype={'LITHO_CONF': 'object', 'STRAT_CONF': 'object'})

I found after some trial and error that I needed to force the data types for the ‘LITHO_CONF’ and ‘STRAT_CONF’ columns, because Dask incorrectly guessed them as integers. That’s why in the above code I have explicitly provided the data type (dtype) in the load call.

Next we can do some simple interrogation of the data like looking at the column names and using the head() method to view the top 5 rows of data:

       'SITE_NO', 'EASTING_GDA2020', 'NORTHING_GDA2020', 'ZONE_GDA2020',
0589RORock outcrop / floatSSASediment Siliciclastic AreniteSDSTNaNSandstoneNaNNaNNaNNaNMIRAMS, R.C.600/113/12/1960NaNNaNNaNNaN128096513602.617.006e+0652129.137-27.070129.137-27.070158NaNGEOCHEMISTRYNaNTi0.15%AESAES
1589RORock outcrop / floatSSASediment Siliciclastic AreniteSDSTNaNSandstoneNaNNaNNaNNaNMIRAMS, R.C.600/113/12/1960NaNNaNNaNNaN128096513602.617.006e+0652129.137-27.070129.137-27.070158NaNGEOCHEMISTRYNaNAg500ppbAESAES
2589RORock outcrop / floatSSASediment Siliciclastic AreniteSDSTNaNSandstoneNaNNaNNaNNaNMIRAMS, R.C.600/113/12/1960NaNNaNNaNNaN128096513602.617.006e+0652129.137-27.070129.137-27.070158NaNGEOCHEMISTRYNaNBa1200ppmAESAES
3589RORock outcrop / floatSSASediment Siliciclastic AreniteSDSTNaNSandstoneNaNNaNNaNNaNMIRAMS, R.C.600/113/12/1960NaNNaNNaNNaN128096513602.617.006e+0652129.137-27.070129.137-27.070158NaNGEOCHEMISTRYNaNCo2ppmAESAES
4589RORock outcrop / floatSSASediment Siliciclastic AreniteSDSTNaNSandstoneNaNNaNNaNNaNMIRAMS, R.C.600/113/12/1960NaNNaNNaNNaN128096513602.617.006e+0652129.137-27.070129.137-27.070158NaNGEOCHEMISTRYNaNCr1000ppmAESAES

Extracting the required data

For this exercise, we are going to create a tidy dataset (wide format data) which only contains the major elements for rock samples. In the table above we can see there is a column for ‘SAMPLE_SOURCE’. To see how many different sample types, and how many records for each, we can compute the value counts for each. As this is evaluating data, we need to call the compute method:

Drill cuttings                                                  10956887
Sawn half drill core                                             7256976
Drill core                                                       6533477
Calcrete                                                         2154614
Soil                                                              989296
Drilled interval rock sample, type unspecified                    970508
Rock outcrop / float                                              537537
Sawn quarter drill core                                           237103
Drillhole log data - used in calculating values                   212746
Stream sediment                                                   164251
Auger sample from near surface                                    105514
Vegetation                                                         71018
Core sludge                                                        56217
A full face slice of core                                          52531
Gravel                                                             37740
Pulp / powdered rock of known origin, typically a lab return       16428
Magnetic lag fraction                                              10135
Duplicate: a split from an existing sample.                         8646
Sample specifically for bulk leach extractable gold analysis        7826
Aircore: consolidated sample from aircore drilling method           7488
Analytical standard check sample                                    6358
Mine rock sample                                                    6203
Channel (linear rock chip/outcrop sampling)                         4453
Rock outcrop sample                                                 4086
Lag sample, surface sample consisting of gravelly material.         3918
Costean (trench)                                                    3746
Filleted, shaved or ground core sample                              3478
Mine mullock / well spoil                                           3086
Loam                                                                2679
Mine stockpile                                                      2248
Lake floor sediment                                                 1799
Excrement of animals                                                 772
Mine tailings                                                        347
Rock float sample (not in situ)                                      335
Drillhole                                                            332
Smelter slag                                                         293
Rock subcrop sample (in situ, not attached to bedrock)                67
Single mineral                                                        40
Bulk (high volume) sample, diamond exploration                        19
Core Library archival rock, original source unknown                   12
Name: SAMPLE_SOURCE, dtype: int64

We can see above that there are a lot of samples that are not from rocks. We can create a new Dask DataFrame and only select samples from ‘rock sample’ types, excluding things like soils, lag and excrement of animals! Because this is a ‘lazy’ operation, it can be done on the total dataset and is quick:

ddf_rock_sample = ddf[ddf['SAMPLE_SOURCE'].isin(['Drill cuttings','Sawn half drill core','Drill core','Drilled interval rock sample, type unspecified','Rock outcrop / float','Sawn quarter drill core','Drillhole log data - used in calculating values','A full face slice of core','Pulp / powdered rock of known origin, typically a lab return','Duplicate: a split from an existing sample.','Channel (linear rock chip/outcrop sampling)','Rock outcrop sample','Costean (trench)','Filleted, shaved or ground core sample','Rock float sample (not in situ)','Drillhole','Rock subcrop sample (in situ, not attached to bedrock)'])]

Now we just have rock samples, that should have removed a few hundred thousand rows from our dataset, significantly reducing the size. Next we can reduce that further and only grab the samples of interest, here the ones containing major element data.

First we generate a list of elements we want to include. Then we want to select all of the unique SAMPLE_ANALYSIS_NO (the primary identifier for each sample analysis) that contain data for the required major elements:

elements = ['SiO2','Al2O3','TiO2','Fe2O3','MgO','CaO','Na2O','K2O','P2O5','LOI','FeO']
analyses = ddf_rock_sample[ddf_rock_sample.CHEM_CODE.isin(elements)].SAMPLE_ANALYSIS_NO.unique()

Now we have identified each sample we want to keep we can compute a Pandas DataFrame for the unique SAMPLE_ANALYSIS_NO with only the required columns we want (such as the sample number, chem code, value and unit. We can drop unnecessary columns like lithology to further reduce the memory size of the dataset). Because this is converting a larger than memory dataset into a more useable sized dataset, this may take a few moments to process:

ddf_major_chem = ddf_rock_sample[ddf_rock_sample.SAMPLE_ANALYSIS_NO.isin(list(analyses))][['SAMPLE_NO','SAMPLE_ANALYSIS_NO',"CHEM_CODE",'VALUE','UNIT','ANALYSIS_TYPE_DESC','CHEM_METHOD_CODE','OTHER_ANALYSIS_ID','ANALYSIS_TYPE_DESC','COLLECTED_DATE']].drop_duplicates()

df_major_chem = ddf_major_chem.compute()

Doing the selection this way, selecting by sample number and not just the specific required elements, is potentially a redundant step. We could instead just use the code above where we selected by the unique sample numbers, and instead of computing a Pandas DataFrame containing all the analyses done on those samples, we could have just selected those rows that belonged to major element data by using the ‘elements’ list instead of the ‘analyses’ list in the above code. This would allow us to skip the line of code above and some of the code below. Here I’ve chosen to do it this way to demonstrate a way to select samples based on containing a type of data, in this case major elements.

So, because we’ve selected by sample, if we look at an example sample below we can see that the Pandas DataFrame contains the major elements we want plus extras. We can also see that there are duplicate analytes analysed by different methods:


Our next step is to create a new DataFrame with only the required major element rows, dropping the additional trace and rare earth element data for this example. We do this by using our elements list and selecting only rows that have those elements in the ‘CHEM_CODE’ column:

df_major_elements = df_major_chem[df_major_chem['CHEM_CODE'].isin(elements)]

We can check that we only have the major elements now by checking that same sample again:


As I mentioned above, some of the samples have duplicate analyses by different methods. To check on some of them we can look for duplicate CHEM_CODE and SAMPLE_ANALYSIS_NO fields.

dup_analysis_samples = df_major_chem[df_major_chem.duplicated(subset=['SAMPLE_ANALYSIS_NO','CHEM_CODE'],keep=False)].sort_values(['SAMPLE_ANALYSIS_NO','CHEM_CODE']) 

To handle these duplicates properly we would need to look at the ‘CHEM_METHOD_CODE’ and the labs and decide which to keep. For this example we’ll just create a new DataFrame and drop one of the duplicate analyses:

df_major_elements_tmp1 = df_major_elements.sort_values(['SAMPLE_ANALYSIS_NO','CHEM_CODE','CHEM_METHOD_CODE']).drop_duplicates(subset=['SAMPLE_ANALYSIS_NO','CHEM_CODE'],keep='last')

Cleaning the data

Our subset of major element data now consists of just under 3 million individual analyses. But the data is not all numeric and contains a number of other symbols such as ‘<’ and ‘-‘. We need to find what non-numeric symbols there are and then remove them. But we also need to note that the ‘<’ symbol is the deliminator for analyses below detection limit (BDL). In this case to handle BDL values we will replace the BDL value with half of the reported detection limit.

First we can use regular expressions to find what non-numeric characters are in the element value column.

sym_find_list = [re.findall(r'\D', str(i)) for i in df_major_elements_tmp1['VALUE']]
unique_symbols = set([item for sublist in sym_find_list for item in sublist])
{'-', '.', '<', '>'}

We can then use Pandas string methods, like str.contains to remove them:

df_major_elements_tmp1.drop(df_major_elements_tmp1[df_major_elements_tmp1['VALUE'].str.contains('>',na=False, regex=False)].index, inplace=True)

The ‘-‘ symbol is an interesting one as it can potentially be a legitimate symbol for things like LOI. But looking at the samples that contain ‘-‘ we can see that in some instances it is used to represent a range of values, like 0-10. To deal with this case (only 2 samples in our dataset) we can explicitly convert them to a different value, such as 10

df_major_elements_tmp1.loc[df_major_elements_tmp1['VALUE'] == '<0-10','VALUE'] = '<10'

Now we can deal with the BDL values represented by ‘<’. There are a number of ways to deal with BDL values, but here we will just convert them to half the reported detection limit.

We do that by first making a flag column identifying which analyses are BDL. Then we can strip the ‘<’ character and finally divide just those rows that have the flag by 2:

# create a flag column to identify BDL values
df_major_elements_tmp1['BDL'] = 0
df_major_elements_tmp1.loc[df_major_elements_tmp1['VALUE'].str.contains('<',na=False,regex=False), 'BDL'] = 1
# remove < in vlues col
df_major_elements_tmp1['VALUE'] = df_major_elements_tmp1['VALUE'].astype(str).str.replace("<", "").astype(float)
#convert BDL units to half reported limit
df_major_elements_tmp1.loc[df_major_elements_tmp1['BDL'] == 1,'VALUE'] = df_major_elements_tmp1.loc[df_major_elements_tmp1['BDL'] == 1,'VALUE'] /2

From here it’s possible to do a number of further validation checks on the data, such as checking that the values fall within a realistic range etc. But for this example we’ll assume the data from here is ready to go.

Pivoting the data

The last thing we need to do is convert the cleaned chemical data from long format to a tidy data wide format, and convert data types from object to float. We can do this by pivoting the DataFrame (when you pivot the data, it creates a multilevel header. To fix this we simply rename the columns of the table, flattening it back to a single header row):

df_major_chem_wide = df_major_elements_tmp1.pivot(index='SAMPLE_ANALYSIS_NO', values=['VALUE'],columns='CHEM_CODE').sort_values('CHEM_CODE',axis=1)

df_major_chem_wide.columns = ['Al2O3','CaO','Fe2O3','FeO','K2O','LOI','MgO','Na2O','P2O5','SiO2','TiO2']
<class 'Pandas.core.frame.DataFrame'>
Int64Index: 321523 entries, 122 to 2476035
Data columns (total 20 columns):
(VALUE, Al2O3)    189183 non-null float64
(VALUE, CaO)      189758 non-null float64
(VALUE, Fe2O3)    61647 non-null float64
(VALUE, FeO)      4665 non-null float64
(VALUE, K2O)      186116 non-null float64
(VALUE, LOI)      156535 non-null float64
(VALUE, MgO)      190418 non-null float64
(VALUE, Na2O)     171249 non-null float64
(VALUE, P2O5)     211858 non-null float64
(VALUE, SiO2)     192035 non-null float64
(VALUE, TiO2)     151620 non-null float64
dtypes: float64(20)
memory usage: 61.5 MB

Finally we can export our new, customised, tidy format dataset using the Pandas to_csv method:



In this example we have taken a very large geochemical dataset, more than 10Gb worth of data, and turned it into a useable size of less than 65Mb. Using Dask we were able to deal with more data than my computer could handle in memory, filter it down to just the sample types and elements that we wanted (still over 300,000 samples mind you), and then convert it into the usual tidy data wide table format most of us are used to dealing with.

I hope you find this example useful for your own efforts to deal with large geological datasets. With a little bit of code and the right python libraries anyone can move away from Excel and start using larger than memory datasets.

Please check out my other content at, and get in touch if you have any questions or comments. And make sure you check out all the amazing work and freely available data from the Geological Survey of South Australia and SARIG. This example forms part of a larger exploration and transformation of this dataset for application in a lithology prediction machine learning model and can be found on GitHub.