Clay cracks in central South Australia — Rian Dutch

May 2, 2021

Using Python to handle large geochemical datasets

The Geological Survey of South Australia (GSSA) holds a wealth of data in its geoscientific database SA Geodata. SA Geodata is the primary repository for geoscientific information in South Australia and contains open source data collected from a variety of sources.

The SA Geodata database contains over 10 Gb of geochemical data. That’s a lot of chemistry. Explorers often request extracts of this data set, but then find it a challenge to handle all that data. Because of the size and amount of data, programs like Excel wont even open the file, and if the extract is small enough to open, explorers often find the format of the data a challenge. Generally, people like to use tidy data for analysis, where each row in a table represents all the data about a single sample. But the data exports are in a long format where each row represents a single data point.

In this blog, I’ll show a simple method using python and the Dask and Pandas libraries to query and extract the desired elements, and then quickly pivot them into the desired tidy structure. This example uses a copy of the SA Geodata geochemical dataset provided for the Unearthed ExploreSA: Gawler Challenge and is valid as at February 2020.

When there’s too much data

Dask is an open source library for parallel computing in python. It can be used to parallelise computations across multiple machines, or larger-than-memory data sets on a single machine. It’s also tightly integrated with the python PyData stack including Pandas and Numpy, sharing a vary similar API and making working with large DataFrames easy.

From the Dask website: “A Dask DataFrame is a large parallel DataFrame composed of many smaller Pandas DataFrames, split along the index. These Pandas DataFrames may live on disk for larger-than-memory computing on a single machine, or on many different machines in a cluster.”

Most operations in Dask are ‘lazy’, which means they don’t actually evaluate anything until you explicitly ask for it. To return a computed result, you need to call the compute method and the results need to be able to fit in memory as this will produce an out put or convert a Dask DataFrame into a Pandas DataFrame.

Lets work through an example to see how this all works. First we need to import the required libraries:

import dask.dataframe as dd
import pandas as pd
import numpy as np
import re

Loading the dataset

Because of the similar API between Dask and Pandas, loading the dataset is exactly the same as we would using Pandas, except we call Dask (dd) instead of Pandas (pd):

ddf = dd.read_csv(r'D:\Unearthed_SARIG_Data_Package\SARIG_Data_Package\sarig_rs_chem_exp.csv', 
                    dtype={'LITHO_CONF': 'object', 'STRAT_CONF': 'object'})

I found after some trial and error that I needed to force the data types for the ‘LITHO_CONF’ and ‘STRAT_CONF’ columns, because Dask incorrectly guessed them as integers. That’s why in the above code I have explicitly provided the data type (dtype) in the load call.

Next we can do some simple interrogation of the data like looking at the column names and using the head() method to view the top 5 rows of data:

ddf.columns

Index(['SAMPLE_NO', 'SAMPLE_SOURCE_CODE', 'SAMPLE_SOURCE', 'ROCK_GROUP_CODE',
       'ROCK_GROUP', 'LITHO_CODE', 'LITHO_CONF', 'LITHOLOGY_NAME',
       'LITHO_MODIFIER', 'MAP_SYMBOL', 'STRAT_CONF', 'STRAT_NAME',
       'COLLECTED_BY', 'COLLECTORS_NUMBER', 'COLLECTED_DATE',
       'DRILLHOLE_NUMBER', 'DH_NAME', 'DH_DEPTH_FROM', 'DH_DEPTH_TO',
       'SITE_NO', 'EASTING_GDA2020', 'NORTHING_GDA2020', 'ZONE_GDA2020',
       'LONGITUDE_GDA2020', 'LATITUDE_GDA2020', 'LONGITUDE_GDA94',
       'LATITUDE_GDA94', 'SAMPLE_ANALYSIS_NO', 'OTHER_ANALYSIS_ID',
       'ANALYSIS_TYPE_DESC', 'LABORATORY', 'CHEM_CODE', 'VALUE', 'UNIT',
       'CHEM_METHOD_CODE', 'CHEM_METHOD_DESC'],
      dtype='object')

ddf.head(5)

	SAMPLE_NO	SAMPLE_SOURCE_CODE	SAMPLE_SOURCE	ROCK_GROUP_CODE	ROCK_GROUP	LITHO_CODE	LITHO_CONF	LITHOLOGY_NAME	LITHO_MODIFIER	MAP_SYMBOL	STRAT_CONF	STRAT_NAME	COLLECTED_BY	COLLECTORS_NUMBER	COLLECTED_DATE	DRILLHOLE_NUMBER	DH_NAME	DH_DEPTH_FROM	DH_DEPTH_TO	SITE_NO	EASTING_GDA2020	NORTHING_GDA2020	ZONE_GDA2020	LONGITUDE_GDA2020	LATITUDE_GDA2020	LONGITUDE_GDA94	LATITUDE_GDA94	SAMPLE_ANALYSIS_NO	OTHER_ANALYSIS_ID	ANALYSIS_TYPE_DESC	LABORATORY	CHEM_CODE	VALUE	UNIT	CHEM_METHOD_CODE	CHEM_METHOD_DESC
0	589	RO	Rock outcrop / float	SSA	Sediment Siliciclastic Arenite	SDST	NaN	Sandstone	NaN	NaN	NaN	NaN	MIRAMS, R.C.	600/1	13/12/1960	NaN	NaN	NaN	NaN	128096	513602.61	7.006e+06	52	129.137	-27.070	129.137	-27.070	158	NaN	GEOCHEMISTRY	NaN	Ti	0.15	%	AES	AES
1	589	RO	Rock outcrop / float	SSA	Sediment Siliciclastic Arenite	SDST	NaN	Sandstone	NaN	NaN	NaN	NaN	MIRAMS, R.C.	600/1	13/12/1960	NaN	NaN	NaN	NaN	128096	513602.61	7.006e+06	52	129.137	-27.070	129.137	-27.070	158	NaN	GEOCHEMISTRY	NaN	Ag	500	ppb	AES	AES
2	589	RO	Rock outcrop / float	SSA	Sediment Siliciclastic Arenite	SDST	NaN	Sandstone	NaN	NaN	NaN	NaN	MIRAMS, R.C.	600/1	13/12/1960	NaN	NaN	NaN	NaN	128096	513602.61	7.006e+06	52	129.137	-27.070	129.137	-27.070	158	NaN	GEOCHEMISTRY	NaN	Ba	1200	ppm	AES	AES
3	589	RO	Rock outcrop / float	SSA	Sediment Siliciclastic Arenite	SDST	NaN	Sandstone	NaN	NaN	NaN	NaN	MIRAMS, R.C.	600/1	13/12/1960	NaN	NaN	NaN	NaN	128096	513602.61	7.006e+06	52	129.137	-27.070	129.137	-27.070	158	NaN	GEOCHEMISTRY	NaN	Co	2	ppm	AES	AES
4	589	RO	Rock outcrop / float	SSA	Sediment Siliciclastic Arenite	SDST	NaN	Sandstone	NaN	NaN	NaN	NaN	MIRAMS, R.C.	600/1	13/12/1960	NaN	NaN	NaN	NaN	128096	513602.61	7.006e+06	52	129.137	-27.070	129.137	-27.070	158	NaN	GEOCHEMISTRY	NaN	Cr	1000	ppm	AES	AES

Extracting the required data

For this exercise, we are going to create a tidy dataset (wide format data) which only contains the major elements for rock samples. In the table above we can see there is a column for ‘SAMPLE_SOURCE’. To see how many different sample types, and how many records for each, we can compute the value counts for each. As this is evaluating data, we need to call the compute method:

ddf['SAMPLE_SOURCE'].value_counts().compute()

Drill cuttings                                                  10956887
Sawn half drill core                                             7256976
Drill core                                                       6533477
Calcrete                                                         2154614
Soil                                                              989296
Drilled interval rock sample, type unspecified                    970508
Rock outcrop / float                                              537537
Sawn quarter drill core                                           237103
Drillhole log data - used in calculating values                   212746
Stream sediment                                                   164251
Auger sample from near surface                                    105514
Vegetation                                                         71018
Core sludge                                                        56217
A full face slice of core                                          52531
Gravel                                                             37740
Pulp / powdered rock of known origin, typically a lab return       16428
Magnetic lag fraction                                              10135
Duplicate: a split from an existing sample.                         8646
Sample specifically for bulk leach extractable gold analysis        7826
Aircore: consolidated sample from aircore drilling method           7488
Analytical standard check sample                                    6358
Mine rock sample                                                    6203
Channel (linear rock chip/outcrop sampling)                         4453
Rock outcrop sample                                                 4086
Lag sample, surface sample consisting of gravelly material.         3918
Costean (trench)                                                    3746
Filleted, shaved or ground core sample                              3478
Mine mullock / well spoil                                           3086
Loam                                                                2679
Mine stockpile                                                      2248
Lake floor sediment                                                 1799
Excrement of animals                                                 772
Mine tailings                                                        347
Rock float sample (not in situ)                                      335
Drillhole                                                            332
Smelter slag                                                         293
Rock subcrop sample (in situ, not attached to bedrock)                67
Single mineral                                                        40
Bulk (high volume) sample, diamond exploration                        19
Core Library archival rock, original source unknown                   12
Name: SAMPLE_SOURCE, dtype: int64

We can see above that there are a lot of samples that are not from rocks. We can create a new Dask DataFrame and only select samples from ‘rock sample’ types, excluding things like soils, lag and excrement of animals! Because this is a ‘lazy’ operation, it can be done on the total dataset and is quick:

ddf_rock_sample = ddf[ddf['SAMPLE_SOURCE'].isin(['Drill cuttings','Sawn half drill core','Drill core','Drilled interval rock sample, type unspecified','Rock outcrop / float','Sawn quarter drill core','Drillhole log data - used in calculating values','A full face slice of core','Pulp / powdered rock of known origin, typically a lab return','Duplicate: a split from an existing sample.','Channel (linear rock chip/outcrop sampling)','Rock outcrop sample','Costean (trench)','Filleted, shaved or ground core sample','Rock float sample (not in situ)','Drillhole','Rock subcrop sample (in situ, not attached to bedrock)'])]

Now we just have rock samples, that should have removed a few hundred thousand rows from our dataset, significantly reducing the size. Next we can reduce that further and only grab the samples of interest, here the ones containing major element data.

First we generate a list of elements we want to include. Then we want to select all of the unique SAMPLE_ANALYSIS_NO (the primary identifier for each sample analysis) that contain data for the required major elements:

elements = ['SiO2','Al2O3','TiO2','Fe2O3','MgO','CaO','Na2O','K2O','P2O5','LOI','FeO']
analyses = ddf_rock_sample[ddf_rock_sample.CHEM_CODE.isin(elements)].SAMPLE_ANALYSIS_NO.unique()

Now we have identified each sample we want to keep we can compute a Pandas DataFrame for the unique SAMPLE_ANALYSIS_NO with only the required columns we want (such as the sample number, chem code, value and unit. We can drop unnecessary columns like lithology to further reduce the memory size of the dataset). Because this is converting a larger than memory dataset into a more useable sized dataset, this may take a few moments to process:

ddf_major_chem = ddf_rock_sample[ddf_rock_sample.SAMPLE_ANALYSIS_NO.isin(list(analyses))][['SAMPLE_NO','SAMPLE_ANALYSIS_NO',"CHEM_CODE",'VALUE','UNIT','ANALYSIS_TYPE_DESC','CHEM_METHOD_CODE','OTHER_ANALYSIS_ID','ANALYSIS_TYPE_DESC','COLLECTED_DATE']].drop_duplicates()

df_major_chem = ddf_major_chem.compute()

Doing the selection this way, selecting by sample number and not just the specific required elements, is potentially a redundant step. We could instead just use the code above where we selected by the unique sample numbers, and instead of computing a Pandas DataFrame containing all the analyses done on those samples, we could have just selected those rows that belonged to major element data by using the ‘elements’ list instead of the ‘analyses’ list in the above code. This would allow us to skip the line of code above and some of the code below. Here I’ve chosen to do it this way to demonstrate a way to select samples based on containing a type of data, in this case major elements.

So, because we’ve selected by sample, if we look at an example sample below we can see that the Pandas DataFrame contains the major elements we want plus extras. We can also see that there are duplicate analytes analysed by different methods:

df_major_chem[df_major_chem['SAMPLE_ANALYSIS_NO']==2466083]

	SAMPLE_NO	SAMPLE_ANALYSIS_NO	CHEM_CODE	VALUE	UNIT	ANALYSIS_TYPE_DESC	CHEM_METHOD_CODE	OTHER_ANALYSIS_ID	ANALYSIS_TYPE_DESC	COLLECTED_DATE
147316	2958572	2466083	Ga	30	ppm	GEOCHEMISTRY	ME-MS62s	NaN	GEOCHEMISTRY	31/05/2012
147317	2958572	2466083	La	26.5	ppm	GEOCHEMISTRY	ME-MS62s	NaN	GEOCHEMISTRY	31/05/2012
147318	2958572	2466083	Mo	0.5	ppm	GEOCHEMISTRY	ME-MS62s	NaN	GEOCHEMISTRY	31/05/2012
147319	2958572	2466083	Ni	24	ppm	GEOCHEMISTRY	ME-MS62s	NaN	GEOCHEMISTRY	31/05/2012
147320	2958572	2466083	P	600	ppm	GEOCHEMISTRY	ME-MS62s	NaN	GEOCHEMISTRY	31/05/2012
147321	2958572	2466083	P2O5	0.137	%	GEOCHEMISTRY	ME-MS62s	NaN	GEOCHEMISTRY	31/05/2012
147322	2958572	2466083	Pb	4	ppm	GEOCHEMISTRY	ME-MS62s	NaN	GEOCHEMISTRY	31/05/2012
147323	2958572	2466083	S	0.01	%	GEOCHEMISTRY	ME-MS62s	NaN	GEOCHEMISTRY	31/05/2012
147324	2958572	2466083	Sb	2.5	ppm	GEOCHEMISTRY	ME-MS62s	NaN	GEOCHEMISTRY	31/05/2012
147325	2958572	2466083	Sc	8	ppm	GEOCHEMISTRY	ME-MS62s	NaN	GEOCHEMISTRY	31/05/2012
147326	2958572	2466083	Sr	35	ppm	GEOCHEMISTRY	ME-MS62s	NaN	GEOCHEMISTRY	31/05/2012
147327	2958572	2466083	Te	0.025	ppm	GEOCHEMISTRY	ME-MS62s	NaN	GEOCHEMISTRY	31/05/2012
147328	2958572	2466083	Th	20	ppm	GEOCHEMISTRY	ME-MS62s	NaN	GEOCHEMISTRY	31/05/2012
147329	2958572	2466083	Ti	0.19	%	GEOCHEMISTRY	ME-MS62s	NaN	GEOCHEMISTRY	31/05/2012
147330	2958572	2466083	Tl	5	ppm	GEOCHEMISTRY	ME-MS62s	NaN	GEOCHEMISTRY	31/05/2012
147331	2958572	2466083	W	5	ppm	GEOCHEMISTRY	ME-MS62s	NaN	GEOCHEMISTRY	31/05/2012
147332	2958572	2466083	Y	47.4	ppm	GEOCHEMISTRY	ME-MS62s	NaN	GEOCHEMISTRY	31/05/2012
17613	2958572	2466083	Al2O3	12.9	%	GEOCHEMISTRY	ME-MS41	NaN	GEOCHEMISTRY	31/05/2012
17614	2958572	2466083	Au	0.005	ppm	GEOCHEMISTRY	Au-TL43	NaN	GEOCHEMISTRY	31/05/2012
17615	2958572	2466083	CaO	2.39	%	GEOCHEMISTRY	ME-MS41	NaN	GEOCHEMISTRY	31/05/2012
17616	2958572	2466083	K2O	5.05	%	GEOCHEMISTRY	ME-MS41	NaN	GEOCHEMISTRY	31/05/2012
17617	2958572	2466083	MgO	10.2	%	GEOCHEMISTRY	ME-MS41	NaN	GEOCHEMISTRY	31/05/2012
17618	2958572	2466083	Na2O	0.217	%	GEOCHEMISTRY	ME-MS41	NaN	GEOCHEMISTRY	31/05/2012
17619	2958572	2466083	SiO2	53.8	%	GEOCHEMISTRY	ME-MS41	NaN	GEOCHEMISTRY	31/05/2012
31118	2958572	2466083	Ag	0.25	ppm	GEOCHEMISTRY	ME-XRF21n	NaN	GEOCHEMISTRY	31/05/2012
31119	2958572	2466083	Al	7.05	%	GEOCHEMISTRY	ME-XRF21n	NaN	GEOCHEMISTRY	31/05/2012
31120	2958572	2466083	As	2.5	ppm	GEOCHEMISTRY	ME-XRF21n	NaN	GEOCHEMISTRY	31/05/2012
31121	2958572	2466083	Ba	1240	ppm	GEOCHEMISTRY	ME-XRF21n	NaN	GEOCHEMISTRY	31/05/2012
31122	2958572	2466083	Be	7.2	ppm	GEOCHEMISTRY	ME-XRF21n	NaN	GEOCHEMISTRY	31/05/2012
31123	2958572	2466083	Bi	1	ppm	GEOCHEMISTRY	ME-XRF21n	NaN	GEOCHEMISTRY	31/05/2012
31124	2958572	2466083	Cd	0.25	ppm	GEOCHEMISTRY	ME-XRF21n	NaN	GEOCHEMISTRY	31/05/2012
31125	2958572	2466083	Ce	52.9	ppm	GEOCHEMISTRY	ME-XRF21n	NaN	GEOCHEMISTRY	31/05/2012
31126	2958572	2466083	Co	28	ppm	GEOCHEMISTRY	ME-XRF21n	NaN	GEOCHEMISTRY	31/05/2012
31127	2958572	2466083	Cr	21	ppm	GEOCHEMISTRY	ME-XRF21n	NaN	GEOCHEMISTRY	31/05/2012
31128	2958572	2466083	Cu	15	ppm	GEOCHEMISTRY	ME-XRF21n	NaN	GEOCHEMISTRY	31/05/2012
31129	2958572	2466083	Fe	6.27	%	GEOCHEMISTRY	ME-XRF21n	NaN	GEOCHEMISTRY	31/05/2012
31130	2958572	2466083	U	10.3	ppm	GEOCHEMISTRY	ME-XRF21n	NaN	GEOCHEMISTRY	31/05/2012
31131	2958572	2466083	U3O8	12	ppm	GEOCHEMISTRY	ME-XRF21n	NaN	GEOCHEMISTRY	31/05/2012
116463	2958572	2466083	Ca	1.76	%	GEOCHEMISTRY	ME-ICP61	NaN	GEOCHEMISTRY	31/05/2012
116464	2958572	2466083	Fe	6.43	%	GEOCHEMISTRY	ME-ICP61	NaN	GEOCHEMISTRY	31/05/2012
116465	2958572	2466083	K	4.24	%	GEOCHEMISTRY	ME-ICP61	NaN	GEOCHEMISTRY	31/05/2012
116466	2958572	2466083	Mg	6.51	%	GEOCHEMISTRY	ME-ICP61	NaN	GEOCHEMISTRY	31/05/2012
116467	2958572	2466083	Mn	1.01e+03	ppm	GEOCHEMISTRY	ME-ICP61	NaN	GEOCHEMISTRY	31/05/2012
116468	2958572	2466083	Na	0.09	%	GEOCHEMISTRY	ME-ICP61	NaN	GEOCHEMISTRY	31/05/2012
116469	2958572	2466083	V	45	ppm	GEOCHEMISTRY	ME-ICP61	NaN	GEOCHEMISTRY	31/05/2012
116470	2958572	2466083	Zn	55	ppm	GEOCHEMISTRY	ME-ICP61	NaN	GEOCHEMISTRY	31/05/2012
147444	2958572	2466083	LOI	5.69	%	GEOCHEMISTRY	ME-GRA05	NaN	GEOCHEMISTRY	31/05/2012

Our next step is to create a new DataFrame with only the required major element rows, dropping the additional trace and rare earth element data for this example. We do this by using our elements list and selecting only rows that have those elements in the ‘CHEM_CODE’ column:

df_major_elements = df_major_chem[df_major_chem['CHEM_CODE'].isin(elements)]

We can check that we only have the major elements now by checking that same sample again:

df_major_elements[df_major_elements['SAMPLE_ANALYSIS_NO']==2466083]

	SAMPLE_NO	SAMPLE_ANALYSIS_NO	CHEM_CODE	VALUE	UNIT	ANALYSIS_TYPE_DESC	CHEM_METHOD_CODE	OTHER_ANALYSIS_ID	ANALYSIS_TYPE_DESC	COLLECTED_DATE
147321	2958572	2466083	P2O5	0.137	%	GEOCHEMISTRY	ME-MS62s	NaN	GEOCHEMISTRY	31/05/2012
17613	2958572	2466083	Al2O3	12.9	%	GEOCHEMISTRY	ME-MS41	NaN	GEOCHEMISTRY	31/05/2012
17615	2958572	2466083	CaO	2.39	%	GEOCHEMISTRY	ME-MS41	NaN	GEOCHEMISTRY	31/05/2012
17616	2958572	2466083	K2O	5.05	%	GEOCHEMISTRY	ME-MS41	NaN	GEOCHEMISTRY	31/05/2012
17617	2958572	2466083	MgO	10.2	%	GEOCHEMISTRY	ME-MS41	NaN	GEOCHEMISTRY	31/05/2012
17618	2958572	2466083	Na2O	0.217	%	GEOCHEMISTRY	ME-MS41	NaN	GEOCHEMISTRY	31/05/2012
17619	2958572	2466083	SiO2	53.8	%	GEOCHEMISTRY	ME-MS41	NaN	GEOCHEMISTRY	31/05/2012
147444	2958572	2466083	LOI	5.69	%	GEOCHEMISTRY	ME-GRA05	NaN	GEOCHEMISTRY	31/05/2012

As I mentioned above, some of the samples have duplicate analyses by different methods. To check on some of them we can look for duplicate CHEM_CODE and SAMPLE_ANALYSIS_NO fields.

dup_analysis_samples = df_major_chem[df_major_chem.duplicated(subset=['SAMPLE_ANALYSIS_NO','CHEM_CODE'],keep=False)].sort_values(['SAMPLE_ANALYSIS_NO','CHEM_CODE'])

dup_analysis_samples.head(6)

	SAMPLE_NO	SAMPLE_ANALYSIS_NO	CHEM_CODE	VALUE	UNIT	ANALYSIS_TYPE_DESC	CHEM_METHOD_CODE	OTHER_ANALYSIS_ID	ANALYSIS_TYPE_DESC	COLLECTED_DATE
164713	969	776	Al2O3	21.36	%	GEOCHEMISTRY	XRF	NaN	GEOCHEMISTRY	23/11/1984
164724	969	776	Al2O3	24.87	%	GEOCHEMISTRY	XRF	NaN	GEOCHEMISTRY	23/11/1984
164719	969	776	CaO	12.16	%	GEOCHEMISTRY	XRF	NaN	GEOCHEMISTRY	23/11/1984
164730	969	776	CaO	12.86	%	GEOCHEMISTRY	XRF	NaN	GEOCHEMISTRY	23/11/1984
164715	969	776	Fe2O3	2.38	%	GEOCHEMISTRY	XRF	NaN	GEOCHEMISTRY	23/11/1984
164726	969	776	Fe2O3	0.57	%	GEOCHEMISTRY	XRF	NaN	GEOCHEMISTRY	23/11/1984

To handle these duplicates properly we would need to look at the ‘CHEM_METHOD_CODE’ and the labs and decide which to keep. For this example we’ll just create a new DataFrame and drop one of the duplicate analyses:

df_major_elements_tmp1 = df_major_elements.sort_values(['SAMPLE_ANALYSIS_NO','CHEM_CODE','CHEM_METHOD_CODE']).drop_duplicates(subset=['SAMPLE_ANALYSIS_NO','CHEM_CODE'],keep='last')

Cleaning the data

Our subset of major element data now consists of just under 3 million individual analyses. But the data is not all numeric and contains a number of other symbols such as ‘<’ and ‘-‘. We need to find what non-numeric symbols there are and then remove them. But we also need to note that the ‘<’ symbol is the deliminator for analyses below detection limit (BDL). In this case to handle BDL values we will replace the BDL value with half of the reported detection limit.

First we can use regular expressions to find what non-numeric characters are in the element value column.

sym_find_list = [re.findall(r'\D', str(i)) for i in df_major_elements_tmp1['VALUE']]
unique_symbols = set([item for sublist in sym_find_list for item in sublist])
unique_symbols

{'-', '.', '<', '>'}

We can then use Pandas string methods, like str.contains to remove them:

df_major_elements_tmp1.drop(df_major_elements_tmp1[df_major_elements_tmp1['VALUE'].str.contains('>',na=False, regex=False)].index, inplace=True)

The ‘-‘ symbol is an interesting one as it can potentially be a legitimate symbol for things like LOI. But looking at the samples that contain ‘-‘ we can see that in some instances it is used to represent a range of values, like 0-10. To deal with this case (only 2 samples in our dataset) we can explicitly convert them to a different value, such as 10

df_major_elements_tmp1[df_major_elements_tmp1['VALUE'].str.contains(r'-',na=False,regex=False)].tail()

	SAMPLE_NO	SAMPLE_ANALYSIS_NO	CHEM_CODE	VALUE	UNIT	ANALYSIS_TYPE_DESC	CHEM_METHOD_CODE	OTHER_ANALYSIS_ID	ANALYSIS_TYPE_DESC	COLLECTED_DATE
14320	1703219	1692890	LOI	-0.32	%	GEOCHEMISTRY	LOI	562.0/0910907	GEOCHEMISTRY	05/10/2009
14334	1703221	1692892	LOI	-0.16	%	GEOCHEMISTRY	LOI	562.0/0910907	GEOCHEMISTRY	05/10/2009
16070	1703222	1692893	LOI	-0.35	%	GEOCHEMISTRY	LOI	562.0/0910907	GEOCHEMISTRY	05/10/2009
95053	2457789	1988166	LOI	<0-10	%	GEOCHEMISTRY	XRF78S	WM111749	GEOCHEMISTRY	07/10/2008
59340	2458844	1989221	LOI	<0-10	%	GEOCHEMISTRY	XRF78S	WM111749	GEOCHEMISTRY	11/09/2008

df_major_elements_tmp1.loc[df_major_elements_tmp1['VALUE'] == '<0-10','VALUE'] = '<10'

Now we can deal with the BDL values represented by ‘<’. There are a number of ways to deal with BDL values, but here we will just convert them to half the reported detection limit.

We do that by first making a flag column identifying which analyses are BDL. Then we can strip the ‘<’ character and finally divide just those rows that have the flag by 2:

# create a flag column to identify BDL values
df_major_elements_tmp1['BDL'] = 0
df_major_elements_tmp1.loc[df_major_elements_tmp1['VALUE'].str.contains('<',na=False,regex=False), 'BDL'] = 1

# remove < in vlues col
df_major_elements_tmp1['VALUE'] = df_major_elements_tmp1['VALUE'].astype(str).str.replace("<", "").astype(float)

#convert BDL units to half reported limit
df_major_elements_tmp1.loc[df_major_elements_tmp1['BDL'] == 1,'VALUE'] = df_major_elements_tmp1.loc[df_major_elements_tmp1['BDL'] == 1,'VALUE'] /2

From here it’s possible to do a number of further validation checks on the data, such as checking that the values fall within a realistic range etc. But for this example we’ll assume the data from here is ready to go.

Pivoting the data

The last thing we need to do is convert the cleaned chemical data from long format to a tidy data wide format, and convert data types from object to float. We can do this by pivoting the DataFrame (when you pivot the data, it creates a multilevel header. To fix this we simply rename the columns of the table, flattening it back to a single header row):

df_major_chem_wide = df_major_elements_tmp1.pivot(index='SAMPLE_ANALYSIS_NO', values=['VALUE'],columns='CHEM_CODE').sort_values('CHEM_CODE',axis=1)

df_major_chem_wide.columns = ['Al2O3','CaO','Fe2O3','FeO','K2O','LOI','MgO','Na2O','P2O5','SiO2','TiO2']

df_major_chem_wide.info()

<class 'Pandas.core.frame.DataFrame'>
Int64Index: 321523 entries, 122 to 2476035
Data columns (total 20 columns):
(VALUE, Al2O3)    189183 non-null float64
(VALUE, CaO)      189758 non-null float64
(VALUE, Fe2O3)    61647 non-null float64
(VALUE, FeO)      4665 non-null float64
(VALUE, K2O)      186116 non-null float64
(VALUE, LOI)      156535 non-null float64
(VALUE, MgO)      190418 non-null float64
(VALUE, Na2O)     171249 non-null float64
(VALUE, P2O5)     211858 non-null float64
(VALUE, SiO2)     192035 non-null float64
(VALUE, TiO2)     151620 non-null float64
dtypes: float64(20)
memory usage: 61.5 MB

df_major_chem_wide.head()

	Al2O3	CaO	Fe2O3	FeO	K2O	LOI	MgO	Na2O	P2O5	SiO2	TiO2
SAMPLE_ANALYSIS_NO
122	14.5	8.55	11.10	NaN	1.39	1.25	4.62	2.26	0.09	55.5	0.84
124	14.9	8.95	10.30	NaN	1.38	1.85	4.86	2.10	0.07	55.0	0.85
126	14.9	5.95	11.10	NaN	2.18	2.28	5.20	4.26	0.09	53.7	0.85
128	15.9	8.70	8.40	NaN	1.32	4.42	4.24	2.08	0.09	54.4	0.90
130	18.6	3.80	1.94	NaN	6.45	3.44	1.08	2.18	0.02	60.8	1.06

Finally we can export our new, customised, tidy format dataset using the Pandas to_csv method:

df_major_chem_wide.to_csv(r'D:\SARIG_major_els_v1.csv')

Summary

In this example we have taken a very large geochemical dataset, more than 10Gb worth of data, and turned it into a useable size of less than 65Mb. Using Dask we were able to deal with more data than my computer could handle in memory, filter it down to just the sample types and elements that we wanted (still over 300,000 samples mind you), and then convert it into the usual tidy data wide table format most of us are used to dealing with.

I hope you find this example useful for your own efforts to deal with large geological datasets. With a little bit of code and the right python libraries anyone can move away from Excel and start using larger than memory datasets.

Please check out my other content at GeoDataAnalytics.net, and get in touch if you have any questions or comments. And make sure you check out all the amazing work and freely available data from the Geological Survey of South Australia and SARIG. This example forms part of a larger exploration and transformation of this dataset for application in a lithology prediction machine learning model and can be found on GitHub.