Using Python to handle large geochemical datasets
The Geological Survey of South Australia (GSSA) holds a wealth of data in its geoscientific database SA Geodata. SA Geodata is the primary repository for geoscientific information in South Australia and contains open source data collected from a variety of sources.
The SA Geodata database contains over 10 Gb of geochemical data. That’s a lot of chemistry. Explorers often request extracts of this data set, but then find it a challenge to handle all that data. Because of the size and amount of data, programs like Excel wont even open the file, and if the extract is small enough to open, explorers often find the format of the data a challenge. Generally, people like to use tidy data for analysis, where each row in a table represents all the data about a single sample. But the data exports are in a long format where each row represents a single data point.
In this blog, I’ll show a simple method using python and the Dask and Pandas libraries to query and extract the desired elements, and then quickly pivot them into the desired tidy structure. This example uses a copy of the SA Geodata geochemical dataset provided for the Unearthed ExploreSA: Gawler Challenge and is valid as at February 2020.
When there’s too much data
Dask is an open source library for parallel computing in python. It can be used to parallelise computations across multiple machines, or larger-than-memory data sets on a single machine. It’s also tightly integrated with the python PyData stack including Pandas and Numpy, sharing a vary similar API and making working with large DataFrames easy.
From the Dask website: “A Dask DataFrame is a large parallel DataFrame composed of many smaller Pandas DataFrames, split along the index. These Pandas DataFrames may live on disk for larger-than-memory computing on a single machine, or on many different machines in a cluster.”
Most operations in Dask are ‘lazy’, which means they don’t actually evaluate anything until you explicitly ask for it. To return a computed result, you need to call the compute
method and the results need to be able to fit in memory as this will produce an out put or convert a Dask DataFrame into a Pandas DataFrame.
Lets work through an example to see how this all works. First we need to import the required libraries:
import dask.dataframe as dd
import pandas as pd
import numpy as np
import re
Loading the dataset
Because of the similar API between Dask and Pandas, loading the dataset is exactly the same as we would using Pandas, except we call Dask (dd
) instead of Pandas (pd
):
ddf = dd.read_csv(r'D:\Unearthed_SARIG_Data_Package\SARIG_Data_Package\sarig_rs_chem_exp.csv',
dtype={'LITHO_CONF': 'object', 'STRAT_CONF': 'object'})
I found after some trial and error that I needed to force the data types for the ‘LITHO_CONF’ and ‘STRAT_CONF’ columns, because Dask incorrectly guessed them as integers. That’s why in the above code I have explicitly provided the data type (dtype
) in the load call.
Next we can do some simple interrogation of the data like looking at the column names and using the head()
method to view the top 5 rows of data:
ddf.columns
Index(['SAMPLE_NO', 'SAMPLE_SOURCE_CODE', 'SAMPLE_SOURCE', 'ROCK_GROUP_CODE',
'ROCK_GROUP', 'LITHO_CODE', 'LITHO_CONF', 'LITHOLOGY_NAME',
'LITHO_MODIFIER', 'MAP_SYMBOL', 'STRAT_CONF', 'STRAT_NAME',
'COLLECTED_BY', 'COLLECTORS_NUMBER', 'COLLECTED_DATE',
'DRILLHOLE_NUMBER', 'DH_NAME', 'DH_DEPTH_FROM', 'DH_DEPTH_TO',
'SITE_NO', 'EASTING_GDA2020', 'NORTHING_GDA2020', 'ZONE_GDA2020',
'LONGITUDE_GDA2020', 'LATITUDE_GDA2020', 'LONGITUDE_GDA94',
'LATITUDE_GDA94', 'SAMPLE_ANALYSIS_NO', 'OTHER_ANALYSIS_ID',
'ANALYSIS_TYPE_DESC', 'LABORATORY', 'CHEM_CODE', 'VALUE', 'UNIT',
'CHEM_METHOD_CODE', 'CHEM_METHOD_DESC'],
dtype='object')
ddf.head(5)
SAMPLE_NO | SAMPLE_SOURCE_CODE | SAMPLE_SOURCE | ROCK_GROUP_CODE | ROCK_GROUP | LITHO_CODE | LITHO_CONF | LITHOLOGY_NAME | LITHO_MODIFIER | MAP_SYMBOL | STRAT_CONF | STRAT_NAME | COLLECTED_BY | COLLECTORS_NUMBER | COLLECTED_DATE | DRILLHOLE_NUMBER | DH_NAME | DH_DEPTH_FROM | DH_DEPTH_TO | SITE_NO | EASTING_GDA2020 | NORTHING_GDA2020 | ZONE_GDA2020 | LONGITUDE_GDA2020 | LATITUDE_GDA2020 | LONGITUDE_GDA94 | LATITUDE_GDA94 | SAMPLE_ANALYSIS_NO | OTHER_ANALYSIS_ID | ANALYSIS_TYPE_DESC | LABORATORY | CHEM_CODE | VALUE | UNIT | CHEM_METHOD_CODE | CHEM_METHOD_DESC | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 589 | RO | Rock outcrop / float | SSA | Sediment Siliciclastic Arenite | SDST | NaN | Sandstone | NaN | NaN | NaN | NaN | MIRAMS, R.C. | 600/1 | 13/12/1960 | NaN | NaN | NaN | NaN | 128096 | 513602.61 | 7.006e+06 | 52 | 129.137 | -27.070 | 129.137 | -27.070 | 158 | NaN | GEOCHEMISTRY | NaN | Ti | 0.15 | % | AES | AES |
1 | 589 | RO | Rock outcrop / float | SSA | Sediment Siliciclastic Arenite | SDST | NaN | Sandstone | NaN | NaN | NaN | NaN | MIRAMS, R.C. | 600/1 | 13/12/1960 | NaN | NaN | NaN | NaN | 128096 | 513602.61 | 7.006e+06 | 52 | 129.137 | -27.070 | 129.137 | -27.070 | 158 | NaN | GEOCHEMISTRY | NaN | Ag | 500 | ppb | AES | AES |
2 | 589 | RO | Rock outcrop / float | SSA | Sediment Siliciclastic Arenite | SDST | NaN | Sandstone | NaN | NaN | NaN | NaN | MIRAMS, R.C. | 600/1 | 13/12/1960 | NaN | NaN | NaN | NaN | 128096 | 513602.61 | 7.006e+06 | 52 | 129.137 | -27.070 | 129.137 | -27.070 | 158 | NaN | GEOCHEMISTRY | NaN | Ba | 1200 | ppm | AES | AES |
3 | 589 | RO | Rock outcrop / float | SSA | Sediment Siliciclastic Arenite | SDST | NaN | Sandstone | NaN | NaN | NaN | NaN | MIRAMS, R.C. | 600/1 | 13/12/1960 | NaN | NaN | NaN | NaN | 128096 | 513602.61 | 7.006e+06 | 52 | 129.137 | -27.070 | 129.137 | -27.070 | 158 | NaN | GEOCHEMISTRY | NaN | Co | 2 | ppm | AES | AES |
4 | 589 | RO | Rock outcrop / float | SSA | Sediment Siliciclastic Arenite | SDST | NaN | Sandstone | NaN | NaN | NaN | NaN | MIRAMS, R.C. | 600/1 | 13/12/1960 | NaN | NaN | NaN | NaN | 128096 | 513602.61 | 7.006e+06 | 52 | 129.137 | -27.070 | 129.137 | -27.070 | 158 | NaN | GEOCHEMISTRY | NaN | Cr | 1000 | ppm | AES | AES |
Extracting the required data
For this exercise, we are going to create a tidy dataset (wide format data) which only contains the major elements for rock samples. In the table above we can see there is a column for ‘SAMPLE_SOURCE’. To see how many different sample types, and how many records for each, we can compute the value counts for each. As this is evaluating data, we need to call the compute
method:
ddf['SAMPLE_SOURCE'].value_counts().compute()
Drill cuttings 10956887
Sawn half drill core 7256976
Drill core 6533477
Calcrete 2154614
Soil 989296
Drilled interval rock sample, type unspecified 970508
Rock outcrop / float 537537
Sawn quarter drill core 237103
Drillhole log data - used in calculating values 212746
Stream sediment 164251
Auger sample from near surface 105514
Vegetation 71018
Core sludge 56217
A full face slice of core 52531
Gravel 37740
Pulp / powdered rock of known origin, typically a lab return 16428
Magnetic lag fraction 10135
Duplicate: a split from an existing sample. 8646
Sample specifically for bulk leach extractable gold analysis 7826
Aircore: consolidated sample from aircore drilling method 7488
Analytical standard check sample 6358
Mine rock sample 6203
Channel (linear rock chip/outcrop sampling) 4453
Rock outcrop sample 4086
Lag sample, surface sample consisting of gravelly material. 3918
Costean (trench) 3746
Filleted, shaved or ground core sample 3478
Mine mullock / well spoil 3086
Loam 2679
Mine stockpile 2248
Lake floor sediment 1799
Excrement of animals 772
Mine tailings 347
Rock float sample (not in situ) 335
Drillhole 332
Smelter slag 293
Rock subcrop sample (in situ, not attached to bedrock) 67
Single mineral 40
Bulk (high volume) sample, diamond exploration 19
Core Library archival rock, original source unknown 12
Name: SAMPLE_SOURCE, dtype: int64
We can see above that there are a lot of samples that are not from rocks. We can create a new Dask DataFrame and only select samples from ‘rock sample’ types, excluding things like soils, lag and excrement of animals! Because this is a ‘lazy’ operation, it can be done on the total dataset and is quick:
ddf_rock_sample = ddf[ddf['SAMPLE_SOURCE'].isin(['Drill cuttings','Sawn half drill core','Drill core','Drilled interval rock sample, type unspecified','Rock outcrop / float','Sawn quarter drill core','Drillhole log data - used in calculating values','A full face slice of core','Pulp / powdered rock of known origin, typically a lab return','Duplicate: a split from an existing sample.','Channel (linear rock chip/outcrop sampling)','Rock outcrop sample','Costean (trench)','Filleted, shaved or ground core sample','Rock float sample (not in situ)','Drillhole','Rock subcrop sample (in situ, not attached to bedrock)'])]
Now we just have rock samples, that should have removed a few hundred thousand rows from our dataset, significantly reducing the size. Next we can reduce that further and only grab the samples of interest, here the ones containing major element data.
First we generate a list of elements we want to include. Then we want to select all of the unique SAMPLE_ANALYSIS_NO (the primary identifier for each sample analysis) that contain data for the required major elements:
elements = ['SiO2','Al2O3','TiO2','Fe2O3','MgO','CaO','Na2O','K2O','P2O5','LOI','FeO']
analyses = ddf_rock_sample[ddf_rock_sample.CHEM_CODE.isin(elements)].SAMPLE_ANALYSIS_NO.unique()
Now we have identified each sample we want to keep we can compute a Pandas DataFrame for the unique SAMPLE_ANALYSIS_NO with only the required columns we want (such as the sample number, chem code, value and unit. We can drop unnecessary columns like lithology to further reduce the memory size of the dataset). Because this is converting a larger than memory dataset into a more useable sized dataset, this may take a few moments to process:
ddf_major_chem = ddf_rock_sample[ddf_rock_sample.SAMPLE_ANALYSIS_NO.isin(list(analyses))][['SAMPLE_NO','SAMPLE_ANALYSIS_NO',"CHEM_CODE",'VALUE','UNIT','ANALYSIS_TYPE_DESC','CHEM_METHOD_CODE','OTHER_ANALYSIS_ID','ANALYSIS_TYPE_DESC','COLLECTED_DATE']].drop_duplicates()
df_major_chem = ddf_major_chem.compute()
Doing the selection this way, selecting by sample number and not just the specific required elements, is potentially a redundant step. We could instead just use the code above where we selected by the unique sample numbers, and instead of computing a Pandas DataFrame containing all the analyses done on those samples, we could have just selected those rows that belonged to major element data by using the ‘elements’ list instead of the ‘analyses’ list in the above code. This would allow us to skip the line of code above and some of the code below. Here I’ve chosen to do it this way to demonstrate a way to select samples based on containing a type of data, in this case major elements.
So, because we’ve selected by sample, if we look at an example sample below we can see that the Pandas DataFrame contains the major elements we want plus extras. We can also see that there are duplicate analytes analysed by different methods:
df_major_chem[df_major_chem['SAMPLE_ANALYSIS_NO']==2466083]
SAMPLE_NO | SAMPLE_ANALYSIS_NO | CHEM_CODE | VALUE | UNIT | ANALYSIS_TYPE_DESC | CHEM_METHOD_CODE | OTHER_ANALYSIS_ID | ANALYSIS_TYPE_DESC | COLLECTED_DATE | |
---|---|---|---|---|---|---|---|---|---|---|
147316 | 2958572 | 2466083 | Ga | 30 | ppm | GEOCHEMISTRY | ME-MS62s | NaN | GEOCHEMISTRY | 31/05/2012 |
147317 | 2958572 | 2466083 | La | 26.5 | ppm | GEOCHEMISTRY | ME-MS62s | NaN | GEOCHEMISTRY | 31/05/2012 |
147318 | 2958572 | 2466083 | Mo | 0.5 | ppm | GEOCHEMISTRY | ME-MS62s | NaN | GEOCHEMISTRY | 31/05/2012 |
147319 | 2958572 | 2466083 | Ni | 24 | ppm | GEOCHEMISTRY | ME-MS62s | NaN | GEOCHEMISTRY | 31/05/2012 |
147320 | 2958572 | 2466083 | P | 600 | ppm | GEOCHEMISTRY | ME-MS62s | NaN | GEOCHEMISTRY | 31/05/2012 |
147321 | 2958572 | 2466083 | P2O5 | 0.137 | % | GEOCHEMISTRY | ME-MS62s | NaN | GEOCHEMISTRY | 31/05/2012 |
147322 | 2958572 | 2466083 | Pb | 4 | ppm | GEOCHEMISTRY | ME-MS62s | NaN | GEOCHEMISTRY | 31/05/2012 |
147323 | 2958572 | 2466083 | S | 0.01 | % | GEOCHEMISTRY | ME-MS62s | NaN | GEOCHEMISTRY | 31/05/2012 |
147324 | 2958572 | 2466083 | Sb | 2.5 | ppm | GEOCHEMISTRY | ME-MS62s | NaN | GEOCHEMISTRY | 31/05/2012 |
147325 | 2958572 | 2466083 | Sc | 8 | ppm | GEOCHEMISTRY | ME-MS62s | NaN | GEOCHEMISTRY | 31/05/2012 |
147326 | 2958572 | 2466083 | Sr | 35 | ppm | GEOCHEMISTRY | ME-MS62s | NaN | GEOCHEMISTRY | 31/05/2012 |
147327 | 2958572 | 2466083 | Te | 0.025 | ppm | GEOCHEMISTRY | ME-MS62s | NaN | GEOCHEMISTRY | 31/05/2012 |
147328 | 2958572 | 2466083 | Th | 20 | ppm | GEOCHEMISTRY | ME-MS62s | NaN | GEOCHEMISTRY | 31/05/2012 |
147329 | 2958572 | 2466083 | Ti | 0.19 | % | GEOCHEMISTRY | ME-MS62s | NaN | GEOCHEMISTRY | 31/05/2012 |
147330 | 2958572 | 2466083 | Tl | 5 | ppm | GEOCHEMISTRY | ME-MS62s | NaN | GEOCHEMISTRY | 31/05/2012 |
147331 | 2958572 | 2466083 | W | 5 | ppm | GEOCHEMISTRY | ME-MS62s | NaN | GEOCHEMISTRY | 31/05/2012 |
147332 | 2958572 | 2466083 | Y | 47.4 | ppm | GEOCHEMISTRY | ME-MS62s | NaN | GEOCHEMISTRY | 31/05/2012 |
17613 | 2958572 | 2466083 | Al2O3 | 12.9 | % | GEOCHEMISTRY | ME-MS41 | NaN | GEOCHEMISTRY | 31/05/2012 |
17614 | 2958572 | 2466083 | Au | 0.005 | ppm | GEOCHEMISTRY | Au-TL43 | NaN | GEOCHEMISTRY | 31/05/2012 |
17615 | 2958572 | 2466083 | CaO | 2.39 | % | GEOCHEMISTRY | ME-MS41 | NaN | GEOCHEMISTRY | 31/05/2012 |
17616 | 2958572 | 2466083 | K2O | 5.05 | % | GEOCHEMISTRY | ME-MS41 | NaN | GEOCHEMISTRY | 31/05/2012 |
17617 | 2958572 | 2466083 | MgO | 10.2 | % | GEOCHEMISTRY | ME-MS41 | NaN | GEOCHEMISTRY | 31/05/2012 |
17618 | 2958572 | 2466083 | Na2O | 0.217 | % | GEOCHEMISTRY | ME-MS41 | NaN | GEOCHEMISTRY | 31/05/2012 |
17619 | 2958572 | 2466083 | SiO2 | 53.8 | % | GEOCHEMISTRY | ME-MS41 | NaN | GEOCHEMISTRY | 31/05/2012 |
31118 | 2958572 | 2466083 | Ag | 0.25 | ppm | GEOCHEMISTRY | ME-XRF21n | NaN | GEOCHEMISTRY | 31/05/2012 |
31119 | 2958572 | 2466083 | Al | 7.05 | % | GEOCHEMISTRY | ME-XRF21n | NaN | GEOCHEMISTRY | 31/05/2012 |
31120 | 2958572 | 2466083 | As | 2.5 | ppm | GEOCHEMISTRY | ME-XRF21n | NaN | GEOCHEMISTRY | 31/05/2012 |
31121 | 2958572 | 2466083 | Ba | 1240 | ppm | GEOCHEMISTRY | ME-XRF21n | NaN | GEOCHEMISTRY | 31/05/2012 |
31122 | 2958572 | 2466083 | Be | 7.2 | ppm | GEOCHEMISTRY | ME-XRF21n | NaN | GEOCHEMISTRY | 31/05/2012 |
31123 | 2958572 | 2466083 | Bi | 1 | ppm | GEOCHEMISTRY | ME-XRF21n | NaN | GEOCHEMISTRY | 31/05/2012 |
31124 | 2958572 | 2466083 | Cd | 0.25 | ppm | GEOCHEMISTRY | ME-XRF21n | NaN | GEOCHEMISTRY | 31/05/2012 |
31125 | 2958572 | 2466083 | Ce | 52.9 | ppm | GEOCHEMISTRY | ME-XRF21n | NaN | GEOCHEMISTRY | 31/05/2012 |
31126 | 2958572 | 2466083 | Co | 28 | ppm | GEOCHEMISTRY | ME-XRF21n | NaN | GEOCHEMISTRY | 31/05/2012 |
31127 | 2958572 | 2466083 | Cr | 21 | ppm | GEOCHEMISTRY | ME-XRF21n | NaN | GEOCHEMISTRY | 31/05/2012 |
31128 | 2958572 | 2466083 | Cu | 15 | ppm | GEOCHEMISTRY | ME-XRF21n | NaN | GEOCHEMISTRY | 31/05/2012 |
31129 | 2958572 | 2466083 | Fe | 6.27 | % | GEOCHEMISTRY | ME-XRF21n | NaN | GEOCHEMISTRY | 31/05/2012 |
31130 | 2958572 | 2466083 | U | 10.3 | ppm | GEOCHEMISTRY | ME-XRF21n | NaN | GEOCHEMISTRY | 31/05/2012 |
31131 | 2958572 | 2466083 | U3O8 | 12 | ppm | GEOCHEMISTRY | ME-XRF21n | NaN | GEOCHEMISTRY | 31/05/2012 |
116463 | 2958572 | 2466083 | Ca | 1.76 | % | GEOCHEMISTRY | ME-ICP61 | NaN | GEOCHEMISTRY | 31/05/2012 |
116464 | 2958572 | 2466083 | Fe | 6.43 | % | GEOCHEMISTRY | ME-ICP61 | NaN | GEOCHEMISTRY | 31/05/2012 |
116465 | 2958572 | 2466083 | K | 4.24 | % | GEOCHEMISTRY | ME-ICP61 | NaN | GEOCHEMISTRY | 31/05/2012 |
116466 | 2958572 | 2466083 | Mg | 6.51 | % | GEOCHEMISTRY | ME-ICP61 | NaN | GEOCHEMISTRY | 31/05/2012 |
116467 | 2958572 | 2466083 | Mn | 1.01e+03 | ppm | GEOCHEMISTRY | ME-ICP61 | NaN | GEOCHEMISTRY | 31/05/2012 |
116468 | 2958572 | 2466083 | Na | 0.09 | % | GEOCHEMISTRY | ME-ICP61 | NaN | GEOCHEMISTRY | 31/05/2012 |
116469 | 2958572 | 2466083 | V | 45 | ppm | GEOCHEMISTRY | ME-ICP61 | NaN | GEOCHEMISTRY | 31/05/2012 |
116470 | 2958572 | 2466083 | Zn | 55 | ppm | GEOCHEMISTRY | ME-ICP61 | NaN | GEOCHEMISTRY | 31/05/2012 |
147444 | 2958572 | 2466083 | LOI | 5.69 | % | GEOCHEMISTRY | ME-GRA05 | NaN | GEOCHEMISTRY | 31/05/2012 |
Our next step is to create a new DataFrame with only the required major element rows, dropping the additional trace and rare earth element data for this example. We do this by using our elements list and selecting only rows that have those elements in the ‘CHEM_CODE’ column:
df_major_elements = df_major_chem[df_major_chem['CHEM_CODE'].isin(elements)]
We can check that we only have the major elements now by checking that same sample again:
df_major_elements[df_major_elements['SAMPLE_ANALYSIS_NO']==2466083]
SAMPLE_NO | SAMPLE_ANALYSIS_NO | CHEM_CODE | VALUE | UNIT | ANALYSIS_TYPE_DESC | CHEM_METHOD_CODE | OTHER_ANALYSIS_ID | ANALYSIS_TYPE_DESC | COLLECTED_DATE | |
---|---|---|---|---|---|---|---|---|---|---|
147321 | 2958572 | 2466083 | P2O5 | 0.137 | % | GEOCHEMISTRY | ME-MS62s | NaN | GEOCHEMISTRY | 31/05/2012 |
17613 | 2958572 | 2466083 | Al2O3 | 12.9 | % | GEOCHEMISTRY | ME-MS41 | NaN | GEOCHEMISTRY | 31/05/2012 |
17615 | 2958572 | 2466083 | CaO | 2.39 | % | GEOCHEMISTRY | ME-MS41 | NaN | GEOCHEMISTRY | 31/05/2012 |
17616 | 2958572 | 2466083 | K2O | 5.05 | % | GEOCHEMISTRY | ME-MS41 | NaN | GEOCHEMISTRY | 31/05/2012 |
17617 | 2958572 | 2466083 | MgO | 10.2 | % | GEOCHEMISTRY | ME-MS41 | NaN | GEOCHEMISTRY | 31/05/2012 |
17618 | 2958572 | 2466083 | Na2O | 0.217 | % | GEOCHEMISTRY | ME-MS41 | NaN | GEOCHEMISTRY | 31/05/2012 |
17619 | 2958572 | 2466083 | SiO2 | 53.8 | % | GEOCHEMISTRY | ME-MS41 | NaN | GEOCHEMISTRY | 31/05/2012 |
147444 | 2958572 | 2466083 | LOI | 5.69 | % | GEOCHEMISTRY | ME-GRA05 | NaN | GEOCHEMISTRY | 31/05/2012 |
As I mentioned above, some of the samples have duplicate analyses by different methods. To check on some of them we can look for duplicate CHEM_CODE and SAMPLE_ANALYSIS_NO fields.
dup_analysis_samples = df_major_chem[df_major_chem.duplicated(subset=['SAMPLE_ANALYSIS_NO','CHEM_CODE'],keep=False)].sort_values(['SAMPLE_ANALYSIS_NO','CHEM_CODE'])
dup_analysis_samples.head(6)
SAMPLE_NO | SAMPLE_ANALYSIS_NO | CHEM_CODE | VALUE | UNIT | ANALYSIS_TYPE_DESC | CHEM_METHOD_CODE | OTHER_ANALYSIS_ID | ANALYSIS_TYPE_DESC | COLLECTED_DATE | |
---|---|---|---|---|---|---|---|---|---|---|
164713 | 969 | 776 | Al2O3 | 21.36 | % | GEOCHEMISTRY | XRF | NaN | GEOCHEMISTRY | 23/11/1984 |
164724 | 969 | 776 | Al2O3 | 24.87 | % | GEOCHEMISTRY | XRF | NaN | GEOCHEMISTRY | 23/11/1984 |
164719 | 969 | 776 | CaO | 12.16 | % | GEOCHEMISTRY | XRF | NaN | GEOCHEMISTRY | 23/11/1984 |
164730 | 969 | 776 | CaO | 12.86 | % | GEOCHEMISTRY | XRF | NaN | GEOCHEMISTRY | 23/11/1984 |
164715 | 969 | 776 | Fe2O3 | 2.38 | % | GEOCHEMISTRY | XRF | NaN | GEOCHEMISTRY | 23/11/1984 |
164726 | 969 | 776 | Fe2O3 | 0.57 | % | GEOCHEMISTRY | XRF | NaN | GEOCHEMISTRY | 23/11/1984 |
To handle these duplicates properly we would need to look at the ‘CHEM_METHOD_CODE’ and the labs and decide which to keep. For this example we’ll just create a new DataFrame and drop one of the duplicate analyses:
df_major_elements_tmp1 = df_major_elements.sort_values(['SAMPLE_ANALYSIS_NO','CHEM_CODE','CHEM_METHOD_CODE']).drop_duplicates(subset=['SAMPLE_ANALYSIS_NO','CHEM_CODE'],keep='last')
Cleaning the data
Our subset of major element data now consists of just under 3 million individual analyses. But the data is not all numeric and contains a number of other symbols such as ‘<’ and ‘-‘. We need to find what non-numeric symbols there are and then remove them. But we also need to note that the ‘<’ symbol is the deliminator for analyses below detection limit (BDL). In this case to handle BDL values we will replace the BDL value with half of the reported detection limit.
First we can use regular expressions to find what non-numeric characters are in the element value column.
sym_find_list = [re.findall(r'\D', str(i)) for i in df_major_elements_tmp1['VALUE']]
unique_symbols = set([item for sublist in sym_find_list for item in sublist])
unique_symbols
{'-', '.', '<', '>'}
We can then use Pandas string methods, like str.contains
to remove them:
df_major_elements_tmp1.drop(df_major_elements_tmp1[df_major_elements_tmp1['VALUE'].str.contains('>',na=False, regex=False)].index, inplace=True)
The ‘-‘ symbol is an interesting one as it can potentially be a legitimate symbol for things like LOI. But looking at the samples that contain ‘-‘ we can see that in some instances it is used to represent a range of values, like 0-10. To deal with this case (only 2 samples in our dataset) we can explicitly convert them to a different value, such as 10
df_major_elements_tmp1[df_major_elements_tmp1['VALUE'].str.contains(r'-',na=False,regex=False)].tail()
SAMPLE_NO | SAMPLE_ANALYSIS_NO | CHEM_CODE | VALUE | UNIT | ANALYSIS_TYPE_DESC | CHEM_METHOD_CODE | OTHER_ANALYSIS_ID | ANALYSIS_TYPE_DESC | COLLECTED_DATE | |
---|---|---|---|---|---|---|---|---|---|---|
14320 | 1703219 | 1692890 | LOI | -0.32 | % | GEOCHEMISTRY | LOI | 562.0/0910907 | GEOCHEMISTRY | 05/10/2009 |
14334 | 1703221 | 1692892 | LOI | -0.16 | % | GEOCHEMISTRY | LOI | 562.0/0910907 | GEOCHEMISTRY | 05/10/2009 |
16070 | 1703222 | 1692893 | LOI | -0.35 | % | GEOCHEMISTRY | LOI | 562.0/0910907 | GEOCHEMISTRY | 05/10/2009 |
95053 | 2457789 | 1988166 | LOI | <0-10 | % | GEOCHEMISTRY | XRF78S | WM111749 | GEOCHEMISTRY | 07/10/2008 |
59340 | 2458844 | 1989221 | LOI | <0-10 | % | GEOCHEMISTRY | XRF78S | WM111749 | GEOCHEMISTRY | 11/09/2008 |
df_major_elements_tmp1.loc[df_major_elements_tmp1['VALUE'] == '<0-10','VALUE'] = '<10'
Now we can deal with the BDL values represented by ‘<’. There are a number of ways to deal with BDL values, but here we will just convert them to half the reported detection limit.
We do that by first making a flag column identifying which analyses are BDL. Then we can strip the ‘<’ character and finally divide just those rows that have the flag by 2:
# create a flag column to identify BDL values
df_major_elements_tmp1['BDL'] = 0
df_major_elements_tmp1.loc[df_major_elements_tmp1['VALUE'].str.contains('<',na=False,regex=False), 'BDL'] = 1
# remove < in vlues col
df_major_elements_tmp1['VALUE'] = df_major_elements_tmp1['VALUE'].astype(str).str.replace("<", "").astype(float)
#convert BDL units to half reported limit
df_major_elements_tmp1.loc[df_major_elements_tmp1['BDL'] == 1,'VALUE'] = df_major_elements_tmp1.loc[df_major_elements_tmp1['BDL'] == 1,'VALUE'] /2
From here it’s possible to do a number of further validation checks on the data, such as checking that the values fall within a realistic range etc. But for this example we’ll assume the data from here is ready to go.
Pivoting the data
The last thing we need to do is convert the cleaned chemical data from long format to a tidy data wide format, and convert data types from object to float. We can do this by pivoting the DataFrame (when you pivot the data, it creates a multilevel header. To fix this we simply rename the columns of the table, flattening it back to a single header row):
df_major_chem_wide = df_major_elements_tmp1.pivot(index='SAMPLE_ANALYSIS_NO', values=['VALUE'],columns='CHEM_CODE').sort_values('CHEM_CODE',axis=1)
df_major_chem_wide.columns = ['Al2O3','CaO','Fe2O3','FeO','K2O','LOI','MgO','Na2O','P2O5','SiO2','TiO2']
df_major_chem_wide.info()
<class 'Pandas.core.frame.DataFrame'>
Int64Index: 321523 entries, 122 to 2476035
Data columns (total 20 columns):
(VALUE, Al2O3) 189183 non-null float64
(VALUE, CaO) 189758 non-null float64
(VALUE, Fe2O3) 61647 non-null float64
(VALUE, FeO) 4665 non-null float64
(VALUE, K2O) 186116 non-null float64
(VALUE, LOI) 156535 non-null float64
(VALUE, MgO) 190418 non-null float64
(VALUE, Na2O) 171249 non-null float64
(VALUE, P2O5) 211858 non-null float64
(VALUE, SiO2) 192035 non-null float64
(VALUE, TiO2) 151620 non-null float64
dtypes: float64(20)
memory usage: 61.5 MB
df_major_chem_wide.head()
Al2O3 | CaO | Fe2O3 | FeO | K2O | LOI | MgO | Na2O | P2O5 | SiO2 | TiO2 | |
---|---|---|---|---|---|---|---|---|---|---|---|
SAMPLE_ANALYSIS_NO | |||||||||||
122 | 14.5 | 8.55 | 11.10 | NaN | 1.39 | 1.25 | 4.62 | 2.26 | 0.09 | 55.5 | 0.84 |
124 | 14.9 | 8.95 | 10.30 | NaN | 1.38 | 1.85 | 4.86 | 2.10 | 0.07 | 55.0 | 0.85 |
126 | 14.9 | 5.95 | 11.10 | NaN | 2.18 | 2.28 | 5.20 | 4.26 | 0.09 | 53.7 | 0.85 |
128 | 15.9 | 8.70 | 8.40 | NaN | 1.32 | 4.42 | 4.24 | 2.08 | 0.09 | 54.4 | 0.90 |
130 | 18.6 | 3.80 | 1.94 | NaN | 6.45 | 3.44 | 1.08 | 2.18 | 0.02 | 60.8 | 1.06 |
Finally we can export our new, customised, tidy format dataset using the Pandas to_csv
method:
df_major_chem_wide.to_csv(r'D:\SARIG_major_els_v1.csv')
Summary
In this example we have taken a very large geochemical dataset, more than 10Gb worth of data, and turned it into a useable size of less than 65Mb. Using Dask we were able to deal with more data than my computer could handle in memory, filter it down to just the sample types and elements that we wanted (still over 300,000 samples mind you), and then convert it into the usual tidy data wide table format most of us are used to dealing with.
I hope you find this example useful for your own efforts to deal with large geological datasets. With a little bit of code and the right python libraries anyone can move away from Excel and start using larger than memory datasets.
Please check out my other content at GeoDataAnalytics.net, and get in touch if you have any questions or comments. And make sure you check out all the amazing work and freely available data from the Geological Survey of South Australia and SARIG. This example forms part of a larger exploration and transformation of this dataset for application in a lithology prediction machine learning model and can be found on GitHub.