What are explorers looking for in S.A.? Part 2
Spatially assessing the distribution of exploration topics using python.
In the previous article, I demonstrated an application of NLP topic modelling using Latent Dirichlet Allocation and the Gensim
library to identify themes in South Australian exploration report summaries. That ML model identified eight coherent topics in the report data set that closely aligns with my domain knowledge of the types of exploration undertaken in the State, and demonstrates some of the potential of undertaking this type of analysis on unstructured geological reports.
As a geologist by trade, I often tend to think about things spatially. Where (and when) do things occur and how does that relate to other observables. To that end I decided to try and represent the spatial distribution of those eight identified topics and see if they align with what I know about the geology and exploration potential of SA. In this second part of “What are explorers looking for in SA”, I will go through one process of creating a spatially enabled data set of exploration reports using the geopandas
library and then plotting them with the geoplot
and contextily
libraries.
Creating the tenement data set
In the previous article, we used the index data set of the South Australian exploration reports provided by the South Australian Department for Energy and Mining, freely available from the SARIG web portal. This csv file contains a number of metadata columns about each exploration report including the envelope number (the document ID), the tenement number associated with the report, a broad subject and a short summary or abstract of the report, along with a number of other metadata columns. The important data column for the previous topic modelling article was the abstract column, consisting of a short text summary of the envelope created by a GSSA geologist. We then added to this data set by processing the abstracts for the LDA modelling and finally appending the results of the model. For this article, we will focus on the tenement number to allow us to put spatial constraints on each exploration report.
First, let’s import the python libraries we’ll need and then open up and have a look at the final data set produced previously:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
import contextily as ctx
import geoplot as gplt
import os
sns.set_style("darkgrid")
pd.set_option('max_columns', 30)
pd.set_option('display.max_colwidth', 200)
final_LDA_df = pd.read_csv(r'D:\Python ML\Envelope-key-words\Data\Processed\Envelope_modelled_topics_20201127.csv').drop(['Unnamed: 0'], axis=1)
final_LDA_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5894 entries, 0 to 5893
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 5894 non-null object
1 Category 5894 non-null object
2 Title 5894 non-null object
3 Publication Date 5630 non-null object
4 Format 5886 non-null object
5 Broad Subject 5892 non-null object
6 Notes 2753 non-null object
7 Tenement 5465 non-null object
8 normalised_Abstract 5894 non-null object
9 Abstract_toks 5894 non-null object
10 Abstract_len 5894 non-null int64
11 Dominant_Topic 5894 non-null float64
12 Percent_Contrib 5894 non-null float64
13 Topic_keywords 5894 non-null object
14 Topic_Name 5894 non-null object
15 Relevent_words 5894 non-null object
dtypes: float64(2), int64(1), object(13)
memory usage: 736.9+ KB
final_LDA_df.sample(1)
ID | Category | Title | Publication Date | Format | Broad Subject | Notes | Tenement | normalised_Abstract | Abstract_toks | Abstract_len | Dominant_Topic | Percent_Contrib | Topic_keywords | Topic_Name | Relevent_words | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
162 | ENV00407 | Company mineral exploration licence reports | Moonta-Wallaroo area. Progress reports to licence expiry/renewal, for the period 1/9/1963 to 31/8/1965. | 24-Sep-64 | Digital Hard Copy Hard Copy | Mineral exploration - SA; Drilling; Geophysics; Geochemistry | Continuation of tenure over same area formerly held as SML 42. Please refer to Env 6999 to see SML 59 six-monthly reports ... | SML00059 | report discuss ongoing geophysical geochemical diamond drilling investigation regional magnetic feature around northern end yorke peninsula targette hide supergene copper gold mineralisation moont... | ['report', 'discuss', 'ongoing', 'geophysical', 'geochemical', 'diamond', 'drilling', 'investigation', 'regional', 'magnetic', 'feature', 'around', 'northern', 'end', 'yorke', 'peninsula', 'target... | 61 | 2.0 | 0.4413 | gold' sample' mineralisation' hole' copper' area' prospect' drilling' sampling' rock | Gold, copper, base metal exploration | ['gold', 'soil', 'sample', 'rock_chip', 'copper', 'value', 'sampling', 'au', 'base_metal', 'mineralisation', 'anomalous', 'geochemical', 'return', 'assay', 'vein', 'prospect', 'ppm', 'zone', 'bedr... |
From the above we can see that there are 16 columns of data in this data set. For the present exercise, we don’t need all this information. So lets make a new DataFrame
which only contains the data we need for the spatial modelling, namely the ID, tenement number and the modelling results:
tenements_df = final_LDA_df[['ID', 'Tenement','Dominant_Topic','Topic_Name']]
tenements_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5894 entries, 0 to 5893
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 5894 non-null object
1 Tenement 5465 non-null object
2 Dominant_Topic 5894 non-null float64
3 Topic_Name 5894 non-null object
dtypes: float64(1), object(3)
memory usage: 184.3+ KB
tenements_df.sample(3)
ID | Tenement | Dominant_Topic | Topic_Name | |
---|---|---|---|---|
1025 | ENV02325 | EL00085;EL00087;EL00121 | 1.0 | IOCG exploration |
3413 | ENV08917 | EL01858 | 2.0 | Gold, copper, base metal exploration |
2652 | ENV06685 | NaN | 4.0 | Mine operations and development |
There are a number of things to note about this new DataFrame
. Not every envelope that has an abstract has an associated tenement number, in fact there are 5894 rows in this data set but only 5465 have tenement numbers. Those reports without tenement numbers would relate to departmental or university publications that aren’t explicitly linked to a physical tenement.
The other thing to note is some of the reports have multiple tenement numbers associated with them. This may be because of a number of things, such as renewals, the tenement changing hands or the tenement type may change from say an exploration lease to a mining lease etc. Another reason may be that the report refers to a number of separate tenements that are held by the same company and are reported against as a group under an amalgamated expenditure agreement.
To deal with this I decided to try two different options and see how many of these tenement numbers I could match up to physical spatial boundaries. One was to use the entire list of tenement numbers and the other was to only use the most recent (highest number) to represent the location of the report. To get this we can first create a list of individual tenement numbers by using the split method on the ‘;’ deliminator and then create a new ‘most_recent_tenement’ column containing only the last one:
tenements_df['Tenement'] = tenements_df['Tenement'].str.split(';')
tenements_df['most_recent_tenement'] = tenements_df.Tenement.str[-1]
tenements_df.tail(3)
ID | Tenement | Dominant_Topic | Topic_Name | most_recent_tenement | |
---|---|---|---|---|---|
5887 | ENV13185 | [EL02593, EL03221, EL04388, EL05542] | 6.0 | Heavy minerals, ?extractives and environmental | EL05542 |
5889 | ENV13201 | [EL02692, EL03346, EL04597, EL05773] | 2.0 | Gold, copper, base metal exploration | EL05773 |
5893 | ENV9267 | [OEL00020, OEL00021, PEL00005, PEL00006, PPL00014] | 4.0 | Mine operations and development | PPL00014 |
We can then make an assessment of just how many different tenement types and individual tenement numbers there are in our data set:
all_tenements_exploded = tenements_df.explode('Tenement')
all_tenements_exploded['tenement_type'] = all_tenements_exploded['Tenement'].str.extract(r'(^\D{1,3})')
# and just for the most recent temenments
tenements_df['tenement_type'] = tenements_df['most_recent_tenement'].str.extract(r'(^\D{1,3})')
all_tenements_exploded['tenement_type'].value_counts()
EL 6933
SML 837
PEL 746
MC 327
ML 310
GEL 274
OEL 227
PPL 177
EPP 146
ATP 20
PE 17
PEP 16
PM 14
RL 10
MSL 10
THR 7
MR 5
MPL 5
ELA 5
PRL 5
EML 4
AAL 3
T 2
SA 1
FR 1
OP 1
SIP 1
Name: tenement_type, dtype: int64
We can then create a list of tenement numbers to see how many we have, and to use later in selecting our spatial boundaries:
most_recent_tenem_list = tenements_df['most_recent_tenement'].dropna().to_list()
all_tenements_list = all_tenements_exploded['Tenement'].dropna().to_list()
print(len(most_recent_tenem_list))
print(len(all_tenemenents_list))
5465
10109
So there are 5465 most recent tenements (as expected this is the same number of exploration reports that have an associated tenement number), and over 10,000 tenement numbers in total. We can see from the value counts above that this is dominated by the EL (Exploration Licence) type for mineral exploration with nearly 7000 tenements, followed by SML (Special Mining Lease), PEL (Petroleum Exploration Licence), MC (Mineral Claim), ML (Mining Licence), GEL (Geothermal Exploration Licence), OEL (Oil Exploration Licence), PPL (Petroleum Production Licence) and EPP (Exploration Permit for Petroleum) each with between 1000 and 100 tenements.
Now we have the tenement numbers associated with our exploration report topics, we need to get the spatial data sets so we can make some maps.
Downloading the tenement spatial data
I’ve used the word tenement a lot so far in this article, but what exactly is a tenement. When an exploration company want’s to explore for resources, they are required by law to ‘peg’ a mineral claim. Back before gps and digital systems, this meant exactly that, they would physically peg out the boundaries of their mineral claim. Now days this is all done online through application via the SARIG web portal where applicants can digitally draw the area onto a map, with the exact coordinates being captured. The below image is a snapshot of the current mineral exploration licence tenements from SARIG as at January 2021.
The spatial boundaries of each tenement (active and historic) are provided as polygons in a variety of formats including as an ESRI Geodatabase or SHP file, a MapInfo TAB file and a Google Earth KMZ file. The easiest way to download the tenement data is via the SARIG catalog.
Here I searched for each tenement type, focussing on the ones that had over 100 records. It’s important to note here that there are generally two or three files for each tenement type, one with the current records, one which contains the aggregated historic records and sometimes one with the current applications for new tenements. It’s also important to note that some of the files include more that a single tenement type, for example the ‘Historical Mining and Mineral Production Tenements’ file contains data for EML’s, MC’s, ML’s, MPL’s, PM’s, RL’s and SML’s.
I downloaded and extracted the following files in ‘gdb’ format, which is compatible with thegeopandas
library:Creating a spatial dataframe
The next step in the process is to extract the required spatial data from the ‘gdb’ files so we can append it to our exploration report dataframe. This is where the geopandas
library comes in. geopandas is an open source project built on pandas and shapely to make working with geospatial data in python easy. Because it’s built on pandas, the api is very similar and working between the two is really easy.
First we need to create a geopandas
GeoDataFrame
for each of our ‘gdb’ files. To do this we can iterate over all the ‘gdb’ files in the folder and append each into a python dictionary, using the ‘gdb’ folder name as the key:
tenm_path = (r'D:\Python ML\Envelope-key-words\Data\Tenement_spatial')
tenm_dict = {}
for root, dirs, files in os.walk(tenm_path):
for dir in dirs:
if dir.endswith(".gdb"):
key = dir
data = gpd.read_file(os.path.join(root,dir))
tenm_dict[key] = data
tenm_dict.keys()
dict_keys(['geothermal_exploration_licences.gdb', 'geothermal_exploration_licence_applications.gdb', 'Mineral Claims.gdb', 'Mineral leases.gdb', 'mineral_opal_exploration_licences.gdb', 'Miscellaneous_Purposes_Leases.gdb', 'Non_active_geothermal_exploration_licences.gdb', 'non_active_mineral_exploration_licences.gdb', 'non_active_mineral_production_tenements.gdb', 'Non_active_petroleum_exploration_licences_1945_1967.gdb', 'Non_active_petroleum_exploration_licences.gdb', 'Non_active_special_mining_licences.gdb', 'Olympic_dam.gdb', 'petroleum_exploration_licences_permits.gdb', 'petroleum_exploration_licence_permit_applications.gdb', 'petroleum_production_licences.gdb', 'Private_mines.gdb', 'Retention_leases.gdb'])
Now we have a dictionary of GeoDataFrames
we need to take a look at them and see what data we have and how it is presented. The geometry column, with a data type of ‘geometry’ is where all the geopandas
magic lies. This column holds the spatial attributes, in this case polygons, for our tenement boundaries. For us to be able to join the spatial data onto our exploration topic DataFrame
we will need to use the tenement numbers as the join key.
tenm_dict['Retention_leases.gdb'].info()
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 28 entries, 0 to 27
Data columns (total 26 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 OBJECTID 28 non-null int64
1 TENEMENT_TYPE 28 non-null object
2 Reports_and_related_records 28 non-null int64
3 TENEMENT_LABEL 28 non-null object
4 TENEMENT_STATUS 28 non-null object
5 TENEMENT_HOLDERS 28 non-null object
6 TENEMENT_OPERATORS 28 non-null object
7 LEGAL_AREA 28 non-null float64
8 AREA_UNIT 28 non-null object
9 LOCATION 18 non-null object
10 REGISTRATION_GRANT_DATE 28 non-null object
11 SURRENDER_DATE 0 non-null object
12 EXPIRY_DATE 28 non-null object
13 RENEWAL_RECEIVED_DATE 7 non-null object
14 TRANSFER_RECEIVED_DATE 2 non-null object
15 SURRENDER_RECEIVED_DATE 1 non-null object
16 COMMODITY_CATEGORIES 17 non-null object
17 COMMODITIES 17 non-null object
18 COURT_ACTION_PENDING 28 non-null object
19 MPL_PURPOSE 0 non-null object
20 OPERATION_NAME 28 non-null object
21 OPERATION_STATUS 28 non-null object
22 OPERATION_METHOD 28 non-null object
23 SHAPE_Length 28 non-null float64
24 SHAPE_Area 28 non-null float64
25 geometry 28 non-null geometry
dtypes: float64(3), geometry(1), int64(2), object(20)
memory usage: 5.8+ KB
tenm_dict['Retention_leases.gdb'].head(1)
OBJECTID | TENEMENT_TYPE | Reports_and_related_records | TENEMENT_LABEL | TENEMENT_STATUS | TENEMENT_HOLDERS | TENEMENT_OPERATORS | LEGAL_AREA | AREA_UNIT | LOCATION | REGISTRATION_GRANT_DATE | SURRENDER_DATE | EXPIRY_DATE | RENEWAL_RECEIVED_DATE | TRANSFER_RECEIVED_DATE | SURRENDER_RECEIVED_DATE | COMMODITY_CATEGORIES | COMMODITIES | COURT_ACTION_PENDING | MPL_PURPOSE | OPERATION_NAME | OPERATION_STATUS | OPERATION_METHOD | SHAPE_Length | SHAPE_Area | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | RL | 125 | RL 125 | Active | Iluka (Eucla Basin) Pty Ltd | Iluka (Eucla Basin) Pty Ltd | 2320.0 | Hectares | Approximately 105 km northwest of Ceduna | 2011-04-14T00:00:00 | None | 2021-04-13T00:00:00 | None | None | None | Gem & Semi-precious Stones;Industrial Minerals;Construction Materials;Metallic Minerals;Exploration | Sillimanite; Diamond; Sapphire; Calcrete; Staurolite; Xenotime; Rutile; Gold; Sand; Anatase; Feldspar; Kyanite; Monazite; Ilmenite; Ironstone; Granite; Garnet; Topaz; Zircon; Gravel (Natural); Leu... | No | None | Iluka (Eucla Basin) | Operating | No Method Defined | 0.188852 | 0.002203 | MULTIPOLYGON (((132.81911 -31.55883, 132.81886 -31.56965, 132.81860 -31.58047, 132.79754 -31.58011, 132.77648 -31.57975, 132.77673 -31.56893, 132.77699 -31.55811, 132.77722 -31.54857, 132.77746 -3... |
Unfortunately for us the petroleum and minerals groups use slightly different headers, plus special mining leases are different again. Also, the tenement numbers here don’t have the zero padding like they do in the index table plus have spaces. So to fix this we need to write a function to add padding to make them the same and remove the white space:
def tenement_label(x):
a,b = x.split(' ')
b = b.zfill(5)
return a+b
We can then apply that function to either the ‘Tenement_No’ column for the petroleum tenements or the ‘TENEMENT_LABEL’ for the minerals ones and create a new normalised ‘tenement_id’ column in each GeoDataFrame
:
tenm_dict['geothermal_exploration_licences.gdb']['tenement_id'] = tenm_dict['geothermal_exploration_licences.gdb'].Tenement_No.apply(tenement_label)
tenm_dict['mineral_opal_exploration_licences.gdb']['tenement_id'] = tenm_dict['mineral_opal_exploration_licences.gdb'].TENEMENT_LABEL.apply(tenement_label)
tenm_dict['Non_active_special_mining_licences.gdb']['tenement_id'] = tenm_dict['Non_active_special_mining_licences.gdb'].Tenement_Type.str.cat(tenm_dict['Non_active_special_mining_licences.gdb'].Tenement_No.astype(str).str.zfill(5))
We now have a ‘tenement_id’ column which is in the same format as the ‘Tenement’ column in our exploration topics DataFrame
:
for key in tenm_dict.keys():
print(tenm_dict[key].tenement_id[0])
GEL00181
GELA00266
MC04322
ML06361
EL06107
MPL00122
GEL00233
EL05087
MPL00138
OEL00007
PEL00083
SML00029
SML00001
PEL00148
PELA00631
PPL00133
PM00181
RL00125
It is important with geospatial data that that we check to see what the spatial reference is and ensure it is all the same, otherwise we can get data plotting in odd locations. This is simple with the geopandas
library by calling the crs
attribute which returns an epsg code:
for key in tenm_dict.keys():
print(tenm_dict[key].crs)
epsg:4283
epsg:4283
epsg:4283
epsg:4283
epsg:4283
epsg:4283
epsg:4283
epsg:4283
epsg:4283
epsg:4283
epsg:4283
epsg:4283
epsg:4283
epsg:4283
epsg:4283
epsg:4283
epsg:4283
epsg:4283
Each of our geodataframes is in the same coordinate reference system, epsg 4283 which is GDA94 (for a reference of all the epsg codes have a look at spatialreference.org).
Now we can create a single new GeoDataFrame
which contains the normalised tenement_id column and the geometry column from all of the different tenement GeoDataFrames
:
combined_tenement_gdf = gpd.GeoDataFrame()
for key in tenm_dict.keys():
combined_tenement_gdf = combined_tenement_gdf.append(tenm_dict[key][['tenement_id','geometry']], ignore_index=True)
combined_tenement_gdf.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 13117 entries, 0 to 13116
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 tenement_id 13117 non-null object
1 geometry 13117 non-null geometry
dtypes: geometry(1), object(1)
memory usage: 205.1+ KB
This gives us a GeoDatFrame
with over 13,000 tenements, but you’ll recall that we only had 5465 exploration reports with associated tenement numbers. We can filter this list down based on the list we created earlier of unique tenement numbers:
all_tenement_gdf = combined_tenement_gdf[combined_tenement_gdf['tenement_id'].isin(all_tenemenents_list)]
all_tenement_gdf.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 5792 entries, 0 to 13109
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 tenement_id 5792 non-null object
1 geometry 5792 non-null geometry
dtypes: geometry(1), object(1)
memory usage: 135.8+ KB
We now have a new GeoDatFrame
which contains 5792 matching tenement numbers out of our total list of over 10,000 unique tenements in our exploration topic DataFrame
. I also tried the above with our ‘most_recent_tenements’ list, but that only produced 3823 matches suggesting that there are some of the tenements that are only represented by some of the older tenement numbers. The remaining tenements are not represented in our report topic dataframe.
Next we can create our new geospatially enabled exploration topic dataframe by joining the tenement geometries onto our ‘exploded_all_tenements’ topic DataFrame
, creating a new GeoDatFrame
:
all_tenement_topics_gdf = all_tenement_gdf.merge(all_tenements_exploded, how = 'left', left_on='tenement_id', right_on='Tenement')
all_tenement_topics_gdf.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 9596 entries, 0 to 9595
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 tenement_id 9596 non-null object
1 geometry 9596 non-null geometry
2 ID 9596 non-null object
3 Tenement 9596 non-null object
4 Dominant_Topic 9596 non-null float64
5 Topic_Name 9596 non-null object
6 most_recent_tenement 9596 non-null object
7 tenement_type 9596 non-null object
dtypes: float64(1), geometry(1), object(6)
memory usage: 674.7+ KB
When we look at this new GeoDatFrame
, we see that the number of objects has increased up to 9596. The reason for this is because we used the exploded version of our exploration topic DataFrame
, we generated a number of rows with different tenement numbers but duplicated exploration report id’s. For example, if we look at one exploration report number ENV09797 we can see six different tenement numbers for the same exploration report:
all_tenement_topics_gdf[all_tenement_topics_gdf.ID == 'ENV09797']
tenement_id | geometry | ID | Tenement | Dominant_Topic | Topic_Name | most_recent_tenement | tenement_type | |
---|---|---|---|---|---|---|---|---|
1 | MC03565 | MULTIPOLYGON (((140.93911 -32.25891, 140.93911 -32.25991, 140.92843 -32.26200, 140.92664 -32.26235, 140.92664 -32.25342, 140.93084 -32.25343, 140.93270 -32.25343, 140.93912 -32.25343, 140.93912 -3... | ENV09797 | MC03565 | 4.0 | Mine operations and development | ML05678 | MC |
2 | MC03566 | MULTIPOLYGON (((140.93084 -32.25343, 140.93085 -32.23744, 140.93913 -32.23744, 140.93913 -32.23987, 140.93912 -32.24548, 140.93912 -32.24554, 140.93912 -32.24991, 140.93912 -32.25276, 140.93912 -3... | ENV09797 | MC03566 | 4.0 | Mine operations and development | ML05678 | MC |
43 | ML05678 | MULTIPOLYGON (((140.93842 -32.24845, 140.93774 -32.25014, 140.93641 -32.25341, 140.93400 -32.25274, 140.93523 -32.25014, 140.93693 -32.24656, 140.93745 -32.24671, 140.93809 -32.24690, 140.93845 -3... | ENV09797 | ML05678 | 4.0 | Mine operations and development | ML05678 | ML |
821 | EL04590 | MULTIPOLYGON (((140.93463 -32.28184, 140.91797 -32.28184, 140.91797 -32.21517, 140.95129 -32.21517, 140.95130 -32.28184, 140.93463 -32.28184))) | ENV09797 | EL04590 | 4.0 | Mine operations and development | ML05678 | EL |
4606 | EL03382 | MULTIPOLYGON (((140.95130 -32.28184, 140.93463 -32.28184, 140.91797 -32.28184, 140.91796 -32.24850, 140.91796 -32.23184, 140.91796 -32.21517, 140.95129 -32.21517, 140.95130 -32.23184, 140.95130 -3... | ENV09797 | EL03382 | 4.0 | Mine operations and development | ML05678 | EL |
5477 | EL02704 | MULTIPOLYGON (((140.95130 -32.28184, 140.93463 -32.28184, 140.91797 -32.28184, 140.91796 -32.24850, 140.91796 -32.23184, 140.91796 -32.21517, 140.95129 -32.21517, 140.95130 -32.23184, 140.95130 -3... | ENV09797 | EL02704 | 4.0 | Mine operations and development | ML05678 | EL |
To deal with this we can simply drop the duplicate entries, delivering us a final GeoDatFrame
with all our required data containing 4827 unique entries. This is not exactly the 5465 rows of topic modelled reports with tenement numbers we started with, because we didn’t download every single tenemetnt type ‘gdb’ file, but is a very good sample to give us an idea of the spatial distribution of the differnet topics.
all_tenement_topics_nodup_gdf = all_tenement_topics_gdf.drop_duplicates(subset='ID')
all_tenement_topics_nodup_gdf.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 4827 entries, 0 to 9592
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 tenement_id 4827 non-null object
1 geometry 4827 non-null geometry
2 ID 4827 non-null object
3 Tenement 4827 non-null object
4 Dominant_Topic 4827 non-null float64
5 Topic_Name 4827 non-null object
6 most_recent_tenement 4827 non-null object
7 tenement_type 4827 non-null object
dtypes: float64(1), geometry(1), object(6)
memory usage: 339.4+ KB
Representing the spatial distribution of the modelled topics
Curently we have a GeoDatFrame
contining the polygon geometries of the tenement areas. But this is not a particuarly good way to show the regional distribution of different topics accross the whole state, point data would be a better way to display it. Luckily with geopandas
we can easily convert polygon geometries to a centroid point of the polygon (or multiple polygons) to represent the location.
all_tenement_topics_nodup_gdf['geometry'] = all_tenement_topics_nodup_gdf['geometry'].centroid
all_tenement_topics_nodup_gdf.head(3)
tenement_id | geometry | ID | Tenement | Dominant_Topic | Topic_Name | most_recent_tenement | tenement_type | |
---|---|---|---|---|---|---|---|---|
0 | GEL00181 | POINT (139.74073 -31.48535) | ENV12457 | GEL00181 | 1.0 | IOCG exploration | GEL00210 | GEL |
1 | MC03565 | POINT (140.93255 -32.25731) | ENV09797 | MC03565 | 4.0 | Mine operations and development | ML05678 | MC |
3 | MC04476 | POINT (137.77280 -33.93578) | ENV00625 | MC04476 | 4.0 | Mine operations and development | SML00096 | MC |
Now we have a point dataset of topics we can easily diplay on a map. To plot this up we can use the matplotlib
library to plot the point data over a basemap aquired using contextily
:
fig_1, ax_1 = plt.subplots(figsize=(8.27,11.69))
all_tenement_topics_nodup_gdf.plot(column='Topic_Name', ax=ax_1, alpha=0.7, legend=True, legend_kwds={'loc':(0.01,0.05)}, cmap='Dark2')
ctx.add_basemap(ax_1, crs=all_tenement_topics_nodup_gdf.crs.to_string(), source=ctx.providers.Esri.WorldStreetMap, aspect='auto')
fig_1.subplots_adjust(top=0.983)
fig_1.suptitle("Distrubution of all topics accross S.A.", fontsize=20, backgroundcolor='beige', alpha=0.7)
plt.xlabel("long")
plt.ylabel("lat")
plt.savefig(r"D:\Python ML\Envelope-key-words\Notebooks\Abstract LDA\Figures\Dist_all_topics.png", bbox_inches='tight')
plt.show()
And there you have it, the spatial distribution of each of our located exploration reports, colour coded by topic. But this presentation is not all that useful to see spatial trends in the data. We could plot each topic individually, but a better representation I think is to grid the data.
This is where the geoplot
library comes in. Geoplot is a high level geospatial data visualisation library that makes plotting data on maps easy. There are a number of differnt plot types avaliable in geoplot
but we’re after their kernel density estimation (KDE) plot, kdeplot
.
A kernel density estimation is is a useful way to observe the shape of a distribution of data. It is a non-parametric estimation of the probability density function of our topic points. For our one-dimensional data, essentially each data point is assumed to be a normal gaussian distribution centred around that point, then those distributions are summed to create the KDE curve. In the below figures, the more yellow represents areas with a high probability of data points, while the blue is low probability. You’ll see I also included the point data on each of the plots for reference as well.
For ease of viewing, I created two seperate images each containing four maps. The code below to create these images is only for one of the figures, but the full code is avaliable from the GitHub repository.
uranium = all_tenement_topics_nodup_gdf.query("Dominant_Topic == 0")
iocg = all_tenement_topics_nodup_gdf.query("Dominant_Topic == 1")
base = all_tenement_topics_nodup_gdf.query("Dominant_Topic == 2")
diamond = all_tenement_topics_nodup_gdf.query("Dominant_Topic == 7")
fig3, ax3 = plt.subplots(2,2, figsize=(18.4,20.1), sharex=True, sharey=True, gridspec_kw={'wspace':0.1, 'hspace':0.09})
for r in range(2):
for c in range(2):
ax3[r,c].set_xlim(128.5,141.5)
ax3[r,c].set_ylim(-38.456,-25.6)
gplt.kdeplot(iocg, ax=ax3[0,0], cmap='viridis', shade=True, alpha=0.5)
gplt.pointplot(iocg, ax=ax3[0,0], s=1, color= '#66A61E')
ax3[0,0].set_title("IOCG exploration", fontsize=20)
gplt.kdeplot(base, ax=ax3[0,1], cmap='viridis', shade=True, alpha=0.5)
gplt.pointplot(base, ax=ax3[0,1],s=1, color='#7570B3' )
ax3[0,1].set_title('Gold, copper, base metal exploration', fontsize=20)
gplt.kdeplot(uranium, ax=ax3[1,0], cmap='viridis', shade=True, alpha=0.5)
gplt.pointplot(uranium, ax=ax3[1,0],s=1, color='#666666')
ax3[1,0].set_title('Uranium and coal exploration', fontsize=20)
gplt.kdeplot(hm, ax=ax3[1,1], cmap='viridis', shade=True, alpha=0.5)
gplt.pointplot(hm, extent=extent, ax=ax3[1,1],s=1, color='#E7298A')
ax3[1,1].set_title('Heavy minerals, ?extractives\n and environmental', fontsize=20)
for r in range(2):
for c in range(2):
ctx.add_basemap(ax3[r,c], crs=all_tenement_topics_nodup_gdf.crs.to_string(), source=ctx.providers.Esri.WorldStreetMap, aspect='auto')
ax3[r,c].axison=True
fig3.subplots_adjust(top=0.94, left=0.1)
fig3.suptitle("Kernel density distribution of main topics accross S.A.", fontsize=30)
plt.tight_layout()
plt.show()
What does it all mean
So what can we make of the above KDE spatial distribution plots of our modelled topics. To me this is a great way to see how different parts of the state have been the focus for different commodities. Knowing about the geology of South Australia, this also validates the modelled topics and the usefulness of this technique.
For example, we can see that ‘IOCG exploration’ is centred and focused around the eastern Gawler Craton region where we would expect it to be, with some IOCG exploration also focused in the Curnamona Province in the states east. ‘Gold, copper and base metal exploration’ on the other hand is more focused in the northern Flinders Ranges and Curnamona regions, where there has been significant modern and historic mineralisation and exploration potential, with deposits like the Burra copper mine and the base metal exploration in the Curnamona looking for another Broken Hill Pb-Zn deposit. It also highlights the gold, copper-gold and silver focused exploration around the Tarcoola region and the upper Eyre Peninsula. Similarly uranium exploration is focussed on the Mount Painter/Lake Frome region which hosts the Beverly U deposit and Oil and Gas are focused largely around the Cooper and Otway Basins in the states north-east and south-east respectively.
An interesting outcome for me is the distribution of both the ‘mine operations and development’ and the ‘heavy minerals, ?extractives and environmental’ topics. When I generally think about mining in South Australia, I’m drawn to the recent mining opperations associated with the IOCG deposits like Olympic Dam, Prominant Hill and Carrapateena. But the distribution of mine related topics is strongly focused around the northern Adelaide, upper Yorke Peninsula and Flinders Ranges regions. I think this reflects the ongoing exploration and discussion around the mining history and still real exploration potential of the copper triangle and historic mining regions of Moonta, Wallaroo, Kapunda and Burra. Simarly for the heavy-minerals and extractives topic. Modern heavy minerals exploration is focused in the south-west of the state, which is highlighted by the topic distribution, but the dominant region is around the Mount Lofty Ranges, Adelaide and the lower Flinders Ranges. I think this reflects the extent of the extractives exploration around the built up and populated parts of the state (extractives are things like dimension stone for building, limestone for concrete and quarries for road base and construction material etc) and gives me more confidence that the model is indeed picking up extractives as a strong part of that topic (and that I should take that ? out of the topic name!).
Summary
My aim and hope is that this, and the previous article on topic modelling, has given you some insight into the potential hidden information that is contained in the large textural datasets held by the GSSA, and how we might use python and its ecosystem of data science and visualisation tools to easily access those insights. By applying the unsupervised LDA topic model to the exploration report abstracts, it becomes clear that the bulk of the exploration activity that has occurred in South Australia can be classified into one of eight different themes. Then by using some simple python data science and visualisation tools like geopandas
, geoplot
and contextily
we can quickly create spatial visualisations of this data to get a real sense of not just what are explorers looking for in South Australia, but where are the looking.
I hope you’ve enjoyed reading this. Please check out my other content at GeoDataAnalytics.net, and get in touch if you have any questions or comments. And make sure you check out all the amazing freely avaliable data from the Geological Survey of South Australia, SARIG and the SARIG data catalog. All the code used to undertake this analysis is available from GitHub.