Data Ingestion

Instructions for ingesting data for training isoscape models and using them to generate isoscapes

Last Updated: 2023-09-27

Our codebase is evolving rapidly, so these steps are may go out of date. The team is happy to respond to inquiries.

Background

Running the Data Ingestion notebook is generally the first stage in the sequence of steps required to produce a new isoscape. See Research Colabs for more information.

Running the Notebook

Find the Colab at: https://colab.research.google.com/github/tnc-br/ddf-isoscapes/blob/main/data_ingestion.ipynb

In general, the default options in data_ingestion.ipynb are what most users will want. Most users will not need to change the vast majority of settings. However, advanced users should note the following:

  • Known Issue: Mismatched Projections Lead to Poor-Quality Isoscapes Currently, we have a known issue where the data ingestion pipeline uses GeoTIFFs with mismatched projections to instantiate our tabular dataset:

    Until this is fixed, generated isoscapes will not be of the quality that they could be.

  • "GOOGLE DRIVE" is currently the only working sample source, so users should strongly prefer that. The "TIMBERID" source is still in the testing stage. When it is fully operational, this will automatically import trusted samples from TimberID as a dataset.

  • "ORG_NAME" for the "TIMBERID" source should correspond to the permissions of the user-- Google users should select "google" and USP users should select "USP"

  • Pay attention to errors and warnings in the feature selection stage-- these indicate that any downstream models may not train with all of the expected features, which can result in low-quality isoscapes

  • Grouping: If KEEP_GROUPING is checked, the exported rows will have unique values of the grouping columns selected above. If False, the exported rows will still contain the mean, variance, and other statistics for each group, but the original set of rows will be exported without combining any rows. For training the variational inference model, check the box (this is the default). For XGBoost, Kriging, and Linear Kriging, select False.

  • Visualize the dataset if you'd like, then run the cells under "Export processed dataset." This will export the partitioned, preprocessed data to Google Drive, which can be used to train models and produce isoscapes with existing models.

Last updated