Fast and easy Local Climate Zone mapping
With version 2.5.0 of the LCZ-Generator a change was made to the way the test-train split works.
Previously, a simpler approach was taken, splitting the entire Training Area (TA) dataset randomly into 70% training data and 30% test data. This was done 25 times, each time shuffling the dataset. The split was achieved by creating a uniform distribution between 0 and 1, assigning a number to each polygon, and finally only selecting numbers ≤ 0.7. While in the best case all LCZ-classes should have a similar number of TA, this is not always the case, making it necessary to stratify each split by the LCZ-class.
This improvement was implemented in version 2.5.0, utilizing the
train_test_split
function from the scikit-learn package and performing the split offline. During this
process, a set of new columns, named after the random seed used during the split, is
added. One column for each bootstrap that will be performed is added, resulting in a
new file structure with 25 columns, each marking test data as 0
and
train data as 1
.
A detailed impact assessment of the change was carried out and can be found on GitHub.