When neccessary - for livelihoods data, the harmonisation is done automatically based on predetermined harmonisation list for crop/animal/unit names. However, when necessary, for instance, when there are new crop/animal/units which do not yet exist in the standard list, it requires Manual intervention to provide the parameters to the system so that it can harmonise these exceptions.
Modules of Data Cleaning
- AUTO: After the system pulled the primary data from KoboToolbox to the Analysis sheet, within the same createUpdateAnalysisSheet function, it schedules the following tasks in a sequence on the same datasheet to clean/harmonise the primary data.
- AUTO - STEP1 PULL AND HARMONISE NEW DATA: First task will identify any new primary data in the ORIGINALDATA sheet and copy them to the ALLDATA sheet, while:
The screenshot of ALLDATA sheet
- Calculating the dependency ratio under the DEPENDENCY column
The screenshot of DEPENDENCY column
- Checking and harmonising the names of crops and animals under the columns
- Checking unit conversion rate, and recalculating the KG and HA for crops 1 and 2 under columns
The screenshot of columns
Determining the status of data under the DATA_CLEANED column using the following categories:
- OK: Data are all harmonised and ready to be used for analysis and export to Open Data Platform
- UNIT: The local unit used is not known and cannot convert into KG/HA
- CROP NAMES: The name of crop(s) is unknown and cannot standardise
- ANIMAL NAMES: The name of animal(s) is unknown and cannot standardise
The screenshot of DATA_CLEANED column
This is necessary as the standard list of names and units may have been manually updated, which was not applied when the data was imported for the first time.
- The columns for dependency, KG HA converted for crops, and data status are added at the right end of the datasheet.
The screenshot of data status
- The columns for names of crops are in the survey data. The data of these columns are overwritten by the standard names if they are found in the pre-determined list (this list is provided in HAM sheet ). In this example, “Riz” was converted into Rice, paddy, and Manioc was converted into Casssava. The pre-converted names remain in the ORIGINALDATA.
The screenshot for name of crops
- The columns for names of animals are also in the survey data. The data of these columns are overwritten by the standard names if they are found in the pre-determined list (this list is provided in HAM sheet). In this example,**GOATS** are converted into Goats, and DONKEY is converted into Donkeys. The pre-converted names remain in the ORIGINALDATA.
The screenshot for name of animals
From 2018 Monitoring Template, the names of crops and animals will be standardised and thus the in the customised form. Therefore, the diversion from the standard names should not happen frequently.
- AUTO - STEP3 OUTLIERS: After all data on crop/animal names and local units are harmonised, the system will calculate **Top code 2% and Bottom code 2% **for selected variables, in order to control Outliers.
- Calculate Top and Bottom 2 % thresholds for each of the variables provided in OUTLIERV Sheet, for each country and baseline/endline, using the primary data from ORIGNALDATA Sheet.
- Based on the calculation above, any values above the Top 2% or below the Bottom 2% will be replaced by the top/bottom 2% values in the ALLDATA Sheet
This is necessary to recalculate this for the entire data repeatedly, because new data submitted may change the top/bottom 2% thresholds.
The screenshot of OUTLIERV sheet
- The variables to be top/bottom coded are provided in OUTLIERV sheet. In default, 6 variables for the KG/HA of crop production, and # of animals owned are registered.
- The result of calculation of the 2% top/bottom thresholds will be shown (every time the system recalculates) in OUTLIERR sheet, for each variable, each country, for each of baseline/endline data, under the Bottomcode and Topcode columnd. The system will use these thresholds to review the data in the ALLDATA sheet, and overwrite the values as necessary. For instance, the top 2% threshold for Crop1 KG in baseline survey for Cameroon is 4,000. Therefore, any values for this variable in Cameroon above 4,000 will be replaced by 4,000.
- For each datasheet, the % of the data cleaned and harmonised is shown under the CLEANED DATA % column. If the % is low, there is a problem of harmonisation, and manual intervention is necessary to update the harmonisation/conversion list