###################################################################################################### Livelihoods Module - Data Cleaning and Harmonisation of Crop/Animal/Local Units and TOP/BOTTOM CODING ###################################################################################################### .. note:: When a new data are pulled from KoboToolbox to the Analysis sheet , the data will be copied to another spreadsheet for data cleaning and harmonisation. .. figure:: images/process_15/process_15_1.png :width: 400px :align: center :alt: This is the image caption The screenshot of CREATEANALYSISSHEET .. note:: **When neccessary** - for livelihoods data, the harmonisation is done automatically based on predetermined harmonisation list for crop/animal/unit names. However, when necessary, for instance, when there are new crop/animal/units which do not yet exist in the standard list, it requires Manual intervention to provide the parameters to the system so that it can harmonise these exceptions. Modules of Data Cleaning ^^^^^^^^^^^^^^^^^^^^^^^^ * **AUTO**: After the system pulled the primary data from KoboToolbox to the Analysis sheet, within the same **createUpdateAnalysisSheet** function, it schedules the following tasks in a sequence on the same datasheet to clean/harmonise the primary data. * **AUTO - STEP1 PULL AND HARMONISE NEW DATA**: First task will identify any new primary data in the ORIGINALDATA sheet and copy them to the ALLDATA sheet, while: .. figure:: images/process_15/process_15_2.png :width: 400px :align: center :alt: This is the image caption The screenshot of ALLDATA sheet 1. Calculating the dependency ratio under the DEPENDENCY column .. figure:: images/process_15/process_15_3.png :width: 400px :align: center :alt: This is the image caption The screenshot of DEPENDENCY column 2. Checking and harmonising the names of crops and animals under the columns * "G_OUTPUT1/O1_Crop_Production1/O1_1stCrop" * "G_OUTPUT1/O1_Crop_Production2/O1_2ndCrop" * "G_OUTPUT1/O1_Animal_Production/Animal1" * "G_OUTPUT1/O1_Animal_Production/Animal2" 3. Checking unit conversion rate, and recalculating the KG and HA for crops 1 and 2 under columns * "Crop1KG" * "Crop2KG" * "Crop1HA" * "Crop2HA" .. figure:: images/process_15/process_15_4.png :width: 400px :align: center :alt: This is the image caption The screenshot of columns 4. Determining the status of data under the DATA_CLEANED column using the following categories: * *OK*: Data are all harmonised and ready to be used for analysis and export to Open Data Platform * *UNIT*: The local unit used is not known and cannot convert into KG/HA * *CROP NAMES*: The name of crop(s) is unknown and cannot standardise * *ANIMAL NAMES*: The name of animal(s) is unknown and cannot standardise .. figure:: images/process_15/process_15_5.png :width: 200px :align: center :alt: This is the image caption The screenshot of DATA_CLEANED column * **AUTO - STEP2 HARMONISE ALL DATA:** After the new data were imported and harmonised in the ALLDATA sheet, the system will do again: #. Checks unit conversion rate, and recalculate the KG and HA for crops 1 and 2 #. Harmonise names of crops and animals for all the data once again, including the previously existing data. .. note:: This is necessary as the standard list of names and units may have been manually updated, which was not applied when the data was imported for the first time. * The columns for dependency, KG HA converted for crops, and data status are added at the right end of the datasheet. .. figure:: images/process_15/process_15_step2_1.png :width: 600px :align: center :alt: This is the image caption The screenshot of data status * The columns for **names of crops** are in the survey data. The data of these columns are overwritten by the standard names if they are found in the pre-determined list (this list is provided in HAM sheet ). In this example, "Riz" was converted into **Rice, paddy**, and **Manioc** was converted into **Casssava**. The pre-converted names remain in the **ORIGINALDATA**. .. figure:: images/process_15/process_15_step2_2.png :width: 600px :align: center :alt: This is the image caption The screenshot for name of crops * The columns for **names of animals** are also in the survey data. The data of these columns are overwritten by the standard names if they are found in the pre-determined list (this list is provided in HAM sheet). In this example,**GOATS** are converted into **Goats**, and **DONKEY** is converted into **Donkeys**. The pre-converted names remain in the **ORIGINALDATA**. .. figure:: images/process_15/process_15_step2_3.png :width: 600px :align: center :alt: This is the image caption The screenshot for name of animals .. note:: From 2018 Monitoring Template, the names of crops and animals will be standardised and thus the in the customised form. Therefore, the diversion from the standard names should not happen frequently. * **AUTO - STEP3 OUTLIERS**: After all data on crop/animal names and local units are harmonised, the system will calculate **Top code 2% and Bottom code 2% **for selected variables, in order to control Outliers. 1. Calculate Top and Bottom 2 % thresholds for each of the variables provided in OUTLIERV Sheet, for each country and baseline/endline, using the primary data from ORIGNALDATA Sheet. 2. Based on the calculation above, any values above the Top 2% or below the Bottom 2% will be replaced by the top/bottom 2% values in the ALLDATA Sheet .. note:: This is necessary to recalculate this for the entire data repeatedly, because new data submitted may change the top/bottom 2% thresholds. .. figure:: images/process_15/process_15_6.png :width: 300px :align: center :alt: This is the image caption The screenshot of OUTLIERV sheet * The variables to be top/bottom coded are provided in OUTLIERV sheet. In default, 6 variables for the KG/HA of crop production, and # of animals owned are registered. .. figure:: images/process_15/process_15_step3_1.png :width: 300px :align: center :alt: This is the image caption * The result of calculation of the 2% top/bottom thresholds will be shown (every time the system recalculates) in OUTLIERR sheet, for each variable, each country, for each of baseline/endline data, under the Bottomcode and Topcode columnd. The system will use these thresholds to review the data in the ALLDATA sheet, and overwrite the values as necessary. For instance, the top 2% threshold for Crop1 KG in baseline survey for Cameroon is 4,000. Therefore, any values for this variable in Cameroon above 4,000 will be replaced by 4,000. .. figure:: images/process_15/process_15_step3_2.png :width: 300px :align: center :alt: This is the image caption * For each datasheet, the % of the data cleaned and harmonised is shown under the CLEANED DATA % column. If the % is low, there is a problem of harmonisation, and manual intervention is necessary to update the harmonisation/conversion list .. figure:: images/process_15/process_15_step3_3.png :width: 300px :align: center :alt: This is the image caption Diagram on Data Cleaning ^^^^^^^^^^^^^^^^^^^^^^^^^ .. figure:: images/process_15/process-15_diagram.png :width: 400px :align: center :alt: This is the image caption The block-diagram on Data Cleaning GAS Functions for Data Cleaning ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * :ref:`DATAANALYSIS component syncroLHsHarmonization() ` * :ref:`DATAANALYSIS component harmoniseData()` * :ref:`DATAANALYSIS component recalculateTopBottomThresholds()` #. :ref:`DATAANALYSIS component calculateOutlier()` #. :ref:`DATAANALYSIS component overwriteOutlier()` * :ref:`SAMPLEPROFILE component SAMPLEPROFILE()` * :ref:`CONTROL component calculateCleaningProgress()` * :ref:`updateOnCustomisation component updateOnCustomisation()`