Summary and findings¶
Following are some of the outcomes of the data reconciliation exercise:
1. Codes vs. Labels¶
The current dataset has codes assigned for each value of a categorical variable. For instance, the variable gd_floor which indicates the type of ground floor has the following codes associated with it:
- 1, which denotes “Mud”
- 2, which denotes “Brick/Stone”
- 3, which denotes “Timber”
- 4, which denotes “RC”
- 5, which denotes “Other”
An alternate way of storing this information would be in the form of labels, as opposed to codes. That is,
- mud, which denotes “Mud”
- brick_stone, which denotes “Brick/Stone”
- timber, which denotes “Timber”
- rc, which denotes “RC”
- other, which denotes “Other”
Storing them as numbers require less space, but it comes at the cost of the user needing a reference for understanding what the number means.
It has been decided that values for all variables will stored in the form of codes. However, when this information is shown in the portal, they will be replaced with their respective labels. These labels will be extracted from a variable-label mapping table separately stored in the database.
2. Pre/Post Variables¶
A number of variables in the Household table seek to capture information about the difference in living conditions of the households, before and after the earthquake. These include:
- Residence (respreq, resposq)
- Source of Water (h2o_pre, h2o_pos)
- Source of cooking fuel (fir_pre, fir_pos)
- Source of light (lit_pre, lit_pos)
- Type of toilet (toilet_pre, toilet_pos)
- Type of fixed assets owned (ast_pre, ast_pos)
Given that there may be more than one HHD per building, Careful attention needs to be paid when aggregating this information at a building level in later stages
3. Multi select questions¶
Some of the questions, because of their multiselect nature has more than one column associated with them):
- Superstructure type has 11 columns ranging from sup_str1 to sup_str11
- Type of geotechnical risk has 7 columns ranging from gersk_ls1 to gersk_ls3
- Type of secondary use has 10 columns ranging from secuse_ls1 to secuse_ls10
In the case of multiselect columns, additional data cleaning work would be required to make information more usable.
4. Damage Assessment Variables:¶
Information for damage assessment is spread across groups of variables. For example, for users to get complete information on building foundation damage, they will have to go through three variables viz. dm_fndtn1, dm_fndtn2, dm_fndtn3. Other variables that have a similar nature include:
- dm_roof1, dm_roof2, dm_roof3
- corn_sep1, corn_sep2, corn_sep3
- diag_cr1, diag_cr2, diag_cr3
- pl_fail1, pl_fail2, pl_fail3
- op_fail1, op_fail2, op_fail3
- op_fl_nl1, op_fl_nl2, op_fl_nl3
- dm_gabl1, dm_gabl2, dm_gabl3
- delam1, delam2, delam3
- col_fail1, col_fail2, col_fail3
- beam_fl1, beam_fl2, beam_fl3
- str_case1, str_case2, str_case3
- parapet1, parapet2, parapet3
- clad_glz1, clad_glz2, clad_glz3
- clad_glz1, clad_glz2, clad_glz3
Furthermore, information for “No damage” is contained as a categorical value within the first out of three variable, as illustrated by the picure below.
![]()
Variable names for damage assessment columns need to include severity related information, i.e. dm_fndtn_severe, dm_fndtn_moderate, dm_fndtn_insignfcant. In additions, information about ‘no damage should be captured in a separate variable.
5. Missing variable definitions¶
The data dictionary provided by CBS had left out definitions and range for the following 24 variables across the 8 tables. Fortunately, these definitions have been successfully extracted through variables labels available in theie respective SPSS(.sav) files.
Main table:
- rhouse_sa & rhouse_da: Number of residential house within same area and Number of residential house outside Enumeration Area
- ndam_c: No damage non-residential house number
- pdam_c: Partial damage non-residential house number
- cdam_c: Complete damage non-residential house number
Building table:
- delam1, delam2 and delam3: They represent damage assesment of delaminated structures
- fam_cn: Count of families in the building (or) Total family in the house
- hgt_pre & hgt_pos: Height of house in feet before and after earthquake
- pl_area: Plinth area in sq ft of house
- age: Age of house
- floor_pre & floor_pos: Number of floor before and after earthquake
Individual Table:
- age: Member’s age
Household Table:
- age: Age of household head
- hhd_size: Household size
- death_cn: Number of Death in the family within 12 months period
- loss_cn: Number of missing/handicapped/serious injured due to earthquake in the family
- edrop_cn: Number of students (level<=10) in the family who dropped school.
- pdrop_cn: Number of pregnent woman in the family who dropped regular checkup.
- vdrop_cn: Number of children who dropped vaccination due to earthquake.
- oc_ch_cn: Number of family member who changed/dropped occupation due to earthquake.
- respreqd: District code of usual residence of household head before earthquake
- resposqd: District code residence place of household after earthquake
House Other Place Table:
- haop_sn: Serial Number of House in other place
Death Table:
- age: Age of the dead person
Injured/Missing Table:
- age: Age of the person who is missing/injured
There are some common variable names that capture different information across different tables (like say, age and gender). To avoid confusion, all variable names need to be revisited to ensure they are more representative of the information that they hold