Summary and findings

Following are some of the outcomes of the data reconciliation exercise:

1. Codes vs. Labels

The current dataset has codes assigned for each value of a categorical variable. For instance, the variable gd_floor which indicates the type of ground floor has the following codes associated with it:

  1. 1, which denotes “Mud”
  2. 2, which denotes “Brick/Stone”
  3. 3, which denotes “Timber”
  4. 4, which denotes “RC”
  5. 5, which denotes “Other”

An alternate way of storing this information would be in the form of labels, as opposed to codes. That is,

  1. mud, which denotes “Mud”
  2. brick_stone, which denotes “Brick/Stone”
  3. timber, which denotes “Timber”
  4. rc, which denotes “RC”
  5. other, which denotes “Other”

Storing them as numbers require less space, but it comes at the cost of the user needing a reference for understanding what the number means.

It has been decided that values for all variables will stored in the form of codes. However, when this information is shown in the portal, they will be replaced with their respective labels. These labels will be extracted from a variable-label mapping table separately stored in the database.

2. Pre/Post Variables

A number of variables in the Household table seek to capture information about the difference in living conditions of the households, before and after the earthquake. These include:

  1. Residence (respreq, resposq)
  2. Source of Water (h2o_pre, h2o_pos)
  3. Source of cooking fuel (fir_pre, fir_pos)
  4. Source of light (lit_pre, lit_pos)
  5. Type of toilet (toilet_pre, toilet_pos)
  6. Type of fixed assets owned (ast_pre, ast_pos)

Given that there may be more than one HHD per building, Careful attention needs to be paid when aggregating this information at a building level in later stages

3. Multi select questions

Some of the questions, because of their multiselect nature has more than one column associated with them):

  1. Superstructure type has 11 columns ranging from sup_str1 to sup_str11
  2. Type of geotechnical risk has 7 columns ranging from gersk_ls1 to gersk_ls3
  3. Type of secondary use has 10 columns ranging from secuse_ls1 to secuse_ls10

In the case of multiselect columns, additional data cleaning work would be required to make information more usable.

4. Damage Assessment Variables:

Information for damage assessment is spread across groups of variables. For example, for users to get complete information on building foundation damage, they will have to go through three variables viz. dm_fndtn1, dm_fndtn2, dm_fndtn3. Other variables that have a similar nature include:

  1. dm_roof1, dm_roof2, dm_roof3
  2. corn_sep1, corn_sep2, corn_sep3
  3. diag_cr1, diag_cr2, diag_cr3
  4. pl_fail1, pl_fail2, pl_fail3
  5. op_fail1, op_fail2, op_fail3
  6. op_fl_nl1, op_fl_nl2, op_fl_nl3
  7. dm_gabl1, dm_gabl2, dm_gabl3
  8. delam1, delam2, delam3
  9. col_fail1, col_fail2, col_fail3
  10. beam_fl1, beam_fl2, beam_fl3
  11. str_case1, str_case2, str_case3
  12. parapet1, parapet2, parapet3
  13. clad_glz1, clad_glz2, clad_glz3
  14. clad_glz1, clad_glz2, clad_glz3

Furthermore, information for “No damage” is contained as a categorical value within the first out of three variable, as illustrated by the picure below.

_images/dm_fndtn1.png

Variable names for damage assessment columns need to include severity related information, i.e. dm_fndtn_severe, dm_fndtn_moderate, dm_fndtn_insignfcant. In additions, information about ‘no damage should be captured in a separate variable.

5. Missing variable definitions

The data dictionary provided by CBS had left out definitions and range for the following 24 variables across the 8 tables. Fortunately, these definitions have been successfully extracted through variables labels available in theie respective SPSS(.sav) files.

  1. Main table:

    • rhouse_sa & rhouse_da: Number of residential house within same area and Number of residential house outside Enumeration Area
    • ndam_c: No damage non-residential house number
    • pdam_c: Partial damage non-residential house number
    • cdam_c: Complete damage non-residential house number
  2. Building table:

    • delam1, delam2 and delam3: They represent damage assesment of delaminated structures
    • fam_cn: Count of families in the building (or) Total family in the house
    • hgt_pre & hgt_pos: Height of house in feet before and after earthquake
    • pl_area: Plinth area in sq ft of house
    • age: Age of house
    • floor_pre & floor_pos: Number of floor before and after earthquake
  3. Individual Table:

    • age: Member’s age
  4. Household Table:

    • age: Age of household head
    • hhd_size: Household size
    • death_cn: Number of Death in the family within 12 months period
    • loss_cn: Number of missing/handicapped/serious injured due to earthquake in the family
    • edrop_cn: Number of students (level<=10) in the family who dropped school.
    • pdrop_cn: Number of pregnent woman in the family who dropped regular checkup.
    • vdrop_cn: Number of children who dropped vaccination due to earthquake.
    • oc_ch_cn: Number of family member who changed/dropped occupation due to earthquake.
    • respreqd: District code of usual residence of household head before earthquake
    • resposqd: District code residence place of household after earthquake
  5. House Other Place Table:

    • haop_sn: Serial Number of House in other place
  6. Death Table:

    • age: Age of the dead person
  1. Injured/Missing Table:

    • age: Age of the person who is missing/injured

There are some common variable names that capture different information across different tables (like say, age and gender). To avoid confusion, all variable names need to be revisited to ensure they are more representative of the information that they hold