DataFrames in Survey

The internal structure of a survey design is build upon DataFrames. In fact, the data argument is the only required argument for the constructor, and it must be an AbstractDataFrame.

Data manipulation

The provided DataFrame is altered by the SurveyDesign constructor in order to add columns for frequency and probability weights, sample and population sizes and, if necessary, strata and cluster information.

Notice the change in apisrs:

julia> apisrs = load_data("apisrs")200×40 DataFrame
 Row │ Column1  cds             stype    name             sname                ⋯
     │ Int64    Int64           String1  String15         String               ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │    1039  15739081534155  H        McFarland High   McFarland High       ⋯
   2 │    1124  19642126066716  E        Stowers (Cecil   Stowers (Cecil B.) E
   3 │    2868  30664493030640  H        Brea-Olinda Hig  Brea-Olinda High
   4 │    1273  19644516012744  E        Alameda Element  Alameda Elementary
   5 │    4926  40688096043293  E        Sunnyside Eleme  Sunnyside Elementary ⋯
   6 │    2463  19734456014278  E        Los Molinos Ele  Los Molinos Elementa
   7 │    2031  19647336058200  M        Northridge Midd  Northridge Middle
   8 │    1736  19647336017271  E        Glassell Park E  Glassell Park Elemen
  ⋮  │    ⋮           ⋮            ⋮            ⋮                       ⋮      ⋱
 194 │    4880  39686766042782  E        Tyler Skills El  Tyler Skills Element ⋯
 195 │     993  15636851531987  H        Desert Junior/S  Desert Junior/Senior
 196 │     969  15635291534775  H        North High       North High
 197 │    1752  19647336017446  E        Hammel Street E  Hammel Street Elemen
 198 │    4480  37683386039143  E        Audubon Element  Audubon Elementary   ⋯
 199 │    4062  36678196036222  E        Edison Elementa  Edison Elementary
 200 │    2683  24657716025621  E        Franklin Elemen  Franklin Elementary
                                                 36 columns and 185 rows omitted
julia> names(apisrs)40-element Vector{String}:
 "Column1"
 "cds"
 "stype"
 "name"
 "sname"
 "snum"
 "dname"
 "dnum"
 "cname"
 "cnum"
 ⋮
 "col.grad"
 "grad.sch"
 "avg.ed"
 "full"
 "emer"
 "enroll"
 "api.stu"
 "pw"
 "fpc"
julia> srs = SurveyDesign(apisrs; weights=:pw);
julia> apisrs200×45 DataFrame
 Row │ Column1  cds             stype    name             sname                ⋯
     │ Int64    Int64           String1  String15         String               ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │    1039  15739081534155  H        McFarland High   McFarland High       ⋯
   2 │    1124  19642126066716  E        Stowers (Cecil   Stowers (Cecil B.) E
   3 │    2868  30664493030640  H        Brea-Olinda Hig  Brea-Olinda High
   4 │    1273  19644516012744  E        Alameda Element  Alameda Elementary
   5 │    4926  40688096043293  E        Sunnyside Eleme  Sunnyside Elementary ⋯
   6 │    2463  19734456014278  E        Los Molinos Ele  Los Molinos Elementa
   7 │    2031  19647336058200  M        Northridge Midd  Northridge Middle
   8 │    1736  19647336017271  E        Glassell Park E  Glassell Park Elemen
  ⋮  │    ⋮           ⋮            ⋮            ⋮                       ⋮      ⋱
 194 │    4880  39686766042782  E        Tyler Skills El  Tyler Skills Element ⋯
 195 │     993  15636851531987  H        Desert Junior/S  Desert Junior/Senior
 196 │     969  15635291534775  H        North High       North High
 197 │    1752  19647336017446  E        Hammel Street E  Hammel Street Elemen
 198 │    4480  37683386039143  E        Audubon Element  Audubon Elementary   ⋯
 199 │    4062  36678196036222  E        Edison Elementa  Edison Elementary
 200 │    2683  24657716025621  E        Franklin Elemen  Franklin Elementary
                                                 41 columns and 185 rows omitted
julia> names(apisrs)45-element Vector{String}:
 "Column1"
 "cds"
 "stype"
 "name"
 "sname"
 "snum"
 "dname"
 "dnum"
 "cname"
 "cnum"
 ⋮
 "enroll"
 "api.stu"
 "pw"
 "fpc"
 "false_strata"
 "false_cluster"
 "_sampsize"
 "_popsize"
 "_allprobs"

Five columns were added:

false_strata - only in the case of no stratification
This column is necessary because when making a ReplicateDesign, the bootweights function uses groupby with a column representing the stratification variable. If there are no strata, there is no such column, so it should be added in order to keep bootweights general.
false_cluster - only in the case of no clustering
The reasoning is the same as in the case of no stratification.
_sampsize - sample sizes
_popsize - population sizes
These match the stratification variable:

julia> apistrat = load_data("apistrat");
julia> dstrat = SurveyDesign(apistrat; strata=:stype, weights=:pw);
julia> apistrat[:, [:stype, :_sampsize, :_popsize]]200×3 DataFrame
 Row │ stype    _sampsize  _popsize 
     │ String1  Int64      Float64  
─────┼──────────────────────────────
   1 │ E              100    4421.0
   2 │ E              100    4421.0
   3 │ E              100    4421.0
   4 │ E              100    4421.0
   5 │ E              100    4421.0
   6 │ E              100    4421.0
   7 │ E              100    4421.0
   8 │ E              100    4421.0
  ⋮  │    ⋮         ⋮         ⋮
 194 │ E              100    4421.0
 195 │ E              100    4421.0
 196 │ E              100    4421.0
 197 │ H               50     755.0
 198 │ M               50    1018.0
 199 │ E              100    4421.0
 200 │ H               50     755.0
                    185 rows omitted

_allprobs - probability weights

No column was added for frequency weights because the column passed through the weights argument is used by other functions, hence there is no need to add a new column. If weights is not specified, then a column called _weights is added.