DataFrames in Survey
The internal structure of a survey design is build upon DataFrames
. In fact, the data
argument is the only required argument for the constructor, and it must be an AbstractDataFrame
.
Data manipulation
The provided DataFrame
is altered by the SurveyDesign
constructor in order to add columns for frequency and probability weights, sample and population sizes and, if necessary, strata and cluster information.
Notice the change in apisrs
:
julia> apisrs = load_data("apisrs")
200×40 DataFrame Row │ Column1 cds stype name sname ⋯ │ Int64 Int64 String1 String15 String ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ 1039 15739081534155 H McFarland High McFarland High ⋯ 2 │ 1124 19642126066716 E Stowers (Cecil Stowers (Cecil B.) E 3 │ 2868 30664493030640 H Brea-Olinda Hig Brea-Olinda High 4 │ 1273 19644516012744 E Alameda Element Alameda Elementary 5 │ 4926 40688096043293 E Sunnyside Eleme Sunnyside Elementary ⋯ 6 │ 2463 19734456014278 E Los Molinos Ele Los Molinos Elementa 7 │ 2031 19647336058200 M Northridge Midd Northridge Middle 8 │ 1736 19647336017271 E Glassell Park E Glassell Park Elemen ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ 194 │ 4880 39686766042782 E Tyler Skills El Tyler Skills Element ⋯ 195 │ 993 15636851531987 H Desert Junior/S Desert Junior/Senior 196 │ 969 15635291534775 H North High North High 197 │ 1752 19647336017446 E Hammel Street E Hammel Street Elemen 198 │ 4480 37683386039143 E Audubon Element Audubon Elementary ⋯ 199 │ 4062 36678196036222 E Edison Elementa Edison Elementary 200 │ 2683 24657716025621 E Franklin Elemen Franklin Elementary 36 columns and 185 rows omitted
julia> names(apisrs)
40-element Vector{String}: "Column1" "cds" "stype" "name" "sname" "snum" "dname" "dnum" "cname" "cnum" ⋮ "col.grad" "grad.sch" "avg.ed" "full" "emer" "enroll" "api.stu" "pw" "fpc"
julia> srs = SurveyDesign(apisrs; weights=:pw);
julia> apisrs
200×45 DataFrame Row │ Column1 cds stype name sname ⋯ │ Int64 Int64 String1 String15 String ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ 1039 15739081534155 H McFarland High McFarland High ⋯ 2 │ 1124 19642126066716 E Stowers (Cecil Stowers (Cecil B.) E 3 │ 2868 30664493030640 H Brea-Olinda Hig Brea-Olinda High 4 │ 1273 19644516012744 E Alameda Element Alameda Elementary 5 │ 4926 40688096043293 E Sunnyside Eleme Sunnyside Elementary ⋯ 6 │ 2463 19734456014278 E Los Molinos Ele Los Molinos Elementa 7 │ 2031 19647336058200 M Northridge Midd Northridge Middle 8 │ 1736 19647336017271 E Glassell Park E Glassell Park Elemen ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ 194 │ 4880 39686766042782 E Tyler Skills El Tyler Skills Element ⋯ 195 │ 993 15636851531987 H Desert Junior/S Desert Junior/Senior 196 │ 969 15635291534775 H North High North High 197 │ 1752 19647336017446 E Hammel Street E Hammel Street Elemen 198 │ 4480 37683386039143 E Audubon Element Audubon Elementary ⋯ 199 │ 4062 36678196036222 E Edison Elementa Edison Elementary 200 │ 2683 24657716025621 E Franklin Elemen Franklin Elementary 41 columns and 185 rows omitted
julia> names(apisrs)
45-element Vector{String}: "Column1" "cds" "stype" "name" "sname" "snum" "dname" "dnum" "cname" "cnum" ⋮ "enroll" "api.stu" "pw" "fpc" "false_strata" "false_cluster" "_sampsize" "_popsize" "_allprobs"
Five columns were added:
false_strata
- only in the case of no stratificationThis column is necessary because when making a
ReplicateDesign
, thebootweights
function usesgroupby
with a column representing the stratification variable. If there are no strata, there is no such column, so it should be added in order to keepbootweights
general.false_cluster
- only in the case of no clusteringThe reasoning is the same as in the case of no stratification.
_sampsize
- sample sizes_popsize
- population sizesThese match the stratification variable:
julia> apistrat = load_data("apistrat");
julia> dstrat = SurveyDesign(apistrat; strata=:stype, weights=:pw);
julia> apistrat[:, [:stype, :_sampsize, :_popsize]]
200×3 DataFrame Row │ stype _sampsize _popsize │ String1 Int64 Float64 ─────┼────────────────────────────── 1 │ E 100 4421.0 2 │ E 100 4421.0 3 │ E 100 4421.0 4 │ E 100 4421.0 5 │ E 100 4421.0 6 │ E 100 4421.0 7 │ E 100 4421.0 8 │ E 100 4421.0 ⋮ │ ⋮ ⋮ ⋮ 194 │ E 100 4421.0 195 │ E 100 4421.0 196 │ E 100 4421.0 197 │ H 50 755.0 198 │ M 50 1018.0 199 │ E 100 4421.0 200 │ H 50 755.0 185 rows omitted
_allprobs
- probability weights
No column was added for frequency weights because the column passed through the weights
argument is used by other functions, hence there is no need to add a new column. If weights
is not specified, then a column called _weights
is added.