Installation
The Survey.jl
package is registered. Regular Pkg
commands can be used for installing the package:
julia> using Pkg
julia> Pkg.add("Survey")
Updating registry at `~/.julia/registries/General.toml` Resolving package versions... Installed Survey ─ v0.2.0 Updating `~/work/Survey.jl/Survey.jl/docs/Project.toml` [c1a98b4d] ~ Survey v0.3.0 `~/work/Survey.jl/Survey.jl` ⇒ v0.2.0 Updating `~/work/Survey.jl/Survey.jl/docs/Manifest.toml` [c1a98b4d] ~ Survey v0.3.0 `~/work/Survey.jl/Survey.jl` ⇒ v0.2.0 Precompiling project... ✓ UnPack ✓ RangeArrays ✓ PositiveFactorizations ✓ PkgVersion ✓ Grisu ✓ MappedArrays ✓ Inflate ✓ ProgressMeter ✓ CommonSubexpressions ✓ DiffResults ✓ SimpleTraits ✓ StackViews ✓ Imath_jll ✓ PaddedViews ✓ LLVMOpenMP_jll ✓ IntelOpenMP_jll ✓ JpegTurbo_jll ✓ ArrayInterface ✓ Parameters ✓ DiffRules ✓ QOI ✓ MosaicViews ✓ OpenEXR_jll ✓ AxisArrays ✓ MKL_jll ✓ libsixel_jll ✓ ArrayInterface → ArrayInterfaceGPUArraysCoreExt ✓ ArrayInterface → ArrayInterfaceStaticArraysCoreExt ✓ OpenEXR ✓ FiniteDiff ✓ FiniteDiff → FiniteDiffStaticArraysExt ✓ ForwardDiff ✓ ForwardDiff → ForwardDiffStaticArraysExt ✓ NLSolversBase ✓ LineSearches ✓ Optim ✓ ImageCore ✓ JpegTurbo ✓ Sixel ✓ ImageBase ✓ PNGFiles ✓ ImageAxes ✓ ImageMetadata ✓ Netpbm ✓ TiffImages ✓ Survey 46 dependencies successfully precompiled in 61 seconds. 238 already precompiled. 1 dependency precompiled but a different version is currently loaded. Restart julia to access the new version
] add Survey
Tutorial
This tutorial assumes basic knowledge of statistics and survey analysis.
To begin this tutorial, load the package in your workspace:
julia> using Survey
Now load a survey dataset that you want to study. In this tutorial we will be using the Academic Performance Index (API) datasets for Californian schools. The datasets contain information for all schools with at least 100 students and for various probability samples of the data.
julia> apisrs = load_data("apisrs")
200×40 DataFrame Row │ Column1 cds stype name sname ⋯ │ Int64 Int64 String1 String15 String ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ 1039 15739081534155 H McFarland High McFarland High ⋯ 2 │ 1124 19642126066716 E Stowers (Cecil Stowers (Cecil B.) E 3 │ 2868 30664493030640 H Brea-Olinda Hig Brea-Olinda High 4 │ 1273 19644516012744 E Alameda Element Alameda Elementary 5 │ 4926 40688096043293 E Sunnyside Eleme Sunnyside Elementary ⋯ 6 │ 2463 19734456014278 E Los Molinos Ele Los Molinos Elementa 7 │ 2031 19647336058200 M Northridge Midd Northridge Middle 8 │ 1736 19647336017271 E Glassell Park E Glassell Park Elemen ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ 194 │ 4880 39686766042782 E Tyler Skills El Tyler Skills Element ⋯ 195 │ 993 15636851531987 H Desert Junior/S Desert Junior/Senior 196 │ 969 15635291534775 H North High North High 197 │ 1752 19647336017446 E Hammel Street E Hammel Street Elemen 198 │ 4480 37683386039143 E Audubon Element Audubon Elementary ⋯ 199 │ 4062 36678196036222 E Edison Elementa Edison Elementary 200 │ 2683 24657716025621 E Franklin Elemen Franklin Elementary 36 columns and 185 rows omitted
apisrs
is a simple random sample of the Academic Performance Index of Californian schools. The load_data
function loads it as a DataFrame
. You can look at the column names of apisrs
to get an idea of what the dataset contains.
julia> names(apisrs)
40-element Vector{String}: "Column1" "cds" "stype" "name" "sname" "snum" "dname" "dnum" "cname" "cnum" ⋮ "col.grad" "grad.sch" "avg.ed" "full" "emer" "enroll" "api.stu" "pw" "fpc"
Next, build a survey design from your DataFrame
:
julia> srs = SurveyDesign(apisrs; weights=:pw)
SurveyDesign: data: 200×45 DataFrame strata: none cluster: none popsize: [6194.0, 6194.0, 6194.0 … 6194.0] sampsize: [200, 200, 200 … 200] weights: [30.97, 30.97, 30.97 … 30.97] allprobs: [0.0323, 0.0323, 0.0323 … 0.0323]
This is a simple random sample design with weights given by the column :pw
of apisrs
. You can also create more complex designs such as stratified or cluster sample designs. You can find more information on the complete capabilities of the package in the Manual. The purpose of this tutorial is to show the basic usage of the package. For that, we will stick with a simple random sample.
Now you can analyse your design according to your needs using the functionality provided by the package. For example, you can compute the estimated mean or population total for a given variable. Let's say you want to find the mean Academic Performance Index from the year 1999. If you are only interested in the estimated mean, then you can directly pass your design to the mean
function:
julia> mean(:api99, srs)
1×1 DataFrame Row │ mean │ Float64 ─────┼───────── 1 │ 624.685
If you also want to know the standard error of the mean, you need to convert the SurveyDesign
to a ReplicateDesign
using bootstrapping:
julia> bsrs = bootweights(srs; replicates = 1000)
ReplicateDesign{BootstrapReplicates}: data: 200×1045 DataFrame strata: none cluster: none popsize: [6194.0, 6194.0, 6194.0 … 6194.0] sampsize: [200, 200, 200 … 200] weights: [30.97, 30.97, 30.97 … 30.97] allprobs: [0.0323, 0.0323, 0.0323 … 0.0323] type: bootstrap replicates: 1000
julia> mean(:api99, bsrs)
1×2 DataFrame Row │ mean SE │ Float64 Float64 ─────┼────────────────── 1 │ 624.685 9.84669
You can find the mean of both the 1999 API and 2000 API for a clear comparison between students' performance from one year to another:
julia> mean([:api99, :api00], bsrs)
2×3 DataFrame Row │ names mean SE │ String Float64 Float64 ─────┼────────────────────────── 1 │ api99 624.685 9.84669 2 │ api00 656.585 9.5409
The ratio
is also appropriate for studying the relationship between the two APIs:
julia> ratio(:api00, :api99, bsrs)
ERROR: MethodError: no method matching ratio(::Symbol, ::Symbol, ::ReplicateDesign{BootstrapReplicates}) Closest candidates are: ratio(::Vector{Symbol}, ::Any, ::AbstractSurveyDesign) @ Survey ~/work/Survey.jl/Survey.jl/src/ratio.jl:109
If you're interested in a certain statistic estimated by a specific domain, you can add the domain as the second parameter to your function. Let's say you want to find the estimated total number of students enrolled in schools from each county:
julia> total(:enroll, :cname, bsrs)
38×3 DataFrame Row │ total SE cname │ Float64 Float64 String ─────┼──────────────────────────────────────────────── 1 │ 1.95823e5 74731.2 Kern 2 │ 867129.0 1.36622e5 Los Angeles 3 │ 1.68786e5 63858.0 Orange 4 │ 6720.49 6790.49 San Luis Obispo 5 │ 30319.6 18197.6 San Francisco 6 │ 6503.7 6481.36 Modoc 7 │ 134224.0 46808.0 Alameda 8 │ 64479.5 39542.0 Solano ⋮ │ ⋮ ⋮ ⋮ 32 │ 32642.4 23541.9 Kings 33 │ 36203.9 32062.1 Shasta 34 │ 12171.2 12502.9 Yolo 35 │ 12976.4 13241.6 Calaveras 36 │ 39239.0 30181.9 Napa 37 │ 6410.79 6986.29 Lake 38 │ 15392.1 15202.2 Merced 23 rows omitted
Another way to visualize data is through graphs. You can make a histogram to better see the distribution of enrolled students:
julia> hist(srs, :enroll)
The REPL doesn't show the plot. To see it, you need to save it locally.
julia> import AlgebraOfGraphics.save
julia> save("hist.png", h)