Title: | Survey Sampling Procedures |
---|---|
Description: | Sampling procedures from the book 'Stichproben - Methoden und praktische Umsetzung mit R' by Goeran Kauermann and Helmut Kuechenhoff (2010). |
Authors: | Juliane Manitz [aut,cre], Mark Hempelmann [ctb], Goeran Kauermann [ctb], Helmut Kuechenhoff [aut], Shuai Shao [ctb], Cornelia Oberhauser [ctb], Nina Westerheide [ctb], Manuel Wiesenfarth [ctb] |
Maintainer: | Juliane Manitz <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.2.4 |
Built: | 2024-11-11 06:16:20 UTC |
Source: | https://github.com/jmanitz/samplingbook |
Sampling procedures from the book 'Stichproben - Methoden und praktische Umsetzung mit R' by Goeran Kauermann and Helmut Kuechenhoff (2010).
Package: | Samplingbook |
Type: | Package |
Version: | 1.2.1 |
Date: | 2016-07-05 |
License: | GPL(>=2) |
LazyLoad: | yes |
Index:
election | German Parliament Election Data |
htestimate | Horvitz-Thompson Estimator |
influenza | Population and Cases of Influenza in Administrative Districts of Germany |
mbes | Model Based Estimation |
money | Money Data Frame |
pop | Small Suppositious Sampling Example |
pps.sampling | Sampling with Probabilities Proportional to Size |
sample.size.mean | Sample Size Calculation for Mean Estimation |
sample.size.prop | Sample Size Calculation for Proportion Estimation |
Samplingbook-package | Survey Sampling Procedures |
Smean | Sampling Mean Estimation |
Sprop | Sampling Proportion Estimation |
stratamean | Stratified Sample Mean Estimation |
stratasamp | Sample Size Calculation for Stratified Sampling |
tax | Hypothetical Tax Refund Data Frame |
Author: Juliane Manitz <[email protected]>,
contributions by
Mark Hempelmann <[email protected]>,
Goeran Kauermann <[email protected]>,
Helmut Kuechenhoff <[email protected]>,
Shuai Shao <[email protected]>,
Cornelia Oberhauser <[email protected]>,
Nina Westerheide <[email protected]>,
Manuel Wiesenfarth <[email protected]>
Maintainer: Juliane Manitz <[email protected]>
Kauermann, Goeran/Kuechenhoff, Helmut (2010): Stichproben. Methoden und praktische Umsetzung mit R. Springer.
Data frame with number of citizens eligible to vote and results of the elections in 2002 and 2005 for the German Bundestag, the first chamber of the German parliament.
data(election)
data(election)
A data frame with 299 observations (corresponding to constituencies) on the following 13 variables.
state
factor, the 16 German federal states
eligible_02
number of citizens eligible to vote in 2002
SPD_02
a numeric vector, percentage for the Social Democrats SPD in 2002
UNION_02
a numeric vector, percentage for the conservative Christian Democrats CDU/CSU in 2002
GREEN_02
a numeric vector, percentage for the Greens in 2002
FDP_02
a numeric vector, percentage for the Liberal Party FDP in 2002
LEFT_02
a numeric vector, percentage for the Left Party PDS in 2002
eligible_05
number of citizens eligible to vote in 2005
SPD_05
a numeric vector, percentage for the Social Democrats SPD in 2005
UNION_05
a numeric vector, percentage for the conservative Christian Democrats CDU/CSU in 2005
GREEN_05
a numeric vector, percentage for the Greens in 2005
FDP_05
a numeric vector, percentage for the Liberal Party FDP in 2005
LEFT_05
a numeric vector, percentage for the Left Party in 2005
German Federal Elections
Half of the Members of the German Bundestag are elected directly from Germany's 299 constituencies, the other half one on the parties' land lists. Accordingly, each voter has two votes in the elections to the German Bundestag. The first vote, allowing voters to elect their local representatives to the Bundestag, decides which candidates are sent to Parliament from the constituencies. The second vote is cast for a party list. And it is this second vote that determines the relative strengths of the parties represented in the Bundestag. At least 598 Members of the German Bundestag are elected in this way. In addition to this, there are certain circumstances in which some candidates win what are known as 'overhang mandates' when the seats are being distributed.
The data set provides the percentage of second votes for each party, which determines the number of seats each party gets in parliament. These percentages are calculated by the number of votes for a party divided by number of valid votes.
The data is provided by the R package flexclust.
Kauermann, Goeran/Kuechenhoff, Helmut (2010): Stichproben. Methoden und praktische Umsetzung mit R. Springer.
Homepage of the Bundestag: http://www.bundestag.de.
Friedrich Leisch. A Toolbox for K-Centroids Cluster Analysis. Computational Statistics and Data Analysis, 51 (2), 526-544, 2006.
data(election) summary(election) # 1) Draw a simple sample of size n=20 n <- 20 set.seed(67396) index <- sample(1:nrow(election), size=n) sample1 <- election[index,] Smean(sample1$SPD_02, N=nrow(election)) # true mean mean(election$SPD_02) # 2) Estimate sample size to forecast proportion of SPD in election of 2005 sample.size.prop(e=0.01, P=mean(election$SPD_02), N=Inf) # 3) Usage of previous knowledge by model based estimation # draw sample of size n = 20 N <- nrow(election) set.seed(67396) sample <- election[sort(sample(1:N, size=20)),] # secondary information SPD in 2002 X.mean <- mean(election$SPD_02) # forecast proportion of SPD in election of 2005 mbes(SPD_05 ~ SPD_02, data=sample, aux=X.mean, N=N, method='all') # true value Y.mean <- mean(election$SPD_05) Y.mean # Use a second predictor variable X.mean2 <- c(mean(election$SPD_02),mean(election$GREEN_02)) # forecast proportion of SPD in election of 2005 with two predictors mbes(SPD_05 ~ SPD_02+GREEN_02, data=sample, aux=X.mean2, N=N, method= 'regr')
data(election) summary(election) # 1) Draw a simple sample of size n=20 n <- 20 set.seed(67396) index <- sample(1:nrow(election), size=n) sample1 <- election[index,] Smean(sample1$SPD_02, N=nrow(election)) # true mean mean(election$SPD_02) # 2) Estimate sample size to forecast proportion of SPD in election of 2005 sample.size.prop(e=0.01, P=mean(election$SPD_02), N=Inf) # 3) Usage of previous knowledge by model based estimation # draw sample of size n = 20 N <- nrow(election) set.seed(67396) sample <- election[sort(sample(1:N, size=20)),] # secondary information SPD in 2002 X.mean <- mean(election$SPD_02) # forecast proportion of SPD in election of 2005 mbes(SPD_05 ~ SPD_02, data=sample, aux=X.mean, N=N, method='all') # true value Y.mean <- mean(election$SPD_05) Y.mean # Use a second predictor variable X.mean2 <- c(mean(election$SPD_02),mean(election$GREEN_02)) # forecast proportion of SPD in election of 2005 with two predictors mbes(SPD_05 ~ SPD_02+GREEN_02, data=sample, aux=X.mean2, N=N, method= 'regr')
Calculates Horvitz-Thompson estimate with different methods for variance estimation such as Yates and Grundy, Hansen-Hurwitz and Hajek.
htestimate(y, N, PI, pk, pik, method = 'yg')
htestimate(y, N, PI, pk, pik, method = 'yg')
y |
vector of observations |
N |
integer for population size |
PI |
square matrix of second order inclusion probabilities with |
pk |
vector of first order inclusion probabilities of length |
pik |
an optional vector of first order inclusion probabilities of length |
method |
method to be used for variance estimation. Options are |
For using methods 'yg'
or 'ht'
has to be provided matrix PI
, and for 'hh'
and 'ha'
has to be specified vector pk
of inclusion probabilities.
Additionally, for Hajek method 'ha'
can be specified pik
. Unless, an approximate Hajek method is used.
The function htestimate
returns a value, which is a list consisting of the components
call |
is a list of call components: |
mean |
mean estimate |
se |
standard error of the mean estimate |
Juliane Manitz
Kauermann, Goeran/Kuechenhoff, Helmut (2010): Stichproben. Methoden und praktische Umsetzung mit R. Springer.
data(influenza) summary(influenza) # pps.sampling() set.seed(108506) pps <- pps.sampling(z=influenza$population,n=20,method='midzuno') sample <- influenza[pps$sample,] # htestimate() N <- nrow(influenza) # exact variance estimate PI <- pps$PI htestimate(sample$cases, N=N, PI=PI, method='yg') htestimate(sample$cases, N=N, PI=PI, method='ht') # approximate variance estimate pk <- pps$pik[pps$sample] htestimate(sample$cases, N=N, pk=pk, method='hh') pik <- pps$pik htestimate(sample$cases, N=N, pk=pk, pik=pik, method='ha') # without pik just approximate calculation of Hajek method htestimate(sample$cases, N=N, pk=pk, method='ha') # calculate confidence interval based on normal distribution for number of cases est.ht <- htestimate(sample$cases, N=N, PI=PI, method='ht') est.ht$mean*N lower <- est.ht$mean*N - qnorm(0.975)*N*est.ht$se upper <- est.ht$mean*N + qnorm(0.975)*N*est.ht$se c(lower,upper) # true number of influenza cases sum(influenza$cases)
data(influenza) summary(influenza) # pps.sampling() set.seed(108506) pps <- pps.sampling(z=influenza$population,n=20,method='midzuno') sample <- influenza[pps$sample,] # htestimate() N <- nrow(influenza) # exact variance estimate PI <- pps$PI htestimate(sample$cases, N=N, PI=PI, method='yg') htestimate(sample$cases, N=N, PI=PI, method='ht') # approximate variance estimate pk <- pps$pik[pps$sample] htestimate(sample$cases, N=N, pk=pk, method='hh') pik <- pps$pik htestimate(sample$cases, N=N, pk=pk, pik=pik, method='ha') # without pik just approximate calculation of Hajek method htestimate(sample$cases, N=N, pk=pk, method='ha') # calculate confidence interval based on normal distribution for number of cases est.ht <- htestimate(sample$cases, N=N, PI=PI, method='ht') est.ht$mean*N lower <- est.ht$mean*N - qnorm(0.975)*N*est.ht$se upper <- est.ht$mean*N + qnorm(0.975)*N*est.ht$se c(lower,upper) # true number of influenza cases sum(influenza$cases)
The data frame influenza
provides cases of influenza and inhabitants for administrative districts of Germany in 2007.
data(influenza)
data(influenza)
A data frame with 424 observations on the following 4 variables.
id
a numeric vector
district
a factor with levels LK Aachen
, LK Ahrweiler
, ..., SK Zweibruecken
, names of administrative districts in Germany
population
a numeric vector specifying the number of inhabitants in the specific administrative district
cases
a numeric vector specifying the number of influenza cases in the specific administrative district
Data of 2007. If you want to use the population numbers in the future, be aware of local governmental reorganizations, e.g. district unions.
Database SurvStat of Robert Koch-Institute. Many thanks to Hermann Claus.
Database of Robert Koch-Institute http://www3.rki.de/SurvStat/
Kauermann, Goeran/Kuechenhoff, Helmut (2010): Stichproben. Methoden und praktische Umsetzung mit R. Springer.
data(influenza) summary(influenza) # 1) Usage of pps.sampling set.seed(108506) pps <- pps.sampling(z=influenza$population,n=20,method='midzuno') pps sample <- influenza[pps$sample,] sample # 2) Usage of htestimate set.seed(108506) pps <- pps.sampling(z=influenza$population,n=20,method='midzuno') sample <- influenza[pps$sample,] # htestimate() N <- nrow(influenza) # exact variance estimate PI <- pps$PI htestimate(sample$cases, N=N, PI=PI, method='ht') htestimate(sample$cases, N=N, PI=PI, method='yg') # approximate variance estimate pk <- pps$pik[pps$sample] htestimate(sample$cases, N=N, pk=pk, method='hh') pik <- pps$pik htestimate(sample$cases, N=N, pk=pk, pik=pik, method='ha') # without pik just approximative calculation of Hajek method htestimate(sample$cases, N=N, pk=pk, method='ha') # calculate confidence interval based on normal distribution for number of cases est.ht <- htestimate(sample$cases, N=N, PI=PI, method='ht') est.ht$mean*N lower <- est.ht$mean*N - qnorm(0.975)*N*est.ht$se upper <- est.ht$mean*N + qnorm(0.975)*N*est.ht$se c(lower,upper) # true number of influenza cases sum(influenza$cases)
data(influenza) summary(influenza) # 1) Usage of pps.sampling set.seed(108506) pps <- pps.sampling(z=influenza$population,n=20,method='midzuno') pps sample <- influenza[pps$sample,] sample # 2) Usage of htestimate set.seed(108506) pps <- pps.sampling(z=influenza$population,n=20,method='midzuno') sample <- influenza[pps$sample,] # htestimate() N <- nrow(influenza) # exact variance estimate PI <- pps$PI htestimate(sample$cases, N=N, PI=PI, method='ht') htestimate(sample$cases, N=N, PI=PI, method='yg') # approximate variance estimate pk <- pps$pik[pps$sample] htestimate(sample$cases, N=N, pk=pk, method='hh') pik <- pps$pik htestimate(sample$cases, N=N, pk=pk, pik=pik, method='ha') # without pik just approximative calculation of Hajek method htestimate(sample$cases, N=N, pk=pk, method='ha') # calculate confidence interval based on normal distribution for number of cases est.ht <- htestimate(sample$cases, N=N, PI=PI, method='ht') est.ht$mean*N lower <- est.ht$mean*N - qnorm(0.975)*N*est.ht$se upper <- est.ht$mean*N + qnorm(0.975)*N*est.ht$se c(lower,upper) # true number of influenza cases sum(influenza$cases)
mbes is used for model based estimation of population means using auxiliary variables. Difference, ratio and regression estimates are available.
mbes(formula, data, aux, N = Inf, method = 'all', level = 0.95, ...)
mbes(formula, data, aux, N = Inf, method = 'all', level = 0.95, ...)
formula |
object of class |
data |
data frame containing variables in the model |
aux |
known mean of auxiliary variable, which provides secondary information |
N |
positive integer for population size. Default is |
method |
estimation method. Options are |
level |
coverage probability for confidence intervals. Default is |
... |
further options for linear regression model |
The option method='simple'
calculates the simple sample estimation without using the auxiliary variable.
The option method='diff'
calculates the difference estimate, method='ratio'
the ratio estimate, and method='regr'
the regression estimate which is based on the selected model. The option method='all'
calculates the simple and all model based estimates.
For methods 'diff'
, 'ratio'
and 'all'
the formula has to be y~x
with y
primary and x
secondary information.
For method 'regr'
, it is the symbolic description of the linear regression model. In this case, it can be used more than one auxiliary variable. Thus, aux
has to be a vector of the same length as the number of auxiliary variables in order as specified in the formula.
The function mbes
returns an object, which is a list consisting of the components
call |
is a list of call components: |
info |
is a list of further information components: |
simple |
is a list of result components, if |
diff |
is a list of result components, if |
ratio |
is a list of result components, if |
regr |
is a list of result components, if |
Juliane Manitz
Kauermann, Goeran/Kuechenhoff, Helmut (2010): Stichproben. Methoden und praktische Umsetzung mit R. Springer.
## 1) simple suppositious example data(pop) # Draw a random sample of size=3 set.seed(802016) data <- pop[sample(1:5, size=3),] names(data) <- c('id','x','y') # difference estimator mbes(formula=y~x, data=data, aux=15, N=5, method='diff', level=0.95) # ratio estimator mbes(formula=y~x, data=data, aux=15, N=5, method='ratio', level=0.95) # regression estimator mbes(formula=y~x, data=data, aux=15, N=5, method='regr', level=0.95) ## 2) Bundestag election data(election) # draw sample of size n = 20 N <- nrow(election) set.seed(67396) sample <- election[sort(sample(1:N, size=20)),] # secondary information SPD in 2002 X.mean <- mean(election$SPD_02) # forecast proportion of SPD in election of 2005 mbes(SPD_05 ~ SPD_02, data=sample, aux=X.mean, N=N, method='all') # true value Y.mean <- mean(election$SPD_05) Y.mean # Use a second predictor variable X.mean2 <- c(mean(election$SPD_02),mean(election$GREEN_02)) # forecast proportion of SPD in election of 2005 with two predictors mbes(SPD_05 ~ SPD_02+GREEN_02, data=sample, aux=X.mean2, N=N, method= 'regr') ## 3) money sample data(money) mu.X <- mean(money$X) x <- money$X[which(!is.na(money$y))] y <- na.omit(money$y) # estimation mbes(y~x, aux=mu.X, N=13, method='all') ## 4) model based two-phase sampling with mbes() id <- 1:1000 x <- rep(c(1,0,1,0),times=c(10,90,70,830)) y <- rep(c(1,0,NA),times=c(15,85,900)) phase <- rep(c(2,1), times=c(100,900)) data <- data.frame(id,x,y,phase) # mean of x out of first phase mean.x <- mean(data$x) mean.x N1 <- length(data$x) # calculation of estimation for y est.y <- mbes(y~x, data=data, aux=mean.x, N=N1, method='ratio') est.y # correction of standard error with uncertaincy in first phase v.y <- var(data$y, na.rm=TRUE) se.y <- sqrt(est.y$ratio$se^2 + v.y/N1) se.y # corrected confidence interval lower <- est.y$ratio$mean - qnorm(0.975)*se.y upper <- est.y$ratio$mean + qnorm(0.975)*se.y c(lower, upper)
## 1) simple suppositious example data(pop) # Draw a random sample of size=3 set.seed(802016) data <- pop[sample(1:5, size=3),] names(data) <- c('id','x','y') # difference estimator mbes(formula=y~x, data=data, aux=15, N=5, method='diff', level=0.95) # ratio estimator mbes(formula=y~x, data=data, aux=15, N=5, method='ratio', level=0.95) # regression estimator mbes(formula=y~x, data=data, aux=15, N=5, method='regr', level=0.95) ## 2) Bundestag election data(election) # draw sample of size n = 20 N <- nrow(election) set.seed(67396) sample <- election[sort(sample(1:N, size=20)),] # secondary information SPD in 2002 X.mean <- mean(election$SPD_02) # forecast proportion of SPD in election of 2005 mbes(SPD_05 ~ SPD_02, data=sample, aux=X.mean, N=N, method='all') # true value Y.mean <- mean(election$SPD_05) Y.mean # Use a second predictor variable X.mean2 <- c(mean(election$SPD_02),mean(election$GREEN_02)) # forecast proportion of SPD in election of 2005 with two predictors mbes(SPD_05 ~ SPD_02+GREEN_02, data=sample, aux=X.mean2, N=N, method= 'regr') ## 3) money sample data(money) mu.X <- mean(money$X) x <- money$X[which(!is.na(money$y))] y <- na.omit(money$y) # estimation mbes(y~x, aux=mu.X, N=13, method='all') ## 4) model based two-phase sampling with mbes() id <- 1:1000 x <- rep(c(1,0,1,0),times=c(10,90,70,830)) y <- rep(c(1,0,NA),times=c(15,85,900)) phase <- rep(c(2,1), times=c(100,900)) data <- data.frame(id,x,y,phase) # mean of x out of first phase mean.x <- mean(data$x) mean.x N1 <- length(data$x) # calculation of estimation for y est.y <- mbes(y~x, data=data, aux=mean.x, N=N1, method='ratio') est.y # correction of standard error with uncertaincy in first phase v.y <- var(data$y, na.rm=TRUE) se.y <- sqrt(est.y$ratio$se^2 + v.y/N1) se.y # corrected confidence interval lower <- est.y$ratio$mean - qnorm(0.975)*se.y upper <- est.y$ratio$mean + qnorm(0.975)*se.y c(lower, upper)
Data provides guesses and true values for students wallet money.
data(money)
data(money)
A data frame with 13 observations (corresponding to the students) on the following 3 variables.
id
a numeric vector of identification number
X
a numeric vector of secondary information, guesses of money in the wallet
y
a numeric vector of primary information, counted money in the wallet. NA
means subject was not included into the sample.
In a lesson an experiment was made, in which the students were asked to guess the current amount of money in their wallet. A simple sample of these students was drawn, who counted the money in their wallet exactly. Using this secondary information, model based estimation of the population mean is possible.
Kauermann, Goeran/Kuechenhoff, Helmut (2010): Stichproben. Methoden und praktische Umsetzung mit R. Springer.
data(money) print(money) # Usage of mbes() mu.X <- mean(money$X) x <- money$X[which(!is.na(money$y))] y <- na.omit(money$y) # estimation mbes(y~x, aux=mu.X, N=13, method='all')
data(money) print(money) # Usage of mbes() mu.X <- mean(money$X) x <- money$X[which(!is.na(money$y))] y <- na.omit(money$y) # estimation mbes(y~x, aux=mu.X, N=13, method='all')
pop
is a suppositious data frame for a small population with 5 elements. It is used for simple illustration of survey sampling
estimators.
data(pop)
data(pop)
A data frame with 5 observations on the following 3 variables.
id
a numeric vector of individual identification values
X
a numeric vector of first characteristic
Y
a numeric vector of second characteristic
Kauermann, Goeran/Kuechenhoff, Helmut (2010): Stichproben. Methoden und praktische Umsetzung mit R. Springer.
data(pop) print(pop) ## 1) Usage of Smean() data(pop) Y <- pop$Y Y # Draw a random sample pop size=3 set.seed(93456) y <- sample(x = Y, size = 3) sort(y) # Estimation with infiniteness correction est <- Smean(y = y, N = length(pop$Y)) est ## 2) Usage of mbes() data(pop) # Draw a random sample of size=3 set.seed(802016) data <- pop[sample(1:5, size=3),] names(data) <- c('id','x','y') # difference estimator mbes(formula=y~x, data=data, aux=15, N=5, method='diff', level=0.95) # ratio estimator mbes(formula=y~x, data=data, aux=15, N=5, method='ratio', level=0.95) # regression estimator mbes(formula=y~x, data=data, aux=15, N=5, method='regr', level=0.95)
data(pop) print(pop) ## 1) Usage of Smean() data(pop) Y <- pop$Y Y # Draw a random sample pop size=3 set.seed(93456) y <- sample(x = Y, size = 3) sort(y) # Estimation with infiniteness correction est <- Smean(y = y, N = length(pop$Y)) est ## 2) Usage of mbes() data(pop) # Draw a random sample of size=3 set.seed(802016) data <- pop[sample(1:5, size=3),] names(data) <- c('id','x','y') # difference estimator mbes(formula=y~x, data=data, aux=15, N=5, method='diff', level=0.95) # ratio estimator mbes(formula=y~x, data=data, aux=15, N=5, method='ratio', level=0.95) # regression estimator mbes(formula=y~x, data=data, aux=15, N=5, method='regr', level=0.95)
The function provides sample techniques with sampling probabilities which are proportional to the size of a quantity z.
pps.sampling(z, n, id = 1:N, method = 'sampford', return.PI = FALSE)
pps.sampling(z, n, id = 1:N, method = 'sampford', return.PI = FALSE)
z |
vector of quantities which determine the sampling probabilities in the population |
n |
positive integer for sample size |
id |
an optional vector with identification values for population elements. Default is |
method |
the sampling method to be used. Options are |
return.PI |
logical. If |
The different methods vary in their run time. Therefore, method='sampford'
is stopped if N > 200
or if n/N < 0.3
. method='tille'
is stopped if N > 500
.
In case of large populations use method='midzuno'
or method='madow'
.
The function pps.sampling
returns a value, which is a list consisting of the components
call |
is a list of call components: |
sample |
resulted sample |
pik |
inclusion probabilities |
PI |
sample second order inclusion probabilities |
PI.full |
full second order inclusion probabilities |
Juliane Manitz
Kauermann, Goeran/Kuechenhoff, Helmut (2010): Stichproben. Methoden und praktische Umsetzung mit R. Springer.
## 1) simple suppositious example data <- data.frame(id = 1:7, z = c(1.8, 2 ,3.2 ,2.9 ,1.5 ,2.0 ,2.2)) # Usage of pps.sampling for Sampford method set.seed(178209) pps.sample_sampford <- pps.sampling(z=data$z, n=2, method='sampford', return.PI=FALSE) pps.sample_sampford # sampling elements id.sample <- pps.sample_sampford$sample id.sample # other methods set.seed(178209) pps.sample_tille <- pps.sampling(z=data$z, n=2, method='tille') pps.sample_tille set.seed(178209) pps.sample_midzuno <- pps.sampling(z=data$z, n=2, method='midzuno') pps.sample_midzuno set.seed(178209) pps.sample_madow <- pps.sampling(z=data$z, n=2, method='madow') pps.sample_madow ## 2) influenza data(influenza) summary(influenza) set.seed(108506) pps <- pps.sampling(z=influenza$population,n=20,method='midzuno') pps sample <- influenza[pps$sample,] sample
## 1) simple suppositious example data <- data.frame(id = 1:7, z = c(1.8, 2 ,3.2 ,2.9 ,1.5 ,2.0 ,2.2)) # Usage of pps.sampling for Sampford method set.seed(178209) pps.sample_sampford <- pps.sampling(z=data$z, n=2, method='sampford', return.PI=FALSE) pps.sample_sampford # sampling elements id.sample <- pps.sample_sampford$sample id.sample # other methods set.seed(178209) pps.sample_tille <- pps.sampling(z=data$z, n=2, method='tille') pps.sample_tille set.seed(178209) pps.sample_midzuno <- pps.sampling(z=data$z, n=2, method='midzuno') pps.sample_midzuno set.seed(178209) pps.sample_madow <- pps.sampling(z=data$z, n=2, method='madow') pps.sample_madow ## 2) influenza data(influenza) summary(influenza) set.seed(108506) pps <- pps.sampling(z=influenza$population,n=20,method='midzuno') pps sample <- influenza[pps$sample,] sample
The function sample.size.mean
returns the sample size needed for mean estimations either with or without consideration of finite population correction.
sample.size.mean(e, S, N = Inf, level = 0.95)
sample.size.mean(e, S, N = Inf, level = 0.95)
e |
positive number specifying the precision which is half width of confidence interval |
S |
standard deviation in population |
N |
positive integer for population size. Default is |
level |
coverage probability for confidence intervals. Default is |
The function sample.size.mean
returns a value, which is a list consisting of the components
call |
is a list of call components: |
n |
estimate of sample size |
Juliane Manitz
Kauermann, Goeran/Kuechenhoff, Helmut (2010): Stichproben. Methoden und praktische Umsetzung mit R. Springer.
# sample size for precision e=4 sample.size.mean(e=4,S=10,N=300) # sample size for precision e=1 sample.size.mean(e=1,S=10,N=300)
# sample size for precision e=4 sample.size.mean(e=4,S=10,N=300) # sample size for precision e=1 sample.size.mean(e=1,S=10,N=300)
The function sample.size.prop
returns the sample size needed for proportion estimation either with or without consideration of finite population correction.
sample.size.prop(e, P = 0.5, N = Inf, level = 0.95)
sample.size.prop(e, P = 0.5, N = Inf, level = 0.95)
e |
positive number specifying the precision which is half width of confidence interval |
P |
expected proportion of events with domain between values 0 and 1. Default is |
N |
positive integer for population size. Default is |
level |
coverage probability for confidence intervals. Default is |
For meaningful calculation, precision e
should be chosen smaller than 0.5, because the domain of P
is between values 0 and 1. Furthermore, precision e
should be smaller than proportion P
, respectively (1-P)
.
The function sample.size.prop
returns a value, which is a list consisting of the components
call |
is a list of call components |
n |
estimate of sample size |
Juliane Manitz
Kauermann, Goeran/Kuechenhoff, Helmut (2010): Stichproben. Methoden und praktische Umsetzung mit R. Springer.
## 1) examples with different precisions # precision 1% for election forecast of SPD in 2005 sample.size.prop(e=0.01, P=0.5, N=Inf) data(election) sample.size.prop(e=0.01, P=mean(election$SPD_02), N=Inf) # precision 5% for questionnaire sample.size.prop(e=0.05, P=0.5, N=300) sample.size.prop(e=0.05, P=0.5, N=Inf) # precision 10% sample.size.prop(e=0.1, P=0.5, N=300) sample.size.prop(e=0.1, P=0.5, N=1000) ## 2) tables in the book # table 2.2 P_vector <- c(0.2, 0.3, 0.4, 0.5) N_vector <- c(10, 100, 1000, 10000) results <- matrix(NA, ncol=4, nrow=4) for (i in 1:length(P_vector)){ for (j in 1:length(N_vector)){ x <- try(sample.size.prop(e=0.1, P=P_vector[i], N=N_vector[j])) if (class(x)=='try-error') {results[i,j] <- NA} else {results[i,j] <- x$n} } } dimnames(results) <- list(paste('P=',P_vector, sep=''), paste('N=',N_vector, sep='')) results # table 2.3 P_vector <- c(0.5, 0.1) e_vector <- c(0.1, 0.05, 0.03, 0.02, 0.01) results <- matrix(NA, ncol=2, nrow=5) for (i in 1:length(e_vector)){ for (j in 1:length(P_vector)){ x <- try(sample.size.prop(e=e_vector[i], P=P_vector[j], N=Inf)) if (class(x)=='try-error') {results[i,j] <- NA} else {results[i,j] <- x$n} } } dimnames(results) <- list(paste('e=',e_vector, sep=''), paste('P=',P_vector, sep='')) results
## 1) examples with different precisions # precision 1% for election forecast of SPD in 2005 sample.size.prop(e=0.01, P=0.5, N=Inf) data(election) sample.size.prop(e=0.01, P=mean(election$SPD_02), N=Inf) # precision 5% for questionnaire sample.size.prop(e=0.05, P=0.5, N=300) sample.size.prop(e=0.05, P=0.5, N=Inf) # precision 10% sample.size.prop(e=0.1, P=0.5, N=300) sample.size.prop(e=0.1, P=0.5, N=1000) ## 2) tables in the book # table 2.2 P_vector <- c(0.2, 0.3, 0.4, 0.5) N_vector <- c(10, 100, 1000, 10000) results <- matrix(NA, ncol=4, nrow=4) for (i in 1:length(P_vector)){ for (j in 1:length(N_vector)){ x <- try(sample.size.prop(e=0.1, P=P_vector[i], N=N_vector[j])) if (class(x)=='try-error') {results[i,j] <- NA} else {results[i,j] <- x$n} } } dimnames(results) <- list(paste('P=',P_vector, sep=''), paste('N=',N_vector, sep='')) results # table 2.3 P_vector <- c(0.5, 0.1) e_vector <- c(0.1, 0.05, 0.03, 0.02, 0.01) results <- matrix(NA, ncol=2, nrow=5) for (i in 1:length(e_vector)){ for (j in 1:length(P_vector)){ x <- try(sample.size.prop(e=e_vector[i], P=P_vector[j], N=Inf)) if (class(x)=='try-error') {results[i,j] <- NA} else {results[i,j] <- x$n} } } dimnames(results) <- list(paste('e=',e_vector, sep=''), paste('P=',P_vector, sep='')) results
The function Smean
estimates the population mean out of simple samples either with or without consideration of finite population correction.
Smean(y, N = Inf, level = 0.95)
Smean(y, N = Inf, level = 0.95)
y |
vector of sample data |
N |
positive integer specifying population size. Default is |
level |
coverage probability for confidence intervals. Default is |
The function Smean
returns a value, which is a list consisting of the components
call |
is a list of call components: |
mean |
mean estimate |
se |
standard error of the mean estimate |
ci |
vector of confidence interval boundaries |
Juliane Manitz
Kauermann, Goeran/Kuechenhoff, Helmut (2010): Stichproben. Methoden und praktische Umsetzung mit R. Springer.
data(pop) Y <- pop$Y Y # Draw a random sample of size=3 set.seed(93456) y <- sample(x = Y, size = 3) sort(y) # Estimation with infiniteness correction est <- Smean(y = y, N = length(pop$Y)) est
data(pop) Y <- pop$Y Y # Draw a random sample of size=3 set.seed(93456) y <- sample(x = Y, size = 3) sort(y) # Estimation with infiniteness correction est <- Smean(y = y, N = length(pop$Y)) est
The function Sprop
estimates the proportion out of samples either with or without
consideration of finite population correction. Different methods for calculating
confidence intervals for example based on binomial distribution (Agresti and
Coull or Clopper-Pearson) or based on hypergeometric distribution are used.
Sprop(y, m, n = length(y), N = Inf, level = 0.95)
Sprop(y, m, n = length(y), N = Inf, level = 0.95)
y |
vector of sample data containing values 0 and 1 |
m |
an optional non-negative integer for number of positive events |
n |
an optional positive integer for sample size. Default is |
N |
positive integer for population size. Default is |
level |
coverage probability for confidence intervals. Default is |
Sprop
can be called by usage of a data vector y
with the observations 1 for event and 0 for failure. Moreover, it can be called by specifying the number of events m
and trials n
.
The function Sprop
returns a value, which is a list consisting of the components
call |
is a list of call components: |
p |
proportion estimate |
se |
standard error of the proportion estimate |
ci |
is a list of confidence interval boundaries for proportion. |
nr |
In case of finite population of size |
Juliane Manitz
Kauermann, Goeran/Kuechenhoff, Helmut (2010): Stichproben. Methoden und praktische Umsetzung mit R. Springer.
Agresti, Alan/Coull, Brent A. (1998): Approximate Is Better than 'Exact' for Interval Estimation of Binomial Proportions. The American Statistician, Vol. 52, No. 2 , pp. 119-126.
# 1) Survey in company to upgrade office climate Sprop(m=45, n=100, N=300) Sprop(m=2, n=100, N=300) # 2) German opinion poll for 03/07/09 with # (http://www.wahlrecht.de/umfragen/politbarometer.htm) # a) 302 of 1206 respondents who would elect SPD. # b) 133 of 1206 respondents who would elect the Greens. Sprop(m=302, n=1206, N=Inf) Sprop(m=133, n=1206, N=Inf) # 3) Rare disease of animals (sample size n=500 of N=10.000 animals, one infection) # for 95% one sided confidence level use level=0.9 Sprop(m=1, n=500, N=10000, level=0.9) # 4) call with data vector y y <- c(0,0,1,0,1,0,0,0,1,1,0,0,1) Sprop(y=y, N=200) # is the same as Sprop(m=5, n=13, N=200)
# 1) Survey in company to upgrade office climate Sprop(m=45, n=100, N=300) Sprop(m=2, n=100, N=300) # 2) German opinion poll for 03/07/09 with # (http://www.wahlrecht.de/umfragen/politbarometer.htm) # a) 302 of 1206 respondents who would elect SPD. # b) 133 of 1206 respondents who would elect the Greens. Sprop(m=302, n=1206, N=Inf) Sprop(m=133, n=1206, N=Inf) # 3) Rare disease of animals (sample size n=500 of N=10.000 animals, one infection) # for 95% one sided confidence level use level=0.9 Sprop(m=1, n=500, N=10000, level=0.9) # 4) call with data vector y y <- c(0,0,1,0,1,0,0,0,1,1,0,0,1) Sprop(y=y, N=200) # is the same as Sprop(m=5, n=13, N=200)
The function stratamean
estimates the population mean out of stratified samples either with or without consideration of finite population correction.
stratamean(y, h, Nh, wh, level = 0.95, eae = FALSE)
stratamean(y, h, Nh, wh, level = 0.95, eae = FALSE)
y |
vector of target variable. |
h |
vector of stratifying variable. |
Nh |
vector of sizes of every stratum, which has to be supplied in alphabetical or numerical order of the categories of h. |
wh |
vector of weights of every stratum, which has to be supplied in alphabetical or numerical order of the categories of h. |
level |
coverage probability for confidence intervals. Default is |
eae |
|
If the absolute stratum sizes Nh
are given, the variances are calculated with finite population correction. Otherwise, if the stratum weights wh
are given, the variances are calculated without finite population correction.
The function stratamean
returns a value, which is a list consisting of the components
call |
is a list of call components: |
mean |
mean estimate for population |
se |
standard error of the mean estimate for population |
ci |
vector of confidence interval boundaries for population |
Shuai Shao and Juliane Manitz
Kauermann, Goeran/Kuechenhoff, Helmut (2010): Stichproben. Methoden und praktische Umsetzung mit R. Springer.
# random data testy <- rnorm(100) testh <- c(rep("male",40), rep("female",60)) stratamean(testy, testh, wh=c(0.5, 0.5)) stratamean(testy, testh, wh=c(0.5, 0.5), eae=TRUE) # tax data data(tax) summary(tax) nh <- as.vector(table(tax$Class)) wh <- nh/sum(nh) stratamean(y=tax$diff, h=as.vector(tax$Class), wh=wh, eae=TRUE)
# random data testy <- rnorm(100) testh <- c(rep("male",40), rep("female",60)) stratamean(testy, testh, wh=c(0.5, 0.5)) stratamean(testy, testh, wh=c(0.5, 0.5), eae=TRUE) # tax data data(tax) summary(tax) nh <- as.vector(table(tax$Class)) wh <- nh/sum(nh) stratamean(y=tax$diff, h=as.vector(tax$Class), wh=wh, eae=TRUE)
The function stratasamp
calculates the sample size for each stratum depending on type of allocation.
stratasamp(n, Nh, Sh = NULL, Ch = NULL, type = 'prop')
stratasamp(n, Nh, Sh = NULL, Ch = NULL, type = 'prop')
n |
positive integer specifying sampling size. |
Nh |
vector of population sizes of each stratum. |
Sh |
vector of standard deviation in each stratum. |
Ch |
vector of cost for a sample in each stratum. |
type |
type of allocation. Default is |
The function stratasamp
returns a matrix, which lists the strata and the sizes of observation depending on type of allocation.
Shuai Shao and Juliane Manitz
Kauermann, Goeran/Kuechenhoff, Helmut (2010): Stichproben. Methoden und praktische Umsetzung mit R. Springer.
stratamean
, stratasize
, sample.size.mean
#random proportional stratified sample stratasamp(n=500, Nh=c(5234,2586,649,157)) stratasamp(n=500, Nh=c(5234,2586,649,157), Sh=c(251,1165,8035,24725), type='opt')
#random proportional stratified sample stratasamp(n=500, Nh=c(5234,2586,649,157)) stratasamp(n=500, Nh=c(5234,2586,649,157), Sh=c(251,1165,8035,24725), type='opt')
The function stratasize
determinates the total size of stratified samples depending on type of allocation and determinated by specified precision.
stratasize(e, Nh, Sh, level = 0.95, type = 'prop')
stratasize(e, Nh, Sh, level = 0.95, type = 'prop')
e |
positive number specifying sampling precision. |
Nh |
vector of population sizes in each stratum. |
Sh |
vector of standard deviation in each stratum. |
level |
coverage probability for confidence intervals. Default is |
type |
type of allocation. Default is |
The function stratasize
returns a value, which is a list consisting of the components
call |
is a list of call components: |
n |
determinated total sample size. |
Shuai Shao
Kauermann, Goeran/Kuechenhoff, Helmut (2011): Stichproben. Methoden und praktische Umsetzung mit R. Springer.
#random proportional stratified sample stratasize(e=0.1, Nh=c(100000,300000,600000), Sh=c(1,2,3)) #random optimal stratified sample stratasize(e=0.1, Nh=c(100000,300000,600000), Sh=c(1,2,3), type="opt")
#random proportional stratified sample stratasize(e=0.1, Nh=c(100000,300000,600000), Sh=c(1,2,3)) #random optimal stratified sample stratasize(e=0.1, Nh=c(100000,300000,600000), Sh=c(1,2,3), type="opt")
The function submean
estimates the population mean out of sub-samples (two-stage samples) either with or without consideration of finite population correction in both stages.
submean(y, PSU, N, M, Nl, m.weight, n.weight, method = 'simple', level = 0.95)
submean(y, PSU, N, M, Nl, m.weight, n.weight, method = 'simple', level = 0.95)
y |
vector of target variable. |
PSU |
vector of grouping variable which indicates the primary unit for each sample element. |
N |
positive integer specifying population size |
M |
positive integer specifying the number of primary units in the population. |
Nl |
vector of sample sizes in each primary unit, which has to be specified in alphabetical or numerical order of the categories of l. |
m.weight |
vector of primary sample unit weights, which has to be specified in alphabetical or numerical order of the categories of l. |
n.weight |
vector of secondary sample unit weights in each primary sample unit, which has to be specified in alphabetical or numerical order of the categories of l. |
method |
estimation method. Default is "simple", alternative is "ratio". |
level |
coverage probability for confidence intervals. Default is |
If the absolute sizes M
and Nl
are given, the variances are calculated with finite population correction. Otherwise, if the weights m.weight
and n.weight
are given, the variances are calculated without finite population correction.
The function submean
returns a value, which is a list consisting of the components
call |
is a list of call components: |
mean |
mean estimate for population |
se |
standard error of the mean estimate for population |
ci |
vector of confidence interval boundaries for population |
Shuai Shao and Juliane Manitz
Kauermann, Goeran/Kuechenhoff, Helmut (2011): Stichproben. Methoden und praktische Umsetzung mit R. Springer.
y <- c(23,33,24,25,72,74,71,37,42) psu <- as.factor(c(1,1,1,1,2,2,2,3,3)) # with finite population correction submean(y, PSU=psu, N=700, M=23, Nl=c(100,50,75), method='ratio') # without finite population correction submean(y, PSU=psu, N=700, m.weight=3/23, n.weight=c(4/100,3/50,2/75), method='ratio') # Chinese wage data data(wage) summary(wage) submean(wage$Wage,PSU=wage$Region, N=990, M=33, Nl=rep(30,14))
y <- c(23,33,24,25,72,74,71,37,42) psu <- as.factor(c(1,1,1,1,2,2,2,3,3)) # with finite population correction submean(y, PSU=psu, N=700, M=23, Nl=c(100,50,75), method='ratio') # without finite population correction submean(y, PSU=psu, N=700, m.weight=3/23, n.weight=c(4/100,3/50,2/75), method='ratio') # Chinese wage data data(wage) summary(wage) submean(wage$Wage,PSU=wage$Region, N=990, M=33, Nl=rep(30,14))
Simulated tax refund data frame including the estimated and actual refund value
data(tax)
data(tax)
A data frame with 9083 observations on the following 5 variables.
id
a numeric vector indicating the tax payer
estRefund
a numeric vector representing the estimated value of tax refund by the tax payer
actRefund
a numeric vector representing the actual tax refund calculated by the financial authority
diff
difference between estimated and acture tax refund
Class
a factor with levels 1
, 2
, 3
, and 4
indicating the strata
Due to data protection this is a simulated data set reflecting the real data.
Kauermann, Goeran/Kuechenhoff, Helmut (2010): Stichproben. Methoden und praktische Umsetzung mit R. Springer.
data(tax) summary(tax) # illustration of stratamean nh <- as.vector(table(tax$Class)) wh <- nh/sum(nh) stratamean(y=tax$diff, h=as.vector(tax$Class), wh=wh, eae=TRUE)
data(tax) summary(tax) # illustration of stratamean nh <- as.vector(table(tax$Class)) wh <- nh/sum(nh) stratamean(y=tax$diff, h=as.vector(tax$Class), wh=wh, eae=TRUE)
A data frame with hypothetical Chinese wages differenciated by region and industrial sector.
data(wage)
data(wage)
A data frame with 231 observations on the following 3 variables.
Region
factor, Chinese regions with 14 levels.
Sector
factor, industrial sector with 30 levels.
Wage
a numeric vector, average wage in the region and sector measured in Chinese yuan.
The dataset is hypothetical. Its structure imitates the data in the Chinese Statistical Yearbook. The values are simulated corresponding to the distribution of the real data which are not publicly accessible.
Kauermann, Goeran/Kuechenhoff, Helmut (2010): Stichproben. Methoden und praktische Umsetzung mit R. Springer.
# Chinese wage data data(wage) summary(wage) submean(wage$Wage,PSU=wage$Region, N=990, M=33, Nl=rep(30,14))
# Chinese wage data data(wage) summary(wage) submean(wage$Wage,PSU=wage$Region, N=990, M=33, Nl=rep(30,14))