drawing a stratified sample in R - r

Designing my stratified sample
library(survey)
design <- svydesign(id=~1,strata=~Category, data=billa, fpc=~fpc)
So far so good, but how can I draw now a sample in the same way I was able for simple sampling?
set.seed(67359)
samplerows <- sort(sample(x=1:N, size=n.pre$n))

If you have a stratified design, then I believe you can sample randomly within each stratum. Here is a short algorithm to do proportional sampling in each stratum, using ddply:
library(plyr)
set.seed(1)
dat <- data.frame(
id = 1:100,
Category = sample(LETTERS[1:3], 100, replace=TRUE, prob=c(0.2, 0.3, 0.5))
)
sampleOne <- function(id, fraction=0.1){
sort(sample(id, round(length(id)*fraction)))
}
ddply(dat, .(Category), summarize, sampleID=sampleOne(id, fraction=0.2))
Category sampleID
1 A 21
2 A 29
3 A 72
4 B 13
5 B 20
6 B 42
7 B 58
8 B 82
9 B 100
10 C 1
11 C 11
12 C 14
13 C 33
14 C 38
15 C 40
16 C 63
17 C 64
18 C 71
19 C 92

Take a look at the sampling package on CRAN (pdf here), and the strata function in particular.
This is a good package to know if you're doing surveys; there are several vignettes available from its page on CRAN.
The task view on "Official Statistics" includes several topics that are closely related to these issues of survey design and sampling - browsing through it and the packages recommended may also introduce other tools that you can use in your work.

You can draw a stratified sample using dplyr. First we group by the column or columns in which we are interested in. In our example, 3 records of each Species.
library(dplyr)
set.seed(1)
iris %>%
group_by (Species) %>%
sample_n(., 3)
Output:
Source: local data frame [9 x 5]
Groups: Species
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.3 3.0 1.1 0.1 setosa
2 5.7 3.8 1.7 0.3 setosa
3 5.2 3.5 1.5 0.2 setosa
4 5.7 3.0 4.2 1.2 versicolor
5 5.2 2.7 3.9 1.4 versicolor
6 5.0 2.3 3.3 1.0 versicolor
7 6.5 3.0 5.2 2.0 virginica
8 6.4 2.8 5.6 2.2 virginica
9 7.4 2.8 6.1 1.9 virginica

here's a quick way to sample three records per distinct 'carb' value from the mtcars data frame without replacement
# choose how many records to sample per unique 'carb' value
records.per.carb.value <- 3
# draw the sample
your.sample <-
mtcars[
unlist(
tapply(
1:nrow( mtcars ) ,
mtcars$carb ,
sample ,
records.per.carb.value
)
) , ]
# print the results to the screen
your.sample
note that the survey package is mostly used for analyzing complex sample survey data, not creating it. #Iterator is right that you should check out the sampling package for more advanced ways to create complex sample survey data. :)

Related

R - Create random subsamples of size n for multiple sample groups

I have a large data set of samples that belong to different groups and differ in the area covered. The structure of the data set is simplified below. I now would like to create pooled samples (Subgroups) for each Group where the area covered by each Subgroup equates to a specified area (e.g. 20). Samples should be allocated randomly and without replacement to each Subgroup and the number of the Subgroup should be listed in a new column at the end of the data frame.
SampleID Group Area Subgroup
1 A 1.5 1
2 A 3.8 2
3 A 6 4
4 A 1.9 1
5 A 1.5 3
6 A 4.1 1
7 A 3.7 1
8 A 4.5 3
...
300 B 1.2 1
301 B 3.8 1
302 B 4.1 4
303 B 2.6 3
304 B 3.1 5
305 B 3.5 3
306 B 2.1 2
...
2000 S 2.7 5
...
I am currently using the ‘cumsum’ command to create the Subgroups, using the code below.
dat <- read.table("Pooling_Test.txt", header = TRUE, sep = "\t")
dat$CumArea <- cumsum(dat$Area)
dat$Diff_CumArea <- c(0, head(cumsum(dat$Area), -1))
dat$Sample_Int_1 <- "0"
dat$Sample_End <- "0"
current.sum <- 0
for (c in 1:nrow(dat)) {
current.sum <- current.sum + dat[c, "Area"]
dat[c, "Diff_CumArea"] <- current.sum
if (current.sum >= 20) {
dat[c, "Sample_Int_1"] <- "1"
dat[c, "Sample_End"] <- "End"
current.sum <- 0
dat$Sample_Int_2 <- cumsum(dat$Sample_Int_1)+1
dat$Sample_Final <- dat$Sample_Int_2
for (d in 1:nrow(dat)) {
if (dat$Sample_End[d] == 'End')
dat$Subgroup[d] <- dat$Sample_Int_2[d]-1
else 0 }
}}
write.csv(dat, file = 'Pooling_Test_Output.csv', row.names = FALSE)
The resultant data frame shows what I want (see below). However, there are a couple of steps I would like to improve. First, I have problems including a command for choosing samples randomly from each Group, so I currently randomise the order of samples before loading the data frame into R. Secondly, in the output table the Subgroups are numbered consecutively, but I would like to start the Subgroup numbering with 1 for each new Group. Has anybody any advice on how to achieve this?
SampleID Group CumArea Subgroups
1 A 1.5 1
77 A 4.6 1
6 A 9.3 1
43 A 16.4 1
17 A 19.5 1
67 A 2.1 2
4 A 4.3 2
32 A 8.9 2
...
300 B 4.5 10
257 B 6.8 10
397 B 10.6 10
344 B 14.5 10
367 B 16.7 10
303 B 20.1 10
306 B 1.5 11
...
A few functions in the dplyr package make this fairly straightforward. You can use slice to randomize the data, group_by to perform computations at the group level, and mutate to create new variables. If you chain the functions together with the %>% operator, I believe the solution would look something like this, assuming that you want groups that add up to 20.
install.packages("dplyr") #If you haven't used dplyr before
library(dplyr)
dat %>%
group_by(Group) %>%
slice(sample(1:n())) %>%
mutate(CumArea = cumsum(Area), SubGroup = ceiling(CumArea / 20))

Filter data frame based off factor - R

I have the following data frame (called cats, can be accessed using library(MASS)
Sex Bwt Hwt
1 F 2.0 7.0
2 F 2.0 7.4
3 F 2.0 9.5
4 F 2.1 7.2
5 F 2.1 7.3
6 F 2.1 7.6
7 F 2.1 8.1
8 F 2.1 8.2
9 F 2.1 8.3
10 F 2.1 8.5
I first create 3 factors:
x = cut(cats$Bwt, breaks=3)
Now I need to grab all the data which fits in the first factor, plot it in a boxplot. Then do the same for the other 2 factors.
I have tried:
new_data = subset(cats, cats$Bwt %in% x[1])
also
new_data = cats[which(cats$Bwt == x[1])]
I can't seem to filter this data based on the factor. How is this done?
The simple answer is that the variable you created is the one you should be iterating over when performing the comparison. So:
new_data <- cats[which(x == unique(x)[1]),]
Another alternative is not to subset at all but instead use the facet functionality from ggplot something like this
cats %>%
mutate(breaks = cut(Bwt, breaks=3)) %>%
ggplot() +
geom_boxplot(aes(x = Sex, y = Hwt)) +
facet_wrap(~breaks)

Calculation via factor results in a by-list - how to circumvent?

I have a data.frame as following:
Lot Wafer Voltage Slope Voltage_irradiated Slope_irradiated m_dist_lot
1 8 810 356.119 6.08423 356.427 6.13945 NA
2 8 818 355.249 6.01046 354.124 6.20855 NA
3 9 917 346.921 6.21474 346.847 6.33904 NA
4 (...)
120 9 914 353.335 6.15060 352.540 6.19277 NA
121 7 721 358.647 6.10592 357.797 6.17244 NA
122 (...)
My goal is simple but also a bit difficult. Definitely it is doable to solve it in several ways:
I want to apply a function "func" to each row according to a factor, e.g. the factor "Lot". This is done via
m_dist_lot<- by(data.frame, data.frame$Lot,func)
This actually works but the result is a by-list:
data.frame$Lot: 7
354 355 363 367 378 419 426 427 428 431 460 477 836
3.5231249 9.4229589 1.4996504 7.2984485 7.6883170 1.2354754 1.8547674 3.1129814 4.4303001 1.9634573 3.7281868 3.6182559 6.4718306
data.frame$Lot: 8
1 2 11 15 17 18 19 20 21 22 24 25
2.1415352 4.6459868 1.3485551 38.8218984 3.9988686 2.2473563 6.7186047 2.6433790 0.5869746 0.5832567 4.5321623 1.8567318
The first row seems to be the row of the initial data.frame where the data is taken from. The second row are the calculated values.
My problem now is: How can I store these values properly into the origin data.frame according to the correct rows?
For example in case of one certain calculation/row of the data frame:
m_dist_lot<- by(data.frame, data.frame$Lot,func)
results for the second row of the data.frame in
data.frame$Lot: 8
2
4.6459868
I want to store the value 4.6459868 in data.frame$m_dist_lot according to the correct row "2":
Lot Wafer Voltage Slope Voltage_irradiated Slope_irradiated m_dist_lot
1 8 810 356.119 6.08423 356.427 6.13945 NA
2 8 818 355.249 6.01046 354.124 6.20855 4.6459868
3 9 917 346.921 6.21474 346.847 6.33904 NA
4 (...)
120 9 914 353.335 6.15060 352.540 6.19277 NA
121 7 721 358.647 6.10592 357.797 6.17244 NA
122 (...)
but I don't know how. My best try actually is to use "unlist".
un<- unlist(m_dist_lot) results in
un[1]
6.354
3.523125
un[2]
6.355
9.422959
un[3]
(..)
But I still don't know how I can "separate" the information of "factor.row" and "calculcated" value in such a way that the information is stored correctly in the data frame.
At least when using un<- unlist(m_dist_lot, use.names = FALSE) the factors are not present:
un[1]
3.523125
un[2]
9.422959
un[3]
1.49965
(..)
But now I lack the information of how to assign these values properly into the data.frame.
Using un<- do.call(rbind, lapply(m_dist_lot, data.frame, stringsAsFactors=FALSE)) results in
(...)
7.922 0.94130936
7.976 4.89560441
8.1 2.14153516
8.2 4.64598677
8.11 1.34855514
(...)
Here I still lack a proper assignment of calculated values <> data.frame.
I'm sure there must be a doable way. Do you know a good method?
Without reproducible data or an example of what you want func to do, I am guessing a bit here. However, I think that dplyr is going to be the answer for you.
First, I am going to use the pipe (%>%) from dplyr (exported from magrittr) to pass the builtin iris data through a series of functions. If what you are trying to calculate requires the full data.frame (and not just a column or two), you could modify this approach to do what you want (just write your function to take a data.frame, add the column(s) of interest, then return the full data.frame).
Here, I first split the iris data by Species (this creates a list, with a separate data.frame for each species). Next, I use lapply to run the function head on each element of the list. This returns a list of data.frames that now each only have three rows. (You could replace head with your function of interest here, as long as it returns a full data.frame.) Finally, I stitch each element of the list back together with bind_rows.
topIris <-
iris %>%
split(.$Species) %>%
lapply(head, n = 3) %>%
bind_rows()
This returns:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 7.0 3.2 4.7 1.4 versicolor
5 6.4 3.2 4.5 1.5 versicolor
6 6.9 3.1 4.9 1.5 versicolor
7 6.3 3.3 6.0 2.5 virginica
8 5.8 2.7 5.1 1.9 virginica
9 7.1 3.0 5.9 2.1 virginica
Which I am going to use to illustrate the approach that I think will actually address your underlying problem.
The group_by function from dplyr allows a similar approach, but without having to split the data.frame. When a data.frame is grouped, any functions applied to it are applied separately by group. Here is an example in action, which ranks the sepal lengths within each species. This is obviously not terribly useful directly, but you could write a custom function which took any number of columns as arguments (which are then passed in as vectors) and returned a vector of the same length (to create a new column or update an existing one). The select function at the end is only there to make it easier to see what I did
topIris %>%
group_by(Species) %>%
mutate(rank_Sepal_Length = rank(Sepal.Length)) %>%
select(Species, rank_Sepal_Length, Sepal.Length)
Returns:
Species rank_Sepal_Length Sepal.Length
<fctr> <dbl> <dbl>
1 setosa 3 5.1
2 setosa 2 4.9
3 setosa 1 4.7
4 versicolor 3 7.0
5 versicolor 1 6.4
6 versicolor 2 6.9
7 virginica 2 6.3
8 virginica 1 5.8
9 virginica 3 7.1
I got a workaround with the help of Force gsub to keep trailing zeros :
un<- do.call(rbind, lapply(list, data.frame, stringsAsFactors=FALSE))
un<- gsub(".*.","", un)
un<- regmatches(un, gregexpr("(?<=.).*", un, perl=TRUE))
rows<- data.frame(matrix(ncol = 1, nrow = lengths(un)))
colnames(rows)<- c("row_number")
rows["row_number"]<- sprintf("%s", rownames(un))
rows["row_number"]<- as.numeric(un[,1])
rows["row_number"]<- sub("^[^.]*[.]", "", format(rows[,1], width = max(nchar(rows[,1]))))

R: Creating an index vector

I need some help with R coding here.
The data set Glass consists of 214 rows of data in which each row corresponds to a glass sample. Each row consists of 10 columns. When viewed as a classification problem, column 10
(Type) specifies the class of each observation/instance. The remaining columns are attributes that might beused to infer column 10. Here is an example of the first row
RI Na Mg Al Si K Ca Ba Fe Type
1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0.0 0.0 1
First, I casted column 10 so that it is interpreted by R as a factor instead of an integer value.
Now I need to create a vector with indices for all observations (must have values 1-214). This needs to be done to creating training data for Naive Bayes. I know how to create a vector with 214 values, but not one that has specific indices for observations from a data frame.
If it helps this is being done to set up training data for Naive Bayes, thanks
I'm not totally sure that I get what you're trying to do... So please forgive me if my solution isn't helpful. If your df's name is 'df', just use the dplyr package for reordering your columns and write
library(dplyr)
df['index'] <- 1:214
df <- df %>% select(index,everything())
Here's an example. So that I can post full dataframes, my dataframes will only have 10 rows...
Let's say my dataframe is:
df <- data.frame(col1 = c(2.3,6.3,9.2,1.7,5.0,8.5,7.9,3.5,2.2,11.5),
col2 = c(1.5,2.8,1.7,3.5,6.0,9.0,12.0,18.0,20.0,25.0))
So it looks like
col1 col2
1 2.3 1.5
2 6.3 2.8
3 9.2 1.7
4 1.7 3.5
5 5.0 6.0
6 8.5 9.0
7 7.9 12.0
8 3.5 18.0
9 2.2 20.0
10 11.5 25.0
If I want to add another column that just is 1,2,3,4,5,6,7,8,9,10... and I'll call it 'index' ...I could do this:
library(dplyr)
df['index'] <- 1:10
df <- df %>% select(index, everything())
That will give me
index col1 col2
1 1 2.3 1.5
2 2 6.3 2.8
3 3 9.2 1.7
4 4 1.7 3.5
5 5 5.0 6.0
6 6 8.5 9.0
7 7 7.9 12.0
8 8 3.5 18.0
9 9 2.2 20.0
10 10 11.5 25.0
Hope this will help
df$ind <- seq.int(nrow(df))

Cumulative summing between groups using dplyr

I have a tibble structured as follows:
day theta
1 1 2.1
2 1 2.1
3 2 3.2
4 2 3.2
5 5 9.5
6 5 9.5
7 5 9.5
Note that the tibble contains multiple rows for each day, and for each day the same value for theta is repeated an arbitrary number of times. (The tibble contains other arbitrary columns necessitating this repeating structure.)
I'd like to use dplyr to cumulatively sum values for theta across days such that, in the example above, 2.1 is added only a single time to 3.2, etc. The tibble would be mutated so as to append the new cumulative sum (c.theta) as follows:
day theta c.theta
1 1 2.1 2.1
2 1 2.1 2.1
3 2 3.2 5.3
4 2 3.2 5.3
5 5 9.5 14.8
6 5 9.5 14.8
7 5 9.5 14.8
...
My initial efforts to group_by day and then cumsum over theta resulted only in cumulative summing over the full set of data (e.g., 2.1 + 2.1 + 3.2 ...) which is undesirable. In my Stack Overflow searches, I can find many examples of cumulative summing within groups, but never between groups, as I describe above. Nudges in the right direction would be much appreciated.
Doing this in dplyr I came up with a very similar solution to PoGibas - use distinct to just get one row per day, find the sum and merge back in:
df = read.table(text="day theta
1 1 2.1
2 1 2.1
3 2 3.2
4 2 3.2
5 5 9.5
6 5 9.5
7 5 9.5", header = TRUE)
cumsums = df %>%
distinct(day, theta) %>%
mutate(ctheta = cumsum(theta))
df %>%
left_join(cumsums %>% select(day, ctheta), by = 'day')
Not a dplyr, but just an alternative data.table solution:
library(data.table)
# Original table is called d
setDT(d)
merge(d, unique(d)[, .(c.theta = cumsum(theta), day)])
day theta c.theta
1: 1 2.1 2.1
2: 1 2.1 2.1
3: 2 3.2 5.3
4: 2 3.2 5.3
5: 5 9.5 14.8
6: 5 9.5 14.8
7: 5 9.5 14.8
PS: If you want to preserve other columns you have to use unique(d[, .(day, theta)])
In base R you could use split<- and tapply to return the desired result.
# construct 0 vector to fill in
dat$temp <- 0
# fill in with cumulative sum for each day
split(dat$temp, dat$day) <- cumsum(tapply(dat$theta, dat$day, head, 1))
Here, tapply returns the first element of theta for each day which is is fed to cumsum. The elements of cumulative sum are fed to each day using split<-.
This returns
dat
day theta temp
1 1 2.1 2.1
2 1 2.1 2.1
3 2 3.2 5.3
4 2 3.2 5.3
5 5 9.5 14.8
6 5 9.5 14.8
7 5 9.5 14.8

Resources