Sampling small data frame from a big dataframe - r

I am trying to sample a data frame from a given data frame such that there are enough samples from each of the levels of a variable.
This can be achieved by separating the data frame by the levels and sample from each of those .
I thought ddply (data-frame to data-frame) would do it for me.
Taking a minimal example:
set.seed(1)
data1 <-data.frame(a=sample(c('B0','B1','B2'),100,replace=TRUE),b=rnorm(100),c=runif(100))
> summary(data1$a)
B0 B1 B2
30 32 38
The following commands perform the sampling...
When I enter...
data2 <- ddply(data1,c('a'),function(x) sample(x,20,replace=FALSE))
I get the following error
Error in [.data.frame(x, .Internal(sample(length(x), size, replace, :
cannot take a sample larger than the population when 'replace = FALSE'
This error is because x inside the ddply function is not a vector but a dataframe.
Does anyone have any idea on how to achieve this sampling?
I know one way is to not use ddply and just do (1) segregation, (2) sampling, and (3) collation in three steps. But I was wondering there must by some way ...with base or plyr functions...
Thank you for your help...

I think what you want is to subset the data frame passed in x using sample:
ddply(data1,.(a),function(x) x[sample(nrow(x),20,replace = FALSE),])
But, of course, you still need to take care that the size of the sample for each piece (in this case 20) is at least as big as the smallest subset of your data based on the levels of a.

It would seem that if you want to sample a category that has less than 20 rows, you'd need replace=TRUE...
This might do the trick:
ddply(data1,'a',function(x) x[sample.int(NROW(x),20,replace=TRUE),])

Related

Creating a new column with random dates [duplicate]

I am trying to sample a data frame from a given data frame such that there are enough samples from each of the levels of a variable.
This can be achieved by separating the data frame by the levels and sample from each of those .
I thought ddply (data-frame to data-frame) would do it for me.
Taking a minimal example:
set.seed(1)
data1 <-data.frame(a=sample(c('B0','B1','B2'),100,replace=TRUE),b=rnorm(100),c=runif(100))
> summary(data1$a)
B0 B1 B2
30 32 38
The following commands perform the sampling...
When I enter...
data2 <- ddply(data1,c('a'),function(x) sample(x,20,replace=FALSE))
I get the following error
Error in [.data.frame(x, .Internal(sample(length(x), size, replace, :
cannot take a sample larger than the population when 'replace = FALSE'
This error is because x inside the ddply function is not a vector but a dataframe.
Does anyone have any idea on how to achieve this sampling?
I know one way is to not use ddply and just do (1) segregation, (2) sampling, and (3) collation in three steps. But I was wondering there must by some way ...with base or plyr functions...
Thank you for your help...
I think what you want is to subset the data frame passed in x using sample:
ddply(data1,.(a),function(x) x[sample(nrow(x),20,replace = FALSE),])
But, of course, you still need to take care that the size of the sample for each piece (in this case 20) is at least as big as the smallest subset of your data based on the levels of a.
It would seem that if you want to sample a category that has less than 20 rows, you'd need replace=TRUE...
This might do the trick:
ddply(data1,'a',function(x) x[sample.int(NROW(x),20,replace=TRUE),])

R chisq.test test on data frame

I am attempting to run a chi sqare analysis on the data frame (called "habitat.re") below however im having difficulty as I've gotten it to read the data but its giving the wrong results, when i prompt it with $expected it returns 18 different colums when there should be 3 (one for each site).
All the tourorials ive been able to find have the data as a table, however i've not been able to convert it correctly myself.
The chisq.test function is intended to work with two variables, or columns in this case. If you want to compare all three of your columns, then I suspect you would want to compare 1-2, 2-3, and 3-3, e.g.
chisq.test(x=habitat.re$Gidgee, y=habitat.re$`Ian's Place`)
chisq.test(x=habitat.re$`Ian's Place`, y=habitat.re$`Saw Mulga`)
chisq.test(x=habitat.re$Gidgee, y=habitat.re$`Saw Mulga`)
Actually, just typing in the above should reveal much useful information directly to the R console, something like this:
data: habitat.re$Gidgee and y=habitat.re$`Ian's Place`
X-squared = 5.5569, df = 1, p-value = 0.01841
A sufficiently low p-value might indicate that the two columns are in fact dependent.
Pearson's Chi-Squared Test requires a data frame to be made into a matrix table containing only the variables you need as numerical values. N.B. my data frame is called "habitat.re"
habitat.df<-data.matrix(habitat.re, rownames.force = NA)# convert to matrix table
habitat.df<- habitat.df[,-c(1,2,3)] # delete first 3 columns
rownames(habitat.df) <- habitat.re$COMMON.NAME #pull names from original
chisq.test(habitat.df) #do chisquare test
chisq.test(habitat.df)$expected #return predicted values
The following are images of my data frames
habitat.re
habitat.df

how to make groups of variables from a data frame in R?

Dear Friends I would appreciate if someone can help me in some question in R.
I have a data frame with 8 variables, lets say (v1,v2,...,v8).I would like to produce groups of datasets based on all possible combinations of these variables. that is, with a set of 8 variables I am able to produce 2^8-1=63 subsets of variables like {v1},{v2},...,{v8}, {v1,v2},....,{v1,v2,v3},....,{v1,v2,...,v8}
my goal is to produce specific statistic based on these groupings and then compare which subset produces a better statistic. my problem is how can I produce these combinations.
thanks in advance
You need the function combn. It creates all the combinations of a vector that you provide it. For instance, in your example:
names(yourdataframe) <- c("V1","V2","V3","V4","V5","V6","V7","V8")
varnames <- names(yourdataframe)
combn(x = varnames,m = 3)
This gives you all permutations of V1-V8 taken 3 at a time.
I'll use data.table instead of data.frame;
I'll include an extraneous variable for robustness.
This will get you your subsetted data frames:
nn<-8L
dt<-setnames(as.data.table(cbind(1:100,matrix(rnorm(100*nn),ncol=nn))),
c("id",paste0("V",1:nn)))
#should be a smarter (read: more easily generalized) way to produce this,
# but it's eluding me for now...
#basically, this generates the indices to include when subsetting
x<-cbind(rep(c(0,1),each=128),
rep(rep(c(0,1),each=64),2),
rep(rep(c(0,1),each=32),4),
rep(rep(c(0,1),each=16),8),
rep(rep(c(0,1),each=8),16),
rep(rep(c(0,1),each=4),32),
rep(rep(c(0,1),each=2),64),
rep(c(0,1),128)) *
t(matrix(rep(1:nn),2^nn,nrow=nn))
#now get the correct column names for each subset
# by subscripting the nonzero elements
incl<-lapply(1:(2^nn),function(y){paste0("V",1:nn)[x[y,][x[y,]!=0]]})
#now subset the data.table for each subset
ans<-lapply(1:(2^nn),function(y){dt[,incl[[y]],with=F]})
You said you wanted some statistics from each subset, in which case it may be more useful to instead specify the last line as:
ans2<-lapply(1:(2^nn),function(y){unlist(dt[,incl[[y]],with=F])})
#exclude the first row, which is null
means<-lapply(2:(2^nn),function(y){mean(ans2[[y]])})

Getting expected value through regression model and attach to original dataframe in R

My question is very similar to this one here , but I still can't solve my problem and thus would like to get little bit more help to make it clear. The original dataframe "ddf" looks like:
CONC <- c(0.15,0.52,0.45,0.29,0.42,0.36,0.22,0.12,0.27,0.14)
SPP <- c(rep('A',3),rep('B',3),rep('C',4))
LENGTH <- c(390,254,380,434,478,367,267,333,444,411)
ddf <- as.data.frame(cbind(CONC,SPECIES,LENGTH))
the regression model is constructed based on Species:
model <- dlply(ddf,.(SPP), lm, formula = CONC ~ LENGTH)
the regression model works fine and returns individual models for each species.
What I am going to get is the residual and expected value of 'Length' variable in terms of each models (corresponding to different species) and I want those data could be added into my original dataset ddf as new columns. so the new dataset should looks like:
SPP LENGTH CONC EXPECTED RESIDUAL
Firstly, I use the following code to get the expected value:
model_pre <- lapply(model,function(x)predict(x,data = ddf))
I loom there might be some mistakes in the above code, but it actually works! The result comes with two columns ( predicated value and species). My first question is whether I could believe this result of above code? (Does R fully understand what I am aiming to do, getting expected value of "length" in terms of different model?)
Then i used the following code to attach those data to ddf:
ddf_new <- cbind(ddf, model_pre)
This code works fine as well. But the problem comes here. It seems like R just attach the model_pre result directly to the original dataframe, since the result of model_pre is not sorted the same as the original ddf and thus is obviously wrong(justifying by the species column in original dataframe and model_pre).
I was using resid() and similar lapply, cbind code to get residual and attach it to original ddf. Same problem comes.
Therefore, how can I attach those result correctly in terms of length by species? (please let me know if you confuse what I am trying to explain here)
Any help would be greatly appreciated!
There are several problems with your code, you refer to columns SPP and Conc., but columns by those names don't exist in your data frame.
Your predicted values are made on the entire dataset, not just the subset corresponding to that model (this may be intended, but seems strange with the later usage).
When you cbind a data frame to a list of data frames, does it really cbind the individual data frames?
Now to more helpful suggestions.
Why use dlply at all here? You could just fit a model with interactions that effectively fits a different regression line to each species:
fit <- lm(CONC ~ SPECIES * LENGTH, data= ddf)
fitted(fit)
predict(fit)
ddf$Pred <- fitted(fit)
ddf$Resid <- ddf$CONC - ddf$Pred
Or if there is some other reason to really use dlply and the problem is combining 2 data frame that have different ordering then either use merge or reorder the data frames to match first (see functions like ordor, sort.list, and match).

Generating new variable values by subset

I have a data set, and I am trying to create a new variable with random values that are associated with a particular subset.
For example, given the data frame:
data(iris)
iris=iris
I want another variable that associates each value of iris$Species with a random number (between 0 and 1). This can be accomplished in a circuitous fashion by creating a data frame:
df=data.frame(unique(iris$Species),runif(length(unique(iris$Species))))
And merging it with the original data frame:
iris=merge(iris,df,by.x="Species",by.y="unique.iris.Species.")
This accomplishes what I want, but it is inelegant. Furthermore, if I wanted to replicate this process many times over different variables this process would be burdensome. What I would hope for is some quick indexing method that would hopefully look something like:
iris$Species.unif=runif(length(unique(iris$Species)))[iris$Species]
Given that indexing in R is typically very slick, I expect there is some way of doing this that I am not aware of.
Thank you in advance.
You may want to try by using levels:
iris <- iris
iris$species_unif <- iris$Species
levels(iris$species_unif ) <- runif(length(levels(iris$Species)))

Resources