I am trying to sample a data frame from a given data frame such that there are enough samples from each of the levels of a variable.
This can be achieved by separating the data frame by the levels and sample from each of those .
I thought ddply (data-frame to data-frame) would do it for me.
Taking a minimal example:
set.seed(1)
data1 <-data.frame(a=sample(c('B0','B1','B2'),100,replace=TRUE),b=rnorm(100),c=runif(100))
> summary(data1$a)
B0 B1 B2
30 32 38
The following commands perform the sampling...
When I enter...
data2 <- ddply(data1,c('a'),function(x) sample(x,20,replace=FALSE))
I get the following error
Error in [.data.frame(x, .Internal(sample(length(x), size, replace, :
cannot take a sample larger than the population when 'replace = FALSE'
This error is because x inside the ddply function is not a vector but a dataframe.
Does anyone have any idea on how to achieve this sampling?
I know one way is to not use ddply and just do (1) segregation, (2) sampling, and (3) collation in three steps. But I was wondering there must by some way ...with base or plyr functions...
Thank you for your help...
I think what you want is to subset the data frame passed in x using sample:
ddply(data1,.(a),function(x) x[sample(nrow(x),20,replace = FALSE),])
But, of course, you still need to take care that the size of the sample for each piece (in this case 20) is at least as big as the smallest subset of your data based on the levels of a.
It would seem that if you want to sample a category that has less than 20 rows, you'd need replace=TRUE...
This might do the trick:
ddply(data1,'a',function(x) x[sample.int(NROW(x),20,replace=TRUE),])
Related
Dear Friends I would appreciate if someone can help me in some question in R.
I have a data frame with 8 variables, lets say (v1,v2,...,v8).I would like to produce groups of datasets based on all possible combinations of these variables. that is, with a set of 8 variables I am able to produce 2^8-1=63 subsets of variables like {v1},{v2},...,{v8}, {v1,v2},....,{v1,v2,v3},....,{v1,v2,...,v8}
my goal is to produce specific statistic based on these groupings and then compare which subset produces a better statistic. my problem is how can I produce these combinations.
thanks in advance
You need the function combn. It creates all the combinations of a vector that you provide it. For instance, in your example:
names(yourdataframe) <- c("V1","V2","V3","V4","V5","V6","V7","V8")
varnames <- names(yourdataframe)
combn(x = varnames,m = 3)
This gives you all permutations of V1-V8 taken 3 at a time.
I'll use data.table instead of data.frame;
I'll include an extraneous variable for robustness.
This will get you your subsetted data frames:
nn<-8L
dt<-setnames(as.data.table(cbind(1:100,matrix(rnorm(100*nn),ncol=nn))),
c("id",paste0("V",1:nn)))
#should be a smarter (read: more easily generalized) way to produce this,
# but it's eluding me for now...
#basically, this generates the indices to include when subsetting
x<-cbind(rep(c(0,1),each=128),
rep(rep(c(0,1),each=64),2),
rep(rep(c(0,1),each=32),4),
rep(rep(c(0,1),each=16),8),
rep(rep(c(0,1),each=8),16),
rep(rep(c(0,1),each=4),32),
rep(rep(c(0,1),each=2),64),
rep(c(0,1),128)) *
t(matrix(rep(1:nn),2^nn,nrow=nn))
#now get the correct column names for each subset
# by subscripting the nonzero elements
incl<-lapply(1:(2^nn),function(y){paste0("V",1:nn)[x[y,][x[y,]!=0]]})
#now subset the data.table for each subset
ans<-lapply(1:(2^nn),function(y){dt[,incl[[y]],with=F]})
You said you wanted some statistics from each subset, in which case it may be more useful to instead specify the last line as:
ans2<-lapply(1:(2^nn),function(y){unlist(dt[,incl[[y]],with=F])})
#exclude the first row, which is null
means<-lapply(2:(2^nn),function(y){mean(ans2[[y]])})
I am very new to R and I am struggling to understand how to omit NA values in a specific way.
I have a large dataframe with several columns (up to 40) and rows (up to 200ish). I want to use data from one of the columns to do simple stats (wilcox.test, boxplot, etc): one column will have a continuous variable (V1), while the other has a binary variable (V2; 0 or 1), which divides 2 groups. I want to do this for the continuous variable using different V2 binary variables, which are unrelated. I organized this data in Excel, saved it as CSV and am using R Studio.
All these columns have interspersed NA values and when I use omit.na, it just takes off every single row where a NA value is present, which takes away an awful load of data. Is there any simple solution to do this? I have seen some answers to similar topics, but none seems quite exactly what I need to do.
Many thanks for any answer. Again, I am a baby-level newbie to R and may have overlooked something in other topics!
If I understand, you want to apply to function to a pair of column each time.
wilcox.test(V1,V2)
wilcox.test(V1,V3)...
Where Vi have no missing values. I would do something like this :
## use complete.cases to assert that you have no missing values
## for the selected pair
apply_clean <-
function(x,y){
ok <- complete.cases(x, y)
wilcox.test(x[ok],dat$V1[ok])
}
## apply this function to all columns after removing the continuous column
lapply(subset(dat,select=-V1),apply_clean,y=dat$V1)
You can manipulate the data.frame to omit based on any rules you like. For example:
dirty.frame <- data.frame(col1 = c(1,2,3,4,5,6,7,NA,9,10), col2 = c(10, 9, 8, 7,6,5,4,3,2,1))
cleaned.frame <- dirty.frame[!is.na(dirty.frame$col1),]
This code used is.na() to test if a row in a specific column is na. The ! means not, and will omit that row.
My question is very similar to this one here , but I still can't solve my problem and thus would like to get little bit more help to make it clear. The original dataframe "ddf" looks like:
CONC <- c(0.15,0.52,0.45,0.29,0.42,0.36,0.22,0.12,0.27,0.14)
SPP <- c(rep('A',3),rep('B',3),rep('C',4))
LENGTH <- c(390,254,380,434,478,367,267,333,444,411)
ddf <- as.data.frame(cbind(CONC,SPECIES,LENGTH))
the regression model is constructed based on Species:
model <- dlply(ddf,.(SPP), lm, formula = CONC ~ LENGTH)
the regression model works fine and returns individual models for each species.
What I am going to get is the residual and expected value of 'Length' variable in terms of each models (corresponding to different species) and I want those data could be added into my original dataset ddf as new columns. so the new dataset should looks like:
SPP LENGTH CONC EXPECTED RESIDUAL
Firstly, I use the following code to get the expected value:
model_pre <- lapply(model,function(x)predict(x,data = ddf))
I loom there might be some mistakes in the above code, but it actually works! The result comes with two columns ( predicated value and species). My first question is whether I could believe this result of above code? (Does R fully understand what I am aiming to do, getting expected value of "length" in terms of different model?)
Then i used the following code to attach those data to ddf:
ddf_new <- cbind(ddf, model_pre)
This code works fine as well. But the problem comes here. It seems like R just attach the model_pre result directly to the original dataframe, since the result of model_pre is not sorted the same as the original ddf and thus is obviously wrong(justifying by the species column in original dataframe and model_pre).
I was using resid() and similar lapply, cbind code to get residual and attach it to original ddf. Same problem comes.
Therefore, how can I attach those result correctly in terms of length by species? (please let me know if you confuse what I am trying to explain here)
Any help would be greatly appreciated!
There are several problems with your code, you refer to columns SPP and Conc., but columns by those names don't exist in your data frame.
Your predicted values are made on the entire dataset, not just the subset corresponding to that model (this may be intended, but seems strange with the later usage).
When you cbind a data frame to a list of data frames, does it really cbind the individual data frames?
Now to more helpful suggestions.
Why use dlply at all here? You could just fit a model with interactions that effectively fits a different regression line to each species:
fit <- lm(CONC ~ SPECIES * LENGTH, data= ddf)
fitted(fit)
predict(fit)
ddf$Pred <- fitted(fit)
ddf$Resid <- ddf$CONC - ddf$Pred
Or if there is some other reason to really use dlply and the problem is combining 2 data frame that have different ordering then either use merge or reorder the data frames to match first (see functions like ordor, sort.list, and match).
I have a data set, and I am trying to create a new variable with random values that are associated with a particular subset.
For example, given the data frame:
data(iris)
iris=iris
I want another variable that associates each value of iris$Species with a random number (between 0 and 1). This can be accomplished in a circuitous fashion by creating a data frame:
df=data.frame(unique(iris$Species),runif(length(unique(iris$Species))))
And merging it with the original data frame:
iris=merge(iris,df,by.x="Species",by.y="unique.iris.Species.")
This accomplishes what I want, but it is inelegant. Furthermore, if I wanted to replicate this process many times over different variables this process would be burdensome. What I would hope for is some quick indexing method that would hopefully look something like:
iris$Species.unif=runif(length(unique(iris$Species)))[iris$Species]
Given that indexing in R is typically very slick, I expect there is some way of doing this that I am not aware of.
Thank you in advance.
You may want to try by using levels:
iris <- iris
iris$species_unif <- iris$Species
levels(iris$species_unif ) <- runif(length(levels(iris$Species)))
I am trying to sample a data frame from a given data frame such that there are enough samples from each of the levels of a variable.
This can be achieved by separating the data frame by the levels and sample from each of those .
I thought ddply (data-frame to data-frame) would do it for me.
Taking a minimal example:
set.seed(1)
data1 <-data.frame(a=sample(c('B0','B1','B2'),100,replace=TRUE),b=rnorm(100),c=runif(100))
> summary(data1$a)
B0 B1 B2
30 32 38
The following commands perform the sampling...
When I enter...
data2 <- ddply(data1,c('a'),function(x) sample(x,20,replace=FALSE))
I get the following error
Error in [.data.frame(x, .Internal(sample(length(x), size, replace, :
cannot take a sample larger than the population when 'replace = FALSE'
This error is because x inside the ddply function is not a vector but a dataframe.
Does anyone have any idea on how to achieve this sampling?
I know one way is to not use ddply and just do (1) segregation, (2) sampling, and (3) collation in three steps. But I was wondering there must by some way ...with base or plyr functions...
Thank you for your help...
I think what you want is to subset the data frame passed in x using sample:
ddply(data1,.(a),function(x) x[sample(nrow(x),20,replace = FALSE),])
But, of course, you still need to take care that the size of the sample for each piece (in this case 20) is at least as big as the smallest subset of your data based on the levels of a.
It would seem that if you want to sample a category that has less than 20 rows, you'd need replace=TRUE...
This might do the trick:
ddply(data1,'a',function(x) x[sample.int(NROW(x),20,replace=TRUE),])