I have a data frame with different variables and I want to build different subsets out of this data frame using some conditions and I want to use a loop because there will be a lot of subsets and this would be saving a lot of time.
This are the conditions:
Variable A has an ID for an area, variable B has different species (1,2,3, etc.) and I want to compute different subsets with these columns. The name of every subset should be the the ID of a point and the content should be all individuals of a certain specie in this point.
For a better understanding:
This would be the code for the one subset and I want to use a loop
A_2_NGF_Abies_alba <- subset(A_2_NGF, subset = Baumart %in% c("Abies alba"))
Is this possible doing in R
Thanks
Does this help you?
Baumdaten <- data.frame(pointID=sample(c("A_2_SEF","A_2_LEF","A_3_LEF"), 10, T), Baumart=sample(c("Abies alba", "Betula pendula", "Fagus sylvatica"), 10, T))
split(Baumdaten, Baumdaten[, 1:2])
Related
I have the following data set from Douglas Montgomery's book Introduction to Time Series Analysis & Forecasting:
I created a data frame called pharm from this spreadsheet. We only have two variables but they're repeated over several columns. I'd like to take all odd "Week" columns past the 2nd column and stack them under the 1st Week column in order. Conversely I'd like to do the same thing with the even "Sales, in thousands" columns. Here's what I've tried so far:
pharm2 <- data.frame(week=c(pharm$week, pharm[,3], pharm[,5], pharm[,7]), sales=c(pharm$sales, pharm[,4], pharm[,6], pharm[,8]))
This works because there aren't many columns, but I need a way to do this more efficiently because hard coding won't be practical with many columns. Does anyone know a more efficient way to do this?
If the columns are alternating, just subset with a recycling logical vector, unlist and create a new data.frame
out <- data.frame(week = unlist(pharm[c(TRUE, FALSE)]),
sales = unlist(pharm[c(FALSE, TRUE)]))
You may use the seq function to generate sequence to extract alternating columns.
pharm2 <- data.frame(week = unlist(pharm[seq(1, ncol(pharm), 2)]),
sales = unlist(pharm[seq(2, ncol(pharm), 2)]))
I am rather a beginner with R and currently facing the following challenge where the search didn't provide me an answer.
I have a data frame that has a group assignment in the first column and now I want to create conditional random variables based on the group. E.g. everyone in group A should get a normally distributed random variable with mean 50 and stddev 10. The result of this random variable would then be added as additional column.
Example:
group_assigned <- c("A","A","B","C","A","C")
dframe <- data.frame(group_assigned)
groups <-c("A","B","C")
group_mean <- c(50,40,30)
group_stddev <- c(10,5,5)
group_properties <- data.frame(groups,group_mean, group_stddev)
Can you guide me to a solution? Thank you for your help!
Dear Friends I would appreciate if someone can help me in some question in R.
I have a data frame with 8 variables, lets say (v1,v2,...,v8).I would like to produce groups of datasets based on all possible combinations of these variables. that is, with a set of 8 variables I am able to produce 2^8-1=63 subsets of variables like {v1},{v2},...,{v8}, {v1,v2},....,{v1,v2,v3},....,{v1,v2,...,v8}
my goal is to produce specific statistic based on these groupings and then compare which subset produces a better statistic. my problem is how can I produce these combinations.
thanks in advance
You need the function combn. It creates all the combinations of a vector that you provide it. For instance, in your example:
names(yourdataframe) <- c("V1","V2","V3","V4","V5","V6","V7","V8")
varnames <- names(yourdataframe)
combn(x = varnames,m = 3)
This gives you all permutations of V1-V8 taken 3 at a time.
I'll use data.table instead of data.frame;
I'll include an extraneous variable for robustness.
This will get you your subsetted data frames:
nn<-8L
dt<-setnames(as.data.table(cbind(1:100,matrix(rnorm(100*nn),ncol=nn))),
c("id",paste0("V",1:nn)))
#should be a smarter (read: more easily generalized) way to produce this,
# but it's eluding me for now...
#basically, this generates the indices to include when subsetting
x<-cbind(rep(c(0,1),each=128),
rep(rep(c(0,1),each=64),2),
rep(rep(c(0,1),each=32),4),
rep(rep(c(0,1),each=16),8),
rep(rep(c(0,1),each=8),16),
rep(rep(c(0,1),each=4),32),
rep(rep(c(0,1),each=2),64),
rep(c(0,1),128)) *
t(matrix(rep(1:nn),2^nn,nrow=nn))
#now get the correct column names for each subset
# by subscripting the nonzero elements
incl<-lapply(1:(2^nn),function(y){paste0("V",1:nn)[x[y,][x[y,]!=0]]})
#now subset the data.table for each subset
ans<-lapply(1:(2^nn),function(y){dt[,incl[[y]],with=F]})
You said you wanted some statistics from each subset, in which case it may be more useful to instead specify the last line as:
ans2<-lapply(1:(2^nn),function(y){unlist(dt[,incl[[y]],with=F])})
#exclude the first row, which is null
means<-lapply(2:(2^nn),function(y){mean(ans2[[y]])})
I have a large number of treatment and control groups I need to provide a comparison of population proportions for. I'm looking for a way to loop through a data.frame providing the test against each of the categories.
Sample data:
test_data <- data.frame(
Category = c("A","A","B","B"),
Churn = c(56,46,83,58),
Other = c(180,555,144,86))
For example, compare category A (56/180 to 46/555) and so forth.
My initial solution:
by(test_data, test_data$Category,
function(x) prop.test(test_data$Churn, test_data$Other))
The problem: The solution outputs by category but provides a 4 sample test instead of a two sample test. I've found lots of solutions that iterate well through rows but not so much by a category. Output as a list is fine for now.
Really appreciate the help on this one!
Your by() function is incorrect. You are not using the x value that is passed in. By using the original variable name (test_data) no data is being subset for each by() call. Try
by(test_data, test_data$Category,
function(x) prop.test(x$Churn, x$Other))
This question is about selecting a different number of columns on every row of a data frame. I have a data frame:
df = data.frame(
START=sample(1:2, 10, repace=T), END=sample(2:4, 10, replace=T),
X1=rnorm(10), X2=rnorm(10), X3=rnorm(10), X4=rnorm(10)
)
I would like to have a way without loops to select columns (START[i]:END[i])+2 on row i for all rows of my data frame.
Base R solution
lapply(split(df,1:nrow(df)),function(row) row[(row$START+2):(row$END+2)])
Or something similar as given in the comment above (I would store the output in a list)
library(plyr)
alply(df,1,function(row) row[(row$START+2):(row$END+2)])
Edit per request of OP:
To get a TRUE/FALSE index matrix, use the following R base solution
idx_matrix=col(df)>=df$START+2&col(df)<=df$END+2
df[idx_matrix]
Note, however, that you lose some information here (compared to the list based solution).