I'm working with a large data frame that is pulled from a data lake which I need to subset according to multiple different columns and run an analysis on. The basic subsettings come from an external Excel file which I read in and generate all possible combinations of. I want something to loop through each of these columns and subset my data accordingly.
A few of the subsettings follow a similar form to:
data_settings <- data.frame(country = rep(c('DE','RU','US','CA','BR'),6),
transport=rep(c('road','air','sea')),
category = rep(c('A','B')))
And my data lake extract has a form like:
df <- data.frame(country = rep(unique(data_settings$country),6),
transport = rep(unique(data_settings$transport),10),
category = rep(c('A','B'),15),
values = round(runif(30) * 10))
I need to subset the data according to each of the rows in my data_settings data frame, so I built a loop which constructs the formula according to what is in my data_settings data frame.
for(i in 1:nrow(data_settings)){
sub_string <- paste0(names(data_settings[1]), '==', data_settings[i,1])
for(j in 2:ncol(data_settings)){
col <- names(data_settings)[j]
val <- as.character(data_settings[i,j])
sub_string <- paste0(sub_string, ' & ', col," == ","'",val,"'")
}
df_sub <- subset(df, formula(sub_string))
}
This successfully builds my strings which I try to pass to formula or as.formula, but I receive an error at that point. I've tried a few different formulations without any success. In my actual case, there are thousands of combinations with different columns and values to filter against.
Thanks in advance for your help!
Try this:
merge(data_settings, df)
I worked with my previous approach a bit more today without using subset, filter, etc. and put this together which seems to do what I want well enough by filtering recursively according to the next item in the data_settings frame.
for(i in 1:nrow(data_settings)){
df_sub <- df
for(j in 1:ncol(data_settings)){
col <- names(data_settings)[j]
val <- as.character(data_settings[i,j])
df_col <- grep(col, names(df))
df_sub <- df_sub[df_sub[,df_col] == val,]
}
# Run further analysis here...
}
Related
I'm new in R and coding in general...
I have computed multiple anova analysis on multiple columns (16 in total).
For that purpose, the method "Purr" helped me :
anova_results_5sector <- purrr::map(df_anova_ch[,3:18], ~aov(.x ~ df_anova_ch$Own_5sector))
summary(anova_results_5sector[[1]])
So the dumbest way to retrieve output (p-value, etc) is the following method
summary(anova_results_5sector$Env_Pillar)
summary(anova_results_5sector$Gov_Pillar)
summary(anova_results_5sector$Soc_Pillar)
summary(anova_results_5sector$CSR_Strat)
summary(anova_results_5sector$Comm)
summary(anova_results_5sector$ESG_Comb)
summary(anova_results_5sector$ESG_Contro)
summary(anova_results_5sector$ESG_Score)
summary(anova_results_5sector$Env_Innov)
summary(anova_results_5sector$Human_Ri)
summary(anova_results_5sector$Management)
summary(anova_results_5sector$Prod_Resp)
I've tried to use a loop :
for(i in 1:length(anova_results_5sector)){
summary(anova_results_5sector$[i])
}
It didn't work, I dont know and did not find how to deal with $ in for loop
Here you have a look of the structure of the output vector
Structure of output
I have tried several times with others methods, more or less complicated. Often the examples found online are too simple and does not allow me to adapt to my data.
Any tips ?
Thank you and sorry for such an noobie question
Whenever I use a loop for an analysis I like to store the results in a data.frame, it allows to keep a good overview. Since you did not provide a reproducible example I used the iris dataset:
data("iris")
#make a data frame to store the results with as many columns and rows as you need
anova_results <- data.frame(matrix(ncol = 3, nrow = 3))
#one column per value you want to store and one row per anova you want to run
x <- c("number", "Mean_Sq", "p_value") #assign all values you want to store as column names
colnames(anova_results) <- x
anova_results$number <- 1:3 #assign numers for each annova you want to run, eg. 3
In the loop you can now extract the results of the anova that you are interested in, I use mean squares and p-value as an example, but you can of course add others. Don't forget to add a coulmn for other values you want to add.
for (i in 2:4){
my_anova <- aov(iris[[1]] ~ iris[[i]])
p <- summary(my_anova)[[1]][["Pr(>F)"]][1] #extract the p value
anova_results$p_value[anova_results$number == i-1] <- p
mean <- summary(my_anova)[[1]][["Mean Sq"]][1] #extract the mean quares
anova_results$Mean_Sq[anova_results$number == i-1] <- mean
}
View(anova_results)
I have gotten instructions to do an analysis in R with the vegan package (concerning DCA's).
The instructions on a single dataframe are pretty straightforward, but I would like to apply the analysis on a set of dataframes.
I know this can be done with a for-loop or lapply or sapply, but I have trouble dealing with the fact that each step of the analysis a new extension is added to the name of the dataframe.
An example below
Say I have a dataframe DF, then it goes as follows:
DF.t1 <- decostand(DF, "total")
DF.t2 <- decostand(DF.t1, "max")
DF.t2.dca <- decorana(DF.t2)
DF.t2.dca.DW <- decorana(DF.t2, iweigh=1)
names(DF.t2.dca)
summary(DF.t2.dca)
DF.t2.dca.taxonscores <- scores(DF.t2.dca, display=c("species"), choices=c(1,2))
DF.t2.dca.taxonscores <- DF.t2.dca$cproj[ ,1:2]
DF.t2.dca.samplescores <- scores(DF.t2.dca, display=c("sites"), choices=1)
What I want to achieve is to run several dataframes through this analysis without writing it all out separately.
Let's say I have a set of dataframes called "DF_1", "DF_2" & "DF_3" which I want to do this analysis on.
I probably need to put the dataframes in a list, and get all the steps in a for-loop or one of the apply methods.
But how do I approach the problem with the extensions added (.ra, .t1, .t2, .t2.dca, .t2.dca.DW etc.) to the dataframe names?
Edit: I need to retain the original dataframes after the analysis, in order to do follow-up analysis on them.
Unless you have a very limited amount of data frames, I would not advise to define ca. 8 new objects for each data frame in the global environment as this can become very messy.
One approach you might consider is creating a nested list where the first level is the data frame and the second level are the modified data frames.
# some example data sets
DF1 <- mtcars
DF2 <- mtcars*2
DF3 <- mtcars*3
all_dfs <- list(DF1 = DF1, DF2 = DF2, DF3 =DF3)
some_stuff <- function(df) {
DF.t1 <- decostand(df, "total")
DF.t2 <- decostand(DF.t1, "max")
DF.t2.dca <- decorana(DF.t2)
DF.t2.dca.DW <- decorana(DF.t2, iweigh=1)
names(DF.t2.dca)
summary(DF.t2.dca)
DF.t2.dca.taxonscores <- scores(DF.t2.dca, display=c("species"), choices=c(1,2))
DF.t2.dca.taxonscores <- DF.t2.dca$cproj[ ,1:2]
DF.t2.dca.samplescores <- scores(DF.t2.dca, display=c("sites"), choices=1)
return(list(DF.t1 = DF.t1, DF.t2 = DF.t2,
DF.t2.dca = DF.t2.dca,
DF.t2.dca.DW = DF.t2.dca.DW,
DF.t2.dca.taxonscores = DF.t2.dca.taxonscores,
DF.t2.dca.taxonscores = DF.t2.dca.taxonscores
))
}
nested_list <- lapply(all_dfs, some_stuff)
# To obtain any of the objects for a specific data.frame you could, for example, run
nested_list$DF1$DF.t2.dca.DW
I have two large data frames (500k rows) from two separate sources without a key. Instead of being able to merge using a key, I want to merge the two data frames by matching other columns. Such as age and amount. It is not a perfect match between the two data frames so some values will not match, and I will later simply remove these ones.
The data could look something like this.
So, in the example above I want to be able to create a table matching Key 1 and Key 2. In the picture above we see that XXX1 and YYY3 is a match. So from here I would like to create a data frame like:
[Key 1] [Key 2]
XXX1 YYY3
XXX2 N/A
XXX3 N/A
I know how to do this in Excel but due to the large amount of data, it simply crashes. I want to focus on R but for what it is worth, this is how I built it in Excel (where the idea is that we first do a VLOOKUP, and then uses INDEX as a VLOOKUP for getting the second match if the first one does not match both criteria):
=IF(P2=0;IFNA(VLOOKUP(L2;B:C;2;FALSE);VLOOKUP(L2;G:H;2;FALSE));IF(O2=Q2;INDEX($A$2:$A$378300;SMALL(IF($L2=$B$2:$B378300;ROW($B$2:$B$378300)-ROW($B$2)+1);2));0))
And this is the approach made in R:
for (i in 1:nrow(df)) {
for (j in 1:nrow(df)) {
if (df_1$pc_age[i] == df_2$pp_age[j] && (df_1$amount[i] %in% c(df_2$amount1[j], df_2$amount2[j], df_2$amount3[j]))) {
df_1$Key1[i] = df_2$Key2[j]
} else (df_1$Key1[i] = N/A)
}}
The problem is that this takes way, way to long. Is there a more effective way to map this data as good as possible?
Thanks!
Create dummy columns in both the data frames such as(I can show you for df1) :
for(i in 1:nrow(df1)){
df1$key1 <- paste0("X_",i)
}
Similarly for df2 from Y1....Yn and then join both data frames using "merge" on columns age and amount.
Concatenate Key1 and key2 in a new column in the merged data frame. You will directly get your desired data frame.
could the following code work for you?
# create random data
set.seed(123)
df1 <- data.frame(
key_1=as.factor(paste("xxx",1:100,sep="_")),
age = sample(1:100,100,replace=TRUE),
amount = sample(1:200,100))
df2 <- data.frame(
key_1=paste("yyy",1:500,sep="_"),
age = sample(1:100,500,replace=TRUE),
amount_1 = sample(1:200,500,replace=TRUE),
amount_2 = sample(1:200,500,replace=TRUE),
amount_3 = sample(1:200,500,replace=TRUE))
# ensure at least three fit rows
df2[10,2:3] <- df1[1,2:3]
df2[20,c(2,4)] <- df1[2,2:3]
df2[30,c(2,5)] <- df1[3,2:3]
# define comparrison with df2
comp2df2 <- function(x){
ageComp <- df2$age == as.numeric(x[2])
if(!any(ageComp)){
return(NaN)
}
amountComp <- apply(df2,1,function(a) as.numeric(x[3]) %in% as.numeric(a[3:5]))
if(!any(amountComp)){
return(NaN)
}
matchIdx <- ageComp & amountComp
if(sum(matchIdx) > 1){
warning("multible match detected first match is taken\n")
}
return(which(matchIdx)[1])
}
# run match
matchIdx <- apply(df1,1,comp2df2)
# merge
df_new <- cbind(df1[!is.na(matchIdx),],df2[matchIdx[!is.na(matchIdx)],])
didn't had time to test it on really big data, but this should be faster than your two for loops I guess....
To further speed up things you could delete the
if(sum(matchIdx) > 1){
warning("multible match detected first match is taken\n")
}
lines if you are not worried about a line matches several others.
I'm trying to store p values from a long nested for loop into an empty column in a data frame. I've tried looking up examples close to my code, but I feel as though my code is really long (and maybe even incorrect) that the same things that can be applied to other for loops can't be applied to mine.
The overview of what I'm trying to do is I'm trying to compare the relatedness of observed paired birds to the relatedness of all possible paired birds in a given year by finding a p value. To do this, I'm writing a for loop where I am selecting a range of years from a huge data set, and then I am applying a bunch of functions to those given years where I'm trying to narrow down the data for observed pairs and then I'm adding a column for relatedness and transferring those relatedness values for the pairs from another data set. I am then applying another for loop function within this in order to create a data frame with all possible paired birds in that given year and also adding and transferring a column of relatedness values for the pairs. From these two data frames of pairs and relatedness within each year, I want to apply the wilcox test to find the p value for each given year. I want to transfer over these p values into a separate data frame that I have created with a year column and a p value column.
Here is my (crazy looking) code:
`year <- c(2000:2013)
pvalue <- c(NA)
results <- data.frame(year, pvalue)
for(j in c(2000:2013)) {
allbr_demo_noEPP_year <- subset(allbr_demo_noEPP, Year == j)
allbr_demo_noEPP_year_geno_obs <- allbr_demo_noEPP_year[allbr_demo_noEPP_year$Pairs %in% c(genome$pair1,genome$pair2),]
allbr_demo_noEPP_year_geno_obs$relatedness <- laply(allbr_demo_noEPP_year_geno_obs$Pairs, function(x) genome[genome$pair1==x|genome$pair2==x,'PI_HAT'])
allbr_demo_noEPP_year_geno <- allbr_demo_noEPP_year[c(allbr_demo_noEPP_year$MB_USFWS,allbr_demo_noEPP_year$FB_USFWS) %in% genotyped$V2,]
breeder_list_males <- allbr_demo_noEPP_year_geno_obs[,8]
breeder_list_females <- allbr_demo_noEPP_year_geno_obs[,10]
unq_breeder_list_males <- unique(breeder_list_males)
unq_breeder_list_females <- unique(breeder_list_females)
all_poss_combo <-list()
for(i in unq_breeder_list_males){
print(i)
all_poss_combo[[i]]<-paste0(i, ",", unq_breeder_list_females)}
lapply(X = all_poss_combo, FUN= function(x) length(unique(x)))
all_poss_df<-unlist(all_poss_combo, use.names = F)
all_poss_df <- data.frame("combo"=all_poss_df, "M"=NA, "F"=NA)
all_poss_df$M <- substr(all_poss_df$combo, start = 1, stop = 10)
all_poss_df$F <- substr(all_poss_df$combo, start = 12, stop = 22)
all_poss_df_geno <- all_poss_df[all_poss_df$combo %in% c(genome$pair1,genome$pair2),]
all_poss_df_geno$relatedness <- laply(all_poss_df_geno$combo, function(x) genome[genome$pair1==x|genome$pair2==x,'PI_HAT'])
wilcox.test(allbr_demo_noEPP_year_geno_obs$relatedness, all_poss_df_geno$relatedness, alternative='greater')}`
To be honest, I'm not even sure if this for loop will work (it seems pretty complex to me, but I am a beginner), but I was told that doing a for loop for this situation should work. I understand there are probably easier or faster ways to do what I am trying to do, which I also welcome, but I would also like to see how I could fix this for loop so it would work and how I could store the results from it into a data frame.
Thank you so much for any help given!
If you are simply looking to save the p value:
str(wilcox.test(rnorm(10), rnorm(10, 2))) # example from running ?Wilcox.test
wilcox.test(rnorm(10), rnorm(10, 2))$p.value #
So with your dataset, perhaps putting this in the bottom of your for loop:
pvalue[j] <- wilcox.test(allbr_demo_noEPP_year_geno_obs$relatedness,
all_poss_df_geno$relatedness, alternative='greater')$p.value
I am trying to populate a data frame from within a for loop in R. The names of the columns are generated dynamically within the loop and the value of some of the loop variables is used as the values while populating the data frame. For instance the name of the current column could be some variable name as a string in the loop, and the column can take the value of the current iterator as its value in the data frame.
I tried to create an empty data frame outside the loop, like this
d = data.frame()
But I cant really do anything with it, the moment I try to populate it, I run into an error
d[1] = c(1,2)
Error in `[<-.data.frame`(`*tmp*`, 1, value = c(1, 2)) :
replacement has 2 rows, data has 0
What may be a good way to achieve what I am looking to do. Please let me know if I wasnt clear.
It is often preferable to avoid loops and use vectorized functions. If that is not possible there are two approaches:
Preallocate your data.frame. This is not recommended because indexing is slow for data.frames.
Use another data structure in the loop and transform into a data.frame afterwards. A list is very useful here.
Example to illustrate the general approach:
mylist <- list() #create an empty list
for (i in 1:5) {
vec <- numeric(5) #preallocate a numeric vector
for (j in 1:5) { #fill the vector
vec[j] <- i^j
}
mylist[[i]] <- vec #put all vectors in the list
}
df <- do.call("rbind",mylist) #combine all vectors into a matrix
In this example it is not necessary to use a list, you could preallocate a matrix. However, if you do not know how many iterations your loop will need, you should use a list.
Finally here is a vectorized alternative to the example loop:
outer(1:5,1:5,function(i,j) i^j)
As you see it's simpler and also more efficient.
You could do it like this:
iterations = 10
variables = 2
output <- matrix(ncol=variables, nrow=iterations)
for(i in 1:iterations){
output[i,] <- runif(2)
}
output
and then turn it into a data.frame
output <- data.frame(output)
class(output)
what this does:
create a matrix with rows and columns according to the expected growth
insert 2 random numbers into the matrix
convert this into a dataframe after the loop has finished.
this works too.
df = NULL
for (k in 1:10)
{
x = 1
y = 2
z = 3
df = rbind(df, data.frame(x,y,z))
}
output will look like this
df #enter
x y z #col names
1 2 3
Thanks Notable1, works for me with the tidytextr
Create a dataframe with the name of files in one column and content in other.
diretorio <- "D:/base"
arquivos <- list.files(diretorio, pattern = "*.PDF")
quantidade <- length(arquivos)
#
df = NULL
for (k in 1:quantidade) {
nome = arquivos[k]
print(nome)
Sys.sleep(1)
dados = read_pdf(arquivos[k],ocr = T)
print(dados)
Sys.sleep(1)
df = rbind(df, data.frame(nome,dados))
Sys.sleep(1)
}
Encoding(df$text) <- "UTF-8"
I had a case in where I was needing to use a data frame within a for loop function. In this case, it was the "efficient", however, keep in mind that the database was small and the iterations in the loop were very simple. But maybe the code could be useful for some one with similar conditions.
The for loop purpose was to use the raster extract function along five locations (i.e. 5 Tokio, New York, Sau Paulo, Seul & Mexico city) and each location had their respective raster grids. I had a spatial point database with more than 1000 observations allocated within the 5 different locations and I was needing to extract information from 10 different raster grids (two grids per location). Also, for the subsequent analysis, I was not only needing the raster values but also the unique ID for each observations.
After preparing the spatial data, which included the following tasks:
Import points shapefile with the readOGR function (rgdap package)
Import raster files with the raster function (raster package)
Stack grids from the same location into one file, with the function stack (raster package)
Here the for loop code with the use of a data frame:
1. Add stacked rasters per location into a list
raslist <- list(LOC1,LOC2,LOC3,LOC4,LOC5)
2. Create an empty dataframe, this will be the output file
TB <- data.frame(VAR1=double(),VAR2=double(),ID=character())
3. Set up for loop function
L1 <- seq(1,5,1) # the location ID is a numeric variable with values from 1 to 5
for (i in 1:length(L1)) {
dat=subset(points,LOCATION==i) # select corresponding points for location [i]
t=data.frame(extract(raslist[[i]],dat),dat$ID) # run extract function with points & raster stack for location [i]
names(t)=c("VAR1","VAR2","ID")
TB=rbind(TB,t)
}
was looking for the same and the following may be useful as well.
a <- vector("list", 1)
for(i in 1:3){a[[i]] <- data.frame(x= rnorm(2), y= runif(2))}
a
rbind(a[[1]], a[[2]], a[[3]])