Looping function with left_join over multiple variables - r

I am working to loop a function that contains a left_join iteratively over a dataframe based on multiple variables in R. The function works when I run it line-by-line over the dataframe, but breaks down in the loop. I need to automate this process because I have to run it hundreds of times, but I am getting errors using foreach and mapply.
A portion of the original data set and the original function is this:
library(tidyverse)
ID <- c(22226820,22226820,22226814,22226814)
ID_US_1 <- c(22226830,22226818,22226816,22226832)
mydf <- data.frame(cbind(ID==as.character(ID),ID_US_1=as.character(ID_US_1)))
ID_key <- c(22226830,22226818,22226818,22226816,22226816,22226832,22226832,22226806,22226806,22226814,22226814,22226804)
ID_key_US <- c(0,22226806,22226814,22226804,22226802,22226840,22226842,22226798,22226796,22226816,22226832,22227684)
mykey <- data.frame(cbind(ID_key=as.character(ID_key),ID_key_US=as.character(ID_key_US)))
myfx <- function(iteration_prior,iteration){
# iteration_prior <- "1"
# iteration <- "2"
varnameprior <- paste0("ID_US","_",iteration_prior)
varname <- paste0("ID_US","_",iteration)
colnames(mykey) <- c(varnameprior,varname)
mydf <-mydf %>%
left_join(x=.,y=mykey,by=varnameprior)
mydf[,ncol(mydf)][is.na(mydf[,ncol(mydf)])] <- 0
mydf[,ncol(mydf)]<-as.character(mydf[,ncol(mydf)])
return(mydf)
}
prior <- c(1,2,3)
current <- c(2,3,4)
mylist <- data.frame(cbind(prior=prior,current=current))
mydf <- myfx(prior[1],current[1])
mydf <- myfx(prior[2],current[2])
This creates my desired output, which is iterative columns of data. ID_US_2 is calculated based on ID_US_1 using the mykey dataframe and ID_US_3 is calculated using ID_US_2 and mykey.
I need to carry out this operation hundreds of times, which means I need to automate the process. I have tried a foreach loop and get the error that 'Join columns must be present in data'. I think this means my new output is not correctly amending to the dataframe. I got the same error/issue with mapply.
library(foreach)
foreach(i=prior,j=current) %do% {myfx(i,j)}
I also considered a nested for loop, but was hung up on the multiple variables (and foreach/mapply seem better suited).

I think that your only issue is that you haven't reassigned mydf in the foreach command. Editing that, you have:
foreach(i=prior, j=current) %do% {mydf <- myfx(i,j)}

Related

How to I put my results from a loop into a dataframe in r

I am trying to take my data frame that has a list of player id numbers and find their name, using this function. Right now my code will simply print separate tibbles of each result, but I want it to combine those results into a data frame. I tried using rbind, but it doesn't work.
for(x in dataframe...)
print(function I am using to find name)
Use sapply which is more efficient than looping :
results <- data.frame(name = sapply(dataframe[,'playerid'], FUN = function(id) baseballr::playername_lookup(id)))
You can initialise a results data frame like this
results <- data.frame()
You can then add the results in each loop using rbind combining the previous version with the new results. In the first iteration of the loop you add your first results to an empty data frame. So combined
results <- data.frame()
for(x in dataframe$playerid){
results <- rbind(results, baseballr::playername_lookup(x))
}
The problem in you code was that you simply printed the results without saving them anywhere.
As mentioned in the comment below, the better way to do this, once your data set becomes very large, is to create a list an later combine that to a data.frame.
results <- list()
for(i in seq_len(nrow(dataframe))){
results[[i]] <- baseballr::playername_lookup(dataframe$playerid[i])
}
final_results <- do.call(rbind, results)

Running a function that renames dataframes per intermediate step, for a list of dataframes

I have gotten instructions to do an analysis in R with the vegan package (concerning DCA's).
The instructions on a single dataframe are pretty straightforward, but I would like to apply the analysis on a set of dataframes.
I know this can be done with a for-loop or lapply or sapply, but I have trouble dealing with the fact that each step of the analysis a new extension is added to the name of the dataframe.
An example below
Say I have a dataframe DF, then it goes as follows:
DF.t1 <- decostand(DF, "total")
DF.t2 <- decostand(DF.t1, "max")
DF.t2.dca <- decorana(DF.t2)
DF.t2.dca.DW <- decorana(DF.t2, iweigh=1)
names(DF.t2.dca)
summary(DF.t2.dca)
DF.t2.dca.taxonscores <- scores(DF.t2.dca, display=c("species"), choices=c(1,2))
DF.t2.dca.taxonscores <- DF.t2.dca$cproj[ ,1:2]
DF.t2.dca.samplescores <- scores(DF.t2.dca, display=c("sites"), choices=1)
What I want to achieve is to run several dataframes through this analysis without writing it all out separately.
Let's say I have a set of dataframes called "DF_1", "DF_2" & "DF_3" which I want to do this analysis on.
I probably need to put the dataframes in a list, and get all the steps in a for-loop or one of the apply methods.
But how do I approach the problem with the extensions added (.ra, .t1, .t2, .t2.dca, .t2.dca.DW etc.) to the dataframe names?
Edit: I need to retain the original dataframes after the analysis, in order to do follow-up analysis on them.
Unless you have a very limited amount of data frames, I would not advise to define ca. 8 new objects for each data frame in the global environment as this can become very messy.
One approach you might consider is creating a nested list where the first level is the data frame and the second level are the modified data frames.
# some example data sets
DF1 <- mtcars
DF2 <- mtcars*2
DF3 <- mtcars*3
all_dfs <- list(DF1 = DF1, DF2 = DF2, DF3 =DF3)
some_stuff <- function(df) {
DF.t1 <- decostand(df, "total")
DF.t2 <- decostand(DF.t1, "max")
DF.t2.dca <- decorana(DF.t2)
DF.t2.dca.DW <- decorana(DF.t2, iweigh=1)
names(DF.t2.dca)
summary(DF.t2.dca)
DF.t2.dca.taxonscores <- scores(DF.t2.dca, display=c("species"), choices=c(1,2))
DF.t2.dca.taxonscores <- DF.t2.dca$cproj[ ,1:2]
DF.t2.dca.samplescores <- scores(DF.t2.dca, display=c("sites"), choices=1)
return(list(DF.t1 = DF.t1, DF.t2 = DF.t2,
DF.t2.dca = DF.t2.dca,
DF.t2.dca.DW = DF.t2.dca.DW,
DF.t2.dca.taxonscores = DF.t2.dca.taxonscores,
DF.t2.dca.taxonscores = DF.t2.dca.taxonscores
))
}
nested_list <- lapply(all_dfs, some_stuff)
# To obtain any of the objects for a specific data.frame you could, for example, run
nested_list$DF1$DF.t2.dca.DW

How to create and add new columns to a dataframe in R within a loop?

I want to create a new dataframe and keep adding variables in R with in a for loop. Here is a pseudo code on what I want:
X <- 100
For(i in 1:X)
{
#do some processing and store it a variable named "temp_var"
temp_var[,i] <- z
#make it as a dataframe and keep adding the new variables until the loop completes
}
After completion of the above loop the vector "temp_var" by itself should be a dataframe containing 100 variables.
I would avoid using for loops as much as possible in R. If you know your columns before hand, lapply is something you want to use.
However, if this is constraint by your program, you can do something like this:
X <- 100
tmp_list <- list()
for (i in 1:X) {
tmp_list[[i]] <- your_input_column
}
data <- data.frame(tmp_list)
names(data) <- paste0("col", 1:X)

Create data frame and change variable each iteration in for loop or apply

i'm a bit new to R and this site has been an amazing help to me in answering a lot of questions. However, I’ve come across a recent problem and have exhausted all options to find a solution on my own and am in need of some help.
I am trying to write a code where I create multiple data frames (or matrices) INSIDE the loop and loop it 5000 times. On each loop I would like the variable to change so I can retrieve the data for each loop at a later point.
Also, I would like to be able to repeat this method for other data frames and in creating these new data frames, it draws upon other data frames based on the iteration it is on.
I have tried to find a solution to this and it seems that it could be either the for loop or apply function, but I am not sure as to how I could execute it. As an example of what I would like to see:
for (i in 1:10) {
df.a[i] <- data.frame (…information...)
df.b[i] <- data.frame (...information...)
df.c[i] <- data.frame (new.col.A=df.a[i]$column1, new.col.B=df.b[i]$column2)
}
Then, after having run the loop, if I were to write df.c3 I would find the data frame created in the loop on the third iteration which has data from iteration 3 in df.a and df.b.
The ‘closest’ I have come to getting what I thought I needed was by doing this:
df.a = seq (1, 10, by=1)
df.b = seq (1,10, by=1)
df.c = seq (1,10, by=1)
for (i in 1:10) {
df.a[[i]] <- data.frame (...information)
...
}
But this typically results in an error of: "number of items to replace is not a multiple of replacement length".
So i'm not sure what else i could do and really hope someone is able to help out.
Create objects df.x as empty lists:
df.a <- list()
df.b <- list()
df.c <- list()
Then access (and write to) individual dataframes using double square backets:
for (i in 1:10) {
df.a[[i]] <- data.frame(...)
df.b[[i]] <- data.frame(...)
df.c[[i]] <- data.frame(new.col.A=df.a[[i]]$column1, new.col.B=df.b[[i]]$column2)
}

Populating a data frame in R in a loop

I am trying to populate a data frame from within a for loop in R. The names of the columns are generated dynamically within the loop and the value of some of the loop variables is used as the values while populating the data frame. For instance the name of the current column could be some variable name as a string in the loop, and the column can take the value of the current iterator as its value in the data frame.
I tried to create an empty data frame outside the loop, like this
d = data.frame()
But I cant really do anything with it, the moment I try to populate it, I run into an error
d[1] = c(1,2)
Error in `[<-.data.frame`(`*tmp*`, 1, value = c(1, 2)) :
replacement has 2 rows, data has 0
What may be a good way to achieve what I am looking to do. Please let me know if I wasnt clear.
It is often preferable to avoid loops and use vectorized functions. If that is not possible there are two approaches:
Preallocate your data.frame. This is not recommended because indexing is slow for data.frames.
Use another data structure in the loop and transform into a data.frame afterwards. A list is very useful here.
Example to illustrate the general approach:
mylist <- list() #create an empty list
for (i in 1:5) {
vec <- numeric(5) #preallocate a numeric vector
for (j in 1:5) { #fill the vector
vec[j] <- i^j
}
mylist[[i]] <- vec #put all vectors in the list
}
df <- do.call("rbind",mylist) #combine all vectors into a matrix
In this example it is not necessary to use a list, you could preallocate a matrix. However, if you do not know how many iterations your loop will need, you should use a list.
Finally here is a vectorized alternative to the example loop:
outer(1:5,1:5,function(i,j) i^j)
As you see it's simpler and also more efficient.
You could do it like this:
iterations = 10
variables = 2
output <- matrix(ncol=variables, nrow=iterations)
for(i in 1:iterations){
output[i,] <- runif(2)
}
output
and then turn it into a data.frame
output <- data.frame(output)
class(output)
what this does:
create a matrix with rows and columns according to the expected growth
insert 2 random numbers into the matrix
convert this into a dataframe after the loop has finished.
this works too.
df = NULL
for (k in 1:10)
{
x = 1
y = 2
z = 3
df = rbind(df, data.frame(x,y,z))
}
output will look like this
df #enter
x y z #col names
1 2 3
Thanks Notable1, works for me with the tidytextr
Create a dataframe with the name of files in one column and content in other.
diretorio <- "D:/base"
arquivos <- list.files(diretorio, pattern = "*.PDF")
quantidade <- length(arquivos)
#
df = NULL
for (k in 1:quantidade) {
nome = arquivos[k]
print(nome)
Sys.sleep(1)
dados = read_pdf(arquivos[k],ocr = T)
print(dados)
Sys.sleep(1)
df = rbind(df, data.frame(nome,dados))
Sys.sleep(1)
}
Encoding(df$text) <- "UTF-8"
I had a case in where I was needing to use a data frame within a for loop function. In this case, it was the "efficient", however, keep in mind that the database was small and the iterations in the loop were very simple. But maybe the code could be useful for some one with similar conditions.
The for loop purpose was to use the raster extract function along five locations (i.e. 5 Tokio, New York, Sau Paulo, Seul & Mexico city) and each location had their respective raster grids. I had a spatial point database with more than 1000 observations allocated within the 5 different locations and I was needing to extract information from 10 different raster grids (two grids per location). Also, for the subsequent analysis, I was not only needing the raster values but also the unique ID for each observations.
After preparing the spatial data, which included the following tasks:
Import points shapefile with the readOGR function (rgdap package)
Import raster files with the raster function (raster package)
Stack grids from the same location into one file, with the function stack (raster package)
Here the for loop code with the use of a data frame:
1. Add stacked rasters per location into a list
raslist <- list(LOC1,LOC2,LOC3,LOC4,LOC5)
2. Create an empty dataframe, this will be the output file
TB <- data.frame(VAR1=double(),VAR2=double(),ID=character())
3. Set up for loop function
L1 <- seq(1,5,1) # the location ID is a numeric variable with values from 1 to 5
for (i in 1:length(L1)) {
dat=subset(points,LOCATION==i) # select corresponding points for location [i]
t=data.frame(extract(raslist[[i]],dat),dat$ID) # run extract function with points & raster stack for location [i]
names(t)=c("VAR1","VAR2","ID")
TB=rbind(TB,t)
}
was looking for the same and the following may be useful as well.
a <- vector("list", 1)
for(i in 1:3){a[[i]] <- data.frame(x= rnorm(2), y= runif(2))}
a
rbind(a[[1]], a[[2]], a[[3]])

Resources