Saving dataframes and variables' names within for loop - r

I am trying to use a for loop to save dataframes and variable names on the way.
I have a data frame called regionmap, one of the variables (Var3) can take thousands different values, among which there are 15 of this form:
"RegionMap *" where * is one of the values of the vector c:
regions <- c("A", "B"........"Z")
I need to run a loop which selects the rows in which each of these values appear, save those rows as a new data frame, transform the relative frequency in a dummy and then merge the new data frame with a bigger one aimed at collecting all of these.
The following code works, I just wanted to know whether it possible to run it 15 times substituting every "A" (both as strings to select and as names of data frames and variables) with other elements of c like in a for loop. (standard for loop does not work)
A <- regionmap[grep("RegionMap A", regionmap$Var3), ]
A$Freq[A$Freq > 1] <- 1
A$Var3 <- NULL
colnames(A) <- c( "name", "date", "RegionMap A")
access_panel <- merge(access_panel, A,by=c("name", "date"))

You don't need to name the variables differently if you are merging everything together anyway - just the column names. Something like this should do the trick...
regions <- c("A", "B"........"Z")
for(x in regions){
mapname <- paste("RegionMap",x,sep=" ") #this is all that needs to change each time
A <- regionmap[grep(mapname, regionmap$Var3), ]
A$Freq[A$Freq > 1] <- 1
A$Var3 <- NULL
colnames(A) <- c( "name", "date", mapname)
if(x=="A") {
access_panel <- A #first one has nothing to merge into
} else {
access_panel <- merge(access_panel, A ,by=c("name", "date"))
}
}

Related

Iterate over names to create new columns and get data with same names in R

I'm having some issues trying to do what I think is quite simple.
I have a list of names:
the_names <- c("X1", "X2")
I want to use these names as column names in a new data frame and use these names to pull data from another data frame.
This list of names is going to be of varying length depending on the sample. So, it will not always be of length 2 (X1, X2).
I'm trying to do something like this:
pair_meta <- data.frame()
for(i in the_names) {
# create a column using the name from the list
# reference name to get data from other data frames (bed_A)
pair_meta[[i]] <- bed_A[i]
}
Where the iterator, i, is X1 then X2 (then X3 etc. if the list of names is longer).
I am having trouble getting the column names of the data frame to match the list of input names, and I am having trouble using the same name to gather data from another file.
For more context, here is the bit of code I am working on:
pair <- data.frame(
chr1 = rep( bed_A[anchor,"chr" ] , length(tail_entry:head_entry) ),
start1 = rep( bed_A[anchor,"start"] , length(tail_entry:head_entry) ),
end1 = rep( bed_A[anchor,"end" ] , length(tail_entry:head_entry) ),
chr2 = bed_B[tail_entry:head_entry,"chr" ],
start2 = bed_B[tail_entry:head_entry,"start"],
end2 = bed_B[tail_entry:head_entry,"end" ]
)
In this "pair" data frame, there are 6 columns with hard-coded names (chr1, start1, end1, chr2, start2, end 2). In this table, there will always be 6 columns with these names.
I am trying to create an additional table that stores metadata associated with this data. There may be any number of metadata columns. I have already collected a list of the metadata columns and named them X1 through Xn and stored them as "the_names".
So, what I am wondering, is how to create a table of metadata that matches the pair data. I want to do this:
for(i in the_names) {
pair_meta[[i]] <- rep( bed_A[anchor, i],length(tail_entry:head_entry))
}
So that the resulting pair_meta dataframe would have i number of columns, named using the list of names, and reference some other data in another data frame.
For example, how can I do this:
chr1 = rep( bed_A[anchor,"chr" ] , length(tail_entry:head_entry) )
without a hard-coded variable name, and instead with a flexible placeholder?
X1 = rep( bed_A[anchor,"X1" ] , length(tail_entry:head_entry) )
and
X2 = rep( bed_A[anchor,"X2" ] , length(tail_entry:head_entry) )
etc.
Is there a way to do this with a loop and an iterator? something like:
i = rep( bed_A[anchor,"i" ] , length(tail_entry:head_entry) )
etc.
I hope these edits clarify my issue here. I wanted to include a reproducible example so I made some simple data, but I think that confused everyone trying to help.
I've tried everything I've posted here, sprintf, and apply functions. Please help!
Hi there I hope this helps
the_names <- c("X1", "X2") # issue 1: string items in vector must be wrapped individually in ""
bed_A <- data.frame(X1 = c("A", "B", "C", "D", "E"), X2 = rep("E", 5))
n_row <- nrow(bed_A)
pair_meta <- data.frame( # issue 2: row length mismatch, add column with number of rows matching bed_A
id <- 1:n_row
)
for (i in the_names) {
pair_meta <- cbind(pair_meta, bed_A[i]) # correct way to add a column to a dataframe
}
print(pair_meta)
# you could also combine the dataframe like this and avoid making an unnecessary column
pair_meta <- data.frame( # add column with number of rows matching bed_A
id <- 1:n_row
)
pair_meta <- cbind(bed_A)
print(pair_meta)

How can lapply work with addressing columns as unknown variables?

So, I have a list of strings named control_for. I have a data frame sampleTable with some of the columns named as strings from control_for list. And I have a third object dge_obj (DGElist object) where I want to append those columns. What I wanted to do - use lapply to loop through control_for list, and for each string, find a column in sampleTable with the same name, and then add that column (as a factor) to a DGElist object. For example, for doing it manually with just one string, it looks like this, and it works:
group <- as.factor(sampleTable[,3])
dge_obj$samples$group <- group
And I tried something like this:
lapply(control_for, function(x) {
x <- as.factor(sampleTable[, x])
dge_obj$samples$x <- x
}
Which doesn't work. I guess the problem is that R can't recognize addressing columns like this. Can someone help?
Here are two base R ways of doing it. The data set is the example of help("DGEList") and a mock up data.frame sampleTable.
Define a vector common_vars of the table's names in control_for. Then create the new columns.
library(edgeR)
sampleTable <- data.frame(a = 1:4, b = 5:8, no = letters[21:24])
control_for <- c("a", "b")
common_vars <- intersect(control_for, names(sampleTable))
1. for loop
for(x in common_vars){
y <- sampleTable[[x]]
dge_obj$samples[[x]] <- factor(y)
}
2. *apply loop.
tmp <- sapply(sampleTable[common_vars], factor)
dge_obj$samples <- cbind(dge_obj$samples, tmp)
This code can be rewritten as a one-liner.
Data
set.seed(2021)
y <- matrix(rnbinom(10000,mu=5,size=2),ncol=4)
dge_obj <- DGEList(counts=y, group=rep(1:2,each=2))

How to add many data frame columns efficiently in R

I need to add several thousand columns to a data frame. Currently, I have a list of 93 lists, where each of the embedded lists contains 4 data frames, each with 19 variables. I want to add each column of all those data frames to an outside file. My code looks like:
vars <- c('tmin_F','tavg_F','tmax_F','pp','etr_grass','etr_alfalfa','vpd','rhmin','rhmax','dtr_F','us','shum','pp_def_grass','pp_def_alfalfa','rw_tot','fdd28_F0','fdd32_F0','fdd35_F0',
'fdd356_F0','fdd36_F0','fdd38_F0','fdd39_F0','fdd392_F0','fdd40_F0','fdd41_F0','fdd44_F0','fdd45_F0','fdd464_F0','fdd48_F0','fdd50_F0','fdd52_F0','fdd536_F0','fdd55_F0',
'fdd57_F0','fdd59_F0','fdd60_F0','fdd65_F0','fdd70_F0','fdd72_F0','hdd40_F0','hdd45_F0','hdd50_F0','hdd55_F0','hdd57_F0','hdd60_F0','hdd65_F0','hdd45_F0',
'cdd45_F0','cdd50_F0','cdd55_F0','cdd57_F0','cdd60_F0','cdd65_F0','cdd70_F0','cdd72_F0',
'gdd32_F0','gdd35_F0','gdd356_F0','gdd38_F0','gdd39_F0','gdd392_F0','gdd40_F0','gdd41_F0','gdd44_F0','gdd45_F0',
'gdd464_F0','gdd48_F0','gdd50_F0','gdd52_F0','gdd536_F0','gdd55_F0','gdd57_F0','gdd59_F0','gdd60_F0','gdd65_F0','gdd70_F0','gdd72_F0',
'gddmod_32_59_F0','gddmod_32_788_F0','gddmod_356_788_F0','gddmod_392_86_F0','gddmod_41_86_F0','gddmod_464_86_F0','gddmod_48_86_F0','gddmod_50_86_F0','gddmod_536_95_F0',
'sdd77_F0','sdd86_F0','sdd95_F0','sdd97_F0','sdd99_F0','sdd104_F0','sdd113_F0')
windows <- c(15,15,15,29,29,29,15,15,15,15,29,29,29,29,15,rep(15,78))
perc_list <- c('obs','smoothed_obs','windowed_obs','smoothed_windowed_obs')
percs <- c('00','02','05','10','20','25','30','33','40','50','60','66','70','75','80','90','95','98','100')
vcols <- seq(1,19,1)
for (v in 1:93){
for (pl in 1:4){
for (p in 1:19){
normals_1981_2010 <- normals_1981_2010 %>% mutate(!!paste0(vars[v],'_daily',perc_list[pl],'_perc',percs[p]) := percents[[v]][[pl]][,vcols[p]])}}
print(v)}
The code starts fast, but very quickly slows to a crawl as the outside data frame grows in size. I didn't realize this would be problem. How do I add all these extra columns efficiently? Is there a better way to do this than by using mutate? I've tried add_column, but that does not work. Maybe it doesn't like the loop or something.
Your example is not reproducible as is (the object normals_1981_2010 doesn't exist but is called within the loop, so I am unsure I understood your question.
If I did though, this should work:
First, I am reproducing your dataset structure, except that instead of 93 list, I set it up to have 5, instead of 4 nested tables within, I set it up to have 3 tables, and instead of each tables having 19 columns, I set them up to have 3 columns.
df_list <- vector("list", 5) # Create an empty list vector, then fill it in.
for(i in 1:5) {
df_list[[i]] <- vector("list", 3)
for(j in 1:3) {
df_list[[i]][[j]] <- data.frame(a = 1:12,
b = letters[1:12],
c = month.abb[1:12])
colnames(df_list[[i]][[j]]) <- paste0(colnames(df_list[[i]][[j]]), "_nest_", i, "subnest_", j)
}
}
df_list # preview the structure.
Then, answering your question:
# Now, how to bind everything together:
df_out <- vector("list", 5)
for(i in 1:5) {
df_out[[i]] <- bind_cols(df_list[[i]])
}
# Final step
df_out <- bind_cols(df_out)
ncol(df_out) # Here I have 5*3*3 = 45 columns, but you will have 93*4*19 = 7068 columns
# [1] 45

Replace value from dataframe column with value from keyvalue lookup

I want to replace certain values in a data frame column with values from a lookup table. I have the values in a list, stuff.kv, and many values are stored in the list (but some may not be).
stuff.kv <- list()
stuff.kv[["one"]] <- "thing"
stuff.kv[["two"]] <- "another"
#etc
I have a dataframe, df, which has multiple columns (say 20), with assorted names. I want to replace the contents of the column named 'stuff' with values from 'lookup'.
I have tried building various apply methods, but nothing has worked.
I built a function, which process a list of items and returns the mutated list,
stuff.lookup <- function(x) {
for( n in 1:length(x) ) {
if( !is.null( stuff.kv[[x[n]]] ) ) x[n] <- stuff.kv[[x[n]]]
}
return( x )
}
unlist(lapply(df$stuff, stuff.lookup))
The apply syntax is bedeviling me.
Since you made such a nice lookup table, You can just use it to change the values. No loops or apply needed.
## Sample Data
set.seed(1234)
DF = data.frame(stuff = sample(c("one", "two"), 8, replace=TRUE))
## Make the change
DF$stuff = unlist(stuff.kv[DF$stuff])
DF
stuff
1 thing
2 another
3 another
4 another
5 another
6 another
7 thing
8 thing
Below is a more general solution building on #G5W's answer as it doesn't cover the case where your original data frame has values that don't exist in the lookup table (which would result in length mismatch error):
library(dplyr)
stuff.kv <- list(one = "another", two = "thing")
df <- data_frame(
stuff = rep(c("one", "two", "three"), each = 3)
)
df <- df %>%
mutate(stuff = paste(stuff.kv[stuff]))

Building forvalues loops in R

[Working with R 3.2.2]
I have three data frames with the same variables. I need to modify the value of some variables and change the name of the variables (rename the columns). Instead of doing this data frame by data frame, I would like to use a loop.
This is the code I want to run:
#Change the values of the variables
vlist <- c("var1", "var2", "var3")
dataframe0[,vlist] <- dataframe0[,vlist]/10
dataframe1[,vlist] <- dataframe1[,vlist]/10
dataframe2[,vlist] <- dataframe2[,vlist]/10
#Change the name of the variables
colnames(dataframe0)[colnames(dataframe0)=="var1"] <- "temp_min"
colnames(dataframe0)[colnames(dataframe0)=="var2"] <- "temp_max"
colnames(dataframe0)[colnames(dataframe0)=="var3"] <- "prep"
colnames(dataframe1)[colnames(dataframe1)=="var1"] <- "temp_min"
colnames(dataframe1)[colnames(dataframe1)=="var2"] <- "temp_max"
colnames(dataframe1)[colnames(dataframe1)=="var3"] <- "prep"
colnames(dataframe2)[colnames(dataframe2)=="var1"] <- "temp_min"
colnames(dataframe2)[colnames(dataframe2)=="var2"] <- "temp_max"
colnames(dataframe2)[colnames(dataframe2)=="var3"] <- "prep"
I know the logic to do it with programs like Stata, with a forvalues loop:
#Change the values of the variables
forvalues i=0/2 {
dataframe`i'[,vlist] <- dataframe`i'[,vlist]/10
#Change the name of the variables
colnames(dataframe`i')[colnames(dataframe`i')=="var1"] <- "temp_min"
colnames(dataframe`i')[colnames(dataframe`i')=="var2"] <- "temp_max"
colnames(dataframe`i')[colnames(dataframe`i')=="var3"] <- "prep"
}
But, I am not able to reproduce it in R. How should I proceed? Thanks in advance!
I would go working with a list of dataframe, you can still 'split' it after if really needed:
df1 <- data.frame("id"=1:10,"var1"=11:20,"var2"=11:20,"var3"=11:20,"test"=1:10)
df2 <- df1
df3 <- df1
dflist <- list(df1,df2,df3)
for (i in seq_along(dflist)) {
df[[i]]['test'] <- df[[i]]['test']/10
colnames( dflist[[i]] )[ colnames(dflist[[i]]) %in% c('var1','var2','var3') ] <- c('temp_min','temp_max','prep')
# eventually reassign df1-3 to their list value:
# assign(paste0("df",i),dflist[[i]])
}
The interest of using a list is that you can access them a little more easily in a programmatic way.
I did change your code from 3 calls to only one, as colnames give a vector you can subset it and replace in one pass, this is assuming your var1 to var3 are always in the same order.
Addendum: if you want a single dataset at end you can use do.call(rbind,dflist) or with data.table package rbindlist(dflist).
More details on working with list of data.frames in Gregor's answer here

Resources