Sum a variable across dataframes by an ID variable - r

There are 3 data frames. The ID variable is in the 12th column of each data frame. I created a vector list_cc_q1 that contains all the unique IDs across all data frames (hence each entry in this vector appears in the 12th column of at least one data frame).
I wish to create a vector v1 that adds, for each ID, the values in the 7th column from each data frame which contains that ID (hence v1 would be of the same length as list_cc_q1). Here's the code I'm using:
f1 <- function(x,y){
ifelse(length(get(y)[which(get(y)[x,12]),7])>0, get(y)[which(get(y)[x,12]),7], 0)}
g1 <- function(x){sum(sapply(ls()[1:3], function(y){ f1(x,y)}))}
v1 <- sapply(list_cc_q1, function(z){ g1(z) })
This returns the following error:
Error in get(y)[x, 12] : incorrect number of dimensions
Called from: which(get(y)[x, 12])
I think I've overcomplicated the code, a simpler method will be immensely helpful.
But why doesn't this work?

Not sure I understand correctly, but how about:
library(data.table)
dt <- data.table(value = c(df1[[7]],df2[[7]],df3[[7]]), id = c(df1[[12]],df2[[12]],df3[[12]]))
dt[, .(sum = sum(value)), by = id]
This concatenates the 7th column of each of the three data.frames (df1, df2, df3) to a value column and the 12th column of each of the data.frames (df1, df2, df3) to an id column to form a data.table with two columns (value and id). It then sums the value column by the id column.
EDIT: Your code might not work because of the
ls()[1:3]
The ls() command is executed in the function-environment which does not contain your three data.frames if I see this correctly. You can see this by comparing the following:
ls()[1:3]
# [1] "df1" "df2" "df3"
function_ls <- function(){cat(ls()[1:3])}
function_ls()
# NA NA NA

Related

How to replace several variables with several variables from another dataframe in R using a loop?

I would like to replace multiple variables with variables from a second dataframe in R.
df1$var1 <- df2$var1
df1$var2 <- df2$var2
# and so on ...
As you can see the variable names are the same in both dataframes, however, numeric values are slightly different whereas the correct version is in df2 but needs to be in df1. I need to do this for many, many variables in a complex data set and wonder whether someone could help with a more efficient way to code this (possibly without using column references).
Here some example data:
# dataframe 1
var1 <- c(1:10)
var2 <- c(1:10)
df1 <- data.frame(var1,var2)
# dataframe 2
var1 <- c(11:20)
var2 <- c(11:20)
df2 <- data.frame(var1,var2)
# assigning correct values
df1$var1 <- df2$var1
df1$var2 <- df2$var2
As Parfait has said, the current post seems a bit too simplified to give any immediate help but I will try and summarize what you may need for something like this to work.
If the assumption is that df1 and df2 have the same number of rows AND that their orders are already matching, then you can achieve this really easily by the following subset notation:
df1[,c({column names df1}), drop = FALSE] <- df2[, c({column names df2}), drop = FALSE]
Lets say that df1 has columns a, b, and c and you want to replace b and c with two columns of df1 whose columns are x, y, z.
df1[,c("b","c"), drop = FALSE] <- df2[, c("y", "z"), drop = FALSE]
Here we are replacing b with y and c with z. The drop argument is just for added protection against subsetting a data.frame to ensure you don't get a vector.
If you do NOT know the order is correct or one data frame may have a differing size than the other BUT there is a unique identifier between the two data.frames - then I would personally use a function that is designed for merging two data frames. Depending on your preference you can use merge from base or use *_join functions from the dplyr package (my preference).
library(dplyr)
#assuming a and x are unique identifiers that can be matched.
new_df <- left_join(df1, df2, by = c("a"="x"))

How to detele row and fill-in column in multiple data.frames at once?

I have over 10 data.frames with the same columns. First column is filled with "!" and I want to change it into data.frame name. Each data frame have over 7 000 rows. Last row in each data.frame is empty that I want to remove. My goal is to merge the data.frames but preserve the origin of the data in first column.
I have a list of data frame names in temp2, those are also names I want to put into first column of given data frame.
> str(temp2)
chr [1:13] "bone_marrow" "colon" "duodenum" "esophagus" "liver" "lymph_node" "rectum" ...
The first column to be filled in by data.frame name in named "Tissue" df$Tissue
I tried to delete last row in each data frame with:
for (i in 1:length(temp2)) assign(temp2[i], temp2[i][-nrow(temp2[i],)])
and to fill the first column with data frame name with:
for (i in 1:length(temp2)) paste0(temp2[i], "$Tissue = ", temp2[i])
or
for (i in 1:length(temp2)) paste0(temp2[i], "$Tissue") <- temp2[i]
the first code (for deleting last rows) returns:
Error in nrow(temp2[i], ) : unused argument (alist())
the second code is silent but does nothing,
the last one returns:
Error in paste0(temp2[i], "$Tissue") <- temp2[i] :
could not find function "paste0<-"
and the final goal to merge all the data frames into one with rbind or merge
for (i in 1:length(temp2)) allDF = rbind(temp2[i])
Let's say your data frames are df1, df2, df3.
First, you can combine them into a list:
dflist <- list(df1, df2, df3)
Then you can create new columns for each of them and strip the last row of each data frame (the head() function with argument -1 removes the last row):
dflist2 <- lapply(1:NROW(dflist),
function(i) {dflist[[i]]$Tissue <- temp2[i];
head(dflist[[i]], -1)}
)
And, finally, you can bind them into one big dataframe:
mydf <- do.call(rbind, dflist2)

how to search for column names in a dataframe

I have the following data frame df2 and a vector n. How can I create a new data frame where df2 column names are same as vector n
df2 <- data.frame(x1=c(1,6,3),x2=c(4,3,1),x3=c(5,4,6),x4=c(7,6,7))
n<-c("x1","x4")
Any of these would work:
df2[n]
df2[, n] # see note below for caveat
subset(df2, select = n)
Note that in the second one if n can be of length one, i.e. one column, then it returns a vector rather than a data frame and if you want it to always return a data frame you would need instead:
df2[, n, drop = FALSE]
df3 <- subset(df2, select=c("x1", "x4"))
df3
hope it helps

Finding nearest number between two lists

I have a list of dataframes (df1) and another list of dataframes (df2) which hold values required to find the 'nearest value' in the first list.
df1<-list(d1=data.frame(y=1:10), d2=data.frame(y=3:20))
df2<-list(d3=data.frame(y=2),d4=data.frame(y=4))
Say I have this function:
df1[[1]]$y[which(abs(df1[[1]]$y-df2[[1]])== min(abs(df1[[1]]$y-df2[[1]])))]
This function works perfectly in finding the closest value of df2 value 1 in df1. What I can't achieve is getting to work with lapply as in something like:
lapply(df1, function(x){
f<-x$y[which(abs(x$y-df2) == min(abs(x$y - df2)))]
})
I would like to return a dataframe with all f values which show the nearest number for each item in df1.
Thanks,
M
I assume you're trying to compare the first data.frames in df1 and df2 to each other, and the second data.frames in df1 and df2 to each other. It would also be useful to use the which.min function (check out help(which.min)).
edit
In response to your comment, you could use mapply instead:
> mapply(function(x,z) x$y[which.min(abs(x$y - z$y))], df1, df2)
d1 d2
2 4
The OP's real problem is unclear, but I would probably do...
library(data.table)
DT1 = rbindlist(unname(df1), idcol=TRUE)
DT2 = rbindlist(unname(df2), idcol=TRUE)
DT1[DT2, on=c(".id","y"), roll="nearest"]
# .id y
# 1: 1 2
# 2: 2 4

Rename columns in data frame in R using lookup data frame

I have a data frame with the default column names V1, V2, V3, V230, etc.
I have another data frame which has 2 columns, one containing V1, V2, V3 etc and the second column containing a character string.
I would like to rename the columns in the forst data frame using the second data frame as a lookup table.
Note that the first data frame has less columns than are listed in the second "lookup" data frame.
Any ideas?
We can use match
colnames(firstdat) <- seconddat[,2][match(colnames(firstdat),
seconddat[,1], nomatch=0)]
Say the first data frame is x and second one is y:
colnames(x) <- merge(data.frame(colnames(x)), y, by.x ="colnames.x.", by.y= "Col1" )[,2]
### Col1 is the name of column 1 of *y* (containing V1, V2 etc)
Consider using data.table
library(data.table)
DT <- as.data.table(df)
temp.lookup <- lookup[lookup$oldnames %in% names(DT), ]
setnames(DT, old = temp.lookup$oldnames, new = temp.lookup$newnames)

Resources