Merge data frames without Merge Function

Merge data frames without Merge Function - r

I have two data frames called height.txt
ID: 1 2 3 4 5
Height: 67 60 62 55 69
and next data frame is weight.txt
ID: 1 2 4 5 6
Weight: 110 123 150 170 185
The goal is to merge these two data frames together, without using the merge() function in R, and the output should be in the image. How would I do this? This is for practice, I know merge() does the trick, but I am supposed to do this without it, it is for a class.
Edit:
Data in a copy&paste format.
ID <- scan(text = "1 2 3 4 5")
Height <- scan(text = "67 60 62 55 69")
df1 <- data.frame(ID, Height)
ID <- scan(text = "1 2 4 5 6")
Weight <- scan(text = "110 123 150 170 185")
df2 <- data.frame(ID, Weight)

It's a simple repeated use of match.
Create a data.frame with all the elements of the common column, ID, with no repetitions.
match the ID's of each of the dataframes with the ID of the result res.
Assign the other columns.
Remember to create each of the other columns before assigning values to them.
res <- data.frame(ID = unique(c(df1$ID, df2$ID)))
i <- match(df1$ID, res$ID)
j <- na.omit(match(res$ID, df1$ID))
res$Height <- NA
res$Height[i] <- df1$Height[j]
i <- match(df2$ID, res$ID)
j <- na.omit(match(res$ID, df2$ID))
res$Weight <- NA
res$Weight[i] <- df2$Weight[j]
res
# ID Height Weight
#1 1 67 110
#2 2 60 123
#3 3 62 NA
#4 4 55 150
#5 5 69 170
#6 6 NA 185
identical(res, merge(df1, df2, all = TRUE))
#[1] TRUE
Edit.
Answering to a question in a comment about how general this solution is. From help("merge"):
Details
merge is a generic function whose principal method is for data frames:
the default method coerces its arguments to data frames and calls the
"data.frame" method.
The method merge.data.frame in R 3.6.2 is 158 code lines long, this solution is not general at all.
Edit 2.
A function generalizing the code above could the following.
merge_by_one_col <- function(X, Y, col = "ID"){
common <- unique(c(X[[col]], Y[[col]]))
res <- data.frame(common)
names(res) <- col
i <- match(X[[col]], res[[col]])
j <- na.omit(match(res[[col]], X[[col]]))
for(new in setdiff(names(X), col)){
res[[new]] <- NA
res[[new]][i] <- X[[new]][j]
}
i <- match(Y[[col]], res[[col]])
j <- na.omit(match(res[[col]], Y[[col]]))
for(new in setdiff(names(Y), names(res))){
res[[new]] <- NA
res[[new]][i] <- Y[[new]][j]
}
res
}
merge_by_one_col(df1, df2)

I used cbind after rbinding the missing IDs from each data frame and sorting by ID.
df1_ <- rbind(df1, data.frame(ID=setdiff(df2$ID, df1$ID), Height=NA))
df2_ <- rbind(df2, data.frame(ID=setdiff(df1$ID, df2$ID), Weight=NA))
cbind(df1_[order(df1_$ID),], df2_[order(df2_$ID), -1, drop=FALSE])
ID Height Weight
1 1 67 110
2 2 60 123
3 3 62 NA
4 4 55 150
5 5 69 170
6 6 NA 185
Edit:
Generalizing so that no column names are required (except the "by" column "ID")
n1 <- setdiff(df1$ID, df2$ID); n1
n2 <- setdiff(df2$ID, df1$ID); n2
df1a <- df1[rep(nrow(df1)+1, length(n1)),]; df1a
df2a <- df2[rep(nrow(df2)+1, length(n2)),]; df2a
df1a$ID <- n2
df2a$ID <- n1
df1_ <- rbind(df1, df1a)
df2_ <- rbind(df2, df2a)
res <- cbind(df1_[order(df1_$ID),], df2_[order(df2_$ID), -1, drop=FALSE])
rownames(res) <- 1:nrow(res)
res
ID Height Weight
1 1 67 110
2 2 60 123
3 3 62 NA
4 4 55 150
5 5 69 170
NA 6 NA 185
Edit 2: Using rbind.fill from the plyr package:
library(plyr)
df1_ <- rbind.fill(df1, data.frame(ID=setdiff(df2$ID, df1$ID)))
df2_ <- rbind.fill(df2, data.frame(ID=setdiff(df1$ID, df2$ID)))
res <- cbind(df1_[order(df1_$ID),], df2_[order(df2_$ID), -1, drop=FALSE])
identical(res, merge(df1, df2, all=TRUE))
# TRUE

Related

How to compare elements of two large datasets as efficient as possible?

I am an R amateur and learning slowly. I present the situation:
I have two data frames with several columns (4) and +10000 rows looking like this, both:
df1: df2:
Nº x y attr Nº x y attr
1 45 34 X 1 34 23 x
1 48 45 XX 4 123 45 x
1 41 23 X 4 99 69 xx
4 23 12 X 4 112 80 xx
4 28 16 X 5 78 80 x
5 78 80 XXX 5 69 74 xx
...
I would like to compare both data frames based on x,y (coordinates) to delete in df1 all the values that also appear in df2 (all the values/coordinates that are contained in both datasets, delete them in df1).
So in my example, the last row of df1 would be deleted because the same coordinates are in df2.
What I am doing is using a double loop for(), one for one dataset and another one for the other, to compare one by one all the values possible.
I know this is extremely inefficient and it takes also a lot of time if I increase the amount of data.
What other ways are there to do this?
There are probably some functions but I generally don't know how to use them so much and it gives me problems.
Thank you very much!!

A library(data.table) method:
df1[fsetdiff(df1[, .(x,y)] , df2[, .(x,y)] ), on=c('x','y')]
# Nº x y attr
#1: 1 45 34 X
#2: 1 48 45 XX
#3: 1 41 23 X
#4: 4 23 12 X
#5: 4 28 16 X

Not the most elegant solution but gets the job done:
df2 = fread('Nº x y attr
1 34 23 x
4 123 45 x
4 99 69 xx
4 112 80 xx
5 78 80 x
5 69 74 xx')
df1 = fread('Nº x y attr
1 45 34 X
1 48 45 XX
1 41 23 X
4 23 12 X
4 28 16 X
5 78 80 XXX')
> df1[!stringr::str_c(df1$x, df1$y, sep="_") %in% stringr::str_c(df2$x, df2$y, sep="_"),]
Nº x y attr
1: 1 45 34 X
2: 1 48 45 XX
3: 1 41 23 X
4: 4 23 12 X
5: 4 28 16 X
Explanation:
It's best to use vectorised functions rather than loops. !stringr::str_c(df1$x, df1$y, sep="_") %in% stringr::str_c(df2$x, df2$y, sep="_") concatenates the x and y columns into a string and then finds elements from df1 that aren't in df2. This creates a logical vector of TRUE FALSE values which we can then use to subset df1.
EDIT:
I was curious to see if mine or #dww answer was faster:
> library(microbenchmark)
>
> n=100000
>
> df1 = data.table(x = sample(n), y=sample(n))
> df2 = data.table(x = sample(n), y=sample(n))
>
>
>
> microbenchmark(
... df1[!stringr::str_c(df1$x, df1$y, sep="_") %in% stringr::str_c(df2$x, df2$y, sep="_"),],
... df1[fsetdiff(df1[, .(x,y)] , df2[, .(x,y)] ), on=c('x','y')]
... )
Unit: milliseconds
expr
df1[!stringr::str_c(df1$x, df1$y, sep = "_") %in% stringr::str_c(df2$x, df2$y, sep = "_"), ]
df1[fsetdiff(df1[, .(x, y)], df2[, .(x, y)]), on = c("x", "y")]
min lq mean median uq max neval
168.40953 199.37183 219.30054 209.61414 222.08134 364.3458 100
41.07557 42.67679 52.34855 44.34379 59.27378 152.1283 100
Seems like the data.table version of dww is ~5x faster.

3 lines of code
#generate sample data
x1 <- sample(1:50,9001, T)
y1 <- sample(1:50,9001, T)
x2 <- sample(1:50,9001, T)
y2 <- sample(1:50,9001, T)
df1 <- data.frame(id =1:9001, x1,y1, stringsAsFactors = F)
df2 <- data.frame(id =1:9001, x2,y2, stringsAsFactors = F)
#add a match column to each dataframe
df1$match <- paste(df1$x1, df1$y1)
df2$match <- paste(df2$x2, df2$y2)
#overwrite df1 with the date of df1 that does not appear in df2
df1 <- df1[!df1$match %in% df2$match,]

R apply a vector of functions to a dataframe

I am currently working on a dataframe with raw numeric data in cols. Every col contains data for one parameter (for example gene expression data of gene xyz) while each row contains a subject. Some of the data in the cols are normally distributed, while some are far from it. I ran shapiro tests using apply with margin 2 for different transformations and then picked suitable transformations by comparing shapiro.test()$p.value. I sent my pick as char to a vector, giving me a vector of NA, log10, sqrt with the length of ncol(DataFrame). I now wonder if it is possible to apply the vector to the data frame via an apply-function, or if neccessary a for-loop. How do I do this or is there a better way? I guess I could loop if-else statements but there has to be a more efficient ways because my code already is slow.
Thanks all!
Update: I tried the code below but it is giving me "Error in file(filename, "r") : invalid 'description' argument"
TransformedExampleDF <- apply(exampleDF, 2 , function(x) eval(parse(paste(transformationVector , "(" , x , ")" , sep = "" ))))
exampleDF <- as.data.frame(matrix(c(1,2,3,4,1,10,100,1000,0.1,0.2,0.3,0.4), ncol=3, nrow = 4))
transformationVector <- c(NA, "log10", NA)

So you could do something like this. In the example below, I've cooked up four random functions whose names I've then stored in the list func_list (Note: the last function converts data to NA; that is intentional).
Then, I created another function func_to_df() that accepts the data.frame and the list of functions (func_list) as inputs, and applies (i.e., executes using get()) the functions upon the corresponding column of the data.frame. The output is returned (and in this example, is stored in the data.frame my_df1.
tl;dr: just look at what func_to_df() does. It might also be worthwhile looking into the purrr package (although it hasn't been used here).
#---------------------
#Example function 1
myaddtwo <- function(x){
if(is.numeric(x)){
x = x+2
} else{
warning("Input must be numeric!")
}
return(x)
#Constraints such as the one shown above
#can be added elsewhere to prevent
#inappropriate action
}
#Example function 2
mymulttwo <- function(x){
return(x*2)
}
#Example function 3
mysqrt <- function(x){
return(sqrt(x))
}
#Example function 4
myna <- function(x){
return(NA)
}
#---------------------
#Dummy data
my_df <- data.frame(
matrix(sample(1:100, 40, replace = TRUE),
nrow = 10, ncol = 4),
stringsAsFactors = FALSE)
#User somehow ascertains that
#the following order of functions
#is the right one to be applied to the data.frame
my_func_list <- c("myaddtwo", "mymulttwo", "mysqrt", "myna")
#---------------------
#A function which applies
#the functions from func_list
#to the columns of df
func_to_df <- function(df, func_list){
for(i in 1:length(func_list)){
df[, i] <- get(func_list[i])(df[, i])
#Alternative to get()
#df[, i] <- eval(as.name(func_list[i]))(df[, i])
}
return(df)
}
#---------------------
#Execution
my_df1 <- func_to_df(my_df, my_func_list)
#---------------------
#Output
my_df
# X1 X2 X3 X4
# 1 8 85 6 41
# 2 45 7 8 65
# 3 34 80 16 89
# 4 34 62 9 31
# 5 98 47 51 99
# 6 77 28 40 72
# 7 24 7 41 46
# 8 45 80 75 30
# 9 93 25 39 72
# 10 68 64 87 47
my_df1
# X1 X2 X3 X4
# 1 10 170 2.449490 NA
# 2 47 14 2.828427 NA
# 3 36 160 4.000000 NA
# 4 36 124 3.000000 NA
# 5 100 94 7.141428 NA
# 6 79 56 6.324555 NA
# 7 26 14 6.403124 NA
# 8 47 160 8.660254 NA
# 9 95 50 6.244998 NA
# 10 70 128 9.327379 NA
#---------------------

Read multiple csvs in loop and write as columns in a master csv

Let's say I have two .csv tables (I actually have hundreds):
Table 1
x mean_snowcover useless_data
1 80 6546156
2 50 6285465
3 60 2859525
Table 2
x mean_snowcover useless_data
1 91 87178
2 89 987189
3 88 879278927
I want a new table that looks like this:
Mean Snowcover
x Table_1 Table_2
1 80 91
2 50 89
3 60 88
This is my current code:
setwd("C:/Users/evan/Desktop/Finished Data SGMA/test")
master1=read.csv("New folder/AllSGMA.csv")
temp = list.files(pattern="*.csv$",recursive=FALSE)
###READ CSVS IN LOOP###
for(x in 1:length(temp)){
mycsv = read.csv(temp[x])
mean_snowcover=mycsv$mean_snowcover
master2=cbind(master1,mean_snowcover)
}
write.csv(master2,"Mean Snowcover.csv")
But the output is a blank table. I've looked at similar questions on stack overflow but I am unable to figure out what I need to change. I'm fairly new to R.

You can use Reduce and dplyr::left_join:
df1 <- read.table(text =
"x mean_snowcover useless_data
1 80 6546156
2 50 6285465
3 60 2859525", header = T)
df2 <- read.table(text =
"x mean_snowcover useless_data
1 91 87178
2 89 987189
3 88 879278927", header = T)
library(dplyr);
library(magrittr);
Reduce(function(x,y)
left_join(x, y, by = "x") %>% select(x, contains("snowcover")), list(df1, df2))
# x mean_snowcover.x mean_snowcover.y
# 1 1 80 91
# 2 2 50 89
# 3 3 60 88
This will work on any number of data.frames, as long as they share a common x column, and you put them all in a list, i.e.
lst <- list(df1, df2, df3, ....)
Reduce(function(x,y)
left_join(x, y, by = "x") %>% select(x, contains("snowcover")), lst)

R correlation in a loop with one dynamic variable and based on group variable using ddply

I want to generate correlation and it may be basic but I am not able to get it. Need your help!!
I am trying to generate correlation for user specified variables (i.e. the variable on which correlation needs to be generated is not fixed. Could different in different scenarios and hence taking inputs and storing in a vector str_char )
For each of these variables I need to generate correlation with value variable and the correlation should be based on the groups in type variable.
Below is the sample dummy data. My actual data has many more columns and rows.
library("plyr")
library("data.table")
set.seed(1200)
id <- 1:100
bills <- sample(1:20,100,replace = T)
nos <- sample(1:80,100,replace = T)
stru <- sample(c("A","B","C","D"),100,replace = T)
var1 <- sample(1:80,100,replace = T)
var2 <- sample(1:80,100,replace = T)
v1 <- sample(1:80,100,replace = T)
v2 <- sample(1:80,100,replace = T)
a1 <- sample(1:80,100,replace = T)
b1 <- sample(1:80,100,replace = T)
type <- sample(1:7,100,replace = T)
value <- sample(100:1000,100,replace = T)
df1 <- as.data.table(data.frame(id,bills,nos,stru,var1,var2,v1,v2,a1,b1,type,value))
#storing the variables for which need to generate correlation. This would change in different scenarios and one would need to update this variable.
str_char <- c("bills","nos","stru","var2","v1","b1")
len <- length(str_char)
#Since the variables are not fixed using for loop. To tackle the requirement of generating correlation by group using ddply
corr<-data.frame()
for (i in 1:len){
df1$var1 <- df1[,which(colnames(df1) == str_char[i])]
var1 <- str_char[i]
temp1 <- ddply(
df1
, .(type)
, summarize
, var1=cor(value,var1,method="spearman")
)
corr <- as.data.frame(cbind(corr,temp1))
}
This generates a empty data frame for corr. Not sure where I am going wrong. I wanted to have the type in rows and each of these variables in columns with the cell having the correlation value.
Once I have the data frame with correlation values, I want to identify the variables where the correlation is > 0.2 and store them in a vector.
Could you please help by suggesting where I am going wrong OR suggest some better way out to meet this requirement.
Thank you !!

With data.table no "sophisticated trick" is required. It can be done by using the by parameter (instead of split()) and the .SDcols parameter to specify the columns to be used in the call to cor(). So, it's pretty much straightforward data.table syntax:
# without stru because it is factor not numeric
str_char <- c("bills", "nos", "var2", "v1", "b1")
df1[, lapply(.SD, function(x) cor(value, x, method = "spearman")),
keyby = type, .SDcol = str_char]
type bills nos var2 v1 b1
1: 1 -0.58026951 0.16493506 -0.07664827 0.11627152 -0.05595326
2: 2 0.02646100 0.22246750 0.40308468 0.38943918 -0.10121018
3: 3 -0.11389551 0.36446564 -0.16438528 0.00000000 -0.04100238
4: 4 -0.45645233 -0.21585955 -0.19560440 0.28351648 -0.08580863
5: 5 -0.18596606 -0.23776224 -0.06304738 -0.03508794 0.39860140
6: 6 -0.72346726 -0.04175824 0.24862501 -0.30583077 -0.31718139
7: 7 -0.02649032 -0.08810594 0.48398529 0.30143033 0.50165047
# with stru after coersion of factor to numeric
str_char <- c("bills", "nos", "stru", "var2", "v1", "b1")
result <- df1[, lapply(.SD, function(x) cor(value, as.numeric(x), method = "spearman")),
keyby = type, .SDcol = str_char]
result
type bills nos stru var2 v1 b1
1: 1 -0.58026951 0.16493506 0.08202645 -0.07664827 0.11627152 -0.05595326
2: 2 0.02646100 0.22246750 0.21968328 0.40308468 0.38943918 -0.10121018
3: 3 -0.11389551 0.36446564 -0.11769798 -0.16438528 0.00000000 -0.04100238
4: 4 -0.45645233 -0.21585955 -0.37551547 -0.19560440 0.28351648 -0.08580863
5: 5 -0.18596606 -0.23776224 0.39444627 -0.06304738 -0.03508794 0.39860140
6: 6 -0.72346726 -0.04175824 0.28585837 0.24862501 -0.30583077 -0.31718139
7: 7 -0.02649032 -0.08810594 -0.05718863 0.48398529 0.30143033 0.50165047
Note that keyby is used instead of by to have the result in the same order as in LAP's answer for comparison.
In addition, the OP has requested to append a new column to the result which contains the names of the 3 top variables with the highest cor() value > 0.2 for each type.
Finding the top 3 values can be most conveniently done after reshaping result from wide to long format:
# reshape from wide to long
melt(result, id.vars = "type")[
# select by value
value > 0.2][
# order by descending value and pick the first 3 (if any)
order(-value), toString(head(variable, 3L)), keyby = type]
type V1
1: 2 var2, v1, nos
2: 3 nos
3: 4 v1
4: 5 b1, stru
5: 6 stru, var2
6: 7 b1, var2, v1
Appending to result is done by an update on join:
result[
melt(result, id.vars = "type")[value > 0.2][
order(-value), toString(head(variable, 3L)), keyby = type],
on = "type", selected := V1][
# beautify result
is.na(selected), selected := ""][]
type bills nos stru var2 v1 b1 selected
1: 1 -0.58026951 0.16493506 0.08202645 -0.07664827 0.11627152 -0.05595326
2: 2 0.02646100 0.22246750 0.21968328 0.40308468 0.38943918 -0.10121018 var2, v1, nos
3: 3 -0.11389551 0.36446564 -0.11769798 -0.16438528 0.00000000 -0.04100238 nos
4: 4 -0.45645233 -0.21585955 -0.37551547 -0.19560440 0.28351648 -0.08580863 v1
5: 5 -0.18596606 -0.23776224 0.39444627 -0.06304738 -0.03508794 0.39860140 b1, stru
6: 6 -0.72346726 -0.04175824 0.28585837 0.24862501 -0.30583077 -0.31718139 stru, var2
7: 7 -0.02649032 -0.08810594 -0.05718863 0.48398529 0.30143033 0.50165047 b1, var2, v1

I've got a base R solution using split to generate a list of subsets, calculate the correlations and rbind them together in the way you want. I guess there will be a more sophisticated approach using data.table, but for now it'll do the trick.
Generate a data.frame from the data you provided:
df1 <- data.frame(id,bills,nos,stru,var1,var2,v1,v2,a1,b1,type,value)
> head(df1)
id bills nos stru var1 var2 v1 v2 a1 b1 type value
1 1 4 74 A 36 1 54 75 9 31 2 139
2 2 8 36 D 75 73 10 72 43 55 6 743
3 3 10 12 B 64 60 39 22 62 40 4 574
4 4 11 33 B 11 73 69 33 29 38 1 409
5 5 10 32 B 73 66 37 34 29 58 6 620
6 6 12 39 D 38 39 40 56 68 29 6 539
Create subsets using split:
subsets <- split(df1, df1$type)
Use a nested lapply solution to loop over the variable names in str_char:
corlist <- lapply(subsets, function(x) lapply(str_char, function(y) cor(x[,"value"], as.numeric(x[,y]), method = "spearman")))
Use a nested do.call to create the matrix of correlation coefficients:
cormatrix <- do.call(rbind, lapply(corlist, function(x) do.call(c, x)))
Assign names to the columns:
colnames(cormatrix) <- str_char
Output:
> cormatrix
bills nos var2 v1 b1
1 -0.58026951 0.16493506 -0.07664827 0.11627152 -0.05595326
2 0.02646100 0.22246750 0.40308468 0.38943918 -0.10121018
3 -0.11389551 0.36446564 -0.16438528 0.00000000 -0.04100238
4 -0.45645233 -0.21585955 -0.19560440 0.28351648 -0.08580863
5 -0.18596606 -0.23776224 -0.06304738 -0.03508794 0.39860140
6 -0.72346726 -0.04175824 0.24862501 -0.30583077 -0.31718139
7 -0.02649032 -0.08810594 0.48398529 0.30143033 0.50165047
To add the type and the names of up to three variables with correlation coefficient > 0.2 (sorted by value) to the cormatrix, use this:
maxvector <- apply(cormatrix, 1, function(x) sort(x[which(x > .2)], decreasing = T))
maxvector <- lapply(maxvector, function(x) names(x)[1:3])
maxvector <- lapply(maxvector, function(x) x[!is.na(x)])
maxvector <- lapply(maxvector, function(x) paste(x, collapse = ","))
cormatrix <- cbind(type = 1:7, cormatrix, maxvector)
Result:
> cormatrix
type bills nos stru var2 v1 b1 maxvector
1 1 -0.5802695 0.1649351 0.08202645 -0.07664827 0.1162715 -0.05595326 ""
2 2 0.026461 0.2224675 0.2196833 0.4030847 0.3894392 -0.1012102 "var2,v1,nos"
3 3 -0.1138955 0.3644656 -0.117698 -0.1643853 0 -0.04100238 "nos"
4 4 -0.4564523 -0.2158596 -0.3755155 -0.1956044 0.2835165 -0.08580863 "v1"
5 5 -0.1859661 -0.2377622 0.3944463 -0.06304738 -0.03508794 0.3986014 "b1,stru"
6 6 -0.7234673 -0.04175824 0.2858584 0.248625 -0.3058308 -0.3171814 "stru,var2"
7 7 -0.02649032 -0.08810594 -0.05718863 0.4839853 0.3014303 0.5016505 "b1,var2,v1"
Edit: I've also reincluded stru by converting with as.numeric (Thanks #Uwe).

Here is a tidyverse attempt:
library(tidyverse)
df1 %>%
select(bills, nos, var2, v1, b1, type) %>% #select needed variables, one can also do: select(str_char, type), however `stru` is not numeric
group_by(type) %>% #group by type
do(correlation = as.data.frame(cor(.[1:5]))) %>% #correlation
unnest(correlation) %>% #convenient output
gather(key, value, bills:b1) %>% #for easier pairwise removal
filter(var != key) %>% #remove self correlation
arrange(type, var, key)
#output
# A tibble: 140 x 4
type var key value
<int> <fctr> <chr> <dbl>
1 1 b1 bills 0.01978168
2 1 b1 nos -0.40581082
3 1 b1 v1 -0.08507922
4 1 b1 var2 0.15430381
5 1 bills b1 0.01978168
6 1 bills nos 0.21208062
7 1 bills v1 -0.15127493
8 1 bills var2 -0.02983736
9 1 nos b1 -0.40581082
10 1 nos bills 0.21208062
# ... with 130 more rows

Using lapply to change column names of a list of data frames

I'm trying to use lapply on a list of data frames; but failing at passing the parameters correctly (I think).
List of data frames:
df1 <- data.frame(A = 1:10, B= 11:20)
df2 <- data.frame(A = 21:30, B = 31:40)
listDF <- list(df1, df2,df3) #multiple data frames w. way less columns than the length of vector todos
Vector with columns names:
todos <-c('col1','col2', ......'colN')
I'd like to change the column names using lapply:
lapply (listDF, function(x) { colnames(x)[2:length(x)] <-todos[1:length(x)-1] } )
but this doesn't change the names at all. Am I not passing the data frames themselves, but something else? I just want to change names, not to return the result to a new object.
Thanks in advance, p.

You can also use setNames if you want to replace all columns
df1 <- data.frame(A = 1:10, B= 11:20)
df2 <- data.frame(A = 21:30, B = 31:40)
listDF <- list(df1, df2)
new_col_name <- c("C", "D")
lapply(listDF, setNames, nm = new_col_name)
## [[1]]
## C D
## 1 1 11
## 2 2 12
## 3 3 13
## 4 4 14
## 5 5 15
## 6 6 16
## 7 7 17
## 8 8 18
## 9 9 19
## 10 10 20
## [[2]]
## C D
## 1 21 31
## 2 22 32
## 3 23 33
## 4 24 34
## 5 25 35
## 6 26 36
## 7 27 37
## 8 28 38
## 9 29 39
## 10 30 40
If you need to replace only a subset of column names, then you can use the solution of #Jogo
lapply(listDF, function(df) {
names(df)[-1] <- new_col_name[-ncol(df)]
df
})
A last point, in R there is a difference between a:b - 1 and a:(b - 1)
1:10 - 1
## [1] 0 1 2 3 4 5 6 7 8 9
1:(10 - 1)
## [1] 1 2 3 4 5 6 7 8 9
EDIT
If you want to change the column names of the data.frame in global environment from a list, you can use list2env but I'm not sure it is the best way to achieve want you want. You also need to modify your list and use named list, the name should be the same as name of the data.frame you need to replace.
listDF <- list(df1 = df1, df2 = df2)
new_col_name <- c("C", "D")
listDF <- lapply(listDF, function(df) {
names(df)[-1] <- new_col_name[-ncol(df)]
df
})
list2env(listDF, envir = .GlobalEnv)
str(df1)
## 'data.frame': 10 obs. of 2 variables:
## $ A: int 1 2 3 4 5 6 7 8 9 10
## $ C: int 11 12 13 14 15 16 17 18 19 20

try this:
lapply (listDF, function(x) {
names(x)[-1] <- todos[-length(x)]
x
})
you will get a new list with changed dataframes. If you want to manipulate the listDF directly:
for (i in 1:length(listDF)) names(listDF[[i]])[-1] <- todos[-length(listDF[[i]])]

I was not able to get the code used in these answers to work. I found some code from another forum which did work. This will assign the new column names into each dataframe, the other methods created a copy of the dataframes. For anyone else here is the code.
# Create some dataframes
df1 <- data.frame(A = 1:10, B= 11:20)
df2 <- data.frame(A = 21:30, B = 31:40)
listDF <- c("df1", "df2") #Notice this is NOT a list
new_col_name <- c("C", "D") #What do you want the new columns to be named?
# Assign the new column names to each dataframe in "listDF"
for(df in listDF) {
df.tmp <- get(df)
names(df.tmp) <- new_col_name
assign(df, df.tmp)
}

Categories

HOME

vuejs3

rules

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Merge data frames without Merge Function - r

Related

How to compare elements of two large datasets as efficient as possible?

R apply a vector of functions to a dataframe

Read multiple csvs in loop and write as columns in a master csv

R correlation in a loop with one dynamic variable and based on group variable using ddply

Using lapply to change column names of a list of data frames

Categories

Resources