I have 4 dataframes and each of them has exactly same number of rows and columns. The values in Column 1,2,5 are same in each data frame. From those 4 dataframes, I want to obtain a single dataframe where the third and fourth columns ('pred1' and 'pred2') are created by summing the values in the 4 dataframes. Is it possible to do that ? here is my dataframes:
df1 = read.csv(fname1,header=FALSE,col.names=c("c1", "c2", "pred1", "pred2","c5")))
df2 = read.csv(fname2,header=FALSE,col.names=c("c1", "c2", "pred1", "pred2","c5")))
df3 = read.csv(fname3,header=FALSE,col.names=c("c1", "c2", "pred1", "pred2","c5")))
df4 = read.csv(fname4,header=FALSE,col.names=c("c1", "c2", "pred1", "pred2","c5")))
How about
df5 <- df1
df5$pred1 <- df1$pred1 + df2$pred1 + df3$pred1 + df4$pred1
df5$pred2 <- df1$pred2 + df2$pred2 + df3$pred2 + df4$pred2
Based on Gregor's suggestions, you could also create a vector to store the columns to be added (in case there are a lot), and then add those together as with
cols = c("pred1", "pred2")
df5[, cols] = df1[, cols] + df2[, cols] + df3[, cols] + df4[, cols]
akrun also provides a suggestion which I don't follow, but seems like it would work well with arbitrarily many dataframes as well (just expand 1:4 to 1:n, where n is the number of the last df).
Reduce("+", lapply(mget(paste0('df', 1:4)), "[[", c("pred1", "pred2")))
If df1, d2, df3, and df 4 have the same values when you merge them
you will get df5 that will the df1 values and the size of df1
than why don't you just
df5 <- df1
df5$Pred1 <- rowSums(df1)
df5$pred2 <- rowSums(df5[,1:4])
First, you should merge all the df and then just create a new Columns pred1 and pred2
df1 <- data.frame(c1= c(1,1,2,2,4),c2 = c(2,2,3,3,5),c5 = c(3,4,4,5,6))
df2 <- data.frame(c1= c(10,1,2,2,4),c2 = c(2,2,30,3,5),c5 = c(3,4,40,5,6))
df3 <- data.frame(c1= c(15,1,2,2,4),c2 = c(22,2,3,3,5),c5 = c(3,44,4,5,6))
df4 <- data.frame(c1= c(12,1,2,2,4),c2 = c(2,23,3,3,5),c5 = c(3,4,45,5,6))
tmp <- merge(df1,df2,by= c("c1","c2","c5"),all.x = TRUE,all.y=TRUE)
tmp <- merge(tmp,df3,by= c("c1","c2","c5"),all.x = TRUE,all.y=TRUE)
tmp <- merge(tmp,df4,by= c("c1","c2","c5"),all.x = TRUE,all.y=TRUE)
tmp$pred1 = rowSums(tmp[,1:3])
tmp:
tmp
c1 c2 c5 pred1
1 1 2 3 6
2 1 2 4 7
3 1 2 44 47
4 1 23 4 28
5 2 3 4 9
....
Related
I have two data frames, df1 has stock symbols and values. df2 correlations with the same names but arranged as rows. df1 has many more columns than df2, but all columns that are in df2 exist in df1. I need to multiply matching columns and store newly created values as a new dataframe. The new dataframe will only have a stock symbol and then all multiplications of df1*df2.
The data looks like this:
df1
A Company Symbol Earn.GR MF Effic MF
TRUE 1.320005832 -0.080712181
df2:
Variable Corr
1 Val MF 0.312140675
2 Earn.GR.withCorr MF 0.992410721
I have tried this code, but not getting the expected result:
Transpose df2:
df2 <- transpose (df2)
rownames(df2) <- colnames(df2)
Match and multiply columns
df3 <- df1[names(df1) %in% names(df2)] <- sapply(names(df1[names(df1) %in% names(df2)]),
function(x) df1[[x]] * df2[[x]])
Thanks in advance.
With base R, you could do something like this
df1 = as.data.frame(matrix(1:14,2,7))
df2 = as.data.frame(matrix(15:28,2,7))
names(df1)= letters[1:7]
names(df2)= c("a","d",letters[9:12],"b")
m = match(names(df1),names(df2))
newdf = setNames(df1[,which(!is.na(m))]*df2[,na.omit(m)],
paste0("mult_",names(df2[,na.omit(m)])))
> newdf
mult_a mult_b mult_d
1 15 81 119
2 32 112 144
Find common columns using intersect, subset from both the dataframe and multiply
common_cols <- intersect(names(df1), names(df2))
df3 <- df1[common_cols] * df2[common_cols]
df3
df3
# a c
#1 2 144
#2 6 169
#3 12 196
#4 20 225
#5 30 256
data
df1 <- data.frame(a = 1:5, b = 11:15, c = 12:16)
df2 <- data.frame(a = 2:6, d = 11:15, c = 12:16, e = 1:5)
Update
Since you have unI think you need to merge before multiplying
df3 <- merge(df1[common_cols], df2[common_cols], by = "Company")
cbind(df3[1], df3[-1][c(TRUE, FALSE)] * df3[-1][c(FALSE, TRUE)])
I have 2 data frames D1 and D2.
dim(D1) = 6096 x 2
dim(D2) = 6100 x 5
I would like to connect D1 and D2 such that we have both the two columns of D2 followed by both the 4 columns of D2 on the right. And, 4 rows of first two columns (D1) have NA's or Null or 0 values.
I'm looking for a function.
Many questions on stacks overflow are where both data frames are added to each other either by rows or column. I don't want to add their values.
Imagine dataframes df1 and df2 and the output is dfinal.
df1 = data.frame(rep(c(1,3,5,7), 2))
df2 = data.frame(rep(c(2,4,6,8), 2))
df1 = cbind(df1,df2)
colnames(df1) = c("C1","C2")
df3 = data.frame(rep(c("1","b","c"), 2))
df4 = data.frame(rep(c("d","3","f"), 2))
df5 = data.frame(rep(c("g","h","2"), 2))
df2 = cbind(df3,df4,df5)
colnames(df2) = c("C3","C4","C5")
df2 = rbind(df2, NA)
df2 = rbind(df2, NA)
dfinal = cbind(df1,df2)
I have a relatively large amount of data stored in a list of data frames with several columns.
For each element of the list I wish to check one column against a reference and if present extract the value held in another column of the same element and place in a new summary matrix.
e.g. with the following example code:
add1 = c("N1","N1","N1")
coords1 = c(1,2,3)
vals1 = c("a","b","c")
extra1 = c("x","y","x")
add2 = c("N2","N2","N2","N2")
coords2 = c(2,3,4,5)
vals2 = c("b","c","d","e")
extra2 = c("z","y","x","x")
add3 = c("N3","N3","N3")
coords3 = c(1,3,5)
vals3 = c("a","c","e")
extra3 = c("z","z","x")
df1 <- data.frame(add1, coords1, vals1, extra1)
df2 <- data.frame(add2, coords2, vals2, extra2)
df3 <- data.frame(add3, coords3, vals3, extra3)
list_all <- list(df1, df2, df3)
coordinate.extract <- unique(unlist(lapply(list_all, "[", 1)))
my_matrix <- matrix(0, ncol = length(list_all)
, nrow = (length(coordinate.extract)))
my_matrix_new <- cbind(as.character(coordinate.extract)
, my_matrix)
I would like to end up with:
my_matrix_new = V1 V2 V3 V4
1 a a
2 b b
3 c c c
4 d
5 e e
i.e. the 3rd column of each list element is chosen based on the value of the second column.
I hope this is clear.
Thanks,
Matt
I would use data.frame as there are mixed classes. You may try merge with Reduce to get the expected output. Select the 2nd and 3rd columns,in each list element, change the column name for the 2nd to be same across all the list elements, merge, and if needed replace the NA elements with ''
lst1 <- lapply(list_all, function(x) {names(x)[2] <- 'V1';x[2:3] })
res <- Reduce(function(...) merge(..., by='V1', all=TRUE), lst1)
res[-1] <- lapply(res[-1], as.character)
res[is.na(res)] <- ''
res
# V1 vals1 vals2 vals3
#1 1 a a
#2 2 b b
#3 3 c c c
#4 4 d
#5 5 e e
We can change the column names
names(res) <- paste0('V', seq_along(res))
Situation
I have two data frames, df1 and df2with the same column headings
x <- c(1,2,3)
y <- c(3,2,1)
z <- c(3,2,1)
names <- c("id","val1","val2")
df1 <- data.frame(x, y, z)
names(df1) <- names
a <- c(1, 2, 3)
b <- c(1, 2, 3)
c <- c(3, 2, 1)
df2 <- data.frame(a, b, c)
names(df2) <- names
And am performing a merge
#library(dplyr) # not needed for merge
joined_df <- merge(x=df1, y=df2, c("id"),all=TRUE)
This gives me the columns in the joined_df as id, val1.x, val2.x, val1.y, val2.y
Question
Is there a way to co-locate the columns that had the same heading in the original data frames, to give the column order in the joined data frame as id, val1.x, val1.y, val2.x, val2.y?
Note that in my actual data frame I have 115 columns, so I'd like to stay clear of using joned_df <- joined_df[, c(1, 2, 4, 3, 5)] if possible.
Update/Edit: also, I would like to maintain the original order of column headings, so sorting alphabetically is not an option (-on my actual data, I realise it would work with the example I have given).
My desired output is
id val1.x val1.y val2.x val2.y
1 1 3 1 3 3
2 2 2 2 2 2
3 3 1 3 1 1
Update with solution for general case
The accepted answer solves my issue nicely.
I've adapted the code slightly here to use the original column names, without having to hard-code them in the rep function.
#specify columns used in merge
merge_cols <- c("id")
# identify duplicate columns and remove those used in the 'merge'
dup_cols <- names(df1)
dup_cols <- dup_cols [! dup_cols %in% merge_cols]
# replicate each duplicate column name and append an 'x' and 'y'
dup_cols <- rep(dup_cols, each=2)
var <- c("x", "y")
newnames <- paste(dup_cols, ".", var, sep = "")
#create new column names and sort the joined df by those names
newnames <- c(merge_cols, newnames)
joined_df <- joined_df[newnames]
How about something like this
numrep <- rep(1:2, each = 2)
numrep
var <- c("x", "y")
var
newnames <- paste("val", numrep, ".", var, sep = "")
newdf <- cbind(joined_df$id, joined_df[newnames])
names(newdf)[1] <- "id"
Which should give you the dataframe like this
id val1.x val1.y val2.x val2.y
1 1 3 1 3 3
2 2 2 2 2 2
3 3 1 3 1 1
I have two data frames.
set.seed(1234)
df <- data.frame(
id = factor(rep(1:24, each = 10)),
price = runif(20)*100,
quantity = sample(1:100,240, replace = T)
)
df2 <- data.frame(
id = factor(seq(1:24)),
eq.quantity = sample(1:100, 24, replace = T)
)
I would like to use df2$eq.quantity to find the closest absolute value compared to df$quantity, by the factor variable, id. I would like to do that for each id in df2 and bind it into a new data-frame, called results.
I can do it like this for each individually ID:
d.1 <- df2[df2$id == 1, 2]
df.1 <- subset(df, id == 1)
id.1 <- df.1[which.min(abs(df.1$quantity-d.1)),]
Which would give the solution:
id price quantity
1 66.60838 84
But I would really like to be able to use a smarter solution, and also gathered the results into a dataframe, so if I do it manually it would look kinda like this:
results <- cbind(id.1, id.2, etc..., id.24)
I had some trouble giving this question a good name?
data.tables are smart!
Adding this to your current example...
library(data.table)
dt = data.table(df)
dt2 = data.table(df2)
setkey(dt, id)
setkey(dt2, id)
dt[dt2, dif:=abs(quantity - eq.quantity)]
dt[,list(price=price[which.min(dif)], quantity=quantity[which.min(dif)]), by=id]
result:
dt[,list(price=price[which.min(dif)], quantity=quantity[which.min(dif)]), by=id]
id price quantity
1: 1 66.6083758 84
2: 2 29.2315840 19
3: 3 62.3379442 63
4: 4 54.4974836 31
5: 5 66.6083758 6
6: 6 69.3591292 13
...
Merge the two datasets and use lapply to perform the function on each id.
df3 <- merge(df,df2,all.x=TRUE,by="id")
diffvar <- function(df){
df4 <- subset(df3, id == df)
df4[which.min(abs(df4$quantity-df4$eq.quantity)),]
}
resultslist <- lapply(levels(df3$id),function(df) diffvar(df))
Combine the resulting list elements in a dataframe:
resultsdf <- data.frame(matrix(unlist(resultslist), ncol=4, byrow=T))
Or more easy:
library(plyr)
resultsdf <- ddply(df3, .(id), function(x)x[which.min(abs(x$quantity-x$eq.quantity)),])