Add missing columns from different data.frame filled with 0 [duplicate] - r

This question already has answers here:
R: Find missing columns, add to data frame if missing
(3 answers)
Closed 2 years ago.
I have the following situation:
df1
a b c d
1 2 3 4
df2
a c
5 6
And the result I want is, to fill up the second data.frame with the missing columns from df1 and fill them with zeros. So the result should be:
df3
a b c d
5 0 6 0
The Data frames are quite big and that is why an automated way of doing this would be gerate.

We can use setdiff to find out columns which are not present in df2 and assign the value 0 to those columns.
df2[setdiff(names(df1), names(df2))] <- 0
# a c b d
#1 5 6 0 0
If we want to maintain the same order of columns as in df1 we can later do
df2[names(df1)]
# a b c d
#1 5 0 6 0

There's probably a more elegant solution, but I think this works for your situation.
If you're not too fussed about mixing your workflow up with dplyr and data.table syntax, you can use setdiff() to identify non-matching column names, and use data.table syntax to create those zero-value columns efficiently without using loops or apply() functions. Once you've made sure this works for all the possible situations, you can wrap it in a function and scale this across more datasets.
df1 <- data.frame(a = 1, b = 2, c = 3, d = 4)
df2 <- data.frame(a = 5, c = 6)
# Variables in df1 but not in df2
diff_vars <- dplyr::setdiff(names(df1),names(df2))
df2 %>%
data.table::data.table() %>%
.[,c(diff_vars):=0] %>%
tibble::as_tibble() # Can choose to keep this in data.table

df1 <- data.frame(a = 1, b = 2, c = 3, d = 4)
df2 <- data.frame(a = 5, c = 6)
library(tidyverse)
right_join(df1, df2)
a b c d
1 5 NA 6 NA
You'll have to change NA's to 0.

Related

R how cbind two data.frames next to each other, filling unequal rows with NAs

How to simply "paste" two data frames next to each other, filling unequal rows with NAs (e.g. because I want to make a "kable" or sth similar)?
df1 <- data.frame(a = c(1,2,3),
b = c(3,4,5))
df2 <- data.frame(a = c(4,5),
b = c(5,6))
# The desired "merge"
a b a b
1 3 4 5
2 4 5 6
3 5 NA NA
Thanks to Ronak Shah, I found an easy answer in the answers to this post: How to cbind or rbind different lengths vectors without repeating the elements of the shorter vectors?
Without having to hack anything together, one can use cbind.na from the qpcR: package:
df1 <- data.frame(a = c(1,2,3),
b = c(3,4,5))
df2 <- data.frame(a = c(4,5),
b = c(5,6))
comb <- qpcR:::cbind.na(df1, df2)
As this answer is 4 years old, I wonder if there are more "modern" solutions in the popular packages like tidyverse et. al.
In base R you could do:
nr <- max(nrow(df1), nrow(df2))
cbind(df1[1:nr, ], df2[1:nr, ])
# a b a b
# 1 1 3 4 5
# 2 2 4 5 6
# 3 3 5 NA NA

Take sum of rows for every 3 columns in a dataframe

I have searched high and low and also tried multiple options to solve this but did not get the desired output as mentioned below:
I have dataframe df3 with headers as date and values beteween 0-1 as shown below:
df = data.frame(replicate(6,sample(0:1,6,rep=TRUE)))
colnames(df) = c("1/1/2018","1/2/2018","1/3/2018","1/4/2018","1/5/2018","1/6/2018")
df2 = data.frame(c("A","B","C","D","E","F"))
colnames(df2) = c("CUST_ID")
df3 = cbind(df2,df)
Now I need df4 in which sum of first 3 columns in series will form one column. This will be repeated in series for rest of the columns dynamically.
df4
Options I tried:
a) rbind.data.frame(apply(matrix(df3, nrow = n - 1), 1,sum))
b) col_list <- list(c("1/1/2018","1/2/2018","1/3/2018"), c("1/4/2018","1/5/2018","1/6/2018"))
lapply(col_list, function(x)sum(df3[,x])) %>% data.frame
One way would be to split df3 every 3 columns using split.default. To split the data we generate a sequence using rep, then for each dataframe we take rowSums and finally cbind the result together.
cbind(df3[1], sapply(split.default(df3[-1],
rep(1:ncol(df3), each = 3, length.out = (ncol(df3) -1))), rowSums))
# CUST_ID 1 2
#1 A 1 1
#2 B 2 0
#3 C 2 1
#4 D 1 1
#5 E 2 2
#6 F 2 2
FYI, the sequence generated from rep is
rep(1:ncol(df3), each = 3, length.out = (ncol(df3) -1))
#[1] 1 1 1 2 2 2
This makes it possible to split every 3 columns.
The results are different because OP used sample without set.seed.
If rep seems too long then we can generate the same sequence of columns using gl
gl(ncol(df3[-1])/3, 3)
#[1] 1 1 1 2 2 2
#Levels: 1 2
So the final code, would be
cbind(df3[1], sapply(split.default(df3[-1], gl(ncol(df3[-1])/3, 3)), rowSums))
We can use seq to create index, get the subset of columns within in a list, Reduce by taking the sum, and create new columns
df4 <- df3[1]
df4[paste0('col', c('123', '456'))] <- lapply(seq(2, ncol(df3), by = 3),
function(i) Reduce(`+`, df3[i:min((i+2), ncol(df3))]))
df4
# CUST_ID col123 col456
#1 A 2 2
#2 B 3 3
#3 C 1 3
#4 D 2 3
#5 E 2 1
#6 F 0 1
data
set.seed(123)
df <- data.frame(replicate(6,sample(0:1,6,rep=TRUE)))
colnames(df) <- c("1/1/2018","1/2/2018","1/3/2018","1/4/2018","1/5/2018","1/6/2018")
df2 <- data.frame(c("A","B","C","D","E","F"))
colnames(df2) = c("CUST_ID")
df3 <- cbind(df2, df)

conditional replace using match of column/row names from other data frame [duplicate]

This question already has answers here:
Not sure why dcast() this data set results in dropping variables
(1 answer)
How to reshape data from long to wide format
(14 answers)
Closed 5 years ago.
I have two data frames:
id <- c("a", "b", "c")
a <- 0
b <- 0
c <- 0
df1 <- data.frame(id, a, b, c)
id a b c
1 a 0 0 0
2 b 0 0 0
3 c 0 0 0
num <- c("a", "c", "c")
partner <- c("b", "b", "a")
value <- c("10", "20", "30")
df2 <- data.frame(num, partner, value)
num partner value
1 a b 10
2 c b 20
3 c a 30
I'd like to replace zeroes in df1 with df2$value in every instance df1$id==df2$num & colnames(df1)==df2$partner. So the output should look like:
a <- c(0, 0, 30)
b <- c(10, 0, 20)
c <- c(0, 0, 0)
df.nice <- data.frame(id, a, b, c)
id a b c
1 a 0 10 0
2 b 0 0 0
3 c 30 20 0
I can replace individual cells with the following:
df1$b[df1$id=="a"] <- ifelse(df2$num=="a" & df2$partner=="b", df2$value, 0)
but I need to cycle through all possible df1 row/column combinations for a large data frame. I suspect this involves plyr and match together, but can't quite figure out how.
Update
Thanks to #MikeH., I've turned to using reshape. This seems to work:
df.nice <- melt(df2, id=c("num", "partner"))
df.nice <- dcast(test.nice, num ~ partner, value.var="value")
to produce this:
num a b
1 a <NA> 10
2 c 30 20
I do need all possible row/column combinations, however, with all represented as zero. Is there a way to ask reshape to obtain rows and columns from another data frame (e.g., df1) or do should I bind those after reshaping?
If you want a replace (rather than a reshape) I think a simple base R solution would be to do:
idxs <- t(mapply(cbind, match(df2$num, df1$id), match(df2$partner, names(df1))))
df1[idxs] <- df2$value
df1
id a b c
1 a 0 10 0
2 b 0 0 0
3 c 30 20 0
Note that I build the row/column combination lookups to replace using the t(mapply(...)). When you select like df1[idxs] this converts to matrix (to select specific row/column combinations) and then converts back to data.frame.
I had to read in your data using stringsAsFactors = FALSE so the values would register properly (instead of numerics).
Data:
df2 <- data.frame(num, partner, value, stringsAsFactors = F)
df1 <- data.frame(id, a, b, c, stringsAsFactors = F)

How to replace values in multiple columns in a data.frame with values from a vector in R?

I would like to replace the values in the last three columns in my data.frame with the three values in a vector.
Example of data.frame
df
A B C D
5 3 8 9
Vector
1 2 3
what I would like the data.frame to look like.
df
A B C D
5 1 2 3
Currently I am doing:
df$B <- Vector[1]
df$C <- Vector[2]
df$D <- Vector[3]
I would like to not replace the values one by one. I would like to do it all at once.
Any help will be appreciated. Please let me know if any further information is needed.
We can subset the last three columns of the dataset with tail, replicate the 'Vector' to make the lengths similar and assign the values to those columns
df[,tail(names(df),3)] <- Vector[col(df[,tail(names(df),3)])]
df
# A B C D
#1 5 1 2 3
NOTE: I replicated the 'Vector' assuming that there will be more rows in the 'df' in the original dataset.
Try this:
df[-1] <- 1:3
giving:
> df
A B C D
1 5 1 2 3
Alternately, we could do it non-destructively like this:
replace(df, -1, 1:3)
Note: The input df in reproducible form is:
df <- data.frame(A = 5, B =3, C = 8, D = 9)

Convert the values in a column into row names in an existing data frame

I would like to convert the values in a column of an existing data frame into row names. Is is possible to do this without exporting the data frame and then reimporting it with a row.names = call?
For example I would like to convert:
> samp
names Var.1 Var.2 Var.3
1 A 1 5 0
2 B 2 4 1
3 C 3 3 2
4 D 4 2 3
5 E 5 1 4
Into:
> samp.with.rownames
Var.1 Var.2 Var.3
A 1 5 0
B 2 4 1
C 3 3 2
D 4 2 3
E 5 1 4
This should do:
samp2 <- samp[,-1]
rownames(samp2) <- samp[,1]
So in short, no there is no alternative to reassigning.
Edit: Correcting myself, one can also do it in place: assign rowname attributes, then remove column:
R> df<-data.frame(a=letters[1:10], b=1:10, c=LETTERS[1:10])
R> rownames(df) <- df[,1]
R> df[,1] <- NULL
R> df
b c
a 1 A
b 2 B
c 3 C
d 4 D
e 5 E
f 6 F
g 7 G
h 8 H
i 9 I
j 10 J
R>
As of 2016 you can also use the tidyverse.
library(tidyverse)
samp %>% remove_rownames %>% column_to_rownames(var="names")
in one line
> samp.with.rownames <- data.frame(samp[,-1], row.names=samp[,1])
It looks like the one-liner got even simpler along the line (currently using R 3.5.3):
# generate original data.frame
df <- data.frame(a = letters[1:10], b = 1:10, c = LETTERS[1:10])
# use first column for row names
df <- data.frame(df, row.names = 1)
The column used for row names is removed automatically.
With a one-row dataframe
Beware that if the dataframe has a single row, the behaviour might be confusing. As the documentation mentions:
If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number).
This mean that, if you use the same command as above, it might look like it did nothing (when it actually named the first row "1", which won't look any different in the viewer).
In that case, you will have to stick to the more verbose:
df <- data.frame(a = "a", b = 1)
df <- data.frame(df, row.names = df[,1])
... but the column won't be removed. Also remember that, if you remove a column to end up with a single-column dataframe, R will simplify it to an atomic vector. In that case, you will want to use the extra drop argument:
df <- data.frame(df[,-1, drop = FALSE], row.names = df[,1])
You can execute this in 2 simple statements:
row.names(samp) <- samp$names
samp[1] <- NULL

Resources