Stacking datasets - r

I have two datasets that need to be stacked on top of each-other. Think of them as two subsets of one data-set. The issue is that they have completely different variables other than the "record_id" and one more variable
ds_app <- data.frame(record_id, a, b, c, d, e, f)
ds_vo <- data.frame(record_id, g, h, i, j, k, l , m, n, o, p, q)
Is there an easy way to stack these other than having to create dummy variables; variables with assigned NA values.
Thanks so much!

I guess you may need merge
merge(df1,df2, all = TRUE)
Example
> df1 <- data.frame(id = 1:3, a = 1:3, b = 4:6)
> df2 <- data.frame(id = 1:5, g = 1:5, h = 6:10, i = 11:15)
> df1
id a b
1 1 1 4
2 2 2 5
3 3 3 6
> df2
id g h i
1 1 1 6 11
2 2 2 7 12
3 3 3 8 13
4 4 4 9 14
5 5 5 10 15
> merge(df1,df2, all = TRUE)
id a b g h i
1 1 1 4 1 6 11
2 2 2 5 2 7 12
3 3 3 6 3 8 13
4 4 NA NA 4 9 14
5 5 NA NA 5 10 15

Another option with full_join
library(dplyr)
full_join(df1, df2)
data
df1 <- data.frame(id = 1:3, a = 1:3, b = 4:6)
df2 <- data.frame(id = 1:5, g = 1:5, h = 6:10, i = 11:15)

Related

aggregate and removes duplicates for some rows

I have a dataset like
df <- data.frame(id = c("a","a","b","b","c","d","e","f"),
val = c(1,2,3,4,5,6,7,8),
extracol = c("x",NA,"y","z","t","v","u","p"))
id val extracol
1 a 1 x
2 a 2 <NA>
3 b 3 y
4 b 4 z
5 c 5 t
6 d 6 v
7 e 7 u
8 f 8 p
and I want to sum (and aggregate) the values according to the column id but only for "a". So I want to get something like:
id val extracol
1 a 3 x
2 b 3 y
3 b 4 z
4 c 5 t
5 d 6 v
6 e 7 u
7 f 8 p
I really don't care if I get "x" or NA in the extracol. Any suggestion?
This would work:
library(dplyr)
df <- data.frame(id = c("a","a","b","b","c","d","e","f"),
val = c(1,2,3,4,5,6,7,8),
extracol = c("x",NA,"y","z","t","v","u","p"))
# keep only a
a = df%>% filter(id == "a")
# aggregate a
a_agg= a %>% group_by(id) %>% summarise(val = sum(val), extracol = first(extracol))
# drop a
df = df %>% filter(id != "a")
# append a
df = rbind(df, a_agg)
df
id val extracol
1 b 3 y
2 b 4 z
3 c 5 t
4 d 6 v
5 e 7 u
6 f 8 p
7 a 3 x
A base R option
with(
df,
rbind(
data.frame(
id = "a",
val = sum(val[id == "a"]),
extracol = na.omit(extracol[id == "a"])
),
df[id != "a", ]
)
)
gives
id val extracol
1 a 3 x
3 b 3 y
4 b 4 z
5 c 5 t
6 d 6 v
7 e 7 u
8 f 8 p

Replace column conditional on matching in another column

I would like to match two columns based on another. I'm trying to use the match function but gets NA values.
a <- data.frame( x = c(1,2,3,4,5))
b <- data.frame( y = c(3,4),
z = c("A","B"))
a$x <- b$z[match(a$x, b$y)]
I get:
> a
x
1 <NA>
2 <NA>
3 A
4 B
5 <NA>
I would like :
> a
x
1 1
2 2
3 A
4 B
5 5
First, rename the numeric column of b so that you can merge the two data frames:
b <- b %>% rename(x = y)
Then, merge them, turn variables into character and replace the values of column x with those of z if not NA.
a <- merge(a, b, by = "x", all.x = TRUE) %>%
mutate_all(as.character) %>%
mutate(x = ifelse(is.na(z), x, z))
Result:
x z
1 1 <NA>
2 2 <NA>
3 A A
4 B B
5 5 <NA>
Without renaming I would propose this which ends with the same result that broti
tmp.merge<- merge(a,b,by.x = "x", by.y="y", all = TRUE)
for (elm in as.numeric(row.names(tmp.merge[which(!is.na(tmp.merge$z)),]))){
tmp.merge[elm,'x'] <- as.character(tmp.merge[elm,'z'])
}
tmp.merge
result :
> tmp.merge
x z
1 1 <NA>
2 2 <NA>
3 A A
4 B B
5 5 <NA>
The following works but you need to set stringsAsFactors = F, when defining dataframe b
a <- data.frame( x = c(1,2,3,4,10,13,12,11))
b <- data.frame( y = c(10,12,13),
z = c("A","B","C"),stringsAsFactors = F)
#
a %>% mutate(x = ifelse(x %in% b$y,b$z[match(x,b$y)],x))
Output
x
1 1
2 2
3 3
4 4
5 A
6 C
7 B
8 11

Checking the presence of values in multiple datasets

I have a number of tables and all the "a" columns of the tables must have identical values for the analysis I am conducting. The actual tables are very big so I will use simplified (mock) data frames.
Let's say I have the following data:
A <- data.frame(a = c(3,4,5,6,7,8), b = c(4,5,6,7,8,9), c = c(5,6,7,8,9,10))
B <- data.frame(a = c(2,3,4,5,6,7), b = c(4,5,6,7,8,9), c = c(5,6,7,8,9,10))
C <- data.frame(a = c(1,2,3,4,5,6), b = c(4,5,6,7,8,9), c = c(5,6,7,8,9,10))
D <- data.frame(a = c(4,5,6,7,8,9), b = c(4,5,6,7,8,9), c = c(5,6,7,8,9,10))
Now, each data frame has unidentical values in column "a"s. My goal is to delete the entire rows that contain different values as compared to all the other tables.
In order to have identical values in column "a" for all tables A, B and C, I could use the following operations:
A <- A[A$a %in% B$a,]
B <- B[B$a %in% A$a,]
C <- C[C$a %in% B$a,]
B <- B[B$a %in% C$a,]
A <- A[A$a %in% C$a,]
This is already getting very tedious as you can see. What if I throw the table D or other data frames in this mix. It's becoming almost impossible to proceed, as each table contain at least one unique value.
One dplyr option could be:
bind_rows(list(A, B, C, D), .id = "ID") %>%
mutate(n_datasets = max(ID)) %>%
group_by(a) %>%
filter(n_distinct(ID) == n_datasets)
ID a b c n_datasets
<chr> <dbl> <dbl> <dbl> <chr>
1 1 4 5 6 4
2 1 5 6 7 4
3 1 6 7 8 4
4 2 4 6 7 4
5 2 5 7 8 4
6 2 6 8 9 4
7 3 4 7 8 4
8 3 5 8 9 4
9 3 6 9 10 4
10 4 4 4 5 4
11 4 5 5 6 4
12 4 6 6 7 4

Add together 2 dataframes in R without losing columns

I have 2 dataframes in R (df1, df2).
A C D
1 1 1
2 2 2
df2 as
A B C
1 1 1
2 2 2
How can I merge these 2 dataframes to produce the following output?
A B C D
2 1 2 1
4 2 4 2
Columns are sorted and column values are added. Both DFs have same number of rows. Thank you in advance.
Code to create DF:
df1 <- data.frame("A" = 1:2, "C" = 1:2, "D" = 1:2)
df2 <- data.frame("A" = 1:2, "B" = 1:2, "C" = 1:2)
nm1 = names(df1)
nm2 = names(df2)
nm = intersect(nm1, nm2)
if (length(nm) == 0){ # if no column names in common
cbind(df1, df2)
} else { # if column names in common
cbind(df1[!nm1 %in% nm2], # columns only in df1
df1[nm] + df2[nm], # add columns common to both
df2[!nm2 %in% nm1]) # columns only in df2
}
# D A C B
#1 1 2 2 1
#2 2 4 4 2
You can try:
library(tidyverse)
list(df2, df1) %>%
map(rownames_to_column) %>%
bind_rows %>%
group_by(rowname) %>%
summarise_all(sum, na.rm = TRUE)
# A tibble: 2 x 5
rowname A B C D
<chr> <int> <int> <int> <int>
1 1 2 1 2 1
2 2 4 2 4 2
By using left_join() from dplyr you won't lose the column
library(tidyverse)
dat1 <- tibble(a = 1:10,
b = 1:10,
c = 1:10)
dat2 <- tibble(c = 1:10,
d = 1:10,
e = 1:10)
left_join(dat1, dat2, by = "c")
#> # A tibble: 10 x 5
#> a b c d e
#> <int> <int> <int> <int> <int>
#> 1 1 1 1 1 1
#> 2 2 2 2 2 2
#> 3 3 3 3 3 3
#> 4 4 4 4 4 4
#> 5 5 5 5 5 5
#> 6 6 6 6 6 6
#> 7 7 7 7 7 7
#> 8 8 8 8 8 8
#> 9 9 9 9 9 9
#> 10 10 10 10 10 10
Created on 2019-01-16 by the reprex package (v0.2.1)
allnames <- sort(unique(c(names(df1), names(df2))))
df3 <- data.frame(matrix(0, nrow = nrow(df1), ncol = length(allnames)))
names(df3) <- allnames
df3[,allnames %in% names(df1)] <- df3[,allnames %in% names(df1)] + df1
df3[,allnames %in% names(df2)] <- df3[,allnames %in% names(df2)] + df2
df3
A B C D
1 2 1 2 1
2 4 2 4 2
Here is a fun base R method with Reduce.
Reduce(cbind,
list(Reduce("+", list(df1[intersect(names(df1), names(df2))],
df2[intersect(names(df1), names(df2))])), # sum results
df1[setdiff(names(df1), names(df2))], # in df1, not df2
df2[setdiff(names(df2), names(df1))])) # in df2, not df1
This returns
A C D B
1 2 2 1 1
2 4 4 2 2
This assumes that both df1 and df2 have columns that are not present in the other. If this is not true, you'd have to adjust the list.
Note also that you could replace Reduce with do.call in both places and you'd get the same result.

How to merge and sum two data frames

Here is my issue:
df1 <- data.frame(x = 1:5, y = 2:6, z = 3:7)
rownames(df1) <- LETTERS[1:5]
df1
x y z
A 1 2 3
B 2 3 4
C 3 4 5
D 4 5 6
E 5 6 7
df2 <- data.frame(x = 1:5, y = 2:6, z = 3:7)
rownames(df2) <- LETTERS[3:7]
df2
x y z
C 1 2 3
D 2 3 4
E 3 4 5
F 4 5 6
G 5 6 7
what I wanted is:
x y z
A 1 2 3
B 2 3 4
C 4 6 8
D 6 8 10
E 8 10 12
F 4 5 6
G 5 6 7
where duplicated rows were added up by same variable.
A solution with base R:
# create a new variable from the rownames
df1$rn <- rownames(df1)
df2$rn <- rownames(df2)
# bind the two dataframes together by row and aggregate
res <- aggregate(cbind(x,y,z) ~ rn, rbind(df1,df2), sum)
# or (thx to #alistaire for reminding me):
res <- aggregate(. ~ rn, rbind(df1,df2), sum)
# assign the rownames again
rownames(res) <- res$rn
# get rid of the 'rn' column
res <- res[, -1]
which gives:
> res
x y z
A 1 2 3
B 2 3 4
C 4 6 8
D 6 8 10
E 8 10 12
F 4 5 6
G 5 6 7
With dplyr,
library(dplyr)
# add rownames as a column in each data.frame and bind rows
bind_rows(df1 %>% add_rownames(),
df2 %>% add_rownames()) %>%
# evaluate following calls for each value in the rowname column
group_by(rowname) %>%
# add all non-grouping variables
summarise_all(sum)
## # A tibble: 7 x 4
## rowname x y z
## <chr> <int> <int> <int>
## 1 A 1 2 3
## 2 B 2 3 4
## 3 C 4 6 8
## 4 D 6 8 10
## 5 E 8 10 12
## 6 F 4 5 6
## 7 G 5 6 7
could also vectorize the operation turning the dfs to matrices:
result_df <- as.data.frame(as.matrix(df1) + as.matrix(df2))
This might need some teaking to get the rownames logic working on a longer example:
dfr <-rbind(df1,df2)
do.call(rbind, lapply( split(dfr, sapply(rownames(dfr),substr,1,1)), colSums))
x y z
A 1 2 3
B 2 3 4
C 4 6 8
D 6 8 10
E 8 10 12
F 4 5 6
G 5 6 7
If the rownames could all be assumed to be alpha characters a gsub solution should be easy.
An alternative is to melt the data and cast it. At first we set the row names to the last column of both data frames thanks to #Jaap
df1$rn <- rownames(df1)
df2$rn <- rownames(df2)
Then we melt the data based on the name
melt(list(df1, df2), id.vars = "rn")
Then we use dcast with mget function which is used to retrieve multiple variables at once.
mydf<- dcast(melt(mget(ls(pattern = "df\\d+")), id.vars = "rn"),
rn ~ variable, value.var = "value", fun.aggregate = sum)
rownames(mydf) <- mydf$rn
# get rid of the 'rn' column
mydf <- mydf[, -1]
> mydf
# x y z
#A 1 2 3
#B 2 3 4
#C 4 6 8
#D 6 8 10
#E 8 10 12
#F 4 5 6
#G 5 6 7

Resources