merge/join two long df in R - r

I have two dataframes a and b which I would like to combine
a <- data.frame(g=c("1","2","2","3","3","3","4","4","4","4"),h=c("1","1","2","1","2","3","1","2","3","4"))
b <- data.frame(g=c("1","2","3","3","3","4","4","4","4","4"),i=c("1","2","3","2","1","2","3","4","5","6"))
g represents a grouping variable and h and i the columns I want to merge/join
> a
g h
1 1 1
2 2 1
3 2 2
4 3 1
5 3 2
6 3 3
7 4 1
8 4 2
9 4 3
10 4 4
> b
g i
1 1 1
2 2 2
3 3 3
4 3 2
5 3 1
6 4 2
7 4 3
8 4 4
9 4 5
10 4 6
a and b should be merged on the level of the grouping variable g whereas identical values of h and i should be put together (independant of the order they appear in h/i) and not identical values should be combined once (not all possible combinations).
a final df would look like:
g h i
1 1 1 1
2 2 1 <NA>
3 2 2 2
4 3 1 1
5 3 2 2
6 3 3 3
7 4 1 <NA>
8 4 2 2
9 4 3 3
10 4 4 4
11 4 <NA> 5
12 4 <NA> 6
I need that df to perform a correlation analysis.

Sounds like a merge on h==i, while retaining i, so create a new variable x to join on, and keep join results from both sides (all=TRUE). With a large hat-tip to #Moody_Mudskipper:
merge(transform(a,x=h), transform(b,x=i), all=TRUE)
# g x h i
#1 1 1 1 1
#2 2 1 1 <NA>
#3 2 2 2 2
#4 3 1 1 1
#5 3 2 2 2
#6 3 3 3 3
#7 4 1 1 <NA>
#8 4 2 2 2
#9 4 3 3 3
#10 4 4 4 4
#11 4 5 <NA> 5
#12 4 6 <NA> 6

We can also do this with dplyr
library(dplyr)
a %>%
mutate(x = h) %>%
full_join(mutate(b, x = i)) %>%
select(-x)

Related

Creating two columns of cumulative sum based on the categories of one column

I like to create two columns with cumulative frequency of "A" and "B" in the assignment columns.
df = data.frame(id = 1:10, assignment= c("B","A","B","B","B","A","B","B","A","B"))
id assignment
1 1 B
2 2 A
3 3 B
4 4 B
5 5 B
6 6 A
7 7 B
8 8 B
9 9 A
10 10 B
The resulting table would have this format
id assignment A B
1 1 B 0 1
2 2 A 1 1
3 3 B 1 2
4 4 B 1 3
5 5 B 1 4
6 6 A 2 4
7 7 B 2 5
8 8 B 2 6
9 9 A 3 6
10 10 B 3 7
How to generalize the codes for more than 2 categories (say for "A","B",C")?
Thanks
Use lapply over unique values in assignment to create new columns.
vals <- sort(unique(df$assignment))
df[vals] <- lapply(vals, function(x) cumsum(df$assignment == x))
df
# id assignment A B
#1 1 B 0 1
#2 2 A 1 1
#3 3 B 1 2
#4 4 B 1 3
#5 5 B 1 4
#6 6 A 2 4
#7 7 B 2 5
#8 8 B 2 6
#9 9 A 3 6
#10 10 B 3 7
We can use model.matrix with colCumsums
library(matrixStats)
cbind(df, colCumsums(model.matrix(~ assignment - 1, df[-1])))
A base R option
transform(
df,
A = cumsum(assignment == "A"),
B = cumsum(assignment == "B")
)
gives
id assignment A B
1 1 B 0 1
2 2 A 1 1
3 3 B 1 2
4 4 B 1 3
5 5 B 1 4
6 6 A 2 4
7 7 B 2 5
8 8 B 2 6
9 9 A 3 6
10 10 B 3 7

Merge 2 rows with duplicated pair of values into a single row

I have the dataframe below in which there are 2 rows with the same pair of values for columns A and B -3RD AND 4RTH with 2 3 -, -7TH AND 8TH with 4 6-.
master <- data.frame(A=c(1,1,2,2,3,3,4,4,5,5), B=c(1,2,3,3,4,5,6,6,7,8),C=c(5,2,5,7,7,5,7,9,7,8),D=c(1,2,5,3,7,5,9,6,7,0))
A B C D
1 1 1 5 1
2 1 2 2 2
3 2 3 5 5
4 2 3 7 3
5 3 4 7 7
6 3 5 5 5
7 4 6 7 9
8 4 6 9 6
9 5 7 7 7
10 5 8 8 0
I would like to merge these rows into one by adding the pipe | operator between values of C and D. The 2nd and 3rd line for example would be like:
A B C D
2 3 2|5 2|5
I think your combined pairs are off by a row in your example, assuming that's the case, this is what you're looking for. We group by the columns we want to collapse the duplicates out of, and then use summarize_all with paste0 to combine the values with a separator.
library(tidyverse)
master %>% group_by(A,B) %>% summarize_all(funs(paste0(., collapse="|")))
A B C D
<dbl> <dbl> <chr> <chr>
1 1 1 5 1
2 1 2 2 2
3 2 3 5|7 5|3
4 3 4 7 7
5 3 5 5 5
6 4 6 7|9 9|6
7 5 7 7 7
8 5 8 8 0
We can do this in base R with aggregate
aggregate(.~ A + B, master, FUN = paste, collapse= '|')
# A B C D
#1 1 1 5 1
#2 1 2 2 2
#3 2 3 5|7 5|3
#4 3 4 7 7
#5 3 5 5 5
#6 4 6 7|9 9|6
#7 5 7 7 7
#8 5 8 8 0

numbering duplicated rows in dplyr [duplicate]

This question already has answers here:
Using dplyr to get cumulative count by group
(3 answers)
Closed 5 years ago.
I come to an issue with numbering the duplicated rows in data.frame and could not find a similar post.
Let's say we have a data like this
df <- data.frame(gr=gl(7,2),x=c("a","a","b","b","c","c","a","a","c","c","d","d","a","a"))
> df
gr x
1 1 a
2 1 a
3 2 b
4 2 b
5 3 c
6 3 c
7 4 a
8 4 a
9 5 c
10 5 c
11 6 d
12 6 d
13 7 a
14 7 a
and want to add new column called x_dupl to show that first occurrence of x values is numbered as 1 and second time 2 and third time 3 and so on..
thanks in advance!
The expected output
> df
gr x x_dupl
1 1 a 1
2 1 a 1
3 2 b 1
4 2 b 1
5 3 c 1
6 3 c 1
7 4 a 2
8 4 a 2
9 5 c 2
10 5 c 2
11 6 d 1
12 6 d 1
13 7 a 3
14 7 a 3
Your example data (plus rows where gr = 7 as in your output), and named df1, not df:
df1 <- data.frame(gr = gl(7,2),
x = c("a","a","b","b","c","c","a","a","c","c","d","d","a","a"))
library(dplyr)
df1 %>%
group_by(x) %>%
mutate(x_dupl = dense_rank(gr)) %>%
ungroup()
# A tibble: 14 x 3
gr x x_dupl
<fctr> <fctr> <int>
1 1 a 1
2 1 a 1
3 2 b 1
4 2 b 1
5 3 c 1
6 3 c 1
7 4 a 2
8 4 a 2
9 5 c 2
10 5 c 2
11 6 d 1
12 6 d 1
13 7 a 3
14 7 a 3
A base R solution:
df <- data.frame(gr=gl(7,2),x=c("a","a","b","b","c","c","a","a","c","c","d","d","a","a"))
x <- rle(as.numeric(df$x))
x$values <- ave(x$values, x$values, FUN = seq_along)
df$x_dupl <- inverse.rle(x)
# gr x x_dupl
# 1 1 a 1
# 2 1 a 1
# 3 2 b 1
# 4 2 b 1
# 5 3 c 1
# 6 3 c 1
# 7 4 a 2
# 8 4 a 2
# 9 5 c 2
# 10 5 c 2
# 11 6 d 1
# 12 6 d 1
# 13 7 a 3
# 14 7 a 3

R Subset matching contiguous blocks

I have a dataframe.
dat <- data.frame(k=c("A","A","B","B","B","A","A","A"),
a=c(4,2,4,7,5,8,3,2),b=c(2,5,3,5,8,4,5,8),
stringsAsFactors = F)
k a b
1 A 4 2
2 A 2 5
3 B 4 3
4 B 7 5
5 B 5 8
6 A 8 4
7 A 3 5
8 A 2 8
I would like to subset contiguous blocks based on variable k. This would be a standard approach.
#using rle rather than levels
kval <- rle(dat$k)$values
for(i in 1:length(kval))
{
subdf <- subset(dat,dat$k==kval[i])
print(subdf)
#do something with subdf
}
k a b
1 A 4 2
2 A 2 5
6 A 8 4
7 A 3 5
8 A 2 8
k a b
3 B 4 3
4 B 7 5
5 B 5 8
k a b
1 A 4 2
2 A 2 5
6 A 8 4
7 A 3 5
8 A 2 8
So the subsetting above obviously does not work the way I intended. Any elegant way to get these results?
k a b
1 A 4 2
2 A 2 5
k a b
1 B 4 3
2 B 7 5
3 B 5 8
k a b
1 A 8 4
2 A 3 5
3 A 2 8
We can use rleid from data.table to create a grouping variable
library(data.table)
setDT(dat)[, grp := rleid(k)]
dat
# k a b grp
#1: A 4 2 1
#2: A 2 5 1
#3: B 4 3 2
#4: B 7 5 2
#5: B 5 8 2
#6: A 8 4 3
#7: A 3 5 3
#8: A 2 8 3
We can group by 'grp' and do all the operations within the 'grp' using standard data.table methods.
Here is a base R option to create 'grp'
dat$grp <- with(dat, cumsum(c(TRUE, k[-1]!= k[-length(k)])))

From table to data.frame

I have a table that looks like:
dat = data.frame(expand.grid(x = 1:10, y = 1:10),
z = sample(LETTERS[1:3], size = 100, replace = TRUE))
tabl <- with(dat, table(z, y))
tabl
y
z 1 2 3 4 5 6 7 8 9 10
A 5 3 1 1 3 6 3 7 2 4
B 4 5 3 6 5 1 3 1 4 4
C 1 2 6 3 2 3 4 2 4 2
Now how do I transform it into a data.frame that looks like
1 2 3 4 5 6 7 8 9 10
A 5 3 1 1 3 6 3 7 2 4
B 4 5 3 6 5 1 3 1 4 4
C 1 2 6 3 2 3 4 2 4 2
Here are a couple of options.
The reason as.data.frame(tabl) doesn't work is that it dispatches to the S3 method as.data.frame.table() which does something useful but different from what you want.
as.data.frame.matrix(tabl)
# 1 2 3 4 5 6 7 8 9 10
# A 5 4 3 1 1 3 3 2 6 2
# B 1 4 3 4 5 3 4 4 3 3
# C 4 2 4 5 4 4 3 4 1 5
## This will also work
as.data.frame(unclass(tabl))

Resources