i have to join two dataframes. the joining is working properly, my only problem is:
sometimes i have 2 rows in a dataset, that have to be joined with the other dataset. after the joining the rows are displayed directly one below the other. I need them in one row:
here a small example:
df1 <- data.frame(A = c(1:5),
B = c(12, 13, 14, 15, 16),
C = c("a", "b", "c", "d", "e"))
df2 <- data.frame(A = c(1,1,2,2,3),
E = c(112, 145, 546, 674, 287),
J = c("t", "e", "v", "p", "s"))
merged <- inner_join(df1, df2, by = "A")
A | B | c | E | J |
---------------------
1 | 12| a | 112 | t
1 | 12| a | 145 | e
2 | 13| b | 546 | v
2 | 13| b | 674 | p
3 | 14| c | 287 | s
i need it this way:
A | B | c | E.x | J.x | E.y | J.y
--------------------------------
1 | 12| a | 112 | t | 145 | e
2 | 13| b | 546 | v | 674 | p
all joined rows together in one row in the new dataset
thank you for your help
derlu
**update:**
My original data frames have
354 items with 1535 observation
246 items 203 observation
They are merged by the first column id_merge. Sometimes I have 2 matches, sometimes 3 identical „id_merge“ values.
One approach is to split up the dataframe by occurrence of each key, and then do another join. First, we create a key (n) and split up the dataframe:
merged <- merged %>%
group_by(A, B, C) %>%
mutate(n = 1:n())
# A B C E J n
# <dbl> <dbl> <fct> <dbl> <fct> <int>
# 1 1 12 a 112 t 1
# 2 1 12 a 145 e 2
# 3 2 13 b 546 v 1
# 4 2 13 b 674 p 2
# 5 3 14 c 287 s 1
merged_list <- split(merged, merged$n) %>%
map(select, -n)
# $`1`
# # A tibble: 3 x 5
# # Groups: A [3]
# A B C E J
# <dbl> <dbl> <fct> <dbl> <fct>
# 1 1 12 a 112 t
# 2 2 13 b 546 v
# 3 3 14 c 287 s
# $`2`
# # A tibble: 2 x 5
# # Groups: A [2]
# A B C E J
# <dbl> <dbl> <fct> <dbl> <fct>
# 1 1 12 a 145 e
# 2 2 13 b 674 p
Then use Reduce to join the dataframes:
Reduce(function(x, y) left_join(x, y, by = c("A", "B", "C")),
merged_list)
# A B C E.x J.x E.y J.y
# <dbl> <dbl> <fct> <dbl> <fct> <dbl> <fct>
# 1 1 12 a 112 t 145 e
# 2 2 13 b 546 v 674 p
# 3 3 14 c 287 s NA NA
It's not cool these days, but Base R's reshape actually shines for this as it deals with unequal group sizes and labelling all in one go. You need to add a time variable first though to label each group:
merged <- merge(df1, cbind(df2, time=ave(df2$A, df2$A, FUN=seq_along)), by="A")
reshape(merged, idvar=names(df1), direction="wide", sep="_")
# A B C E_1 J_1 E_2 J_2
#1 1 12 a 112 t 145 e
#3 2 13 b 546 v 674 p
#5 3 14 c 287 s NA <NA>
It actually fits in a dplyr pipeline not too bad:
df2 %>%
group_by(A) %>%
mutate(time=row_number()) %>%
inner_join(df1,.,by="A") %>%
reshape(idvar=names(df1), direction="wide", sep="_")
With data.table, you can split each group of (E, J) rows having the same (A, B, C) values by row, and cbind that list together. Here .I is the row number, .N is the number of rows in the group, and .SD is the group of rows.
This only works if the number of repeated rows is the same in all groups (except for groups with 1 row)
library(data.table)
setDT(merged)
merged[, if(.N > 1) do.call(cbind, split(.SD, .I))
, by = .(A, B, C)]
# A B C 1.E 1.J 2.E 2.J
# 1: 1 12 a 112 t 145 e
# 2: 2 13 b 546 v 674 p
It will be like this and applicable for larger datasets as well
df1 <- data.frame(A = c(1:5),
B = c(12, 13, 14, 15, 16),
C = c("a", "b", "c", "d", "e"))
df2 <- data.frame(A = c(1,1,2,2,3),
E = c(112, 145, 546, 674, 287),
J = c("t", "e", "v", "p", "s"))
df3 <- merge(df1, df2, by=c("A"))
df4 <- data.frame()
for (j in 1:(nrow(df3)-1))
for (k in (j+1):nrow(df3))
for (i in 2:ncol(df3))
if(df3[j,1] == df3[k,1] && all(df3[j,i:ncol(df3)]!=df3[k,i:ncol(df3)])){
if(nrow(df4) == 0)
df4 <- data.frame(c(df3[j,],df3[k,i:ncol(df3)]))
else
df4 <- rbind(df4,c(df3[j,],df3[k,i:ncol(df3)]))
break
}
Related
I have a dataset with groups--"A", "B", "C", and "A & B"--at two time points--"before" and "after". I only want to include "A & B" if the any of the sample sizes for A or B at either time point fall below 10 people. Otherwise, I want to drop the "A & B" group. How do I tell R to drop this group only if the other criteria are satisfied?
Here's are two sample datasets--one where it should filter out group A & B and one where it should retain it:
library(dplyr)
#This should not filter out anything
should_not_drop_group <- tibble(group = rep(c("A", "B", "C", "A & B"), 2),
time = c(rep(c("Before"), 4), rep(c("After"), 4)),
sample_size = c(5, 100, 132, 105, 250, 50, 224, 300))
#This dataset should drop group A&B
should_drop_group <- tibble(group = rep(c("A", "B", "C", "A & B"), 2),
time = c(rep(c("Before"), 4), rep(c("After"), 4)),
sample_size = c(500, 100, 132, 600, 250, 50, 224, 300))
And here's why I tried to no avail:
library(dplyr)
should_drop_group %>%
filter_if(~any(sample_size[group %in% c("A", "B")] < 10), group != "A & B" )
Maybe the condition in filter would be - subset the group where the sample_size is less than 10, check if there are any values of 'A', 'B' in that group, negate (!), then create the second expression where group is "A & B", join them with &, and then negate (!) the whole expression to filter out those cases
library(dplyr)
should_not_drop_group %>%
filter(!(!any(c("A", "B") %in% group[sample_size < 10]) & group == "A & B"))
# or can be written as
#filter(!(!any(group %in% c("A", "B") & sample_size < 10) & group == "A & B"))
-output
# A tibble: 8 × 3
group time sample_size
<chr> <chr> <dbl>
1 A Before 5
2 B Before 100
3 C Before 132
4 A & B Before 105
5 A After 250
6 B After 50
7 C After 224
8 A & B After 300
and second case
should_drop_group %>%
filter(!(!any(c("A", "B") %in% group[sample_size < 10]) & group == "A & B"))
# A tibble: 6 × 3
group time sample_size
<chr> <chr> <dbl>
1 A Before 500
2 B Before 100
3 C Before 132
4 A After 250
5 B After 50
6 C After 224
If we want to reuse it on several datasets, create a function and reuse it
> f1 <- function(x, sample_size)
!(!any(c("A", "B") %in% x[sample_size < 10]) & x == "A & B")
> should_not_drop_group %>%
filter(if_any(group, f1, sample_size = sample_size))
# A tibble: 8 × 3
group time sample_size
<chr> <chr> <dbl>
1 A Before 5
2 B Before 100
3 C Before 132
4 A & B Before 105
5 A After 250
6 B After 50
7 C After 224
8 A & B After 300
> should_drop_group %>%
filter(if_any(group, f1, sample_size = sample_size))
# A tibble: 6 × 3
group time sample_size
<chr> <chr> <dbl>
1 A Before 500
2 B Before 100
3 C Before 132
4 A After 250
5 B After 50
6 C After 224
Here is a solution with an ifelse statement and a helper column x:
library(dplyr)
df %>%
#df1 %>%
mutate(x = ifelse(any(sample_size < 10) & group == "A & B", 1, 0)) %>%
filter(x!=1) %>%
select(-x)
for df:
group time sample_size
<chr> <chr> <dbl>
1 A Before 500
2 B Before 100
3 C Before 132
4 A & B Before 600
5 A After 250
6 B After 50
7 C After 224
8 A & B After 300
for df1
group time sample_size
<chr> <chr> <dbl>
1 A Before 5
2 B Before 100
3 C Before 132
4 A After 250
5 B After 50
6 C After 224
I have a dictionary table like this:
ID
Position
Region
a
1-50
D1a
a
80-100
D2a
a
250-300
D3a
b
50-100
D1b
b
150-180
D2b
c
1-20
D1c
c
50-80
D2c
c
100-200
D3c
c
250-300
D4c
And a target table like this:
ID
Position
a
28
a
85
a
320
b
55
b
100
c
18
c
45
c
180
c
270
The logic is to examine whether the numeric value-Position in target table is within the range of position in dictionary table and output the region values with a given ID.
I first thought that R package hash should work. But later I found that the hash keys must be unique, which in my case are not.
How can I match each ID first, and use if between() or other functions to map my targets to their respective region?
Maybe something like this with tidy verse approach (tidyr::separate)
However it's a quick answer and I'm not sure to understand your data relations as you want.
library(tidyverse)
one <- tibble::tribble(
~ID, ~Position, ~Region,
"a", "1-50", "D1a",
"a", "80-100", "D2a",
"a", "250-300", "D3a",
"b", "50-100", "D1b",
"b", "150-180", "D2b",
"c", "1-20", "D1c",
"c", "50-80", "D2c",
"c", "100-200", "D3c",
"c", "250-300", "D4c"
)
two <- tibble::tribble(
~ID, ~Position,
"a", 28L,
"a", 85L,
"a", 320L,
"b", 55L,
"b", 100L,
"c", 18L,
"c", 45L,
"c", 180L,
"c", 270L
)
one_ <- one %>%
tidyr::separate(Position, c('p_min', 'p_max'), sep = "-") %>%
mutate_at(vars(starts_with('p_')), as.integer)
two %>%
mutate(rn = row_number()) %>%
left_join(one_) %>%
mutate(in_between = (Position >= p_min & Position <= p_max)) %>%
filter(in_between) %>%
distinct(rn, .keep_all = TRUE)
Joining, by = "ID"
# A tibble: 7 × 7
ID Position rn p_min p_max Region in_between
<chr> <int> <int> <int> <int> <chr> <lgl>
1 a 28 1 1 50 D1a TRUE
2 a 85 2 80 100 D2a TRUE
3 b 55 4 50 100 D1b TRUE
4 b 100 5 50 100 D1b TRUE
5 c 18 6 1 20 D1c TRUE
6 c 180 8 100 200 D3c TRUE
7 c 270 9 250 300 D4c TRUE
Is this your expected result?
library(data.table)
dt1 <- fread("
ID Position Region
a 1-50 D1a
a 80-100 D2a
a 250-300 D3a
b 50-100 D1b
b 150-180 D2b
c 1-20 D1c
c 50-80 D2c
c 100-200 D3c
c 250-300 D4c
")
dt2 <- fread("
ID Position
a 28
a 85
a 320
b 55
b 100
c 18
c 45
c 180
c 270
")
#dt1[,c("Position_left","Position_right") := tstrsplit(Position,"-")]
#dt1[, dt2$Position %between% list(Position_left,Position_right)]
# [1] TRUE TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE
dt1[,.( ID,
Position = dt2$Position,
Region = fifelse(dt2$Position %between% tstrsplit(Position,"-"),Region,NA)
)]
ID Position Region
<char> <int> <char>
1: a 28 D1a
2: a 85 D2a
3: a 320 <NA>
4: b 55 D1b
5: b 100 <NA>
6: c 18 D1c
7: c 45 <NA>
8: c 180 D3c
9: c 270 D4c
You may use package fuzzyjoin to perform the join on range after splitting the Position column into two columns.
Using data from #Guillaume -
library(dplyr)
library(fuzzyjoin)
library(tidyr)
one %>%
separate(Position, c('min', 'max'), sep = '-', convert = TRUE) %>%
fuzzy_right_join(two, by = c('ID', 'min' = 'Position', 'max' = 'Position'),
match_fun = c(`==`, `<=`, `>=`)) %>%
select(ID = ID.y, Position, min, max, Region)
# ID Position min max Region
# <chr> <int> <int> <int> <chr>
#1 a 28 1 50 D1a
#2 a 85 80 100 D2a
#3 a 320 NA NA NA
#4 b 55 50 100 D1b
#5 b 100 50 100 D1b
#6 c 18 1 20 D1c
#7 c 45 NA NA NA
#8 c 180 100 200 D3c
#9 c 270 250 300 D4c
Based on your expected output you may use fuzzy_(inner_join/left_join/full_join).
There are two dataframes, one with an index and another with no index. I want to make a new dataframe with the indices of the first and the rows and columns of the other in such a way that there is a copy of every data in the second table for each index.
df_A <- data.frame("index" = c("id1","id2","id3")
, variable_a = c(1,2,3)
, variable_b = c("x","f","d"))
df_B <- data.frame(variable_x = c("4124","414","123")
, variable_y = c(12,22,13)
, variable_z = c("q","w","d"))
The result should be:
df_C <- data.frame("index" = c("id1","id1","id1","id2","id2","id2","id3","id3","id3")
, variable_x = c("4124","414","123","4124","414","123","4124","414","123")
, variable_y = c(12,22,13,12,22,13,12,22,13)
, variable_z = c("q","w","d","q","w","d","q","w","d"))
This is a full outer join and could be solved via
merge(df_B, df_A$index)
Which yields
> merge(df_B, df_A$index)
variable_x variable_y variable_z y
1 4124 12 q id1
2 414 22 w id1
3 123 13 d id1
4 4124 12 q id2
5 414 22 w id2
6 123 13 d id2
7 4124 12 q id3
8 414 22 w id3
9 123 13 d id3
You could correct the order of the columns like this:
merge(df_B, df_A$index)[,c(4, 1, 2, 3)]
Obviously, a full join can be done in dplyr as well, if you prefer that:
dplyr::full_join(df_B, df_A, by = character())
Another option is to use tidyr::crossing
tidyr::crossing(df_A, df_B)
#----------
# A tibble: 9 x 6
index variable_a variable_b variable_x variable_y variable_z
<chr> <dbl> <chr> <chr> <dbl> <chr>
1 id1 1 x 123 13 d
2 id1 1 x 4124 12 q
3 id1 1 x 414 22 w
4 id2 2 f 123 13 d
5 id2 2 f 4124 12 q
6 id2 2 f 414 22 w
7 id3 3 d 123 13 d
8 id3 3 d 4124 12 q
9 id3 3 d 414 22 w
The following function should help using the library dplyr. Insert the dataframe with index in the first parameter and add the dataframe without index in the second parameter. It should return the requested dataframe.
merge_lines_with_index <- function(index_table, data_table){
df <- data.frame(matrix(ncol = ncol(data_table) + 1))
x <- names(data_table) %>% unlist()
colnames(df) <- c("index", x)
for (item in index_table %>% select(1) %>% unlist()) {
new_data <- data_table %>%
mutate("index" = item)
df <- df %>% rbind(new_data)
}
return(df[-1,])
}
This might be a simple question, but I couldn't seem to find an obvious solution.
I have two data frames, df1 with 64 rows, and df2 with 662,343 rows. I join df1 to df2, where every row in df1 is mapped to each row in df2 so that I have 42,389,952 rows. df1 and df2 might look like this respectively:
df1:
| Cancer | ID |
|---------------------|------------------|
| Sarcoma | 3435 |
| Leukemia | 4465 |
df2:
Gene
TP53
new data frame :
Cancer
ID
Gene
Sarcoma
3435
TP53
Leukemia
4465
TP53
Thanks in advance for any help!
You may full_join without any matching column. So use by = character() in matching column argument. Demo
df <- data.frame(X = c(1, 2))
df2 <- data.frame(A = letters[1:3],
B = LETTERS[24:26])
df
#> X
#> 1 1
#> 2 2
df2
#> A B
#> 1 a X
#> 2 b Y
#> 3 c Z
dplyr::full_join(df2, df, by = character())
#> A B X
#> 1 a X 1
#> 2 a X 2
#> 3 b Y 1
#> 4 b Y 2
#> 5 c Z 1
#> 6 c Z 2
Created on 2021-06-26 by the reprex package (v2.0.0)
I think you are looking for cartesian product and not left join:
library(tidyr)
expand_grid(df1,df2)
# A tibble: 2 x 3
Cancer ID Gene
<chr> <dbl> <chr>
1 Sarcoma 3425 TP53
2 Leukemia 4465 TP53
We may use merge
merge(df2, df, all = TRUE)
-ouptut
A B X
1 a X 1
2 b Y 1
3 c Z 1
4 a X 2
5 b Y 2
6 c Z 2
data
df <- data.frame(X = c(1, 2))
df2 <- data.frame(A = letters[1:3],
B = LETTERS[24:26])
I have this data.frame:
df <- data.frame(id=c('A','A','B','B','B','C'), amount=c(45,66,99,34,71,22))
id | amount
-----------
A | 45
A | 66
B | 99
B | 34
B | 71
C | 22
which I need to expand so that each by group in the data.frame is of equal length (filling it out with zeroes), like so:
id | amount
-----------
A | 45
A | 66
A | 0 <- added
B | 99
B | 34
B | 71
C | 22
C | 0 <- added
C | 0 <- added
What is the most efficient way of doing this?
NOTE
Benchmarking the some of the solutions provided with my actual 1 million row data.frame I got:
plyr | data.table | unstack
-----------------------------------
Elapsed: 139.87s | 0.09s | 2.00s
One way using data.table
df <- structure(list(V1 = structure(c(1L, 1L, 2L, 2L, 2L, 3L),
.Label = c("A ", "B ", "C "), class = "factor"),
V2 = c(45, 66, 99, 34, 71, 22)),
.Names = c("V1", "V2"),
class = "data.frame", row.names = c(NA, -6L))
require(data.table)
dt <- data.table(df, key="V1")
# get maximum index
idx <- max(dt[, .N, by=V1]$N)
# get final result
dt[, list(V2 = c(V2, rep(0, idx-length(V2)))), by=V1]
# V1 V2
# 1: A 45
# 2: A 66
# 3: A 0
# 4: B 99
# 5: B 34
# 6: B 71
# 7: C 22
# 8: C 0
# 9: C 0
I'm sure there is a base R solution, but here is one that uses ddply in the plyr package
library(plyr)
##N: How many values should be in each group
N = 3
ddply(df, "id", summarize,
amount = c(amount, rep(0, N-length(amount))))
gives:
id amount
1 A 45
2 A 66
3 A 0
4 B 99
5 B 34
6 B 71
7 C 22
8 C 0
9 C 0
Here's another way in base R using unstack and stack.
# ensure character id col
df <- transform(df, id=as.character(id))
# break into a list by id
u <- unstack(df, amount ~ id)
# get max length
max.len <- max(sapply(u, length))
# pad the short ones with 0s
filled <- lapply(u, function(x) c(x, numeric(max.len - length(x))))
# recombine into data.frame
stack(filled)
# values ind
# 1 45 A
# 2 66 A
# 3 0 A
# 4 99 B
# 5 34 B
# 6 71 B
# 7 22 C
# 8 0 C
# 9 0 C
How about this?
out <- by(df, INDICES = df$id, FUN = function(x, N) {
x <- droplevels(x)
lng <- nrow(x)
dif <- N - lng
if (dif == 0) return(x)
make.list <- lapply(1:dif, FUN = function(y) data.frame(id = levels(x$id), amount = 0))
rbind(x, do.call("rbind", make.list))
}, N = max(table(df$id))) # N could also be an integer
do.call("rbind", out)
id amount
A.1 A 45
A.2 A 66
A.3 A 0
B.3 B 99
B.4 B 34
B.5 B 71
C.6 C 22
C.2 C 0
C.3 C 0
Here is a dplyr option:
library(dplyr)
# Get maximum number of rows for all groups
N = max(count(df,id)$n)
df %>%
group_by(id) %>%
summarise(amount = c(amount, rep(0, N-length(amount))), .groups = "drop")
Output
id amount
<chr> <dbl>
1 A 45
2 A 66
3 A 0
4 B 99
5 B 34
6 B 71
7 C 22
8 C 0
9 C 0