I need help to join two data frames by one key with duplicates.
I want to merge only once for each duplicate, and I can't do it with dplyr::left_join.
Example:
ds1 <- data.frame(
id = c(1,1,1,2,2),
V2 = c(5,6,7,5,8)
)
ds2<-data.frame(
id=c(1,2),
Value=c(56,98)
)
ds3<-left_join(ds1, ds2, by="id")
In this case I have:
# id V2 Value
1 1 5 56
2 1 6 56
3 1 7 56
4 2 5 98
5 2 8 98
But I need:
# id V2 Value
1 1 5 56
2 1 6
3 1 7
4 2 5 98
5 2 8
Keep your code and just add this:
ds3$Value[duplicated(ds3[c("Value","id")])] <- NA
# id V2 Value
# 1 1 5 56
# 2 1 6 NA
# 3 1 7 NA
# 4 2 5 98
# 5 2 8 NA
Here is another idea using slice, left_join, and then full_join.
ds3 <- ds1 %>%
group_by(id) %>%
slice(1) %>%
left_join(ds2, by = "id") %>%
full_join(ds1, by = c("id", "V2")) %>%
ungroup() %>%
arrange(id, V2)
ds3
# # A tibble: 5 x 3
# id V2 Value
# <dbl> <dbl> <dbl>
# 1 1. 5. 56.
# 2 1. 6. NA
# 3 1. 7. NA
# 4 2. 5. 98.
# 5 2. 8. NA
Related
I have two columns (v5 & v6) in a matrix where both columns have entries between 0 and 5 as
head(matrix)
v1 v2 ... v5 v6
[1,] 0 5
[2,] 1 3
[3,] 2 1
[4,] 4 1
[5,] 2 2
I want to construct a new (6*6)matrix contains the number of occurrences of each pair of values in both columns as
new_matrix
0 1 2 3 4 5
0 2326 2882 2587 734 341 0
1 50 17 103 14 0 6
2 ......
3 .......
4 ......
5 .......
I mean that I want to know how many pairs (0,0) , (0,1), ..., (0,5),... (5,5) are in both columns?
I used library(plyr) as
freq <- ddply(matrix, .(matrix$v5, matrix$v6), nrow)
names(freq) <- c("v5", "v6", "Freq")
But this will not give the needed result!
With tidyverse, you can arrive at this answers using usual group_by operations.
Sample data
I'm creating column names to make it easier to convert to tibble.
set.seed(123)
M <- matrix(sample(0:5, 100, TRUE),
sample(0:5, 100, TRUE),
ncol = 2,
nrow = 100,
dimnames = list(NULL, c("colA", "colB")))
Solution
library("tidyverse")
as_tibble(M) %>%
arrange(colA, colB) %>%
group_by(colA, colB) %>%
summarise(num_pairs = n(), .groups = "drop") %>%
pivot_wider(names_from = colB, values_from = num_pairs) %>%
remove_rownames()
Preview
# A tibble: 6 x 7
colA `0` `1` `2` `4` `5` `3`
<int> <int> <int> <int> <int> <int> <int>
1 0 4 4 4 2 4 NA
2 1 2 2 4 6 2 NA
3 2 6 4 NA 2 6 NA
4 3 2 NA NA 4 6 2
5 4 NA 2 6 NA 2 4
6 5 6 2 4 4 2 2
Comments
You have asked:
I mean that I want to know how many pairs (0,0) , (0,1), ...,
(0,5),... (5,5) are in both columns?
This answer gives you that, the question is how important is for you to have your results stored as a matrix? You can convert the results further into matrix by using as.matrix on what you get. Likely, I would stop after summarise(num_pairs = n(), .groups = "drop") as that gives very usable results, easy to subset join and so forth.
We can also use table
table(as.data.frame(M))
-output
# colB
#colA 0 1 2 3 4 5
# 0 4 4 4 0 2 4
# 1 2 2 4 0 6 2
# 2 6 4 0 0 2 6
# 3 2 0 0 2 4 6
# 4 0 2 6 4 0 2
# 5 6 2 4 2 4 2
I have a dataset (dt) like this in R:
n id val
1 1&&2 10
2 3 20
3 4&&5 30
And what I want to get is
n id val
1 1 10
2 2 10
3 3 20
4 4 30
5 5 30
I know that to split ids I need to do something like this:
id_split <- strsplit(dt$id,"&&")
But how do I create new rows with the same val for ids which were initially together in a row?
You may cbind the splits to get a column which you cbind again to the val (recycling).
res <- do.call(rbind, Map(data.frame, id=lapply(strsplit(dat$id, "&&"), cbind),
val=dat$val))
res <- cbind(n=1:nrow(res), res)
res
# n id val
# 1 1 1 10
# 2 2 2 10
# 3 3 3 20
# 4 4 4 30
# 5 5 5 30
You can use the lengths from the split of id and expand your rows. Then set n to be the sequece of the length of your data frame, i.e.
l1 <- strsplit(as.character(df$id), '&&')
res_df <- transform(df[rep(seq_len(nrow(df)), lengths(l1)),],
id = unlist(l1),
n = seq_along(unlist(l1)))
which gives,
n id val
1 1 1 10
1.1 2 2 10
2 3 3 20
3 4 4 30
3.1 5 5 30
You can remove the rownames with rownames(res_df) <- NULL
A data.table solution.
library(data.table)
DT <- fread('n id val
1 1&&2 10
2 3 20
3 4&&5 30')
DT[,.(id=unlist(strsplit(id,split ="&&"))),by=.(n,val)][,n:=.I][]
#> n val id
#> 1: 1 10 1
#> 2: 2 10 2
#> 3: 3 20 3
#> 4: 4 30 4
#> 5: 5 30 5
Created on 2020-05-08 by the reprex package (v0.3.0)
Note:
A more rebosut solution is by = 1:nrow(DT). But you need to play around your other columns though.
If anyone looking for tidy solution,
dt %>%
separate(id, into = paste0("id", 1:2),sep = "&&") %>%
pivot_longer(cols = c(id1,id2), names_to = "id_name", values_to = "id") %>%
drop_na(id) %>%
select(n, id, val)
output as
# A tibble: 5 x 3
n id val
<dbl> <chr> <dbl>
1 1 1 10
2 1 2 10
3 2 3 20
4 3 4 30
5 3 5 30
Edit:
As suggested by #sotos, and completely missed by me. one liner solution
d %>% separate_rows(id, ,sep = "&&")
gives same output as
# A tibble: 5 x 3
n id val
<dbl> <chr> <dbl>
1 1 1 10
2 1 2 10
3 2 3 20
4 3 4 30
5 3 5 30
tstrplit by id from data.table can do the job
library(data.table)
df <- setDT(df)[,.('id' = tstrsplit(id, "&&")), by = c('n','val')]
df[,'n' := seq(.N)]
df
n val id
1: 1 10 1
2: 2 10 2
3: 3 20 3
4: 4 30 4
5: 5 30 5
I want to add a new row after each id. I found a solution on a stackflow page(Inserting a new row to data frame for each group id)
but there is one thing I want to change and I dont know how. I want to make a new row for all variables, I don't want to write down all the variables ( the stackflow example). It doesnt matter the numbers in the row, I will change that later. If it is possible to add "base" in the new row for trt, that would be good. I want the code to work for many ids and varibles, having a lot of those in the data I'm working with. Many thanks if someone can help me with this!
The example code:
set.seed(1)
> id <- rep(1:3,each=4)
> trt <- rep(c("A","OA", "B", "OB"),3)
> pointA <- sample(1:10,12, replace=TRUE)
> pointB<- sample(1:10,12, replace=TRUE)
> pointC<- sample(1:10,12, replace=TRUE)
> df <- data.frame(id,trt,pointA, pointB,pointC)
> df
id trt pointA pointB pointC
1 1 A 3 7 3
2 1 OA 4 4 4
3 1 B 6 8 1
4 1 OB 10 5 4
5 2 A 3 8 9
6 2 OA 9 10 4
7 2 B 10 4 5
8 2 OB 7 8 6
9 3 A 7 10 5
10 3 OA 1 3 2
11 3 B 3 7 9
12 3 OB 2 2 7
I want it to look like:
df <- rbind(df[1:4,], df1, df[5:8,], df2, df[9:12,],df3)
> df
id trt pointA pointB pointC
1 1 A 3 7 3
2 1 OA 4 4 4
3 1 B 6 8 1
4 1 OB 10 5 4
5 1 base
51 2 A 3 8 9
6 2 OA 9 10 4
7 2 B 10 4 5
8 2 OB 7 8 6
13 2 base
9 3 A 7 10 5
10 3 OA 1 3 2
11 3 B 3 7 9
12 3 OB 2 2 7
14 3 base
>
I'm trying this code:
df %>%
+ group_by(id) %>%
+ summarise(week = "base") %>%
+ mutate_all() %>% #want tomutate allvariables
+ bind_rows(df, .) %>%
+ arrange(id)
You could bind_rows directly, it will add NAs to all other columns by default.
library(dplyr)
df %>% group_by(id) %>% summarise(trt = 'base') %>% bind_rows(df) %>% arrange(id)
# id trt pointA pointB pointC
# <int> <chr> <int> <int> <int>
# 1 1 base NA NA NA
# 2 1 A 3 7 3
# 3 1 OA 4 4 4
# 4 1 B 6 8 1
# 5 1 OB 10 5 4
# 6 2 base NA NA NA
# 7 2 A 3 8 9
# 8 2 OA 9 10 4
# 9 2 B 10 4 5
#10 2 OB 7 8 6
#11 3 base NA NA NA
#12 3 A 7 10 5
#13 3 OA 1 3 2
#14 3 B 3 7 9
#15 3 OB 2 2 7
If you want empty strings instead of NA, we can give a range of columns in mutate_at and replace NA values with empty string.
df %>%
group_by(id) %>%
summarise(trt = 'base') %>%
bind_rows(df) %>%
mutate_at(vars(pointA:pointC), ~replace(., is.na(.) , '')) %>%
arrange(id)
library(dplyr)
library(purrr)
df %>% mutate_if(is.factor, as.character) %>%
group_split(id) %>%
map_dfr(~bind_rows(.x, data.frame(id=.x$id[1], trt="base", stringsAsFactors = FALSE)))
#Note that group_modify is Experimental
df %>% mutate_if(is.factor, as.character) %>%
group_by(id) %>%
group_modify(~bind_rows(.x, data.frame(trt="base", stringsAsFactors = FALSE)))
I have a data.frame with a grouping variable, and some NAs in the value column.
df = data.frame(group=c(1,1,2,2,2,2,2,3,3), value1=1:9, value2=c(NA,4,9,6,2,NA,NA,1,NA))
I can use zoo::na.trim to remove NA at the end of a column: this will remove the last line of the data.frame:
library(zoo)
library(dplyr)
df %>% na.trim(sides="right")
Now I want to remove the trailing NAs by group; how can I achieve this using dplyr?
Expected output for value2 column: c(NA, 4,9,6,2,1)
You could write a little helper function that checks for trailing NAs of a vector and then use group_by and filter.
f <- function(x) { rev(cumsum(!is.na(rev(x)))) != 0 }
library(dplyr)
df %>%
group_by(group) %>%
filter(f(value2))
# A tibble: 6 x 3
# Groups: group [3]
group value1 value2
<dbl> <int> <dbl>
1 1 1 NA
2 1 2 4
3 2 3 9
4 2 4 6
5 2 5 2
6 3 8 1
edit
If we need to remove both leading and trailing zero we need to extend that function a bit.
f1 <- function(x) { cumsum(!is.na(x)) != 0 & rev(cumsum(!is.na(rev(x)))) != 0 }
Given df1
df1 = data.frame(group=c(1,1,2,2,2,2,2,3,3), value1=1:9, value2=c(NA,4,9,NA,2,NA,NA,1,NA))
df1
# group value1 value2
#1 1 1 NA
#2 1 2 4
#3 2 3 9
#4 2 4 NA
#5 2 5 2
#6 2 6 NA
#7 2 7 NA
#8 3 8 1
#9 3 9 NA
We get this result
df1 %>%
group_by(group) %>%
filter(f1(value2))
# A tibble: 5 x 3
# Groups: group [3]
group value1 value2
<dbl> <int> <dbl>
1 1 2 4
2 2 3 9
3 2 4 NA
4 2 5 2
5 3 8 1
Using lapply, loop through group:
do.call("rbind", lapply(split(df, df$group), na.trim, sides = "right"))
# group value1 value2
# 1.1 1 1 NA
# 1.2 1 2 4
# 2.3 2 3 9
# 2.4 2 4 6
# 2.5 2 5 2
# 3 3 8 1
Or using by, as mentioned by #Henrik:
do.call("rbind", by(df, df$group, na.trim, sides = "right"))
I have a dataframe that looks like the following:
x y z
1 2 3
1 2 3
1 2 3
2 3
1 2 3
1 3
I would like to ask if there is a command in R that will allow to obtain the following dataframe (by shifting and aligning similar values)
x y z
1 2 3
1 2 3
1 2 3
NA 2 3
1 2 3
1 NA 3
An alternative solution, where the main idea is to capture the pattern of your dataset based on rows that don't have NAs and then perform some reshaping using the pattern you captured.
df = read.table(text = "
x y z
1 2 3
1 2 3
1 2 3
2 3 NA
1 2 3
1 3 NA
", header= T)
library(tidyverse)
# get the column names of your dataset
names = names(df)
# get unique values after omitting rows with NAs
value = unlist(unique(na.omit(df)))
# create a dataset with names and values
# (this is the pattern you want to follow)
df3 = data.frame(names, value)
df %>%
mutate(id = row_number()) %>% # flag the row number
gather(v,value,-id) %>% # reshape
na.omit() %>% # remove rows with NAs
left_join(df3, by="value") %>% # join info about your pattern
select(-v) %>% # remove that column
spread(names, value) %>% # reshape
select(-id) # remove row number
# x y z
# 1 1 2 3
# 2 1 2 3
# 3 1 2 3
# 4 NA 2 3
# 5 1 2 3
# 6 1 NA 3
library(tidyverse)
df %>%
pmap_dfr(~{ x <- list(...)
if(any(is.na(x))) intersect(x, df[1,]) # match with first row's values to assign names
else x})
Output:
# # A tibble: 6 x 3
# x y z
# <int> <int> <int>
# 1 1 2 3
# 2 1 2 3
# 3 1 2 3
# 4 NA 2 3
# 5 1 2 3
# 6 1 NA 3
reading your data:
df<- fread("x y z
1 2 3
1 2 3
1 2 3
2 3 NA
1 2 3
1 3 NA") %>% setDF
code:
library(magrittr)
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
pattern <- sapply(df,getmode)
df[!complete.cases(df),] %<>% apply(1,function(x){tmp<-pattern;tmp[!(tmp%in%x)] <- NA;return(tmp)}) %>% t %>% data.frame
result:
> df
x y z
1 1 2 3
2 1 2 3
3 1 2 3
4 NA 2 3
5 1 2 3
6 1 NA 3