After seeing this post with a nice answer by #akrun, I wanted to play with dplyr. Here are the sample data from the post and akrun.
df = data.frame(
id1 = c(1,1,2,2,2,3,3,3,3),
id2 = c(1,2,1,2,3,1,2,3,4),
X1 = letters[1:9],
X2 = LETTERS[1:9],
stringsAsFactors = FALSE
)
df2 <- data.frame(
id1 = rep(c(1:3), each = 4),
id2 = rep(c(1:4), times = 3),
stringsAsFactors = FALSE
)
If I replicate akrun's answer, merge() perfectly works here.
df %>%
do(merge(., df2, by = c("id1","id2"), all = TRUE))
id1 id2 X1 X2
1 1 1 a A
2 1 2 b B
3 1 3 <NA> <NA>
4 1 4 <NA> <NA>
5 2 1 c C
6 2 2 d D
7 2 3 e E
8 2 4 <NA> <NA>
9 3 1 f F
10 3 2 g G
11 3 3 h H
12 3 4 i I
Then, I thought left_join(x,y) would do. left_join(x,y) includes all of x, and matching rows of y. From the examples in the dplyr tutorial pdf from UseR!2014, I expected an identical result. But, that was not the case.
> df %>%
+ left_join(df2, .)
Joining by: c("id1", "id2")
id1 id2 X1 X2
1 1 1 a A
2 1 2 b B
3 1 3 <NA> <NA>
4 1 4 <NA> <NA>
5 2 1 <NA> <NA>
6 2 2 <NA> <NA>
7 2 3 <NA> <NA>
8 2 4 <NA> <NA>
9 3 1 <NA> <NA>
10 3 2 <NA> <NA>
11 3 3 <NA> <NA>
12 3 4 <NA> <NA>
The first three rows indicate that dplyr was doing the right job. But, once it encountered NA, it generated NAs till the end. Is this a bug or did I do something wrong? Thank you for taking your time.
There are currently a few bugs with dplyr and the _join functions:
https://github.com/hadley/dplyr/issues/542
https://github.com/hadley/dplyr/issues/455
https://github.com/hadley/dplyr/issues/450
I looks like they are being fixed. In the mean time, if you make sure the group-by variables are the same type (they aren't in your example - you can tell by using str()), then it should work:
df = data.frame(
id1 = c(1,1,2,2,2,3,3,3,3),
id2 = c(1,2,1,2,3,1,2,3,4),
X1 = letters[1:9],
X2 = LETTERS[1:9],
stringsAsFactors = FALSE
)
df2 <- data.frame(
id1 = as.numeric(rep(c(1:3), each = 4)),
id2 = as.numeric(rep(c(1:4), times = 3)),
stringsAsFactors = FALSE
)
left_join(df2, df)
Related
I have a dataframe that has two columns "id" and "detail" (df_current below). I need to group the dataframe by id, and spread the file so that the columns become "Interface1", "Interface2", etc. and the contents under the interface columns are the immediate values under each time the interface value appears. Essentially the "!" is working as a separator, but it is not needed in the output.
The desired output is shown below as: "df_needed_from_current".
I have tried multiple approaches (group_by, spread, reshape, dcast etc.), but can't get it to work. Any help would be greatly appreciated!
Sample Current Dataframe (code to create under):
id
detail
1
!
1
Interface1
1
a
1
b
1
!
1
Interface2
1
a
1
b
2
!
2
Interface1
2
a
2
b
2
c
2
!
2
Interface2
2
a
3
!
3
Interface1
3
a
3
b
3
c
3
d
df_current <- data.frame(
id = c("1","1","1","1","1","1","1","1","2",
"2","2","2","2","2","2","2","3","3",
"3","3","3","3","4","4","4","4","4",
"4","4","4","4","4","4","4","4","4",
"5","5","5","5","5","5","5","5","5",
"5","5","5","5"),
detail = c("!", "Interface1","a","b","!",
"Interface2","a","b","!","Interface1",
"a","b","c","!","Interface2","a",
"!", "Interface1","a","b","c","d",
"!", "Interface1","a","b","!",
"Interface2","a","b","c","!","Interface3",
"a","b","c","!","Interface1","a","b","!",
"Interface2","a","b","c","!","Interface3",
"a","b"))
Dataframe Needed (code to create under):
ID
Interface1
Interface2
Interface3
1
a
a
NA
1
b
b
NA
2
a
a
NA
2
b
NA
NA
2
c
NA
NA
3
a
NA
NA
3
b
NA
NA
3
c
NA
NA
3
d
NA
NA
df_needed_from_current <- data.frame(
id = c("1","1","2","2","2","3","3","3","3","4","4","4","5","5","5"),
Interface1 = c("a","b","a","b","c","a","b","c","d","a","b","NA","a","b","NA"),
Interface2 = c("a","b","a","NA","NA","NA","NA","NA","NA","a","b","c","a","b","c"),
Interface3 = c("NA","NA","NA","NA","NA","NA","NA","NA","NA","a","b","c","a","b","NA")
)
We remove the rows where the 'detail' values is "!", then create a new column 'interface' with only values that have prefix 'Interface' from 'detail', use fill from tidyr to fill the NA elements with the previous non-NA, filter the rows where the 'detail' values are not the same as 'interface' column, create a row sequence id with rowid(from data.table) and reshape to 'wide' format with pivot_wider
library(dplyr)
library(tidyr)
library(data.table)
library(stringr)
df_current %>%
filter(detail != "!") %>%
mutate(interface = case_when(str_detect(detail, 'Interface') ~ detail)) %>%
group_by(id) %>%
fill(interface) %>%
ungroup %>%
filter(detail != interface) %>%
mutate(rn = rowid(id, interface)) %>%
pivot_wider(names_from = interface, values_from = detail) %>%
select(-rn)
# A tibble: 15 x 4
# id Interface1 Interface2 Interface3
# <chr> <chr> <chr> <chr>
# 1 1 a a <NA>
# 2 1 b b <NA>
# 3 2 a a <NA>
# 4 2 b <NA> <NA>
# 5 2 c <NA> <NA>
# 6 3 a <NA> <NA>
# 7 3 b <NA> <NA>
# 8 3 c <NA> <NA>
# 9 3 d <NA> <NA>
#10 4 a a a
#11 4 b b b
#12 4 <NA> c c
#13 5 a a a
#14 5 b b b
#15 5 <NA> c <NA>
This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 2 years ago.
I want to add extra columns depending on values of code which are defined in VAR
DF <- data.frame(id = c(1:5), code = c("A","B","C","D","E"), sub = c("A1","B1","C1","D1","E1"))
id code sub
1 1 A A1
2 2 B B1
3 3 C C1
4 4 D D1
5 5 E E1
VAR <- c("A","B")
How result should be:
id code sub AB ABsub
1 1 A A1 A A1
2 2 B B1 B B1
3 3 C C1 <NA> <NA>
4 4 D D1 <NA> <NA>
5 5 E E1 <NA> <NA>
Or using dplyr:
library(dplyr)
DF<-data.frame(id=c(1:5),code=c("A","B","C","D","E"),sub=c("A1","B1","C1","D1","E1"), stringsAsFactors = FALSE)
VAR<-c("A","B")
DF <- DF %>%
mutate(AB = ifelse(code %in% {{VAR}}, code, NA_character_)) %>%
mutate(ABsub = ifelse(code == AB, sub, NA_character_))
with:
> DF
id code sub AB ABsub
1 1 A A1 A A1
2 2 B B1 B B1
3 3 C C1 <NA> <NA>
4 4 D D1 <NA> <NA>
5 5 E E1 <NA> <NA>
Also works if VAR would equal c("A", "B", "C") but we do not know if that is what you are after.
A simple base R option using merge + subset
merge(DF,subset(DF,code %in% VAR),by = "id",all = TRUE)
such that
> merge(DF,subset(DF,code %in% VAR),by = "id",all = TRUE)
id code.x sub.x code.y sub.y
1 1 A A1 A A1
2 2 B B1 B B1
3 3 C C1 <NA> <NA>
4 4 D D1 <NA> <NA>
5 5 E E1 <NA> <NA>
A dplyr solution with across():
library(dplyr)
DF %>%
mutate(across(-id, ~ replace(.x, !(code %in% VAR), NA), .names = "AB{col}"))
# id code sub ABcode ABsub
# 1 1 A A1 A A1
# 2 2 B B1 B B1
# 3 3 C C1 <NA> <NA>
# 4 4 D D1 <NA> <NA>
# 5 5 E E1 <NA> <NA>
or with left_join():
DF %>%
filter(code %in% VAR) %>%
left_join(DF, ., by = "id", suffix = c("", "AB"))
# id code sub codeAB subAB
# 1 1 A A1 A A1
# 2 2 B B1 B B1
# 3 3 C C1 <NA> <NA>
# 4 4 D D1 <NA> <NA>
# 5 5 E E1 <NA> <NA>
Note: If you have multiple columns in your real data, you don't need to type
mutate(Col1 = ifelse(...), Col2 = ifelse(...), etc.)
one by one.
Here's a solution
ABsub <- ifelse(DF$code %in% VAR, DF$code, NA)
cbind(DF, ABsub)
I am trying to fill in blank cells with the value of rows above. Similar to na.locf function, but I have a pattern that needs to be matched. I don't necessarily know how many rows between new values (i.e betweem a,b and c,d).
I have used the na.locf and searched around for a solution to no avail.
df <- df <- data.frame(col1 = c("a","b", NA, NA, NA, NA, "c", "d", NA, NA))
df
# col1
# 1 a
# 2 b
# 3 <NA>
# 4 <NA>
# 5 <NA>
# 6 <NA>
# 7 c
# 8 d
# 9 <NA>
# 10 <NA>
Solution I would like:
df
col1
a
b
a
b
a
b
c
d
c
d
ave(df$col1,
with(rle(!is.na(df$col1)), rep(cumsum(values), lengths)),
FUN = function(x){
rep(x[!is.na(x)], length.out = length(x))
})
# [1] a b a b a b c d c d
Here's way with dplyr. You can drop the group column if needed. -
df %>%
group_by(group = cumsum(is.na(lag(col1)) & !is.na(col1))) %>%
mutate(
col1 = rep(col1[!is.na(col1)], length.out = n())
) %>%
ungroup()
# A tibble: 10 x 2
col1 group
<chr> <int>
1 a 1
2 b 1
3 a 1
4 b 1
5 a 1
6 b 1
7 c 2
8 d 2
9 c 2
10 d 2
I have 4 datasets that I would like to perform a full_join. For brevity, I would use two datasets, df1 and df2 here.
df1 <- data.frame(ID = c(1, 3, 4, 5), V1 = LETTERS[11:14], V2 = letters[17:20])
df2 <- data.frame(ID = c(1, 10, 4, 9, 13), X5 = paste0(LETTERS[14:17], 1:5), X16 = paste0(letters[17:20], 1:5, 6:10), X23 = 56:60)
I would like to know if a record appears in one dataset but not the other and vice versa. I included a column (an indicator) in each dataset before performing the join.
df1 <- df1 %>% mutate(in_df1 = 1) # 1 if record is inside df1
df2 <- df2 %>% mutate(in_df2 = 1) # 1 if record is inside df2
Then, I perform a full join and I replace NAs in in_df1 and in_df2 columns to 0.
df <- full_join(df1, df2, by = "ID") %>%
mutate_at(vars(in_df1, in_df2), funs(coalesce(., 0))) %>%
select(ID, V1, V2, X5, X16, X23, in_df1, in_df2)
This works as follows:
# df
# ID V1 V2 X5 X16 X23 in_df1 in_df2
# 1 1 K q N1 q16 56 1 1
# 2 3 L r <NA> <NA> NA 1 0
# 3 4 M s P3 s38 58 1 1
# 4 5 N t <NA> <NA> NA 1 0
# 5 10 <NA> <NA> O2 r27 57 0 1
# 6 9 <NA> <NA> Q4 t49 59 0 1
# 7 13 <NA> <NA> N5 q510 60 0 1
However, I would like to know of a nicer way to do this.
I don't know if it's nice but at least it's the way to do it on R base.
df <- merge(df1, df2, all = TRUE)
df$InDf1 <- ifelse(is.na(match(df$ID, df1$ID)),0,1)
df$InDf2 <- ifelse(is.na(match(df$ID, df$ID)),0,1)
> df
ID V1 V2 X5 X16 X23 InDf1 InDf2
1 1 K q N1 q16 56 1 1
2 3 L r <NA> <NA> NA 1 0
3 4 M s P3 s38 58 1 1
4 5 N t <NA> <NA> NA 1 0
5 9 <NA> <NA> Q4 t49 59 0 1
6 10 <NA> <NA> O2 r27 57 0 1
7 13 <NA> <NA> N5 q510 60 0 1
My data frame looks like this
gen<-c("A","B","C")
prob<-c("0.95","0.82","0.78")
mw<-c("10","20","50")
df<-data.frame(gen,prob,mw)
gen prob mw
1 A 0.95 10
2 B 0.82 20
3 C 0.78 50
Now I want to have all the possible outcomes of (A,B,C), (A,B),(A,C),(B,C),(A),(B),(C),(NONE) with the probabilities for example (A,B,C)=0.95*0.82*0.78= 0.60762.
trials <- data.frame(matrix(nrow=0,ncol=length(gen)))
for(i in 1:length(gen)){
trials.tmp <- t(combn(gen,i))
trials <- rbind(trials,cbind(trials.tmp, matrix(nrow=nrow(trials.tmp),
ncol=length(gen)-i ) ))
}
trials
V1 V2 V3
1 A <NA> <NA>
2 B <NA> <NA>
3 C <NA> <NA>
4 A B <NA>
5 A C <NA>
6 B C <NA>
7 A B C
But still I'm missing the combination (NA,NA,NA). How can I make a new data frame with all outcomes and probabilities.
you can try combn as mentioned by zx8754, for example
a=0.95
b=0.82
c=0.78
x <- c(a,b,c)
df <- rbind(t(combn(x, 3)), cbind(t(combn(x, 2)), NA), cbind(t(combn(x, 1)), NA, NA))
apply(df, 1, function(x) prod(x[!is.na(x)]))
[1] 0.60762 0.77900 0.74100 0.63960 0.95000 0.82000 0.78000