This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 2 years ago.
I want to add extra columns depending on values of code which are defined in VAR
DF <- data.frame(id = c(1:5), code = c("A","B","C","D","E"), sub = c("A1","B1","C1","D1","E1"))
id code sub
1 1 A A1
2 2 B B1
3 3 C C1
4 4 D D1
5 5 E E1
VAR <- c("A","B")
How result should be:
id code sub AB ABsub
1 1 A A1 A A1
2 2 B B1 B B1
3 3 C C1 <NA> <NA>
4 4 D D1 <NA> <NA>
5 5 E E1 <NA> <NA>
Or using dplyr:
library(dplyr)
DF<-data.frame(id=c(1:5),code=c("A","B","C","D","E"),sub=c("A1","B1","C1","D1","E1"), stringsAsFactors = FALSE)
VAR<-c("A","B")
DF <- DF %>%
mutate(AB = ifelse(code %in% {{VAR}}, code, NA_character_)) %>%
mutate(ABsub = ifelse(code == AB, sub, NA_character_))
with:
> DF
id code sub AB ABsub
1 1 A A1 A A1
2 2 B B1 B B1
3 3 C C1 <NA> <NA>
4 4 D D1 <NA> <NA>
5 5 E E1 <NA> <NA>
Also works if VAR would equal c("A", "B", "C") but we do not know if that is what you are after.
A simple base R option using merge + subset
merge(DF,subset(DF,code %in% VAR),by = "id",all = TRUE)
such that
> merge(DF,subset(DF,code %in% VAR),by = "id",all = TRUE)
id code.x sub.x code.y sub.y
1 1 A A1 A A1
2 2 B B1 B B1
3 3 C C1 <NA> <NA>
4 4 D D1 <NA> <NA>
5 5 E E1 <NA> <NA>
A dplyr solution with across():
library(dplyr)
DF %>%
mutate(across(-id, ~ replace(.x, !(code %in% VAR), NA), .names = "AB{col}"))
# id code sub ABcode ABsub
# 1 1 A A1 A A1
# 2 2 B B1 B B1
# 3 3 C C1 <NA> <NA>
# 4 4 D D1 <NA> <NA>
# 5 5 E E1 <NA> <NA>
or with left_join():
DF %>%
filter(code %in% VAR) %>%
left_join(DF, ., by = "id", suffix = c("", "AB"))
# id code sub codeAB subAB
# 1 1 A A1 A A1
# 2 2 B B1 B B1
# 3 3 C C1 <NA> <NA>
# 4 4 D D1 <NA> <NA>
# 5 5 E E1 <NA> <NA>
Note: If you have multiple columns in your real data, you don't need to type
mutate(Col1 = ifelse(...), Col2 = ifelse(...), etc.)
one by one.
Here's a solution
ABsub <- ifelse(DF$code %in% VAR, DF$code, NA)
cbind(DF, ABsub)
Related
I have the following dataset:
Letter ID Number
A A1 1
A A2 2
A A3 3
B B1 1
B B2 2
B B3 3
B B4 4
My aim is first to create all possible combinations of IDs within the same "Letter" group. For example, for the letter A, it would be only three combinations: A1-A2,A2-A3,and A1-A3. The same IDs ordered differently don't count as a new combination, so for example A1-A2 is the same as A2-A1.
Then, within those combinations, I want to add up the numbers from the "Number" column associated with those IDs. So for the combination A1-A2, which are associated with 1 and 2 in the "Number" column, this would result in the number 1+2=3.
Finally, I want to place the ID combinations, added numbers and original Letter in a new data frame. Something like this:
Letter Combination Add.Number
A A1-A2 3
A A2-A3 5
A A1-A3 4
B B1-B2 3
B B2-B3 5
B B3-B4 7
B B1-B3 4
B B2-B4 6
B B1-B4 5
How can I do this in R, ideally using the package dplyr?
library(dplyr)
letter <- c("A","A","A","B","B","B","B")
df <-
data.frame(letter) %>%
group_by(letter) %>%
mutate(
number = row_number(),
id = paste0(letter,number)
)
df %>%
full_join(df,by = "letter") %>%
filter(number.x < number.y) %>%
mutate(
combination = paste0(id.x,"-",id.y),
add_number = number.x + number.y) %>%
select(letter,combination,add_number)
# A tibble: 9 x 3
# Groups: letter [2]
letter combination add_number
<chr> <chr> <int>
1 A A1-A2 3
2 A A1-A3 4
3 A A2-A3 5
4 B B1-B2 3
5 B B1-B3 4
6 B B1-B4 5
7 B B2-B3 5
8 B B2-B4 6
9 B B3-B4 7
In base R, using combn:
df <- data.frame(
Letter = c("A","A","A","B","B","B","B"),
Id = c("A1","A2","A3","B1","B2","B3","B4"),
Number = c(1,2,3,1,2,3,4))
# combinations
l<-lapply(split(df$Id, df$Letter) ,function(x)
setNames(data.frame(t(combn(x,2))), c("L1","L2")))
n<-lapply(split(df$Number, df$Letter) ,function(x)
setNames(data.frame(t(combn(x,2))), c("N1","N2")))
# rbind all
result <- do.call(rbind, mapply(cbind, Letter=names(l), l, n, SIMPLIFY = F))
result$combination <- paste(result$L1, result$L2, sep="-")
result$sum = result$N1 + result$N2
result
#> Letter L1 L2 N1 N2 combination sum
#> A.1 A A1 A2 1 2 A1-A2 3
#> A.2 A A1 A3 1 3 A1-A3 4
#> A.3 A A2 A3 2 3 A2-A3 5
#> B.1 B B1 B2 1 2 B1-B2 3
#> B.2 B B1 B3 1 3 B1-B3 4
#> B.3 B B1 B4 1 4 B1-B4 5
#> B.4 B B2 B3 2 3 B2-B3 5
#> B.5 B B2 B4 2 4 B2-B4 6
#> B.6 B B3 B4 3 4 B3-B4 7
I have a dataframe like this, where the values are separated by comma.
# Events
# A,B,C
# C,D
# B,A
# D,B,A,E
# A,E,B
I would like to have the next data frame
# Event1 Event2 Event3 Event4 Event5
# A B C NA NA
# NA NA C NA NA
# A B NA NA NA
# A B NA D E
# A B NA NA E
I have tried with cSplit but I don't have the desired df. Is possible?
NOTE: The values doesn't appear in the same possition as the variable Event in the second dataframe.
1) Here is a base R solution. split each row giving list s and create cols which contains the possible values. Then iterate over s and convert that to a data frame.
Note that this does not hard code the column names and continues to work even if some column names are substrings of other column names.
s <- strsplit(DF$Events, ",")
cols <- unique(sort(unlist(s)))
data.frame(Event = t(sapply(s, function(x) ifelse(cols %in% x, cols, NA))))
giving:
Event.1 Event.2 Event.3 Event.4 Event.5
1 A B C <NA> <NA>
2 <NA> <NA> C D <NA>
3 A B <NA> <NA> <NA>
4 A B <NA> D E
5 A B <NA> <NA> E
2) This base R solution uses strsplit as above and then names the components since stack requires a named list and then invokes stack. Then we expand that into a wide form using tapply and convert it to a data frame and fix up the names.
s <- strsplit(DF$Events, ",")
names(s) <- seq_along(s)
stk <- stack(s)
mat <- t(tapply(stk$values, stk, c))
colnames(mat) <- NULL
data.frame(Event = mat)
giving:
Event.1 Event.2 Event.3 Event.4 Event.5
1 A B C <NA> <NA>
2 <NA> <NA> C D <NA>
3 A B <NA> <NA> <NA>
4 A B <NA> D E
5 A B <NA> <NA> E
This could also be represented as an R 4.2+ pipeline:
DF |>
with(setNames(Events, seq_along(Events))) |>
strsplit(",") |>
stack() |>
with(tapply(values, data.frame(ind, values), c)) |>
`colnames<-`(NULL) |>
data.frame(Event = _)
Note
The input in reproducible form:
Lines <- "Events
A,B,C
C,D
B,A
D,B,A,E
A,E,B"
DF <- read.table(text = Lines, header = TRUE, strip.white = TRUE)
Another approach using tidyverse:
library(dplyr)
library(purrr)
library(stringr)
Events = c("A,B,C", 'C,D', "B,A", "D,B,A,E", "A,E,B")
letters <- Events %>% str_split(",") %>% unlist() %>% unique()
df <- data.frame(Events)
df %>%
map2_dfc(.y = letters, ~ ifelse(str_detect(.x, .y), .y, NA)) %>%
set_names(nm = paste0("Events", 1:length(letters)))
#> # A tibble: 5 × 5
#> Events1 Events2 Events3 Events4 Events5
#> <chr> <chr> <chr> <chr> <chr>
#> 1 A B C <NA> <NA>
#> 2 <NA> <NA> C D <NA>
#> 3 A B <NA> <NA> <NA>
#> 4 A B <NA> D E
#> 5 A B <NA> <NA> E
Created on 2022-07-11 by the reprex package (v2.0.1)
This tidyverse solution is easily the most economical in terms of amount of code used:
library(tidyverse)
data.frame(Events) %>%
# split the strings by the comma:
mutate(Events = str_split(Events, ",")) %>%
# unnest splitted values wider into columns:
unnest_wider(Events, names_sep = "")
# A tibble: 5 × 4
Events1 Events2 Events3 Events4
<chr> <chr> <chr> <chr>
1 A B C NA
2 C D NA NA
3 B A NA NA
4 D B A E
5 A E B NA
Data:
Events = c("A,B,C", 'C,D', "B,A", "D,B,A,E", "A,E,B")
We can try the following base R code
> d <- t(table(stack(setNames(strsplit(df$Events, ","), 1:nrow(df)))))
> as.data.frame.matrix(`dim<-`(colnames(d)[ifelse(d > 0, d * col(d), NA)], dim(d)))
V1 V2 V3 V4 V5
1 A B C <NA> <NA>
2 <NA> <NA> C D <NA>
3 A B <NA> <NA> <NA>
4 A B <NA> D E
5 A B <NA> <NA> E
DF<-data.frame(id=c(1,1,1,2,2,2),rank=c("1","2","3","1","2","3"),code=c("A","B","B","B","B","A"))
DF
id rank code
1 A1 1 A
2 A1 2 B
3 A1 3 B
4 B2 1 B
5 B2 2 B
6 B2 3 A
Desired output:
id rank code type1 type2 type3
1 A1 1 A aa MIX MIX
2 A1 2 B NA MIX MIX
3 A1 3 B NA NA MIX
4 B2 1 B bb bb MIX
5 B2 2 B NA bb MIX
6 B2 3 A NA NA MIX
All is grouped by id
type1 gets code where rank = 1.
type2 gets code where rank = 1-2. If code is different in rank 1 and 2, then MIX
type3 gets code where rank = 1-3. etc. etc.
Anyone? :)
If the column 'code' is factor, convert to character with as.character or use type.convert (automatically), then grouped by 'id', create the conditions with case_when to create the columns, 'type1', 'type2' and 'type3'
library(dplyr)
DF %>%
type.convert(as.is = TRUE) %>%
group_by(id) %>%
mutate(type1 = case_when(rank == 1
~ strrep(tolower(code), 2)),
type2 = case_when(rank %in% 1:2 & all(c(1, 2) %in% rank) &
n_distinct(code[rank %in% 1:2]) == 1
~ strrep(tolower(code), 2),
rank %in% 1:2 & all(c(1, 2) %in% rank) &
n_distinct(code[rank %in% 1:2]) > 1 ~
"MIX"),
type3 = case_when(rank %in% 1:3 & all(c(1, 2, 3) %in% rank) &
n_distinct(code[rank %in% 1:3]) == 1 ~
strrep(tolower(code), 2), rank %in% 1:3 &
all(c(1, 2, 3) %in% rank) & n_distinct(code[rank %in% 1:3]) > 1 ~
"MIX")) %>%
ungroup
-output
# A tibble: 7 × 6
id rank code type1 type2 type3
<int> <int> <chr> <chr> <chr> <chr>
1 1 1 A aa MIX MIX
2 1 2 B <NA> MIX MIX
3 1 3 B <NA> <NA> MIX
4 2 1 B bb bb MIX
5 2 2 B <NA> bb MIX
6 2 3 A <NA> <NA> MIX
7 3 1 A aa <NA> <NA>
data
DF <- data.frame(id=c(1,1,1,2,2,2,3),
rank=c("1","2","3","1","2","3","1"),
code=c("A","B","B","B","B","A","A"))
With a slight modification to my answer from your previous question
maxtype=3
do.call(
rbind,
by(DF,list(DF$id),function(x){
y=list()
for (i in 1:maxtype) {
tmp=rep(NA,nrow(x))
idx=as.numeric(x$rank)<=i
if (length(unique(x$code[idx]))==1) {
tmp[idx]=paste0(rep(tolower(x$code[1]),2),collapse="")
} else {
tmp[idx]="MIX"
}
y[[paste0("type",i)]]=tmp
}
cbind(x,y)
})
)
id rank code type1 type2 type3
1.1 1 1 A aa MIX MIX
1.2 1 2 B <NA> MIX MIX
1.3 1 3 B <NA> <NA> MIX
2.4 2 1 B bb bb MIX
2.5 2 2 B <NA> bb MIX
2.6 2 3 A <NA> <NA> MIX
Also note that your id column is different in DF and your output.
This question already has answers here:
Combine two data frames by rows (rbind) when they have different sets of columns
(14 answers)
Closed 2 years ago.
Can we combine rows of multiple dataframe with different columns. Example below
> asd1 <- data.frame(a = c("a","b"), b = c("fd", "fg"))
> asd1
a b
1 a fd
2 b fg
> asd2 <- data.frame(a = c("a","b"), e = c("fd", "fg"), c = c("gfd","asd"))
> asd2
a e c
1 a fd gfd
2 b fg asd
Newdf <- rbind(asd1, asd2)
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
Right now there is an error since of different columns.
Expected output
newdf
data a b e c
asd1 a fd NA NA
asd1 b fg NA NA
asd2 a NA fd gfd
asd2 b NA fg asd
Is the above output possible?
I would suggest you bind_rows() from dplyr:
library(dplyr)
#Data 1
asd1 <- data.frame(a = c("a","b"), b = c("fd", "fg"))
#Data 2
asd2 <- data.frame(a = c("a","b"), e = c("fd", "fg"), c = c("gfd","asd"))
#Bind
df <- bind_rows(asd1,asd2)
Output:
a b e c
1 a fd <NA> <NA>
2 b fg <NA> <NA>
3 a <NA> fd gfd
4 b <NA> fg asd
library(dplyr)
bind_rows(asd1, asd2, .id = "data")
# data a b e c
# 1 1 a fd <NA> <NA>
# 2 1 b fg <NA> <NA>
# 3 2 a <NA> fd gfd
# 4 2 b <NA> fg asd
After seeing this post with a nice answer by #akrun, I wanted to play with dplyr. Here are the sample data from the post and akrun.
df = data.frame(
id1 = c(1,1,2,2,2,3,3,3,3),
id2 = c(1,2,1,2,3,1,2,3,4),
X1 = letters[1:9],
X2 = LETTERS[1:9],
stringsAsFactors = FALSE
)
df2 <- data.frame(
id1 = rep(c(1:3), each = 4),
id2 = rep(c(1:4), times = 3),
stringsAsFactors = FALSE
)
If I replicate akrun's answer, merge() perfectly works here.
df %>%
do(merge(., df2, by = c("id1","id2"), all = TRUE))
id1 id2 X1 X2
1 1 1 a A
2 1 2 b B
3 1 3 <NA> <NA>
4 1 4 <NA> <NA>
5 2 1 c C
6 2 2 d D
7 2 3 e E
8 2 4 <NA> <NA>
9 3 1 f F
10 3 2 g G
11 3 3 h H
12 3 4 i I
Then, I thought left_join(x,y) would do. left_join(x,y) includes all of x, and matching rows of y. From the examples in the dplyr tutorial pdf from UseR!2014, I expected an identical result. But, that was not the case.
> df %>%
+ left_join(df2, .)
Joining by: c("id1", "id2")
id1 id2 X1 X2
1 1 1 a A
2 1 2 b B
3 1 3 <NA> <NA>
4 1 4 <NA> <NA>
5 2 1 <NA> <NA>
6 2 2 <NA> <NA>
7 2 3 <NA> <NA>
8 2 4 <NA> <NA>
9 3 1 <NA> <NA>
10 3 2 <NA> <NA>
11 3 3 <NA> <NA>
12 3 4 <NA> <NA>
The first three rows indicate that dplyr was doing the right job. But, once it encountered NA, it generated NAs till the end. Is this a bug or did I do something wrong? Thank you for taking your time.
There are currently a few bugs with dplyr and the _join functions:
https://github.com/hadley/dplyr/issues/542
https://github.com/hadley/dplyr/issues/455
https://github.com/hadley/dplyr/issues/450
I looks like they are being fixed. In the mean time, if you make sure the group-by variables are the same type (they aren't in your example - you can tell by using str()), then it should work:
df = data.frame(
id1 = c(1,1,2,2,2,3,3,3,3),
id2 = c(1,2,1,2,3,1,2,3,4),
X1 = letters[1:9],
X2 = LETTERS[1:9],
stringsAsFactors = FALSE
)
df2 <- data.frame(
id1 = as.numeric(rep(c(1:3), each = 4)),
id2 = as.numeric(rep(c(1:4), times = 3)),
stringsAsFactors = FALSE
)
left_join(df2, df)