I have the following kind of dataframe (this is simplified example):
id = c("1", "1", "1", "2", "3", "3", "4", "4")
bank = c("a", "b", "c", "b", "b", "c", "a", "c")
df = data.frame(id, bank)
df
id bank
1 1 a
2 1 b
3 1 c
4 2 b
5 3 b
6 3 c
7 4 a
8 4 c
In this dataframe you can see that for some ids there are multiple banks, i.e. for id==1, bank=c(a,b,c).
The information I would like to extract from this dataframe is the overlap between id's within different banks and the count.
So for example for bank a: bank a has two persons (unique ids): 1 and 4. For these persons, I want to know what other banks they have
For person 1: bank b and c
For person 4: bank c
the total amount of other banks: 3, for which, b = 1, and c = 2.
So I want to create as output a sort of overlap table as below:
bank overlap amount
a b 1
a c 2
b a 1
b c 2
c a 2
c b 2
Took me a while to get a result, so I post it. Not as sexy as Ronak Shahs but same result.
id = c("1", "1", "1", "2", "3", "3", "4", "4")
bank = c("a", "b", "c", "b", "b", "c", "a", "c")
df = data.frame(id, bank)
df$bank <- as.character(df$bank)
resultlist <- list()
dflist <- split(df, df$id)
for(i in 1:length(dflist)) {
if(nrow(dflist[[i]]) < 2) {
resultlist[[i]] <- data.frame(matrix(nrow = 0, ncol = 2))
} else {
resultlist[[i]] <- as.data.frame(t(combn(dflist[[i]]$bank, 2)))
}
}
result <- setNames(data.table(rbindlist(resultlist)), c("bank", "overlap"))
result %>%
group_by(bank, overlap) %>%
summarise(amount = n())
bank overlap amount
<fct> <fct> <int>
1 a b 1
2 a c 2
3 b c 2
We may use data.table:
df = data.frame(id = c("1", "1", "1", "2", "3", "3", "4", "4"),
bank = c("a", "b", "c", "b", "b", "c", "a", "c"))
library(data.table)
setDT(df)[, .(bank = rep(bank, (.N-1L):0L),
overlap = bank[(sequence((.N-1L):1L) + rep(1:(.N-1L), (.N-1L):1))]),
by=id][,
.N, by=.(bank, overlap)]
#> bank overlap N
#> 1: a b 1
#> 2: a c 2
#> 3: b c 2
#> 4: <NA> b 1
Created on 2019-07-01 by the reprex package (v0.3.0)
Please note that you have b for id==2 which is not overlapping with other values. If you don't want that in the final product, just apply na.omit() on the output.
An option would be full_join
library(dplyr)
full_join(df, df, by = "id") %>%
filter(bank.x != bank.y) %>%
dplyr::count(bank.x, bank.y) %>%
select(bank = bank.x, overlap = bank.y, amount = n)
# A tibble: 6 x 3
# bank overlap amount
# <fct> <fct> <int>
#1 a b 1
#2 a c 2
#3 b a 1
#4 b c 2
#5 c a 2
#6 c b 2
Do you need to cover both banks in both the directions? Since a -> b is same as b -> a in this case here. We can use combn and create combinations of unique bank taken 2 at a time, find out length of common id found in the combination.
as.data.frame(t(combn(unique(df$bank), 2, function(x)
c(x, with(df, length(intersect(id[bank == x[1]], id[bank == x[2]])))))))
# V1 V2 V3
#1 a b 1
#2 a c 2
#3 b c 2
data
id = c("1", "1", "1", "2", "3", "3", "4", "4")
bank = c("a", "b", "c", "b", "b", "c", "a", "c")
df = data.frame(id, bank, stringsAsFactors = FALSE)
Related
After years of using your advices to another users, here is my for now unsolvable issue...
I have a dataset with thousands of rows and hundreds of column, that have one column with a possible value in common. Here is a subset of my dataset :
ID <- c("A", "B", "C", "D", "E")
Dose <- c("1", "5", "3", "4", "5")
Value <- c("x1", "x2", "x3", "x2", "x3")
mat <- cbind(ID, Dose, Value)
What I want is to assign a unique value to the rows that have the "Value" column in common, like that :
ID <- c("A", "B", "C", "D", "E")
Dose <- c("1", "5", "3", "4", "5")
Value <- c("153254", "258634", "896411", "258634", "896411")
Code <- c("1", "2", "3", "2", "3")
mat <- cbind(ID, Dose, Value, Code)
Does anyone have an idea that could help me a little ?
Thanks !
We may use match here
library(dplyr)
mat %>%
mutate(Code = match(Value, unique(Value)))
-output
ID Dose Value Code
1 A 1 153254 1
2 B 5 258634 2
3 C 3 896411 3
4 D 4 258634 2
5 E 5 896411 3
data
mat <- data.frame(ID, Dose, Value)
You should consider using a data.frame:
mat <- data.frame(ID, Dose, Value)
Using dplyr you could create the desired output:
library(dplyr)
mat %>%
group_by(Value) %>%
mutate(Code = cur_group_id()) %>%
ungroup()
This returns
# A tibble: 5 x 4
ID Dose Value Code
<chr> <chr> <chr> <int>
1 A 1 153254 1
2 B 5 258634 2
3 C 3 896411 3
4 D 4 258634 2
5 E 5 896411 3
I have a dataframe as below, with all the values corresponding to an 'other' type, belonging to specific IDs:
df <- data.frame(ID = c("1", "1", "1", "2", "2", "3"), type = c("oth", "oth", "oth", "oth", "oth", "oth"), value = c("A", "B", "B", "C", "D", "D"))
ID type value
1 oth A
1 oth B
1 oth B
2 oth C
2 oth D
3 oth D
I would like to change the types of the rows with values A, B, C to be 1, 2, 3 respectively (D stays as 'oth'). If it is changed, I would like to keep the 'oth' row but have the value as NA.
The above df would result into:
df2 <- data.frame(ID = c("1", "1", "1", "1", "1", "1", "2", "2", "2", "3"), type = c("1", "oth", "2", "oth", "2", "oth", "3", "oth", "oth", "oth"), value = c("A", NA, "B", NA, "B", NA, "C", NA, "D", "D"))
ID type value
1 1 A
1 oth <NA>
1 2 B
1 oth <NA>
1 2 B
1 oth <NA>
2 3 C
2 oth <NA>
2 oth D
3 oth D
Note that any rows that match A,B,C will create a new row with correct type, but change the original one to value = NA. If possible, a dplyr solution would be preferred.
Any help would be appreciated, thanks!
You can create a vector of values to change and filter (values). Filter those values and replace value column to NA. Use match to change 'A' to 1, 'B' to 2 and 'C' to 3. Bind the two dataframes together.
library(dplyr)
values <- c('A', 'B', 'C')
df %>%
filter(value %in% values) %>%
mutate(value = NA) %>%
bind_rows(df %>%
mutate(type = match(value, values),
type = replace(type, is.na(type), 'oth'))) %>%
arrange(ID, type)
# ID type value
#1 1 1 A
#2 1 2 B
#3 1 2 B
#4 1 oth <NA>
#5 1 oth <NA>
#6 1 oth <NA>
#7 2 3 C
#8 2 oth <NA>
#9 2 oth D
#10 3 oth D
You may try this way
df
rbind(df,
df%>%
filter(value %in% c("A", "B", "C")) %>%
mutate(type = case_when(value == "A" ~ 1,
value == "B" ~ 2,
value == "C" ~ 3),
value = NA)) %>%
arrange(ID)
ID type value
1 1 oth A
2 1 oth B
3 1 oth B
4 1 1 <NA>
5 1 2 <NA>
6 1 2 <NA>
7 2 oth C
8 2 oth D
9 2 3 <NA>
10 3 oth D
That would be my approach, I didn't use dplyr but the order seemed important
my_df <- data.frame(ID = c("1", "1", "1", "2", "2", "3"), type = c("oth", "oth", "oth", "oth", "oth", "oth"), value = c("A", "B", "B", "C", "D", "D"))
my_var <- which(my_df$value %in% c("A", "B", "C"))
if (length(my_var)) {
my_temp <- my_df[my_var,]
}
my_var <- which(my_temp$value == "A")
if (length(my_var)) {
my_temp[my_var, "type"] <- 1
}
my_var <- which(my_temp$value == "B")
if (length(my_var)) {
my_temp[my_var, "type"] <- 2
}
my_var <- which(my_temp$value == "C")
if (length(my_var)) {
my_temp[my_var, "type"] <- 3
}
my_df <- rbind(my_temp, my_df)
my_df <- my_df[order(my_df$ID, my_df$value),]
my_var <- which(my_df$type == "oth" & my_df$value %in% c("A", "B", "C"))
if (length(my_var)) {
my_df[my_var, "value"] <- NA
}
Here is potentially another dplyr option.
First, create a vector vec with the specific categories to match and obtain numeric values for.
Then, you can create groups based on whether each row value is contained within the vector vec. This allow you to insert rows, combining rows with rbind.
Within each group, the first row will have the type converted to a number, and the remaining row(s) with either NA for value (if value is in vec) or keep the same value.
This seems to work with your example data. Let me know if this meets your needs.
library(dplyr)
vec <- c("A", "B", "C")
df %>%
group_by(grp = cumsum(value %in% vec)) %>%
do(rbind(
mutate(head(., 1), type = match(value, vec)),
mutate(., value = ifelse(value %in% vec, NA, value)))) %>%
ungroup() %>%
select(-grp)
Output
ID type value
<chr> <chr> <chr>
1 1 1 A
2 1 oth NA
3 1 2 B
4 1 oth NA
5 1 2 B
6 1 oth NA
7 2 3 C
8 2 oth NA
9 2 oth D
10 3 oth D
I am trying to merge two data frames (df1 and df2) based on two KEY (KEY1, and KEY2). However in df1, KEY1 is not unique. I want to merge df1 and df2 if KEY1 is unique. I generated a count variable which counts the number of occurence of KEY1, hence I want to merge df1 and df2 only if count equals 1.
Here is an example data frame:
df1$KEY1 <- as.data.frame(c("a", "a", "b", "c", "d"))
df1$count <- as.data.frame(c("2", "2", "1", "1", "1"))
df2$KEY2 <- as.data.frame(c("a", "b", "c", "d", "e"))
df2$value <- as.data.frame(c("85", "25", "581", "12", "4"))
My question is: how to perform the merge only if count equals 1?
df1 <- if(count==1,merge(df1, df2, by.x=KEY1, by.y=KEY2, all.x=TRUE), ?)
My goal is to get this:
df1$KEY1 <- as.data.frame(c("a", "a", "b", "c", "d"))
df1$count <- as.data.frame(c("2", "2", "1", "1", "1"))
df1$value <- as.data.frame(c("NA", "NA", "25", "581", "12"))
You can perform a join and change the values to NA if count is not 1.
library(dplyr)
inner_join(df1, df2, by = c('KEY1' = 'KEY2')) %>%
mutate(value = replace(value, count != 1, NA))
# KEY1 count value
#1 a 2 <NA>
#2 a 2 <NA>
#3 b 1 25
#4 c 1 581
#5 d 1 12
Similarly, in base R -
merge(df1, df2, by.x = 'KEY1', by.y = 'KEY2') |>
transform(value = replace(value, count != 1, NA))
data
df1 <- data.frame(KEY1 = c("a", "a", "b", "c", "d"),
count = c("2", "2", "1", "1", "1"))
df2 <- data.frame(KEY2 = c("a", "b", "c", "d", "e"),
value = c("85", "25", "581", "12", "4"))
If you insist on using base, what you are looking for is the incomparables argument in merge. Values of the key included in it aren't mathched
tab <- table(df1$KEY1)
tab
merge(df1, df2, by.x="KEY1", by.y="KEY2", all.x=TRUE,
incomparables = names(tab)[tab>1])
The output is:
KEY1 count value
1 a 2 <NA>
2 a 2 <NA>
3 b 1 25
4 c 1 581
5 d 1 12
You could use:
library(dplyr)
df1 %>%
mutate(
value = if_else(count == "1" & KEY1 %in% df2$KEY2,
tibble::deframe(df2)[KEY1],
NA_character_)
)
which returns
KEY1 count value
1 a 2 <NA>
2 a 2 <NA>
3 b 1 25
4 c 1 581
5 d 1 12
Or the same as base R:
transform(
df1,
value = ifelse(df1$count == 1,
`names<-`(df2$value, df2$KEY2)[df1$KEY1],
NA_character_)
)
Using data.table
library(data.table)
setDT(df1)[df2, value := NA^(count != 1) * value, on = .(KEY1 = KEY2)]
-output
> df1
KEY1 count value
1: a 2 NA
2: a 2 NA
3: b 1 25
4: c 1 581
5: d 1 12
NOTE: The numeric columns are created as character. Assuming they are of class numeric, do a join on by KEY columns and assign the value to 'df1' after converting to NA based on 'count' column values
I have a data frame that contains around 700 cases with 1800 examinations. Some cases underwent several different modalities. I want to leave only one examination result based on the specific condition of the modality.
Here is a dummy data frame:
df <- data.frame (ID = c("1", "1", "1", "2", "2", "3", "4", "4", "5", "5"),
c1 = c("A", "B", "C", "A", "C", "A", "A", "B", "B", "C"),
x1 = c(5, 4, 5, 3, 1, 3, 4, 2, 3, 5),
x2 = c(4, 3, 7, 9, 1, 2, 4, 7, 5, 0))
There are five cases with 10 exams. [c1] is the exam modality (condition), and the results are x1 and x2.
I want to leave only one row based on the following condition:
C > B > A
I want to leave the row with C first; if not, leave the row with B; If C and B are absent, leave the row with A.
Desired output:
output <- data.frame (ID = c("1", "2", "3", "4", "5"),
c1 = c("C", "C", "A", "B", "C"),
x1 = c(5, 1, 3, 2, 5),
x2 = c(7, 1, 2, 7, 0))
You can arrange the data based on required correct order and for each ID select it's 1st row.
library(dplyr)
req_order <- c('C', 'B', 'A')
df %>%
arrange(ID, match(c1, req_order)) %>%
distinct(ID, .keep_all = TRUE)
# ID c1 x1 x2
# <chr> <chr> <dbl> <dbl>
#1 1 C 5 7
#2 2 C 1 1
#3 3 A 3 2
#4 4 B 2 7
#5 5 C 5 0
In base R, this can be written as :
df1 <- df[order(match(df$c1, req_order)), ]
df1[!duplicated(df1$ID), ]
Here is one approach:
df.srt <- df[order(df$c1, decreasing=TRUE), ]
df.spl <- split(df.srt, df.srt$ID)
first <- lapply(df.spl, head, n=1)
result <- do.call(rbind, first)
result
# ID c1 x1 x2
# 1 1 C 5 7
# 2 2 C 1 1
# 3 3 A 3 2
# 4 4 B 2 7
# 5 5 C 5 0
I'd be very grateful if you could help me with the following as after a few tests I haven't still been able to get the right outcome.
I've got this data:
dd_1 <- data.frame(ID = c("1","2", "3", "4", "5"),
Class_a = c("a",NA, "a", NA, NA),
Class_b = c(NA, "b", "b", "b", "b"))
And I'd like to produce a new column 'CLASS':
dd_2 <- data.frame(ID = c("1","2", "3", "4", "5"),
Class_a = c("a",NA, "a", NA, NA),
Class_b = c(NA, "b", "b", "b", "b"),
CLASS = c("a", "b", "a-b", "b", "b"))
Thanks a lot!
Here it is:
tmp <- paste(dd_1$Class_a, dd_1$Class_b, sep='-')
tmp <- gsub('NA-|-NA', '', tmp)
(dd_2 <- cbind(dd_1, tmp))
First we concatenate (join as strings) the 2 columns. paste treats NAs as ordinary strings, i.e. "NA", so we either get NA-a, NA-b, or a-b. Then we substitute NA- or -NA with an empty string.
Which results in:
## ID Class_a Class_b tmp
## 1 1 a <NA> a
## 2 2 <NA> b b
## 3 3 a b a-b
## 4 4 <NA> b b
## 5 5 <NA> b b
Another option:
dd_1$CLASS <- with(dd_1, ifelse(is.na(Class_a), as.character(Class_b),
ifelse(is.na(Class_b), as.character(Class_a),
paste(Class_a, Class_b, sep="-"))))
This way you would check if any of the classes is NA and return the other, or, if none is NA, return both separated by "-".
Here's a short solution with apply:
dd_2 <- cbind(dd_1, CLASS = apply(dd_1[2:3], 1,
function(x) paste(na.omit(x), collapse = "-")))
The result
ID Class_a Class_b CLASS
1 1 a <NA> a
2 2 <NA> b b
3 3 a b a-b
4 4 <NA> b b
5 5 <NA> b b