Indicate which corresponding columns have a TRUE indicator - r

I have the following dataset:
df<-data.frame(
identifer=c(1,2,3,4),
DF=c("Tablet","Powder","Suspension","System"),
DF_source1=c("Capsule","Powder,Metered","Tablet",NA),
DF_source2=c(NA,NA,"Tablet",NA),
DF_source3=c("Tablet, Extended Release","Liquid","Tablet",NA),
Route_source1=c("Oral","INHALATION","Oral",NA),
Route_source2=c(NA,"TOPICAL","Oral",NA),
Route_source3=c("Oral","IRRIGATION","oral",NA))
I want to know which DF_source matches DF, and additionally which associated Route I should take.
I want the output to look like this:
df_out<-data.frame(
identifer=c(1,2,3,4),
DF=c("Tablet","Powder","Suspension","System"),
DF_match=c("Tablet, Extended Release","Powder,Metered;Powder",NA,NA),
Route_match=c("Oral","INHALATION;TOPICAL",NA,NA),
DF_match_count=c(1,2,0,0),
DF_route_count=c(1,2,0,0))
I tried this but I'm not sure how to pull values for DF_match and Route_ Match
df%>%mutate_at(vars(matches("(DF_source)")),
list(string_detect = ~str_detect(tolower(DF),tolower(str_replace_all(.,"/|,(\\s)?|(?<!,)\\s","|")))))
Any help would be appreciated, thanks!

I'm not entirely sure this is what you have in mind, but hope this might help.
Your end result appears not to match your example data (e.g. TOPICAL is missing).
This might be easier in a tidier form with pivot_longer.
Edit: If columns are factors, convert to character for str_detect in filter.
library(tidyverse)
library(stringr)
df %>%
mutate_if(is.factor, as.character) %>%
pivot_longer(cols = -c(identifer, DF), names_to = c(".value", "number"), names_pattern = "(\\w+)(\\d+)") %>%
filter(str_detect(DF_source, DF)) %>%
group_by(identifer) %>%
summarise(DF_match = paste(DF_source, collapse = ';'),
Route_match = paste(Route_source, collapse = ';'),
match_count = n()) %>%
right_join(df[,c("identifer", "DF")], by = "identifer") %>%
select(c(identifer, DF, DF_match, Route_match, match_count))
Output
# A tibble: 4 x 5
identifer DF DF_match Route_match match_count
<dbl> <chr> <chr> <chr> <int>
1 1 Tablet Tablet, Extended Release Oral 1
2 2 Powder Powder,Metered;Powder INHALATION;TOPICAL 2
3 3 Suspension NA NA NA
4 4 System NA NA NA

Related

Group strings that have the same words but in a different order

I have an example concatenated text field (please see sample data below) that is created from two or three different fields, however there is no guarantee that the order of the words will be the same. I would like to create a new dataset where fields with the same words, regardless of order, are collapsed. However, since I do not know in advance what words will be concatenated together, the code will have to recognize that all words in both strings match.
Code for example data:
var1<-c("BLUE|RED","RED|BLUE","WHITE|BLACK|ORANGE","BLACK|WHITE|ORANGE")
freq<-c(1,1,1,1)
have<-as.data.frame(cbind(var1,freq))
Have:
var1 freq
BLUE|RED 1
RED|BLUE 1
WHITE|BLACK|ORANGE 1
BLACK|WHITE|ORANGE 1
How can I collapse the data into what I want below?
color freq
BLUE|RED 2
WHITE|BLACK|ORANGE 2
data.frame(table(sapply(strsplit(have$var1, '\\|'),
function(x)paste(sort(x), collapse = '|'))))
Var1 Freq
1 BLACK|ORANGE|WHITE 2
2 BLUE|RED 2
In the world of piping: R > 4.0
have$var1 |>
strsplit('\\|')|>
sapply(\(x)paste0(sort(x), collapse = "|"))|>
table()|>
data.frame()
Here is a tidyverse approach:
library(dplyr)
library(tidyr)
have %>%
group_by(id=row_number()) %>%
separate_rows(var1) %>%
arrange(var1, .by_group = TRUE) %>%
mutate(var1 = paste(var1, collapse = "|")) %>%
slice(1) %>%
ungroup() %>%
count(var1, name = "freq")
var1 freq
<chr> <int>
1 BLACK|ORANGE|WHITE 2
2 BLUE|RED 2

R: unique column values, combine rows of second column

From a data frame I need a list of all unique values of one column. For possible later check we need to keep information from a second column, though for simplicity combined.
Sample data
df <- data.frame(id=c(1,3,1),source =c("x","y","z"))
df
id source
1 1 x
2 3 y
3 1 z
The desired outcome is
df2
id source
1 1 x,z
2 3 y
It should be pretty easy, still I cannot find the proper function / grammar?
E.g. something like
df %>%
+ group_by(id) %>%
+ summarise(vlist = paste0(source, collapse = ","))
or
df %>%
+ distinct(id) %>%
+ summarise(vlist = paste0(source, collapse = ","))
What am I missing? Thanks for any advice!
You can use aggregate from stats to combine per group.
aggregate(source ~ id, df, paste, collapse = ",")
# id source
#1 1 x,z
#2 3 y
Using your code here is a solution:
library(dplyr)
df <- data.frame(id=c(1,3,1),source =c("x","y","z"))
df %>%
group_by(id) %>%
summarise(vlist = paste0(source, collapse = ",")) %>%
distinct(id, .keep_all = TRUE)
# A tibble: 2 x 2
id vlist
<dbl> <chr>
1 1 x,z
2 3 y
Your second approach doesn't work because you call distinct before you aggregate the data. Also, you need to use .keep_all = TRUE to also keep the other column.
Your first approach was missing the distinct.
aggregate(source ~ id, df, toString)

Fairly new to R , can anyone tell me the difference between the queries?

penguins %>% group_by(island, species) %>% drop_na() %>%
summarise(meaxbill = max(penguins$bill_length_mm))
penguins %>% group_by(island, species) %>% drop_na() %>%
summarise(meaxbill = max(bill_length_mm))
I'll word it a little more strongly: when using the pipe operator %>% and the dplyr package, you should not use the dataframe name with the column names ($-indexing); while it works sometimes, if anything in the pipeline removes, adds, or reorders the rows, then your subsequent calculations will be wrong. It isn't that you don't need to assign the dataframe name, it's that if you do use it then you are likely corrupting your data. The first code is broken, do not trust it. (Whether it is truly corrupted or not may be contextual; I don't know if it corrupts it here.)
Let me demonstrate. If we want to know the max bill length (mm) of all of the penguins, by sex, we should do something like this:
library(dplyr)
data("penguins", package = "palmerpenguins")
penguins %>%
drop_na() %>%
group_by(sex) %>%
summarize(maxbill = max(bill_length_mm))
# # A tibble: 2 x 2
# sex maxbill
# <fct> <dbl>
# 1 female 58
# 2 male 59.6
If for some reason we instead use penguins$bill_length_mm, then we'll see this:
penguins %>%
drop_na() %>%
group_by(sex) %>%
summarize(maxbill = max(penguins$bill_length_mm))
# # A tibble: 2 x 2
# sex maxbill
# <fct> <dbl>
# 1 female NA
# 2 male NA
which will likely encourage us to add na.rm=TRUE to the data, and we'll get a seemingly valid-ish number:
penguins %>%
drop_na() %>%
group_by(sex) %>%
summarize(maxbill = max(penguins$bill_length_mm, na.rm = TRUE))
# # A tibble: 2 x 2
# sex maxbill
# <fct> <dbl>
# 1 female 59.6
# 2 male 59.6
but the problem is that max(.) is being passed all of penguins$bill_length_mm, not just the values within each group.
In this case, the use of penguins$ is not a syntax error, it is a logical error, and there is no way for dplyr or anything else in R to know that what you are doing is not what you really need. It works, because max(.) sees a vector and it returns a single number; then summarize(.) sees a single number and assigns it to a new variable.
And in this case, our results are corrupted.
The only time it may be valid to use penguins$ in this is if we truly need to bring in a number or object from outside of the current "view" of the data. Realize that the data that summarize(.) sees is not the data that started in the pipe: it has been filtered (by drop_na()), it might be changed (if we mutated some columns into it) or reordered (if we arrange the data).
But if we need to find out the percentage of the max bill length with respect to the max of the original data, we might do this:
penguins %>%
drop_na() %>%
group_by(sex) %>%
summarize(
maxbill = max(bill_length_mm),
maxbill_ratio = max(bill_length_mm) / max(penguins$bill_length_mm, na.rm = TRUE)
)
# # A tibble: 2 x 3
# sex maxbill maxbill_ratio
# <fct> <dbl> <dbl>
# 1 female 58 0.973
# 2 male 59.6 1
(Recall that we needed to add na.rm=TRUE in that call because one of the rows has an NA ... and the data we see in that last max has not been filtered/cleaned by the drop_na() call.)

How to transpose character data for unique IDs

Im trying to perform a sum function to count the number of interactions for Unique Id's
So I have something like this:
Client ID
JOE12_EMI
ABC12_CANC
ABC12_EMI
ABC12_RENE
and so on...
It'll also have a column next to it that counts the how many times each unique ID repeats.
Frequency
1
2
2
1
Is there a way that i can have all the activity types (EMI, TELI, PFL) summed for each ID and then placed into new columns?
I've tried to transpose the data by separating the actual ID from the activity type but this doesn't return the sums, thank you for any help. I'm not sure if that's the best way or if transposing the data to wide format and then doing another sum function but I am unsure how to go about it.
separate(frequency, id, c("id", "act_code") )
nd <- melt(frequency, id=(c("id")))
Try this:
library(dplyr)
data=data.frame(Client_ID= c("JOE12_EMI",
"ABC12_CANC",
"ABC12_EMI",
"ABC12_RENE"),
frequency= c(1,2,2,1))
client_and_id <- as.data.frame(do.call(rbind, strsplit(as.character(data$Client_ID), "_")))
names(client_and_id) <- c("client", "id")
data <- cbind(data, client_and_id)
data_sum <- data %>% group_by(id) %>% mutate(sum_freq = sum(frequency))
The output
> data_sum
# A tibble: 4 x 5
# Groups: id [3]
Client_ID frequency client id sum_freq
<fct> <dbl> <fct> <fct> <dbl>
1 JOE12_EMI 1 JOE12 EMI 3
2 ABC12_CANC 2 ABC12 CANC 2
3 ABC12_EMI 2 ABC12 EMI 3
4 ABC12_RENE 1 ABC12 RENE 1
You can also display the output by ID:
distinct(data_sum %>% dplyr::select(id, sum_freq))
# A tibble: 3 x 2
# Groups: id [3]
id sum_freq
<fct> <dbl>
1 EMI 3
2 CANC 2
3 RENE 1
You're on the right track; I think the only thing you need is a group_by. Something like this:
library(dplyr)
library(tidyr)
df = data.frame(ClientID = c("JOE12_EMI",
"ABC12_CANC",
"ABC12_EMI",
"ABC12_RENE"))
df %>%
separate(ClientID, into = c("id", "act_code"), sep = "_") %>%
group_by(id) %>%
mutate(frequency = n()) %>%
ungroup() %>%
group_by(id, act_code) %>%
mutate(act_frequency = n()) %>%
ungroup() %>%
spread(act_code, act_frequency)
(This does the sum by user and the pivot by activity type separately; it's possible to calculate the sum by user after pivoting, but this way is easier for me to read.)

Parsing a Hierarchy in a String Value

I am trying to create an edge list from a single character vector. My list to be processed is over 93k elements long, but as an example I will provide a small excerpt.
The chracter strings are part of the ICD10 code hierarchy and the parent child relationships exist within the string. That means that a single string, "A0101", would have a parent of "A010"
It would look like this:
A00
A000
A001
A009
A01
A010
A0100
A0101
A02
A03
etc.
My vector does not contain any other data except the strings but i basically need to convert
dat <- c("A00", "A000", "A001", "A009", "A01", "A010", "A0100", "A0101", "A02")
into an edge list formatted as follows...
# (A00, A000)
# (A00, A001)
# (A00, A009)
# (A01, A010)
# (A010, A0100)
# (A010, A0101)
I am fairly certain there are more efficient ways to accomplish this but this excerpt of code should download the ICD10 CM data from the icd.data package. Use the children detection system from the icd package and then make extensive use of the tidyverse to return an edgelist. I had to get a bit creative to connect the "top" of the hierarchies since they do not include the chapters and sub chapters of ICD10 data as an individual 2 or 1 digit code.
Basically sub-chapters become 2 digit codes, chapters become 1 digit codes, and then there is a root node to connect everything at the top.
library(icd.data)
icd10 <- icd10cm2016
library(icd)
code_children <- lapply(icd10$code, children)
code_vec <- sapply(code_children, paste, collapse = ",")
code_df <- as.data.frame(code_vec, stringsAsFactors = F)
library(dplyr);library(stringr);library(tidyr)
code_df_new <- code_df %>%
mutate(parent = sapply(strsplit(code_vec,","), "[", 1)) %>%
separate(code_vec,
paste("code", 1:max(str_count(code_df$code_vec, ",")), sep ="."),
",",extra = "merge")
library(reshape2)
edgelist <- melt(code_df_new, id = "parent") %>%
filter(!is.na(value)) %>%
select(parent, child = value) %>%
arrange(parent)
edgelist <- subset(edgelist, edgelist$parent != edgelist$child)
edgelist <- subset(edgelist, nchar(edgelist$child) == nchar(edgelist$parent) + 1)
subchaps <- icd10 %>% select(three_digit, sub_chapter, chapter) %>%
mutate(two_digit = substr(three_digit, 1, 2)) %>%
select(parent = two_digit, child = three_digit) %>%
distinct()
chaps <- icd10 %>% select(three_digit, sub_chapter, chapter) %>%
mutate(
two_digit = substr(three_digit, 1, 2),
one_digit = substr(three_digit, 1, 1)) %>%
select(parent = one_digit, child = two_digit) %>%
distinct()
root <- icd10 %>% select(three_digit) %>%
mutate(parent = "root", child = substr(three_digit, 1, 1)) %>%
select(parent, child) %>%
distinct()
edgelist_final <- edgelist %>%
bind_rows(list(chaps, subchaps, root)) %>%
arrange(parent)
If anybody has any tips or methods to improve the efficiency of this code I am all ears. (eyes?)
On the assumption that the length of the node names in ICD10 fully define the order (with shorter ones being parents), here's an approach that connects each node with it's immediate parent, if available.
While I think the logic is legible here, I'd be curious to see what a more streamlined solution would look like.
# Some longer fake data to prove that it works acceptably
# with 93k rows (took a few seconds). These are just
# numbers of different lengths, converted to characters, but they
# should suffice if the assumption about length = order is correct.
set.seed(42)
fake <- runif(93000, 0, 500) %>%
magrittr::raise_to_power(3) %>%
as.integer() %>%
as.character()
# Step 1 - prep
library(dplyr); library(tidyr)
fake_2 <- fake %>%
as_data_frame() %>%
mutate(row = row_number()) %>%
# Step 2 - widen by level and fill in all parent nodes
mutate(level = str_length(value)) %>%
spread(level, value) %>%
fill(everything()) %>%
# Step 3 - Get two highest non-NA nodes
gather(level, code, -row) %>%
arrange(row, level) %>%
filter(!is.na(code)) %>%
group_by(row) %>%
top_n(2, wt = level) %>%
# Step 4 - Spread once more to get pairs
mutate(pos = row_number()) %>%
ungroup() %>%
select(-level) %>%
spread(pos, code)
Output on OP data
# A tibble: 9 x 3
row `1` `2`
<int> <chr> <chr>
1 1 A00 NA
2 2 A00 A000
3 3 A00 A001
4 4 A00 A009
5 5 A01 A009
6 6 A01 A010
7 7 A010 A0100
8 8 A010 A0101
9 9 A010 A0101
Output on 93k fake data
> head(fake, 10)
[1] "55174190" "50801321" "46771275" "6480673"
[5] "20447474" "879955" "4365410" "11434009"
[9] "5002257" "9200296"
> head(fake_2, 10)
# A tibble: 10 x 3
row `1` `2`
<int> <chr> <chr>
1 1 55174190 NA
2 2 50801321 NA
3 3 46771275 NA
4 4 6480673 46771275
5 5 6480673 20447474
6 6 6480673 20447474
7 7 4365410 20447474
8 8 4365410 11434009
9 9 5002257 11434009
10 10 9200296 11434009

Resources