I have a mortality dataframe with a character vector (rac) that contains varying strings per row. These strings flag contributing causes of death. Sometimes these strings have an extra whitespace between them (see id = 4, 5, 8). Some times they have exactly 3 characters and at other times they have 4 characters. What I am trying to do is sweep through by row and create a new column that flags whether a particular cause of death is seen in rac or not. Here are the data.
tdf <- structure(list(id = 1:10, rac = c("I250", "K922 R628",
"C259 T149 X599", "K729 C80 J80 N288", "X72 S019", "C189",
"C259 A419 K746 N390", "C349 C787 C793 C795 F179 I10 J449",
"C349 J449 R628", "F03 N189 R628")), row.names = c(NA, -10L),
class = "data.frame")
Take id = 8, where I can easily create a flag called cause_c that notes when C793 or C795 are seen with something like this snippet.
causex <- c("\\bC793|\\bC795")
tdf %>%
mutate(
cause_C = case_when(
str_detect(rac, causex) ~ 1,
TRUE ~ 0)
) -> tdf
It seems to work but I would like to be able to sweep in instances where the vector only shows 3 digits, say C79 and when this happens, cause_C should = 1. This is also a more efficient way to create the flags because then I don't have to spell out all possible versions of the code (C793, C794, C79, and so on), and because I have multiple causes to go through and flag some 16 likely causes of death. But if I try the following id = 8 will end up as all 0s.
tdf %>%
mutate(
cause_C = case_when(
str_sub(rac, 1, 3) == "C79" ~ 1,
TRUE ~ 0)
) -> tdf
There is something I am missing with the ifelse()\case_when() solution and if anyone spots my mistake and the fix, I would be very appreciative! And oh, base-R, data.table(), dplyr(), all solutions are welcome because I would be happy to see the speed comparisons too given the dataframe is chewing up more than 1.5 gigs.
Thank you!
Ani
If you want to use data.table, would you consider splitting up the rows by diagnostic code, then use grepl to match to your vector of desired diagnoses?
library(data.table)
causex <- c("C793", "C795")
search_causex <- paste(causex, collapse = "|")
setDT(tdf, key = "rac")
tdf[, list(rac = unlist(strsplit(rac, " "))), by = id][
, result := grepl(search_causex, rac)][
result == TRUE]
If you want to search by fewer characters you could use this for search pattern:
search_causex <- "C79(.+)"
A tidyverse similar approach could be:
library(tidyverse)
tdf %>%
separate_rows(rac, sep = " ") %>%
filter(grepl(search_causex, rac) == TRUE)
Related
I am trying to remove duplicate rows using a count variable (basic) and/or a coder variable (preferred). The goal is to take a database and then remove duplicates based on a count variable "recode" which is > 1 some of the time.
UPDATE: For clarification, the desired output is the df with just one entry per "id" variable below.
Here's my attempt at a MRE. I wrote a for-loop to try to remove ALL BUT one of the rows that have the same id variable based on "recode" > 1. This seems suboptimal based on some other threads I've looked at (though I've not seen a solution I understand). This MRE seems to actually work, but the method doesn't work with my real data.frame. I think it's because it just implements the 'last' deletion on the intact df, rather than all the deletions for each id.
x <- structure(list(A = c("5.1", "5.1", "6.2", "7.3", "5.1", "6.2", "6.2", "5.1"),
B = c(4, 4, 3, 1, 4, 3, 3, 4),
C = c("A","B","C","A","B","C","D" ,"E")),
.Names = c("id", "count", "coder"),
class = "data.frame",
row.names = c(NA, -8L))
for (i in c(1:length(x$id))) {
count <- x$count[i]
if (count > 1) {
indices <- which(x$id == x$id[i])
indices_sub <- indices[-1]
# print(indices)
# print(indices_sub)
x2 = x[-indices_sub,]
}
}
For context, my data is from a survey that is filled out by five RAs ("coder"). The five RAs code the same show sometimes. So many shows have 2-5 entries. I need to eliminate duplicates. A first solution is just keep the first one (which is what I'm starting...and struggling...to do here). But that's arbitrary and bad. A slightly-better solution step would be to retain the entry based on a coder hierarchy, let's call: A, B, C, then D. If A, then keep A; if not, then if B, keep B; if not, then if C, keep C; if not, then keep D.
The expected output is still a bit fuzzy to me, but I think you're looking for something akin to a SQL windows function to identify a "first" record by some coalescing ordering criteria.
library(dplyr)
x %>%
group_by(id) %>%
arrange(id, count, coder) %>%
mutate(row_num = row_number()) %>%
filter(row_num == 1) %>%
# to remove helper column
select(-row_num)
I have the following dataframe with a string column and I want to extract T,N,M,G,L status (and so on..) for each observation into separate new columns including their respective prefix and suffix. I have tried the grep() and strsplit function but the resulting columns have differing number of rows due to NA values and it doesn't seem to work. I'm not an expert in coding and I'd really appreciate your support for a working script. Thanks in advance.
df <- data.frame(input="cT1b;cN1a;cM0;G3",
"pT1a;pN0;cM0;G1;L0;V0;Pn0;R0",
"cT3;cN0;M0")
The expected output should look like
df <- data.frame(input=c("cT1b;cN1a;cM0;G3",
"pT1a;pN0;cM0;G1;L0;V0;Pn0;R0",
"cT3;cN0;M0" ),
T_output=c("cT1b","pT1a","cT3"),
G_output=c("G3","G1",NA),
L_output=c(NA,"L0",NA))
grep is typically for finding (true/false) strings or occasionally returning whole strings that contain a substring (value=TRUE), but not for extracting substrings from a whole string. For that, one might look into sub/gsub or gregexpr or stringr::str_extract/str_extract_all for extracting substrings. However, I think that's not the best (well, certainly not the only) approach.
Try this:
library(dplyr)
dat %>%
select(input) %>%
mutate(
bind_rows(lapply(
strsplit(input, ";"),
function(S) as.data.frame(lapply(setNames(nm = c("T", "G", "L")),
function(z) paste0(grep(pattern = z, x = S, value = TRUE), collapse = ";"))))),
across(one_of(c("T","G","L")), ~ ifelse(nzchar(.), ., .[NA]))
)
# input T G L
# 1 cT1b;cN1a;cM0;G3 cT1b G3 <NA>
# 2 pT1a;pN0;cM0;G1;L0;V0;Pn0;R0 pT1a G1 L0
# 3 cT3;cN0;M0 cT3 <NA> <NA>
Note: it is obviously doing nothing with the M or N substrings, which might be intentional or whatever. If you want them too, you can do setNames(nm=c("T","G","L","N")) (and again the second time within one_of) to get another upper-letter column.
Data
dat <- structure(list(input = c("cT1b;cN1a;cM0;G3", "pT1a;pN0;cM0;G1;L0;V0;Pn0;R0", "cT3;cN0;M0")), class = "data.frame", row.names = c(NA, -3L))
First off - newbie with R so bear with me. I'm trying to recode string values as numeric. My problem is I have two different string patterns present in my values: "M" and "B" for 'million' and 'billion', respectively.
df <- (funds = c($1.76M, $2B, $57M, $9.87B)
I've successfully knocked off the dollar sign and now have:
df <- (funds = c($1.76M, $2B, $57M, $9.87B),
fundsR = c(1.76M, 2B, 57M, 9.87B)
)
How can I recode these as numeric while retaining their respective monetary values? I've tried using various if statements, for loops, with or without str_detect, pipe operators, case_when, mutate, etc. to isolate values with "M" and values with "B", convert to numeric and multiply to come up the complimentary numeric value--all in a new column. This seemingly simple task turned out not as simple as I imagined it would be and I'd attribute it to being a novice. At this point I'd like to start from scratch and see if anyone has any fresh ideas. My Rstudio is a MESS.
Something like this would be nice:
df <- (funds = c($1.76M, $2B, $57M, $9.87B),
fundsR = c(1.76M, 2B, 57M, 9.87B),
fundsFinal = c(1760000, 2000000000, 57000000, 9870000000)
)
I'd really appreciate your input.
You could create a helper function f, and then apply it to the funds column:
library(dplyr)
library(stringr)
f <- function(x) {
curr = c("M"=1e6, "B" = 1e9)
val = str_remove(x,"\\$")
as.numeric(str_remove_all(val,"B|M"))*curr[str_extract(val, "B|M")]
}
df %>% mutate(fundsFinal = f(funds))
Output:
funds fundsFinal
1 $1.76M 1.76e+06
2 $2B 2.00e+09
3 $57M 5.70e+07
4 $9.87B 9.87e+09
Input:
df = structure(list(funds = c("$1.76M", "$2B", "$57M", "$9.87B")), class = "data.frame", row.names = c(NA,
-4L))
This works but I'm sure better solutions exist. Assuming funds is a character vector:
library(tidyverse)
options(scipen = 999)
df <- data.frame(funds = c('$1.76M', '$2B', '$57M', '$9.87B'))
df = df %>%
mutate( fundsFinal = ifelse(str_sub(funds,nchar(funds),-1) =='M',
as.numeric(substr(funds, 2, nchar(funds) - 1))*10^6,
as.numeric(substr(funds, 2, nchar(funds) - 1))*10^9))
I am trying to merge two dataset on key values and string patterns.
Basically, I would like a function to count the number of sub-string matching occurrences, conditional on other key variables matching across two db.
Across two datasets, base and merge, when the variables k1 and k2 match and merge$str2 is included in base$str I would like a function to count one match. Specifically, I would like whole words match whole words (e.g. "BB" is not a match for "BBB", as if the first was written ad "\bBB\b"). I have multi-million dataset to merge so efficiency is paramount.
I have a possible solution, but I am not sure is the best across very large datasets.
base <- data.frame(k1=rep(1,4),
k2=c(rep(1,3),2),
str=c("AA BBB","BB CCC","CCC","CCC"))
merge <- data.frame(k1=rep(1,2),
k2=rep(1,2),
str2=c("BB","CCC"))
library(stringr)
library(dplyr)
library(tidyr)
base %>%
left_join(merge) %>%
mutate(match=stringi::stri_detect_regex(str, paste0("\\b",str2,"\\b")),
match=replace_na(ifelse(match==T,1,0),0)) %>%
group_by(k1,k2,str) %>%
summarise(matches=sum(match))
desired_result <- data.frame(k1=rep(1,4),
k2=c(rep(1,3),2),
str=c("AA BBB","BB CCC","CCC","CCC"),
match=c(0,2,1,0))
Each time efficiency is an issue, data.table is worth to consider ..
May be this could help you ..
library(data.table)
m <- merge(as.data.table(base),
as.data.table(merge),
by = c("k1", "k2"),
all = TRUE,
allow.cartesian=TRUE)
m[, i := grepl(paste0("\\b", str2,"\\b", sep = ""), str), by = 1:nrow(m)]
m[, .(match = sum(i, na.rm = TRUE)), by = c("k1","k2", "str")]
I'm relatively new to R and I have looked for an answer for my problem but didn't find one. I want to compare two dataframes.
library(dplyr)
library(gtools)
v1 <- LETTERS[1:10]
combinations_from_4_letters <- (as.data.frame(combinations(n = 10, r = 4, v = v1),
stringsAsFactors = FALSE))
combinations_from_4_letters$group <- rep(1:15, each = 14)
combinations_from_2_letters <- (as.data.frame(combinations(n = 10, r = 2, v = v1),
stringsAsFactors = FALSE))
Dataframe 'combinations_from_4_letters' contains all combinations that can be made from 10 letters without repetitions and permutations. The combinations are binned into groups from 1-15. I want to find out how often pairs of the 10 letters (saved in dataframe 'combinations_from_2_letters') are found in each group (basically a frequency table). I started doing a complicated loop looping through both dataframes but I think there must be a more 'R' solution to it, similar to comparing a dataframe and a vector like:
combinations_from_4_letters %in% combinations_from_2_letters[i,])
Thank you in advance for your help!
I recommend an approach like the following:
# adding dummy column for a complete cross-join
combinations_from_4_letters = combinations_from_4_letters %>%
mutate(ones = 1)
combinations_from_2_letters = combinations_from_2_letters %>%
mutate(ones = 1)
joined = combinations_from_2_letters %>%
inner_join(combinations_from_4_letters, by = "ones") %>%
# comparison goes here
mutate(within = ifelse(comb2 %in% comb4, 1, 0)) %>%
group_by(comb2) %>%
summarise(freq = sum(within))
You'll probably need to modify to ensure it matches the exact column names and your comparison condition.
Key ideas:
adding filler column so we have a complete cross-join
mutate a new indicator column for whether the two letter pair is within the four letter pair
sum indicators on the two letter pair