Recode monetary string values into new variable as numeric - r

First off - newbie with R so bear with me. I'm trying to recode string values as numeric. My problem is I have two different string patterns present in my values: "M" and "B" for 'million' and 'billion', respectively.
df <- (funds = c($1.76M, $2B, $57M, $9.87B)
I've successfully knocked off the dollar sign and now have:
df <- (funds = c($1.76M, $2B, $57M, $9.87B),
fundsR = c(1.76M, 2B, 57M, 9.87B)
)
How can I recode these as numeric while retaining their respective monetary values? I've tried using various if statements, for loops, with or without str_detect, pipe operators, case_when, mutate, etc. to isolate values with "M" and values with "B", convert to numeric and multiply to come up the complimentary numeric value--all in a new column. This seemingly simple task turned out not as simple as I imagined it would be and I'd attribute it to being a novice. At this point I'd like to start from scratch and see if anyone has any fresh ideas. My Rstudio is a MESS.
Something like this would be nice:
df <- (funds = c($1.76M, $2B, $57M, $9.87B),
fundsR = c(1.76M, 2B, 57M, 9.87B),
fundsFinal = c(1760000, 2000000000, 57000000, 9870000000)
)
I'd really appreciate your input.

You could create a helper function f, and then apply it to the funds column:
library(dplyr)
library(stringr)
f <- function(x) {
curr = c("M"=1e6, "B" = 1e9)
val = str_remove(x,"\\$")
as.numeric(str_remove_all(val,"B|M"))*curr[str_extract(val, "B|M")]
}
df %>% mutate(fundsFinal = f(funds))
Output:
funds fundsFinal
1 $1.76M 1.76e+06
2 $2B 2.00e+09
3 $57M 5.70e+07
4 $9.87B 9.87e+09
Input:
df = structure(list(funds = c("$1.76M", "$2B", "$57M", "$9.87B")), class = "data.frame", row.names = c(NA,
-4L))

This works but I'm sure better solutions exist. Assuming funds is a character vector:
library(tidyverse)
options(scipen = 999)
df <- data.frame(funds = c('$1.76M', '$2B', '$57M', '$9.87B'))
df = df %>%
mutate( fundsFinal = ifelse(str_sub(funds,nchar(funds),-1) =='M',
as.numeric(substr(funds, 2, nchar(funds) - 1))*10^6,
as.numeric(substr(funds, 2, nchar(funds) - 1))*10^9))

Related

Can I use lapply to check for outliers in comparison to values from all listed tibbles?

My data is imported into R as a list of 60 tibbles each with 13 columns and 8 rows. I want to detect outliers defined as 2*sd by comparing each value in column "2" to the mean of all values of column "2" in the same row.
I know that I am on a wrong path with these lines, as I am not comparing the single values
lapply(list, function(x){
if(x$"2">(mean(x$"2")) + (2*sd(x$"2"))||x$"2"<(mean(x$"2")) - (2*sd(x$"2"))) {}
})
Also I was hoping to replace all values that are thus identified as outliers by the corresponding mean calculated from the 60 values in the same position as the outlier while keeping everything else, but I am also quite unsure how to do that.
Thank you!
you haven't added an example of your code so I've made a quick and simple example to demonstrate my answer. I think this would be much more straightforward logic if you first combine the list of tibbles into a single tibble. This allows you to do everything you want in a simple dplyr pipe, ultimately identifying outliers by 1's in the 'outlier' column:
library(tidyverse)
tibble1 <- tibble(colA = c(seq(1,20,1), 150),
colB = seq(0.1,2.1,0.1),
id = 1:21)
tibble2 <- tibble(colA = c(seq(101,120,1), -150),
colB = seq(21,41,1),
id = 1:21)
# N.B. if you don't have an 'id' column or equivalent
# then it makes it a lot easier if you add one
# The 'id' column is essentially shorthand for an index
tibbleList <- list(tibble1, tibble2)
joinedTibbles <- bind_rows(tibbleList, .id = 'tbl')
res <- joinedTibbles %>%
group_by(id) %>%
mutate(meanA = mean(colA),
sdA = sd(colA),
lowThresh = meanA - 2*sdA,
uppThresh = meanA + 2*sdA,
outlier = ifelse(colA > uppThresh | colA < lowThresh, 1, 0))

Extract specific portion of a string and paste to a new column in R

I have the following dataframe with a string column and I want to extract T,N,M,G,L status (and so on..) for each observation into separate new columns including their respective prefix and suffix. I have tried the grep() and strsplit function but the resulting columns have differing number of rows due to NA values and it doesn't seem to work. I'm not an expert in coding and I'd really appreciate your support for a working script. Thanks in advance.
df <- data.frame(input="cT1b;cN1a;cM0;G3",
"pT1a;pN0;cM0;G1;L0;V0;Pn0;R0",
"cT3;cN0;M0")
The expected output should look like
df <- data.frame(input=c("cT1b;cN1a;cM0;G3",
"pT1a;pN0;cM0;G1;L0;V0;Pn0;R0",
"cT3;cN0;M0" ),
T_output=c("cT1b","pT1a","cT3"),
G_output=c("G3","G1",NA),
L_output=c(NA,"L0",NA))
grep is typically for finding (true/false) strings or occasionally returning whole strings that contain a substring (value=TRUE), but not for extracting substrings from a whole string. For that, one might look into sub/gsub or gregexpr or stringr::str_extract/str_extract_all for extracting substrings. However, I think that's not the best (well, certainly not the only) approach.
Try this:
library(dplyr)
dat %>%
select(input) %>%
mutate(
bind_rows(lapply(
strsplit(input, ";"),
function(S) as.data.frame(lapply(setNames(nm = c("T", "G", "L")),
function(z) paste0(grep(pattern = z, x = S, value = TRUE), collapse = ";"))))),
across(one_of(c("T","G","L")), ~ ifelse(nzchar(.), ., .[NA]))
)
# input T G L
# 1 cT1b;cN1a;cM0;G3 cT1b G3 <NA>
# 2 pT1a;pN0;cM0;G1;L0;V0;Pn0;R0 pT1a G1 L0
# 3 cT3;cN0;M0 cT3 <NA> <NA>
Note: it is obviously doing nothing with the M or N substrings, which might be intentional or whatever. If you want them too, you can do setNames(nm=c("T","G","L","N")) (and again the second time within one_of) to get another upper-letter column.
Data
dat <- structure(list(input = c("cT1b;cN1a;cM0;G3", "pT1a;pN0;cM0;G1;L0;V0;Pn0;R0", "cT3;cN0;M0")), class = "data.frame", row.names = c(NA, -3L))

Comparing each row of one dataframe with a row in another dataframe using R

I'm relatively new to R and I have looked for an answer for my problem but didn't find one. I want to compare two dataframes.
library(dplyr)
library(gtools)
v1 <- LETTERS[1:10]
combinations_from_4_letters <- (as.data.frame(combinations(n = 10, r = 4, v = v1),
stringsAsFactors = FALSE))
combinations_from_4_letters$group <- rep(1:15, each = 14)
combinations_from_2_letters <- (as.data.frame(combinations(n = 10, r = 2, v = v1),
stringsAsFactors = FALSE))
Dataframe 'combinations_from_4_letters' contains all combinations that can be made from 10 letters without repetitions and permutations. The combinations are binned into groups from 1-15. I want to find out how often pairs of the 10 letters (saved in dataframe 'combinations_from_2_letters') are found in each group (basically a frequency table). I started doing a complicated loop looping through both dataframes but I think there must be a more 'R' solution to it, similar to comparing a dataframe and a vector like:
combinations_from_4_letters %in% combinations_from_2_letters[i,])
Thank you in advance for your help!
I recommend an approach like the following:
# adding dummy column for a complete cross-join
combinations_from_4_letters = combinations_from_4_letters %>%
mutate(ones = 1)
combinations_from_2_letters = combinations_from_2_letters %>%
mutate(ones = 1)
joined = combinations_from_2_letters %>%
inner_join(combinations_from_4_letters, by = "ones") %>%
# comparison goes here
mutate(within = ifelse(comb2 %in% comb4, 1, 0)) %>%
group_by(comb2) %>%
summarise(freq = sum(within))
You'll probably need to modify to ensure it matches the exact column names and your comparison condition.
Key ideas:
adding filler column so we have a complete cross-join
mutate a new indicator column for whether the two letter pair is within the four letter pair
sum indicators on the two letter pair

ifelse/case_when with seemingly tricky strings in a character vector

I have a mortality dataframe with a character vector (rac) that contains varying strings per row. These strings flag contributing causes of death. Sometimes these strings have an extra whitespace between them (see id = 4, 5, 8). Some times they have exactly 3 characters and at other times they have 4 characters. What I am trying to do is sweep through by row and create a new column that flags whether a particular cause of death is seen in rac or not. Here are the data.
tdf <- structure(list(id = 1:10, rac = c("I250", "K922 R628",
"C259 T149 X599", "K729 C80 J80 N288", "X72 S019", "C189",
"C259 A419 K746 N390", "C349 C787 C793 C795 F179 I10 J449",
"C349 J449 R628", "F03 N189 R628")), row.names = c(NA, -10L),
class = "data.frame")
Take id = 8, where I can easily create a flag called cause_c that notes when C793 or C795 are seen with something like this snippet.
causex <- c("\\bC793|\\bC795")
tdf %>%
mutate(
cause_C = case_when(
str_detect(rac, causex) ~ 1,
TRUE ~ 0)
) -> tdf
It seems to work but I would like to be able to sweep in instances where the vector only shows 3 digits, say C79 and when this happens, cause_C should = 1. This is also a more efficient way to create the flags because then I don't have to spell out all possible versions of the code (C793, C794, C79, and so on), and because I have multiple causes to go through and flag some 16 likely causes of death. But if I try the following id = 8 will end up as all 0s.
tdf %>%
mutate(
cause_C = case_when(
str_sub(rac, 1, 3) == "C79" ~ 1,
TRUE ~ 0)
) -> tdf
There is something I am missing with the ifelse()\case_when() solution and if anyone spots my mistake and the fix, I would be very appreciative! And oh, base-R, data.table(), dplyr(), all solutions are welcome because I would be happy to see the speed comparisons too given the dataframe is chewing up more than 1.5 gigs.
Thank you!
Ani
If you want to use data.table, would you consider splitting up the rows by diagnostic code, then use grepl to match to your vector of desired diagnoses?
library(data.table)
causex <- c("C793", "C795")
search_causex <- paste(causex, collapse = "|")
setDT(tdf, key = "rac")
tdf[, list(rac = unlist(strsplit(rac, " "))), by = id][
, result := grepl(search_causex, rac)][
result == TRUE]
If you want to search by fewer characters you could use this for search pattern:
search_causex <- "C79(.+)"
A tidyverse similar approach could be:
library(tidyverse)
tdf %>%
separate_rows(rac, sep = " ") %>%
filter(grepl(search_causex, rac) == TRUE)

R: Convert frequency to percentage with only a selected number of columns

I would like to convert a dataframe filled with frequencies into a dataframe filled with percentage by row using dplyr.
My data set has the particularity to get filled with others variables and I just want to calculate the percentage for a set of columns defined by a vector of names. Plus, I want to use the dplyr library.
sim_dat <- function() abs(floor(rnorm(26)*10))
df <- data.frame(a = letters, b = sim_dat(), c = sim_dat(), d = sim_dat()
, z = LETTERS)
names_to_transform <- names(df)[2:4]
df2 <- df %>%
mutate(sum_freq_codpos = rowSums(.[names_to_transform])) %>%
mutate_each(function(x) x / sum_freq_codpos, names_to_transform)
# does not work
Any idea on how to do it? I have tried with mutate_at and mutate_each but I can't get it to work.
you're almost there!:
df2 <- df %>%
mutate(sum_freq_codpos = rowSums(.[names_to_transform])) %>%
mutate_at(names_to_transform, funs(./sum_freq_codpos))
the dot . roughly translates to "the object i am manipulating here", which in this call is "the focal variable in names_to_transform".

Resources