Tidyverse: If_else + str_length + str_pad to mutate 1 column - r

I have found quite a few threads on each part of the code snippet I am trying to create/use.. but not in the way(s) I am trying to do it.
I have a dataframe of customer information.
1 column is a customer ID (CID), the 2nd column is the customer specific identifier (CSI)
That means customer a single customer id can represent many specific customers from a bigger pool, and the CSI tells me which specific customer from that pool I am looking at.
Data would look like this:
data.frame("CID"=c("1","2","3","4","1","2","3","4"),
"Customer_Pool"=c("Art_Supplies", "Automotive_Supplies", "Office_Supplies", "School_Supplies",
"Art_Supplies", "Automotive_Supplies", "Office_Supplies", "School_Supplies"),
"CSI"=c("01","01","01","01","02","02","02","02"),
"Customer_name"=c("Janet","Jane", "Jill", "Jenna", "Joe", "Jim", "Jack", "Jimmy"))
I am trying to combine the CID and CSI numbers.. the problem is I need all the CID to be double digit (01 instead of 1 for example) to match the CID from 10-99
Here is what I have been trying:
DF <- DF %>% mutate(CID = if_else(str_length(CID = 1),
str_pad(CID, width = 2, side = "left), CID))
The error I am getting says: error in str_length(CID = 1): unused argument (CID = 1)
How would I correct this?

You have some syntax issues here. Try
DF <- DF %>% mutate(CID = if_else(str_length(CID) == 1,
str_pad(CID, width = 2, side = "left", pad="0"), CID))
When you call str_length(CID = 1), it looks like you are passing a parameter named "CID" to str_length which it knows nothing about. Rather, you wan to take the string length of CID and then compare that to 1 with == to test for equality (not = which is for parameter names and assignments).
But really the if_else isn't necessary here. If everyhing has to be 2 digits, then just do
DF <- DF %>% mutate(CID = str_pad(CID, width = 2, side = "left", pad="0"))
str_pad will only pad when needed.

Base R solution:
df$p_key <- with(df, paste(ifelse(nchar(CID) == 1, paste0("0", CID), CID), CSI, sep = "-"))
Tidyverse using Mr Flick's clean solution:
library(tidyverse)
df %>%
mutate(p_key = str_c(str_pad(CID, width = 2, side = "left", , pad = "0"), CSI, sep = "-"))

Related

How to summarize rows of a data frame into one while removing Duplicates in R?

so, I have a data frame with 2 or more rows and different columns (ID, Location, Task, Skill, ...). I want to summarize these rows into (a) one row (dataframe) where different column entries should be joined together (but only if different! i.e. if for two rows the IDs are the same, the final dataframe row should show only one ID not the same twice i.e. "ID1", but if they are different, both should be shown i.e. 'ID1, ID2") and some numerical values should be added (+) together.
df = data.frame("ID" = c(PA1, PA1), "Occupation" = c("PO - react to DCS, initiate corrective measures, react to changes
", "PO - data based operations"), "Field" = c("PA","PA"), "Work" = c(0.5, 0.1), "Skill1" = c(CRO, CRO), "Skill2" = c(0, PPto), "ds" = c(5, 5))
print(df)
and the output should look like this
df_final = data.frame("ID" = c(PA1), "Occupation" = c("PO - react to DCS, initiate corrective measures, react to changes, data based operations"), "Field" = c("PA"), "Work" = c(0.6), "Skill1" = c(CRO), "Skill2" = c(PPto), "ds" = c(5))
print(df_final)
Thank you!
Let's ignore Skill2 for now:
How close is the following code to what you want to do?
df2 %>%
group_by(ID)%>%
summarise(work = sum(Work),
skill1 = unique(Skill1),
ds = unique(ds),
occupation = paste0(Occupation, collapse = " "),
field = unique(Field))
You can also mutate(occupation = str_replace_all(occupation, "PO - ")) to get rid of the duplicate "PO - "'s.
You're going to run into problems if the variables like Skill1/Skill2/ds are not unique to each ID, as in they have cardinality > 1.
df2 %>%
group_by(ID)%>%
summarise(work = sum(Work),
skill1 = unique(Skill1),
skill2 = unique(Skill2),
ds = unique(ds),
occupation = paste0(Occupation, collapse = " "),
field = unique(Field))
If it's a simple data-entry issue, you could do a bit of wrangling to filter for only Skill2 entries with letters contained, and then join this frame back to your original frame.
You could also use the past0() collapse = trick, but then you'll end up with Skill2 = c(NA, "PPto"), which I'm pretty sure you don't want.

ifelse/case_when with seemingly tricky strings in a character vector

I have a mortality dataframe with a character vector (rac) that contains varying strings per row. These strings flag contributing causes of death. Sometimes these strings have an extra whitespace between them (see id = 4, 5, 8). Some times they have exactly 3 characters and at other times they have 4 characters. What I am trying to do is sweep through by row and create a new column that flags whether a particular cause of death is seen in rac or not. Here are the data.
tdf <- structure(list(id = 1:10, rac = c("I250", "K922 R628",
"C259 T149 X599", "K729 C80 J80 N288", "X72 S019", "C189",
"C259 A419 K746 N390", "C349 C787 C793 C795 F179 I10 J449",
"C349 J449 R628", "F03 N189 R628")), row.names = c(NA, -10L),
class = "data.frame")
Take id = 8, where I can easily create a flag called cause_c that notes when C793 or C795 are seen with something like this snippet.
causex <- c("\\bC793|\\bC795")
tdf %>%
mutate(
cause_C = case_when(
str_detect(rac, causex) ~ 1,
TRUE ~ 0)
) -> tdf
It seems to work but I would like to be able to sweep in instances where the vector only shows 3 digits, say C79 and when this happens, cause_C should = 1. This is also a more efficient way to create the flags because then I don't have to spell out all possible versions of the code (C793, C794, C79, and so on), and because I have multiple causes to go through and flag some 16 likely causes of death. But if I try the following id = 8 will end up as all 0s.
tdf %>%
mutate(
cause_C = case_when(
str_sub(rac, 1, 3) == "C79" ~ 1,
TRUE ~ 0)
) -> tdf
There is something I am missing with the ifelse()\case_when() solution and if anyone spots my mistake and the fix, I would be very appreciative! And oh, base-R, data.table(), dplyr(), all solutions are welcome because I would be happy to see the speed comparisons too given the dataframe is chewing up more than 1.5 gigs.
Thank you!
Ani
If you want to use data.table, would you consider splitting up the rows by diagnostic code, then use grepl to match to your vector of desired diagnoses?
library(data.table)
causex <- c("C793", "C795")
search_causex <- paste(causex, collapse = "|")
setDT(tdf, key = "rac")
tdf[, list(rac = unlist(strsplit(rac, " "))), by = id][
, result := grepl(search_causex, rac)][
result == TRUE]
If you want to search by fewer characters you could use this for search pattern:
search_causex <- "C79(.+)"
A tidyverse similar approach could be:
library(tidyverse)
tdf %>%
separate_rows(rac, sep = " ") %>%
filter(grepl(search_causex, rac) == TRUE)

How can I name a value by calling a character value?

I wish to gives values in a vector names. I know how to do that but in this case I have many names and many values, both within vectors within lists, and typing them by hand would by suicide.
This method:
> values <- c('jessica' = 1, 'jones' = 2)
> values
jessica jones
1 2
obviously works. However, this method:
> names <- c('jessica', 'jones')
> values <- c(names[1] = 1, names[2] = 2)
Error: unexpected '=' in "values <- c(names[1] ="
Well... I cannot understand why R refuses to read these as pure characters to assign them as names.
I realize I can create values and names separately and then assign names as names(values) but again, my actual case is far more complex. But really I would just like to know why this particular issue occurs.
EDIT I: The ACTUAL data I have is a list of vectors, each is a different combination of amounts of ingredients, and then a giant vector of ingredient names. I cannot just set the name vector as names, because the individual names need to be placed by hand.
EDIT II: Example of my data structure.
ingredients <- c('ing1', 'ing2', 'ing3', 'ing4') # this vector is much longer in reality
amounts <- list(c('ing1' = 1, 'ing2' = 2, 'ing4' = 3),
c('ing2' = 2, 'ing3' = 3),
c('ing1' = 12, 'ing2' = 4, 'ing3' = 3),
c('ing1' = 1, 'ing2' = 1, 'ing3' = 2, 'ing4' = 5))
# this list too is much longer
I could type each numeric value's name individually as presented, but there are many more, and so I tried instead to input the likes of:
c(ingredients[1] = 1, ingredients[2] = 2, ingredients[4] = 3)
But this throws an error:
Error: unexpected '=' in "amounts <- list(c(ingredients[1] ="
We can use setNames
setNames(1:2, names)
Another option is deframe if we have a two column dataset
library(tibble)
tibble(names, val = 1:2) %>%
deframe

Applying mutate to multiple columns and rows in dplyr

A pretty simple question but has me dumbfounded.
I have a table and am trying to round each column to 2 decimal places using mutate_all (or another dplyr function). I know this can be done with certain apply functions but I like the dplyr/tidyverse frame work.
DF = data.frame(A = seq(from = 1, to = 2, by = 0.0255),
B = seq(from = 3, to = 4, by = 0.0255))
Rounded.DF = DF%>%
mutate_all(funs(round(digits = 2)))
This does not work however and just gives me a 2 in every column. Thoughts?
You need a "dot" in the round function. The dot is a placeholder for where mutate_all should place each column that you are trying to manipulate.
Rounded.DF = DF%>%
mutate_all(funs(round(., digits = 2)))
To make it more intuitive you can write the exact same thing as a custom function and then reference that function inside the mutate_all:
round_2_dgts <- function(x) {round(x, digits = 2)}
Rounded.DF = DF%>%
mutate_all(funs(round_2_dgts))

r aggregate and collapse several cells into one

I have a data frame:
x <- data.frame(id = 1:18,
super = c(rep("A", 12), rep("B", 6)),
category = c(rep("one", 6), rep("two", 6), rep("three", 6)),
root = sort(rep(letters[1:6], 3)),
coldefs = letters[1:18], stringsAsFactors = F)
x
I am creating a new column by concatenating 3 columns:
myvars <- c("super", "category", "root")
library(tidyverse)
x <- x %>% unite(col = concat, myvars, sep = "_", remove = F)
x
Now, for each unique value of column 'concat' the values of column 'super' are the same, the values of column 'category' are the same, and the values of column "root" are the same. However, for each unique value of column 'concat' the values of column 'id' are different. The same is true for column 'coldefs'.
I would like to collapse (aggregate) x so that it has only as many rows as there are unique values in column 'concat' (i.e., 6 rows). In each row, I want one value from column 'super', one value from column 'category', one value from column 'root'; and then 3 values of column 'id' (concatenated like this: 1;2;3) and 3 values of column 'coldefs' (concatenated like this: a;b;c).
What's the best way of doing it?
I am trying the following, but it's not working:
x %>% group_by(concat) %>% summarize(id = paste(id, collapse = ";"),
super = unique(super), category = unique(category), root = unique(root),
coldefs = paste(coldefs, collapse = ";"))
I am clearly doing something wrong.
Thanks a lot for your help!
I must say this is a bit (or completely) crazy! I tried my code (the one at the bottom) piece by piece and it worked. I merged it all together - and it worked. I don't understand why was I getting an error before. Here is the correct code that works (at least now):
x %>% group_by(concat) %>% summarize(id = paste(id, collapse = ";"), super = unique(super),
category = unique(category), root = unique(root),
coldefs = paste(coldefs, collapse = ";"))

Resources