Syntax of mutate combined with other expressions - r

I'm struggling to figure out the corect syntax of mutate combined with other functions. Here I'm trying to remove the text "incubated: " from a column called "days.incubated2"
Any ideas?
df%<%
mutate(str_remove(days.incubated2, "[incubated: ]"))

The correct syntax would be :
library(dplyr)
library(stringr)
df <- df %>% mutate(days.incubated2 = str_remove(days.incubated2, "incubated: "))
You had incorrect pipe operator.
You can add a column name where you want to store the value i.e days.incubated2 here. (mutate(days.incubated2 = ....).

We can use sub from base R
df$days.incubated2 <- sub("incubated: ", "", df$days.incubated2)

Related

using tidyr separate function to split by \ backslash

I would like to split text in a column by '' using the separate function in tidyr. Given this example data...
library(tidyr)
df1 <- structure(list(Parent.objectId = 1:2, Attachment.path = c("photos_attachments\\photos_image-20220602-192146.jpg",
"photos_attachments\\photos_image-20220602-191635.jpg")), row.names = 1:2, class = "data.frame")
And I've tried multiple variations of this...
df2 <- df1 %>%
separate(Attachment.path,c("a","b","c"),sep="\\",remove=FALSE,extra="drop",fill="right")
Which doesn't result in an error, but it doesn't split the string into two columns, likely because I'm not using the correct regular expression for the single backslash.
We may need to escape
library(tidyr)
separate(df1, Attachment.path,c("a","b","c"),
sep= "\\\\", remove=FALSE, extra="drop", fill="right")
According to ?separate
sep - ... The default value is a regular expression that matches any sequence of non-alphanumeric values.
By splitting on \, assuming you are trying to get folder and filenames, try these 2 functions:
#get filenames
basename(df1$Attachment.path)
# [1] "photos_image-20220602-192146.jpg" "photos_image-20220602-191635.jpg"
#get foldernames
basename(dirname(df1$Attachment.path))
# [1] "photos_attachments" "photos_attachments"

Insert value in a column if value in another column contains a certain word/letter

I am essentially trying to create some code to detect if the values in a column contains "%". If so, make df$unit col to be %. if not so, do nothing.
I tried the below code but it returns % for all rows of values, even if they don't contain % inside.
How should I fix it?
if(stringr::str_detect(df$variable, "%")) {
df$unit <- "%"
}
A tidyverse approach
library(dplyr)
library(stringr)
df %>%
mutate(unit = if_else(str_detect(variable,"%"),"%",unit))
Try the below:
library(stringr)
df[str_detect(df$variable, "%"), 'unit'] <- "%"
This doesn't need any extra libraries.
In base R, you can use replace with grepl.
df <- transform(df,unit = replace(unit, grepl('%', variable, fixed = TRUE), '%'))
Or
df$unit[grepl('%', df$variable, fixed = TRUE)] <- '%'

Cleaning strings in sparklyr using regex

I'm trying to clean strings in a table in sparklyr using regexp_replace. I need to remove both multiple spaces between words and specific whole words.
Establish Spark Connection
pharms <- spark_read_parquet(sc, 'pharms', 's3/path/to/pharms', infer_schema = TRUE, memory = FALSE)
Vector to clean
The df vector I want to clean looks like this, but it is within a table in the sparklyr connection:
drug_strings <- c("tablomiacin sodium tab mg", "nsaid caps mg")
The desired output once the regex processes the data would be something like this:
Desired Outcomes
[1] "tablomiacin sodium", "nsaid"
Attempts
I've tried various combinations used in regex such as:
pharms_cln <- pharms %>%
distinct(drug_strings)%>%
mutate(new_strings=regexp_replace(drug_strings, "\\b(caps|mg|tab)\\b", ""))
pharms_cln <- pharms %>%
distinct(drug_strings)%>%
mutate(new_strings=regexp_replace(drug_strings, "\\s+", ""))
But they all just replace all letters or substrings and not just the individual word or print an error related to hive. Similarly the efforts I've tried to remove blanks spaces just seem to remove the letter 's'.
If the rule for the sought replacement "anything preceding caps|mg|tab", then this may work:
Data:
drug_strings <- c("tablomiacin sodium tab mg", "nsaid caps mg")
Solution:
trimws(gsub("\\b(tab|mg|caps)\\b", "", drug_strings))
[1] "tablomiacin sodium" "nsaid"
If for some reason you need to use str_extract, you can do this:
str_extract(gsub("\\s{2,}", " ", drug_strings), "\\b\\w+\\b(\\s\\b\\w+\\b)*(?=\\s\\b(tab|mg|caps)\\b)")
This first reduces all multiple white space characters to just one such char, and then does the extraction.
Someone who knows regex could certainly streamline this code, but the following using using the str_remove function from the stringr package.
drug_strings <- c("tablomiacin tab mg", "nsaid caps mg")
drug_strings <- data.frame(drug_strings)
drug_strings <- drug_strings %>%
mutate(new_strings=str_remove(drug_strings, "\\b(caps|mg|tab)\\b")) %>%
mutate(new_strings=str_remove(new_strings, "\\s+")) %>%
mutate(new_strings = str_remove(new_strings, "mg"))
``

str_extract_all: return all patterns found in string concatenated as vector

I want to extract everything but a pattern and return this concetenated in a string.
I tried to combine str_extract_all together with sapply and cat
x = c("a_1","a_20","a_40","a_30","a_28")
data <- tibble(age = x)
# extracting just the first pattern is easy
data %>%
mutate(age_new = str_extract(age,"[^a_]"))
# combining str_extract_all and sapply doesnt work
data %>%
mutate(age_new = sapply(str_extract_all(x,"[^a_]"),function(x) cat(x,sep="")))
class(str_extract_all(x,"[^a_]"))
sapply(str_extract_all(x,"[^a_]"),function(x) cat(x,sep=""))
Returns NULL instead of concatenated patterns
Instead of cat, we can use paste. Also, with tidyverse, can make use of map and str_c (in place of paste - from stringr)
library(tidyverse)
data %>%
mutate(age_new = map_chr(str_extract_all(x, "[^a_]+"), ~ str_c(.x, collapse="")))
using `OP's code
data %>%
mutate(age_new = sapply(str_extract_all(x,"[^a_]"),
function(x) paste(x,collapse="")))
If the intention is to get the numbers
library(readr)
data %>%
mutate(age_new = parse_number(x))
Here is a non tidyverse solution, just using stringr.
apply(str_extract_all(column,regex_command,simplify = TRUE),1,paste,collapse="")
'simplify' = TRUE changed str_extract_all to output a matrix, and apply iterates over the matrix. I got the idea from https://stackoverflow.com/a/4213674/8427463
Example: extract all 'r' in rownames(mtcar) and concatenate as a vector
library(stringr)
apply(str_extract_all(rownames(mtcars),"r",simplify = TRUE),1,paste,collapse="")

Customizing make.names function in R?

I am automating a R code for which I have to use make.names function. Default behavior of make.names function is fine with me but when my table name contains a "-", I want the table name to be different.
For example, current behavior :
> make.names("iris-ir")
[1] "iris.ir"
But I want it to modify only in the case when I have "-" present in table name:
> make.names("iris-ir")
[1] "iris_ir"
How can I achieve this? EDIT: using only builtin packages.
Use the following function:
library(dplyr)
make_names<-function(name)
{
name <- as.character(name)
if(contains("-", vars = name))
sub("-", "_", name)
}
This should do what you want.
Sorry, I forgot to mention that the contains function is in the dplyr package.
Without dplyr
make_names<-function(name)
{
name <- as.character(name)
if(grepl("-", name, fixed = T))
sub("-", "_", name)
else
name
}

Resources