For any entries of the column "district" that match regex("[:alpha:]{2}AL"), I would like to replace the "AL" with "01".
For example:
df <- tibble(district = c("NY14", "MT01", "MTAL", "PA10", "KS02", "NDAL", "ND01", "AL02", "AL01"))
I tried:
df %>% mutate(district=replace(district,
str_detect(district, regex("[:alpha:]{2}AL")),
str_replace(district,"AL","01")))
and
df %>% mutate(district=replace(district,
str_detect(district, regex("[:alpha:]{2}AL")),
paste(str_sub(district, start = 1, end = 2),"01",sep = ""))
but there is a vectorization problem.
Is this ok?
str_replace_all(string=df$district,
pattern="(\\w{2})AL",
replacement="\\101")
I replaced the regex with \\w, a word character: https://www.regular-expressions.info/shorthand.html
I am using \\1 to indicate replace the string with the first captured region, which is captured in the (\\w{2}) so keep the first 2 letters then add the 01
You can change the replace to ifelse
ifelse( str_detect(df$district, regex("[:alpha:]{2}AL")),
str_replace(df$district,"AL","01"),df$district)
Related
I would like to split text in a column by '' using the separate function in tidyr. Given this example data...
library(tidyr)
df1 <- structure(list(Parent.objectId = 1:2, Attachment.path = c("photos_attachments\\photos_image-20220602-192146.jpg",
"photos_attachments\\photos_image-20220602-191635.jpg")), row.names = 1:2, class = "data.frame")
And I've tried multiple variations of this...
df2 <- df1 %>%
separate(Attachment.path,c("a","b","c"),sep="\\",remove=FALSE,extra="drop",fill="right")
Which doesn't result in an error, but it doesn't split the string into two columns, likely because I'm not using the correct regular expression for the single backslash.
We may need to escape
library(tidyr)
separate(df1, Attachment.path,c("a","b","c"),
sep= "\\\\", remove=FALSE, extra="drop", fill="right")
According to ?separate
sep - ... The default value is a regular expression that matches any sequence of non-alphanumeric values.
By splitting on \, assuming you are trying to get folder and filenames, try these 2 functions:
#get filenames
basename(df1$Attachment.path)
# [1] "photos_image-20220602-192146.jpg" "photos_image-20220602-191635.jpg"
#get foldernames
basename(dirname(df1$Attachment.path))
# [1] "photos_attachments" "photos_attachments"
I came across this stack overflow QA: https://stackoverflow.com/a/55479243/11799491
and I want to know how to select all rows that do not match the string detect from the accepted answer. I tried using a ! in front of str_detect and it did not work.
Dataframe %>% filter_at(.vars = vars(names, Jobs),
.vars_predicate = any_vars(!str_detect(. , paste0("^(", paste(Filter_list, collapse = "|"), ")"))))
Thank you in advance for your help!
In the new version of dplyr i.e. 1.0.4, we can use if_any within filter
library(dplyr)
library(stringr)
Dataframe %>%
filter(!if_any(c(names, Jobs),
~ str_detect(., str_c("^(", str_c(Filter_list, collapse="|"), ")"))))
# names Jobs
#1 Mark Nojob
The "Nojob" is not matched because we are checking whether the string starts (^) with "Jo" (also the case is different)
In the older version, we can negate (!) with all_vars
Dataframe %>%
filter_at(.vars = vars(names, Jobs),
.vars_predicate = all_vars(!str_detect(. , paste0("^(", paste(Filter_list, collapse = "|"), ")"))))
# names Jobs
#1 Mark Nojob
The reason why any_vars with ! didn't work is that it is looking for any column that doesn't have a match for the string. So, if one of the column row doesn't have that match while the other have it, then it returns that row. Whereas with all_vars and negate, it will only return that row, when all those columns specified in vars are not matching
In the previous version, we cannot negate (!) in front of any_vars whereas it is not the case with if_any as if_any is returning a logical vector to be passed directly to filter whereas any_vars is doing it indirectly to filter_at
NOTE: The function wrapper that corresponds to all_vars is if_all in the current version
data
Dataframe <- data.frame("names" = c('John','Jill','Joe','Mark'), "Jobs" = c('Mailman','Jockey','Jobhunter',"Nojob"))
Filter_list <- c('Jo')
I have a data-frame with string variable column "disease". I want to filter the rows with partial match "trauma" or "Trauma". I am currently done the following using dplyr and stringr:
trauma_set <- df %>% filter(str_detect(disease, "trauma|Trauma"))
But the result also includes "Nontraumatic" and "nontraumatic". How can I filter only "trauma, Trauma, traumatic or Traumatic" without including nontrauma or Nontrauma? Also, is there a way I can define the string to detect without having to specify both uppercase and lowercase version of the string (as in both trauma and Trauma)?
If we want to specify the word boundary, use \\b at the start. Also, for different cases, we can use ignore_case = TRUE by wrapping with modifiers
library(dplyr)
library(stringr)
out <- df %>%
filter(str_detect(disease, regex("\\btrauma", ignore_case = TRUE)))
sum(str_detect(out$disease, regex("^Non", ignore_case = TRUE)))
#[1] 0
data
set.seed(24)
df <- data.frame(disease = sample(c("Nontraumatic", "Trauma",
"Traumatic", "nontraumatic", "traumatic", "trauma"), 50 ,
replace = TRUE), value = rnorm (50))
You were very close to a correct solution, you just needed to add the "start of string" anchor ^, as follows:
trauma_set <- df %>% filter(str_detect(disease, "^trauma|^Trauma"))
Hello I have a column in a data.frame, it has many rows, e.g.,
df = data.frame("Species" = c("*Briza minor", "*Briza minor", "Wattle"))
I want to make a new column "Species_new" where the "*" is moved to the end of the character string, e.g.,
df = data.frame("Species" = c("*Briza minor", "*Briza minor", "Wattle"),
"Species_new" = c("Briza minor*", "Briza minor*", "Wattle"))
Is there a way to do this using gsub? The manual example would take far too long as I have approximately 50,000 rows.
Thanks in advance
One option is to capture the * as a group and in the replacement reverse the backreferences
df$Species_new <- sub("^([*])(.*)$", "\\2\\1", df$Species)
df$Species_new
#[1] "Briza minor*" "Briza minor*" "Wattle"
NOTE: * is a metacharacter meaning 0 or more, so we can either escape (\\*) or place it in brackets ([]) to evaluate the raw character i.e. literal evaluation
Thanks so much for the quick response, I also found a workaround;
df$Species_new = sub("[*]","",df$Species, perl=TRUE)
differences = setdiff(df$Species,df$Species_new)
tochange = subset(df,df$Species == differences)
toleave = subset(df,!df$Species == differences)
tochange$Species_new = paste(tochange$Species_new, "*", sep = "")
df = rbind(tochange,toleave)
I have a dataframe somehow like this:
df <- ("test1/a/x/w/e/a/adfsadfsfads
test2/w/s/f/x/a/saffakwfkwlwe
test3/a/e/c/o/a/saljsfadswwoe")
The structure is always like testX/0/0/0/0/a/randomstuff while 0 is a random letter. Now I want to change the letter "a" behind the 4 random letters to a "z" in every row.
I tried a regex, but it didn't work because when I choose "/a/" as the pattern and "/z/" as the replacement, it would also replace the two "a"s at the beginning of test1 and test3.
So what I need is a function that replaces only the last pattern that is observed in each line. Is there anything that can do this?
I believe this is what you are looking for:
data <- c(
"test1/a/x/w/e/a/adfsadfsfads",
"test2/w/s/f/x/a/saffakwfkwlwe",
"test3/a/e/c/o/a/saljsfadswwoe"
)
gsub("a/([a-z]+)$", "z/\\1", data)
[1] "test1/a/x/w/e/z/adfsadfsfads" "test2/w/s/f/x/z/saffakwfkwlwe"
[3] "test3/a/e/c/o/z/saljsfadswwoe"
And if you don't like regex you could use strsplit().
library(magrittr)
data %>%
strsplit("/") %>%
lapply(function(x) {x[6] <- "z"; x}) %>%
sapply(paste, collapse = "/")