How can I show which special character was a match in each row of the single column dataframe?
Sample dataframe:
a <- data.frame(name=c("foo","bar'","ip_sum","four","%23","2_planet!","#abc!!"))
determining if the string has a special character:
a$name_cleansed <- gsub("([-./&,])|[[:punct:]]","\\1",a$name) #\\1 puts back the exception we define (dash and slash)
a <- a %>% mutate(has_special_char=if_else(name==name_cleansed,FALSE,TRUE))
You can use str_extract if we want only first special character.
library(stringr)
str_extract(a$name,'[[:punct:]]')
#[1] NA "'" "_" NA "%" "_" "#"
If we need all of the special characters we can use str_extract_all.
sapply(str_extract_all(a$name,'[[:punct:]]'), function(x) toString(unique(x)))
#[1] "" "'" "_" "" "%" "_, !" "#, !"
To exclude certain symbols, we can use
exclude_symbol <- c('-', '.', '/', '&', ',')
sapply(str_extract_all(a$name,'[[:punct:]]'), function(x)
toString(setdiff(unique(x), exclude_symbol)))
We can use grepl here for a base R option:
a$has_special_char <- grepl("(?![-./&,])[[:punct:]]", a$name, perl=TRUE)
a$special_char <- ifelse(a$has_special_char, sub("^.*([[:punct:]]).*$", "\\1", a$name), NA)
a
name has_special_char special_char
1 foo FALSE <NA>
2 bar' TRUE '
3 ip_sum TRUE _
4 four FALSE <NA>
5 %23 TRUE %
6 2_planet! TRUE !
7 #abc!! TRUE !
Data:
a <- data.frame(name=c("foo","bar'","ip_sum","four","%23","2_planet!","#abc!!"))
The above logic returns, arbitrarily, the first symbol character, if present, in each name, otherwise returning NA. It reuses the has_special_char column to determine if a symbol occurs in the name already.
Edit:
If you want a column which shows all special characters, then use:
a$all_special_char <- ifelse(a$has_special_char, gsub("[^[:punct:]]+", "", a$name), NA)
Base R regex solution using (caret) not "^" operator:
gsub("(^[-./&,])|[^[:punct:]]", "", a$name)
Also if you want a data.frame returned:
within(a, {
special_char <- gsub("(^[-./&,])|[^[:punct:]]", "", name);
has_special_char <- special_char != ""})
If you only want unique special characters per name as in #Ronak Shah's answer:
within(a, {
special_char <- sapply(gsub("(^[-./&,])|[^[:punct:]]", "", a$name),
function(x){toString(unique(unlist(strsplit(x, ""))))});
has_special_char <- special_char != ""
}
Related
I have a string:
string <- "newdatat.scat == \"RDS16\" ~ \"Asthma\","
and I want to extract separately:
RDS16
Asthma
What I've tried so far is:
extract <- str_extract(string,'~."(.+)')
but I am only able to get:
~ \"Asthma\",
If you have an answer, can you also kindly explain the regex behind it? I'm having a hard time converting string patterns to regex.
If you just need to extract sections surrounded by ", then you can use something like the following. The pattern ".*?" matches first ", then .*? meaning as few characters as possible, before finally matching another ". This will get you the strings including the " double quotes; you then just have to remove the double quotes to clean up.
Note that str_extract_all is used to return all matches, and that it returns a list of character vectors so we need to index into the list before removing the double quotes.
library(stringr)
string <- "newdatat.scat == \"RDS16\" ~ \"Asthma\","
str_extract_all(string, '".*?"') %>%
`[[`(1) %>%
str_remove_all('"')
#> [1] "RDS16" "Asthma"
Created on 2021-06-21 by the reprex package (v1.0.0)
Base R solutions:
# Solution 1:
# Extract strings (still quoted):
# dirtyStrings => list of strings
dirtyStrings <- regmatches(
string,
gregexpr(
'".*?"',
string
)
)
# Iterate over the list and "clean" - unquote - each
# element, store as a vector: result => character vector
result <- c(
vapply(
dirtyStrings,
function(x){
noquote(
gsub(
'"',
'',
x
)
)
},
character(
lengths(
dirtyStrings
)
)
)
)
# Solution 2:
# Same as above, less generic -- assumes all strings
# will follow the same pattern: result => character vector
result <- unlist(
lapply(
strsplit(
gsub(
".*\\=\\=",
"",
noquote(
string
)
),
"~"),
function(x){
gsub(
"\\W+",
"",
noquote(x)
)
}
)
)
You can capture the two values in two separate columns.
In stringr use str_match -
string <- "newdatat.scat == \"RDS16\" ~ \"Asthma\","
stringr::str_match(string, '"(\\w+)" ~ "(\\w+)"')[, -1, drop = FALSE]
# [,1] [,2]
#[1,] "RDS16" "Asthma"
Or in base R use strcapture
strcapture('"(\\w+)" ~ "(\\w+)"', string,
proto = list(col1 = character(), col2 = character()))
# col1 col2
#1 RDS16 Asthma
I have a character string that looks like below and I want to delete lines that doesn't have any value after '_'.
How do I do that in R?
[985] "Pclo_" "P2yr13_ S329" "Basp1_ S131"
[988] "Stk39_ S405" "Srrm2_ S351" "Grin2b_ S930"
[991] "Matr3_ S604" "Map1b_ S1781" "Crmp1_"
[994] "Elmo1_" "Pcdhgc5_" "Sp4_"
[997] "Pbrm1_" "Pphln1_" "Gnl1_ S33"
[1000] "Kiaa1456_"
We can use grep
grep("_$", v1, invert = TRUE, value = TRUE)
Or endsWith
v1[!endsWith(v1, "_")]
We can use substring to get the last character in the vector and select if it is not "_".
x <- c("Pclo_","P2yr13_ S329","Basp1_ S131")
x[substring(x, nchar(x)) != '_']
#[1] "P2yr13_ S329" "Basp1_ S131"
Last character can be extracted using regex as well with sub :
x[sub('.*(.)$', '\\1', x) != '_']
I would like to extract the second last string after the '/' symbol. For example,
url<- c('https://example.com/names/ani/digitalcod-org','https://example.com/names/bmc/ambulancecod.org' )
df<- data.frame (url)
I want to extract the second word from the last between the two // and would like to get the words 'ani' and 'bmc'
so, I tried this
library(stringr)
df$name<- word(df$url,-2)
I need output which as follows:
name
ani
bmc
You can use word but you need to specify the separator,
library(stringr)
word(url, -2, sep = '/')
#[1] "ani" "bmc"
Try this:
as.data.frame(sapply(str_extract_all(df$url,"\\w{2,}(?=\\/)"),"["))[3,]
# V1 V2
#3 ani bmc
as.data.frame(sapply(str_extract_all(df$url,"\\w{2,}(?=\\/)"),"["))[2:3,]
# V1 V2
#2 names names
#3 ani bmc
Use gsub with
.*?([^/]+)/[^/]+$
In R:
urls <- c('https://example.com/names/ani/digitalcod-org','https://example.com/names/bmc/ambulancecod.org' )
gsub(".*?([^/]+)/[^/]+$", "\\1", urls)
This yields
[1] "ani" "bmc"
See a demo on regex101.com.
Here is a solution using strsplit
words <- strsplit(url, '/')
L <- lengths(words)
vapply(seq_along(words), function (k) words[[k]][L[k]-1], character(1))
# [1] "ani" "bmc"
A non-regex approach using basename
basename(mapply(sub, pattern = basename(url), replacement = "", x = url, fixed = TRUE))
#[1] "ani" "bmc"
basename(url) "removes all of the path up to and including the last path separator (if any)" and returns
[1] "digitalcod-org" "ambulancecod.org"
use mapply to replace this outcome for every element in url by "" and call basename again.
I have a column of strings that I would like to remove everything after the last '.' like so:
ENST00000338167.9
ABCDE.42927.6
ENST00000265393.10
ABCDE.43577.3
ENST00000370826.3
I would like to replace remove the '.' and everything after for the 'ENST' entries only
eg:
ENST00000338167
ABCDE.42927.6
ENST00000265393
ABCDE.43577.3
ENST00000370826
I can do
function(x) sub("\\.[^.]*$", "", x)
if I try
function(x) sub("ENST*\\.[^.]*$", "", x)
this isn't quite working and I don't fully understand the regex commands.
We can use combination of ifelse, grepl and sub. We first check if the string consists of "ENST" string and if it does then remove everything after "." using sub.
ifelse(grepl("^ENST", x), sub("\\..*", "", x), x)
#[1] "ENST00000338167" "ABCDE.42927.6" "ENST00000265393" "ABCDE.43577.3"
#[5] "ENST00000370826"
data
x <- c("ENST00000338167.9","ABCDE.42927.6","ENST00000265393.10",
"ABCDE.43577.3","ENST00000370826.3")
We can use a capture group inside a single gsub call
gsub("(^ENST\\d+)\\.\\d+", "\\1", df[, 1])
#[1] "ENST00000338167" "ABCDE.42927.6" "ENST00000265393" "ABCDE.43577.3"
#[5] "ENST00000370826"
Sample data
df <- read.table(text =
"ENST00000338167.9
ABCDE.42927.6
ENST00000265393.10
ABCDE.43577.3
ENST00000370826.3", header = F)
We can use data.table to specify the logical condition in i while updating the j
library(data.table)
setDT(df)[grepl("^ENST", Col1), Col1 := sub("\\.[^.]+$", "", Col1)]
df
# Col1
#1: ENST00000338167
#2: ABCDE.42927.6
#3: ENST00000265393
#4: ABCDE.43577.3
#5: ENST00000370826
data
df <- structure(list(Col1 = c("ENST00000338167.9", "ABCDE.42927.6",
"ENST00000265393.10", "ABCDE.43577.3", "ENST00000370826.3")), row.names = c(NA,
-5L), class = "data.frame")
We can use startsWith and sub combination:
Data:
df=read.table(text="ENST00000338167.9
ABCDE.42927.6
ENST00000265393.10
ABCDE.43577.3
ENST00000370826.3",header=F)
# if string starts with ENST then remove everything after . (dot) in the
# string else print the string as it is.
ifelse(startsWith(as.character(df[,1]),"ENST"),sub("*\\..*", "", df$V1),
as.character(df[,1]))
Output:
[1] "ENST00000338167" "ABCDE.42927.6" "ENST00000265393" "ABCDE.43577.3" "ENST00000370826"
I have a table with a string column formatted like this
abcdWorkstart.csv
abcdWorkcomplete.csv
And I would like to extract the last word in that filename. So I think the beginning pattern would be the word "Work" and ending pattern would be ".csv". I wrote something using grepl but not working.
grepl("Work{*}.csv", data$filename)
Basically I want to extract whatever between Work and .csv
desired outcome:
start
complete
I think you need sub or gsub (substitute/extract) instead of grepl (find if match exists). Note that when not found, it will return the entire string unmodified:
fn <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
out <- sub(".*Work(.*)\\.csv$", "\\1", fn)
out
# [1] "start" "complete" "abcdNothing.csv"
You can work around this by filtering out the unchanged ones:
out[ out != fn ]
# [1] "start" "complete"
Or marking them invalid with NA (or something else):
out[ out == fn ] <- NA
out
# [1] "start" "complete" NA
With str_extract from stringr. This uses positive lookarounds to match any character one or more times (.+) between "Work" and ".csv":
x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")
library(stringr)
str_extract(x, "(?<=Work).+(?=\\.csv)")
# [1] "start" "complete"
Just as an alternative way, remove everything you don't want.
x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")
gsub("^.*Work|\\.csv$", "", x)
#[1] "start" "complete"
please note:
I have to use gsub. Because I first remove ^.*Work then \\.csv$.
For [\\s\\S] or \\d\\D ... (does not work with [g]?sub)
https://regex101.com/r/wFgkgG/1
Works with akruns approach:
regmatches(v1, regexpr("(?<=Work)[\\s\\S]+(?=[.]csv)", v1, perl = T))
str1<-
'12
.2
12'
gsub("[^.]","m",str1,perl=T)
gsub(".","m",str1,perl=T)
gsub(".","m",str1,perl=F)
. matches also \n when using the R engine.
Here is an option using regmatches/regexpr from base R. Using a regex lookaround to match all characters that are not a . after the string 'Work', extract with regmatches
regmatches(v1, regexpr("(?<=Work)[^.]+(?=[.]csv)", v1, perl = TRUE))
#[1] "start" "complete"
data
v1 <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')