R: strsplit based on two conditions, keeping deliminator - r

I am trying to split sentences based on different criteria. I am looking to split some sentences after "traction" and some after "ramasse". I looked up the grammar rules for grepl but didn't really understand.
A data frame called export has a column ref, which has str values ending either with "traction" or "ramasse".
>export$ref
ref
[1] "62133130_074_traction"
[2] "62156438_074_ramasse"
[3] "62153874_070_ramasse"
[4] "62138861_074_traction"
And I want to split str values in ref column into two.
ref R&T
[1] "62133130_074_" "traction"
[2] "62156438_074_" "ramasse"
[3] "62153874_070_" "ramasse"
[4] "62138861_074_" "traction"
What I tried(none of them was good)
strsplit(export$ref, c("traction", "ramasse"))
strsplit(export$ref, "\\_(?<=\\btraction)|\\_(?<=\\bramasse)", perl = TRUE)
strsplit(export$ref, "(?=['traction''ramasse'])", perl = TRUE)
Any help would be appreciated!

Here's a different approach:
strsplit(x, "_(?=[^_]+$)", perl = TRUE)
[[1]]
[1] "62133130_074" "traction"
[[2]]
[1] "62156438_074" "ramasse"
[[3]]
[1] "62153874_070" "ramasse"
[[4]]
[1] "62138861_074" "traction"
This means split the column / vector at an underscore ("_") which is followed by any number of symbols that don't contain another underscore.

Here is another option using stringr::str_split:
library(stringr);
str_split(ref, pattern = "_(?=[A-Za-z]+)", simplify = T)
# [,1] [,2]
#[1,] "62133130_074" "traction"
#[2,] "62156438_074" "ramasse"
#[3,] "62153874_070" "ramasse"
#[4,] "62138861_074" "traction"
Sample data
ref <- c(
"62133130_074_traction",
"62156438_074_ramasse",
"62153874_070_ramasse",
"62138861_074_traction")

Related

How can I extract unit preceded by number with str_extract?

I think str_extract can do this, but I fail to figure out this. my data contains chinese character so there is no blank white between characters. I simulate the data in english as:
> dd<-c("wwe12hours,fgg23days","ffgg12334hours,23days","ffff1days")
> target <- c("hours","days","hours","days")
> target
[1] "hours" "days" "hours" "days"
How can I achieve the target?
my real case is:
> dd <- c("腹痛发热12小时,再发2天","腹痛132324月,再发1天","发热4天")
> target <- c("小时","月","天")
> target
[1] "小时" "月" "天"
It seems you are looking for regex to capture the units. Since you have a vector of length three, we would prefer to return another vector of length three. From your example(ENGLISH ONE) it is not clear how you obtain a target of 4 units. Although I perceive you meant to have 5 if not 3.
here is how you could tackle. This can generally be used for any language:
English:
gsub("\\p{L}*+\\d+", "", dd, perl = TRUE)
[1] "hours,days" "hours,days" "days"
Chinese:
gsub("\\p{L}*+\\d+", "", dd, perl = TRUE)
[1] "小时,天" "月,天" "天"
regmatches(ddc,gregexpr("(?<=\\d)\\p{L}+",ddc,perl = TRUE))
[[1]]
[1] "小时" "天"
[[2]]
[1] "月" "天"
[[3]]
[1] "天"
or if you want to use other packages:
using str_extract_all:
library(stringr)
str_extract_all(ddc,"(?<=\\d)\\p{L}+")
You could use str_match_all :
library(stringr)
unlist(sapply(str_match_all(dd, '\\d+(\\w+)'), function(x) x[, 2]))
#[1] "hours" "days" "hours" "days" "days"
This captures the first word that comes after a number.
where
str_match_all(dd, '\\d+(\\w+)') #returns
#[[1]]
# [,1] [,2]
#[1,] "12hours" "hours"
#[2,] "23days" "days"
#[[2]]
# [,1] [,2]
#[1,] "12334hours" "hours"
#[2,] "23days" "days"
#[[3]]
# [,1] [,2]
#[1,] "1days" "days"
As mentioned by #Onyambu, we can use a lookbehind regex to avoid using sapply to subset the capture group.
unlist(str_extract_all(dd,"(?<=\\d)[A-z]+"))
Base R solution:
cleaned_dd <- gsub("[[:punct:]].*", "",
unlist(lapply(strsplit(
gsub("[[:digit:]]", " ", dd), "\\s+"
), '[',-1)))

How to replace space with "_" after last slash in a string with R

I have a list of strings, and for each string, I need to replace all spaces after the last slash with an "_". Here's a minimum reproducible example.
my_list <- list("abc/as 345/as df.pdf", "adf3344/aer4 ffsd.doc", "abc/3455/dfr.xls", "abc/3455/dfr serf_dff.xls", "abc/34 5 5/dfr 345 dsdf 334.pdf")
After doing the replacement, the result should be:
list("abc/as 345/as_df.pdf", "adf3344/aer4_ffsd.doc", "abc/3455/dfr.xls", "abc/3455/dfr_serf_dff.xls", "abc/34 5 5/dfr_345_dsdf_334.pdf")
I thought of matching the text after the last slash using regex, and then replace " " for "_", but didn't find a way to implement it.
It would be something like this:
gsub(pattern, "_", my_list),
in which pattern would be a regex that would be saying: match every space after the last slash (there is at least one slash in every element of the list).
You may use negative lookahead:
gsub(" (?!.*/.*)", "_", unlist(my_list), perl = TRUE)
# [1] "abc/as 345/as_df.pdf" "adf3344/aer4_ffsd.doc"
# [3] "abc/3455/dfr.xls" "abc/3455/dfr_serf_dff.xls"
# [5] "abc/34 5 5/dfr_345_dsdf_334.pdf"
Here we match and replace all such spaces that ahead of them there are no more slashes left.
You can use dirname, basename and file.path :
as.list(file.path(
dirname(unlist(my_list)),
gsub(" ", "_", basename(unlist(my_list)))
))
# [[1]]
# [1] "abc/as 345/as_df.pdf"
#
# [[2]]
# [1] "adf3344/aer4_ffsd.doc"
#
# [[3]]
# [1] "abc/3455/dfr.xls"
#
# [[4]]
# [1] "abc/3455/dfr_serf_dff.xls"
#
# [[5]]
# [1] "abc/34 5 5/dfr_345_dsdf_334.pdf"
or a bit more efficient and compact :
as.list(file.path(
dirname(. <- unlist(my_list)),
gsub(" ", "_", basename(.))
))
Here's a thought. First, split by slash:
l2 <- strsplit(unlist(my_list), "/")
l2
# [[1]]
# [1] "abc" "as 345" "as df.pdf"
# [[2]]
# [1] "adf3344" "aer4 ffsd.doc"
# [[3]]
# [1] "abc" "3455" "dfr.xls"
# [[4]]
# [1] "abc" "3455" "dfr serf_dff.xls"
# [[5]]
# [1] "abc" "34 5 5" "dfr 345 dsdf 334.pdf"
Now we do a gsub on just the last element of each split-string, recombining with slashes:
mapply(function(a,i) paste(c(a[-i], gsub(" ", "_", a[i])), collapse="/"),
l2, lengths(l2), SIMPLIFY=FALSE)
# [[1]]
# [1] "abc/as 345/as_df.pdf"
# [[2]]
# [1] "adf3344/aer4_ffsd.doc"
# [[3]]
# [1] "abc/3455/dfr.xls"
# [[4]]
# [1] "abc/3455/dfr_serf_dff.xls"
# [[5]]
# [1] "abc/34 5 5/dfr_345_dsdf_334.pdf"
Here's a solution that uses the gsubfn package.
You use the regex (/[^/]+)$ to find the content following the last slash and you edit that content with a function that converts spaces to underscores.
library(gsubfn)
change_space_to_underscore <- function(x) gsub(x = x, pattern = "[[:space:]]+", replacement = "_")
gsubfn(x = my_list,
pattern = "(/[^/]+)$",
replacement = change_space_to_underscore)

Regex: how to keep all digits when splitting a string?

Question
Using a regular expression, how do I keep all digits when splitting a string?
Overview
I would like to split each element within the character vector sample.text into two elements: one of only digits and one of only the text.
Current Attempt is Dropping Last Digit
This regular expression - \\d\\s{1} - inside of base::strsplit() removes the last digit. Below is my attempt, along with my desired output.
# load necessary data -----
sample.text <-
c("111110 Soybean Farming", "0116 Soybeans")
# split string by digit and one space pattern ------
strsplit(sample.text, split = "\\d\\s{1}")
# [[1]]
# [1] "11111" "Soybean Farming"
#
# [[2]]
# [1] "011" "Soybeans"
# desired output --------
# [[1]]
# [1] "111110" "Soybean Farming"
#
# [[2]]
# [1] "0116" "Soybeans"
# end of script #
Any advice on how I can split sample.text to keep all digits would be much appreciated! Thank you.
Because you're splitting on \\d, the digit there is consumed in the regex, and not present in the output. Use lookbehind for a digit instead:
strsplit(sample.text, split = "(?<=\\d) ", perl=TRUE)
http://rextester.com/GDVFU71820
Some alternative solutions, using very simple pattern matching on the first occurrence of space:
1) Indirectly by using sub to substitute your own delimiter, then strsplit on your delimiter:
E.g. you can substitute ';' for the first space (if you know that character does not exist in your data):
strsplit( sub(' ', ';', sample.text), split=';')
2) Using regexpr and regmatches
You can effectively match on the first " " (space character), and split as follows:
regmatches(sample.text, regexpr(" ", sample.text), invert = TRUE)
Result is a list, if that's what you are after per your sample desired output:
[[1]]
[1] "111110" "Soybean Farming"
[[2]]
[1] "0116" "Soybeans"
3) Using stringr library:
library(stringr)
str_split_fixed(sample.text, " ", 2) #outputs a character matrix
[,1] [,2]
[1,] "111110" "Soybean Farming"
[2,] "0116" "Soybeans"

How to extract multiple substrings in a string using stringr regex

I have this string:
mystring <- "HMSC-bm_in_ALL_CELLTYPES.distal"
What I want to do is to extract the substring as defined
in this bracketing
[HMSC-bm]_in_ALL_CELLTYPES.[distal]
So in the end it will yield a vector with two values: HMSC-bm and distal. How can I do it? I tried this but failed:
> stringr::str_extract(base,"\\([\\w-]+\\)_in_ALL_CELLTYPES\\.\\([\\w+]\\)")
[1] NA
I'd use str_match:
library(stringr)
mymatch <- str_match(mystring, "^(.*?)_.*?\\.(.*?)$")
mymatch
[,1] [,2] [,3]
[1,] "HMSC-bm_in_ALL_CELLTYPES.distal" "HMSC-bm" "distal"
mymatch[, 2]
[1] "HMSC-bm"
mymatch[3, ]
[1] "distal"
We can split the string by _in_ALL_CELLTYPES..
strsplit(mystring, split = "_in_ALL_CELLTYPES.")[[1]]
[1] "HMSC-bm" "distal"

Disappearing row after string split

I have a column of coordinates that I am splitting with strsplit() and removing unwanted character from with gsub(). Note that there are 3034 rows.
> head(bike_parking$Geom)
[1] "(37.7606289177, -122.410647009)" "(37.752476948, -122.410625009)"
[3] "(37.7871729481, -122.402401009)" "(37.7776039475, -122.422764009)"
[5] "(37.7658325695, -122.46649784)" "(37.7693399479, -122.432820008)"
> length(bike_parking$Geom)
[1] 3034
> sum(is.na(bike_parking$Geom))
[1] 0
For some reason, after I run
dat <- data.frame(do.call(rbind, strsplit(as.vector(gsub("[()]", "", bike_parking$Geom)), split = ",")))
I am left with 3033. How did that happen and what steps do I take to figure out what went wrong?
> head(dat)
X1 X2
1 37.7606289177 -122.410647009
2 37.752476948 -122.410625009
3 37.7871729481 -122.402401009
4 37.7776039475 -122.422764009
5 37.7658325695 -122.46649784
6 37.7693399479 -122.432820008
> nrow(dat)
[1] 3033
It seems like your strings do not have the same structure everywhere. You will somehow have to know which structure they all have in common to split them properly. From the comments below the question, I derive that some strings may not contain a comma to split the coordinates. You can remove all commas and split the strings at the empty space instead. I'll post a solution in base R and a solution with the stringr-package.
Option 1: Base R:
We can remove the parentheses and commas from your strings by using gsub(). Then we can split the strings at the space using strsplit(). The result will be:
splitted <- strsplit(gsub("[(),]", "", bike_parking$Geom), " ")
# [[1]]
# [1] "37.7606289177" "-122.410647009"
# [[2]]
# [1] "37.752476948" "-122.410625009"
# [[3]]
# [1] "37.7871729481" "-122.402401009"
# [[4]]
# [1] "37.7776039475" "-122.422764009"
# [[5]]
# [1] "37.7658325695" "-122.46649784"
# [[6]]
# [1] "37.7693399479" "-122.432820008"
We have to reorganise these results a bit, so you'll end up with a data.frame with two columns:
sapply(1:2, function(x) sapply(splitted, `[[`, x))
# [,1] [,2]
# [1,] "37.7606289177" "-122.410647009"
# [2,] "37.752476948" "-122.410625009"
# [3,] "37.7871729481" "-122.402401009"
# [4,] "37.7776039475" "-122.422764009"
# [5,] "37.7658325695" "-122.46649784"
# [6,] "37.7693399479" "-122.432820008"
Option 2: Stringr: This package contains a function str_split() (not strsplit()!), that allows you to skip the last step in the base R solution, because you can immediately get a data.frame instead of a list with vectors:
str_split(gsub("[(),]", "", bike_parking$Geom), " ", simplify=TRUE)

Resources