I think str_extract can do this, but I fail to figure out this. my data contains chinese character so there is no blank white between characters. I simulate the data in english as:
> dd<-c("wwe12hours,fgg23days","ffgg12334hours,23days","ffff1days")
> target <- c("hours","days","hours","days")
> target
[1] "hours" "days" "hours" "days"
How can I achieve the target?
my real case is:
> dd <- c("腹痛发热12小时,再发2天","腹痛132324月,再发1天","发热4天")
> target <- c("小时","月","天")
> target
[1] "小时" "月" "天"
It seems you are looking for regex to capture the units. Since you have a vector of length three, we would prefer to return another vector of length three. From your example(ENGLISH ONE) it is not clear how you obtain a target of 4 units. Although I perceive you meant to have 5 if not 3.
here is how you could tackle. This can generally be used for any language:
English:
gsub("\\p{L}*+\\d+", "", dd, perl = TRUE)
[1] "hours,days" "hours,days" "days"
Chinese:
gsub("\\p{L}*+\\d+", "", dd, perl = TRUE)
[1] "小时,天" "月,天" "天"
regmatches(ddc,gregexpr("(?<=\\d)\\p{L}+",ddc,perl = TRUE))
[[1]]
[1] "小时" "天"
[[2]]
[1] "月" "天"
[[3]]
[1] "天"
or if you want to use other packages:
using str_extract_all:
library(stringr)
str_extract_all(ddc,"(?<=\\d)\\p{L}+")
You could use str_match_all :
library(stringr)
unlist(sapply(str_match_all(dd, '\\d+(\\w+)'), function(x) x[, 2]))
#[1] "hours" "days" "hours" "days" "days"
This captures the first word that comes after a number.
where
str_match_all(dd, '\\d+(\\w+)') #returns
#[[1]]
# [,1] [,2]
#[1,] "12hours" "hours"
#[2,] "23days" "days"
#[[2]]
# [,1] [,2]
#[1,] "12334hours" "hours"
#[2,] "23days" "days"
#[[3]]
# [,1] [,2]
#[1,] "1days" "days"
As mentioned by #Onyambu, we can use a lookbehind regex to avoid using sapply to subset the capture group.
unlist(str_extract_all(dd,"(?<=\\d)[A-z]+"))
Base R solution:
cleaned_dd <- gsub("[[:punct:]].*", "",
unlist(lapply(strsplit(
gsub("[[:digit:]]", " ", dd), "\\s+"
), '[',-1)))
Related
I am trying to split sentences based on different criteria. I am looking to split some sentences after "traction" and some after "ramasse". I looked up the grammar rules for grepl but didn't really understand.
A data frame called export has a column ref, which has str values ending either with "traction" or "ramasse".
>export$ref
ref
[1] "62133130_074_traction"
[2] "62156438_074_ramasse"
[3] "62153874_070_ramasse"
[4] "62138861_074_traction"
And I want to split str values in ref column into two.
ref R&T
[1] "62133130_074_" "traction"
[2] "62156438_074_" "ramasse"
[3] "62153874_070_" "ramasse"
[4] "62138861_074_" "traction"
What I tried(none of them was good)
strsplit(export$ref, c("traction", "ramasse"))
strsplit(export$ref, "\\_(?<=\\btraction)|\\_(?<=\\bramasse)", perl = TRUE)
strsplit(export$ref, "(?=['traction''ramasse'])", perl = TRUE)
Any help would be appreciated!
Here's a different approach:
strsplit(x, "_(?=[^_]+$)", perl = TRUE)
[[1]]
[1] "62133130_074" "traction"
[[2]]
[1] "62156438_074" "ramasse"
[[3]]
[1] "62153874_070" "ramasse"
[[4]]
[1] "62138861_074" "traction"
This means split the column / vector at an underscore ("_") which is followed by any number of symbols that don't contain another underscore.
Here is another option using stringr::str_split:
library(stringr);
str_split(ref, pattern = "_(?=[A-Za-z]+)", simplify = T)
# [,1] [,2]
#[1,] "62133130_074" "traction"
#[2,] "62156438_074" "ramasse"
#[3,] "62153874_070" "ramasse"
#[4,] "62138861_074" "traction"
Sample data
ref <- c(
"62133130_074_traction",
"62156438_074_ramasse",
"62153874_070_ramasse",
"62138861_074_traction")
I have this data:
names <- c("Baker, Chet", "Jarret, Keith", "Miles Davis")
I want to manipulate it so the first name come first, so i split it:
names <- strsplit(names, ", ")
[[1]]
[1] "Baker" "Chet"
[[2]]
[1] "Jarret" "Keith"
[[3]]
[1] "Miles Davis"
The problem is that, when i want to put them together, the name "Miles Davis" will come out wrong, because it is already the full name.
matrix(unlist(names), ncol=2, byrow = TRUE)
[,1] [,2]
[1,] "Baker" "Chet"
[2,] "Jarret" "Keith"
[3,] "Miles Davis" "Baker"
What should i do to create a new df that will look like this:
"Chet Baker"
"Keith Jarret"
"Miles Davis"
Here's the reference: http://rfunction.com/archives/1499
You can easily adapt the pattern used in the regular expression so that it matches either a comma followed by 0+ spaces or 1+ spaces:
names <- strsplit(names, ",\\s*|\\s+")
matrix(unlist(names), ncol=2, byrow = TRUE)
# [,1] [,2]
#[1,] "Baker" "Chet"
#[2,] "Jarret" "Keith"
#[3,] "Miles" "Davis"
Since the desired result is different than initially described, heres's a different approach:
names <- strsplit(names, ",\\s*")
data.frame(name = sapply(names, function(x) paste(rev(x), collapse = " ")))
# name
#1 Chet Baker
#2 Keith Jarret
#3 Miles Davis
Another option, using capture groups in a regular expression to swap everything before the comma with everything after the comma and replace the comma with a space.
names <- c("Baker, Chet", "Jarret, Keith", "Miles Davis")
sub("([^,]+),\\s*([^,]+)$", "\\2 \\1", names)
#[1] "Chet Baker" "Keith Jarret" "Miles Davis"
Another regex solution:
gsub("(\\w+), (\\w+)", "\\2 \\1", names)
# [1] "Chet Baker" "Keith Jarret" "Miles Davis"
I have this string:
mystring <- "HMSC-bm_in_ALL_CELLTYPES.distal"
What I want to do is to extract the substring as defined
in this bracketing
[HMSC-bm]_in_ALL_CELLTYPES.[distal]
So in the end it will yield a vector with two values: HMSC-bm and distal. How can I do it? I tried this but failed:
> stringr::str_extract(base,"\\([\\w-]+\\)_in_ALL_CELLTYPES\\.\\([\\w+]\\)")
[1] NA
I'd use str_match:
library(stringr)
mymatch <- str_match(mystring, "^(.*?)_.*?\\.(.*?)$")
mymatch
[,1] [,2] [,3]
[1,] "HMSC-bm_in_ALL_CELLTYPES.distal" "HMSC-bm" "distal"
mymatch[, 2]
[1] "HMSC-bm"
mymatch[3, ]
[1] "distal"
We can split the string by _in_ALL_CELLTYPES..
strsplit(mystring, split = "_in_ALL_CELLTYPES.")[[1]]
[1] "HMSC-bm" "distal"
I have a column of coordinates that I am splitting with strsplit() and removing unwanted character from with gsub(). Note that there are 3034 rows.
> head(bike_parking$Geom)
[1] "(37.7606289177, -122.410647009)" "(37.752476948, -122.410625009)"
[3] "(37.7871729481, -122.402401009)" "(37.7776039475, -122.422764009)"
[5] "(37.7658325695, -122.46649784)" "(37.7693399479, -122.432820008)"
> length(bike_parking$Geom)
[1] 3034
> sum(is.na(bike_parking$Geom))
[1] 0
For some reason, after I run
dat <- data.frame(do.call(rbind, strsplit(as.vector(gsub("[()]", "", bike_parking$Geom)), split = ",")))
I am left with 3033. How did that happen and what steps do I take to figure out what went wrong?
> head(dat)
X1 X2
1 37.7606289177 -122.410647009
2 37.752476948 -122.410625009
3 37.7871729481 -122.402401009
4 37.7776039475 -122.422764009
5 37.7658325695 -122.46649784
6 37.7693399479 -122.432820008
> nrow(dat)
[1] 3033
It seems like your strings do not have the same structure everywhere. You will somehow have to know which structure they all have in common to split them properly. From the comments below the question, I derive that some strings may not contain a comma to split the coordinates. You can remove all commas and split the strings at the empty space instead. I'll post a solution in base R and a solution with the stringr-package.
Option 1: Base R:
We can remove the parentheses and commas from your strings by using gsub(). Then we can split the strings at the space using strsplit(). The result will be:
splitted <- strsplit(gsub("[(),]", "", bike_parking$Geom), " ")
# [[1]]
# [1] "37.7606289177" "-122.410647009"
# [[2]]
# [1] "37.752476948" "-122.410625009"
# [[3]]
# [1] "37.7871729481" "-122.402401009"
# [[4]]
# [1] "37.7776039475" "-122.422764009"
# [[5]]
# [1] "37.7658325695" "-122.46649784"
# [[6]]
# [1] "37.7693399479" "-122.432820008"
We have to reorganise these results a bit, so you'll end up with a data.frame with two columns:
sapply(1:2, function(x) sapply(splitted, `[[`, x))
# [,1] [,2]
# [1,] "37.7606289177" "-122.410647009"
# [2,] "37.752476948" "-122.410625009"
# [3,] "37.7871729481" "-122.402401009"
# [4,] "37.7776039475" "-122.422764009"
# [5,] "37.7658325695" "-122.46649784"
# [6,] "37.7693399479" "-122.432820008"
Option 2: Stringr: This package contains a function str_split() (not strsplit()!), that allows you to skip the last step in the base R solution, because you can immediately get a data.frame instead of a list with vectors:
str_split(gsub("[(),]", "", bike_parking$Geom), " ", simplify=TRUE)
Lets say I have a string:
fgjh=621729_&ioij_fgjh7=twenty-_-One-_-Forty
I want to extract the following substrings from this string:
1. "621729"
2. "twenty"
3. "One"
4. "Forty"
Basically I want to extract anything after the "fgjh=" substring and "fgjh7=" sub strings.
I've found that this formula works in excel:
=TRIM(RIGHT(SUBSTITUTE(A1,"fgjh=",REPT(" ",LEN(A1))),LEN(A1)))
But the excel file is too large and I need to perform the same operation in R
How would I deal with leading characters and trailing characters. Let's say the string was "lmnop_82137_hhgia=77789_pasdk_ikuk_fgjh=621729_&ioij_fgjh7=twenty--One--Forty_dsaoij_882390=lkuk" and I need to extract the data after "fgjh=" i.e 621729 and everything after "fgjh7=" to get only "twenty", "one" and "forty"
You could use the package stringr and the function str_match for example to parse out the interesting bits with regular expressions
> library(stringr)
> s <- "fgjh=621729_&ioij_fgjh7=twenty--One--Forty"
> str_match(s, "^fgjh=([0-9]+)_&ioij_fgjh7=(.+)--(.+)--(.+)$")
[,1] [,2] [,3] [,4] [,5]
[1,] "fgjh=621729_&ioij_fgjh7=twenty--One--Forty" "621729" "twenty" "One" "Forty"
library(stringr)
unlist(strsplit(str_extract_all(string,'(?<=\\=)([^_]+)')[[1]],'--'))
[1] "621729" "twenty" "One" "Forty"
Using sub with regular expression is more flexible than splitting by position:
> sub(".*=(.*)_&.*", "\\1", "fgjh=621729_&ioij_fgjh7=twenty--One--Forty", )
[1] "621729"
> sub(".*=(.*)--.*--.*", "\\1", "fgjh=621729_&ioij_fgjh7=twenty--One--Forty", )
[1] "twenty"
> sub(".*--(.*)--.*", "\\1", "fgjh=621729_&ioij_fgjh7=twenty--One--Forty", )
[1] "One"
> sub(".*--(.*)$", "\\1", "fgjh=621729_&ioij_fgjh7=twenty--One--Forty", )
[1] "Forty"
In one line :
strsplit(sub(".*=(.*)_&.*=(.*)--(.*)--(.*)", "\\1\\|\\2\\|\\3\\|\\4",
"fgjh=621729_&ioij_fgjh7=twenty--One--Forty" ), split="\\|")[[1]]
[1] "621729" "twenty" "One" "Forty"