Split names and create matrix in R - r

I have this data:
names <- c("Baker, Chet", "Jarret, Keith", "Miles Davis")
I want to manipulate it so the first name come first, so i split it:
names <- strsplit(names, ", ")
[[1]]
[1] "Baker" "Chet"
[[2]]
[1] "Jarret" "Keith"
[[3]]
[1] "Miles Davis"
The problem is that, when i want to put them together, the name "Miles Davis" will come out wrong, because it is already the full name.
matrix(unlist(names), ncol=2, byrow = TRUE)
[,1] [,2]
[1,] "Baker" "Chet"
[2,] "Jarret" "Keith"
[3,] "Miles Davis" "Baker"
What should i do to create a new df that will look like this:
"Chet Baker"
"Keith Jarret"
"Miles Davis"
Here's the reference: http://rfunction.com/archives/1499

You can easily adapt the pattern used in the regular expression so that it matches either a comma followed by 0+ spaces or 1+ spaces:
names <- strsplit(names, ",\\s*|\\s+")
matrix(unlist(names), ncol=2, byrow = TRUE)
# [,1] [,2]
#[1,] "Baker" "Chet"
#[2,] "Jarret" "Keith"
#[3,] "Miles" "Davis"
Since the desired result is different than initially described, heres's a different approach:
names <- strsplit(names, ",\\s*")
data.frame(name = sapply(names, function(x) paste(rev(x), collapse = " ")))
# name
#1 Chet Baker
#2 Keith Jarret
#3 Miles Davis
Another option, using capture groups in a regular expression to swap everything before the comma with everything after the comma and replace the comma with a space.
names <- c("Baker, Chet", "Jarret, Keith", "Miles Davis")
sub("([^,]+),\\s*([^,]+)$", "\\2 \\1", names)
#[1] "Chet Baker" "Keith Jarret" "Miles Davis"

Another regex solution:
gsub("(\\w+), (\\w+)", "\\2 \\1", names)
# [1] "Chet Baker" "Keith Jarret" "Miles Davis"

Related

How can I extract unit preceded by number with str_extract?

I think str_extract can do this, but I fail to figure out this. my data contains chinese character so there is no blank white between characters. I simulate the data in english as:
> dd<-c("wwe12hours,fgg23days","ffgg12334hours,23days","ffff1days")
> target <- c("hours","days","hours","days")
> target
[1] "hours" "days" "hours" "days"
How can I achieve the target?
my real case is:
> dd <- c("腹痛发热12小时,再发2天","腹痛132324月,再发1天","发热4天")
> target <- c("小时","月","天")
> target
[1] "小时" "月" "天"
It seems you are looking for regex to capture the units. Since you have a vector of length three, we would prefer to return another vector of length three. From your example(ENGLISH ONE) it is not clear how you obtain a target of 4 units. Although I perceive you meant to have 5 if not 3.
here is how you could tackle. This can generally be used for any language:
English:
gsub("\\p{L}*+\\d+", "", dd, perl = TRUE)
[1] "hours,days" "hours,days" "days"
Chinese:
gsub("\\p{L}*+\\d+", "", dd, perl = TRUE)
[1] "小时,天" "月,天" "天"
regmatches(ddc,gregexpr("(?<=\\d)\\p{L}+",ddc,perl = TRUE))
[[1]]
[1] "小时" "天"
[[2]]
[1] "月" "天"
[[3]]
[1] "天"
or if you want to use other packages:
using str_extract_all:
library(stringr)
str_extract_all(ddc,"(?<=\\d)\\p{L}+")
You could use str_match_all :
library(stringr)
unlist(sapply(str_match_all(dd, '\\d+(\\w+)'), function(x) x[, 2]))
#[1] "hours" "days" "hours" "days" "days"
This captures the first word that comes after a number.
where
str_match_all(dd, '\\d+(\\w+)') #returns
#[[1]]
# [,1] [,2]
#[1,] "12hours" "hours"
#[2,] "23days" "days"
#[[2]]
# [,1] [,2]
#[1,] "12334hours" "hours"
#[2,] "23days" "days"
#[[3]]
# [,1] [,2]
#[1,] "1days" "days"
As mentioned by #Onyambu, we can use a lookbehind regex to avoid using sapply to subset the capture group.
unlist(str_extract_all(dd,"(?<=\\d)[A-z]+"))
Base R solution:
cleaned_dd <- gsub("[[:punct:]].*", "",
unlist(lapply(strsplit(
gsub("[[:digit:]]", " ", dd), "\\s+"
), '[',-1)))

Extracting all words and clusters of letters in a string and then making each word a seperate piece of data using gsub() in R

Say we have:
stringTest <- c("Here we have 4 words", "Here we have avwerfaf 4")
Expected output:
"Here" "we" "have" "words" "Here" "we" "have" "avwerfaf"
I would like to use gsub(), but other methods are definitely excepted. Thanks Guys!
You can use strsplit:
result <- unlist(strsplit(stringTest, " |\\d"))
result[result != ""]
#> [1] "Here" "we" "have" "words" "Here" "we"
#> [7] "have" "avwerfaf"
or if you prefer a one-liner:
unlist(lapply(strsplit(stringTest, "\\W|\\d"), function(x) x[x != ""]))
library(tidyverse)
stringTest <- c("Here we have 4 words", "Here we have avwerfaf 4")
gsub(" \\d", replacement = "", stringTest) %>%
str_split(pattern = " ") %>%
unlist()
This falls under the "another approach" category. What you appear to be doing is tokenizing by words, dropping numbers.
library(tokenizers)
unlist(tokenize_words(stringTest, lowercase = FALSE, strip_numeric = TRUE))
Which gives:
[1] "Here" "we" "have" "words" "Here" "we" "have" "avwerfaf"
If you are operating out of a data frame, something like this could be useful.
library(dplyr)
library(tidytext)
df <- tibble(description = stringTest)
df2 <- df %>%
rowid_to_column() %>%
unnest_tokens(word, description, to_lower = FALSE, strip_numeric = TRUE)
Which returns a new tibble:
> df2
# A tibble: 8 x 2
rowid word
<int> <chr>
1 1 Here
2 1 we
3 1 have
4 1 words
5 2 Here
6 2 we
7 2 have
8 2 avwerfaf

String splitting with a stop character in R

My data is as follows:
“Louis Hamilton”
“Tiger Wolf”
“Sachin Tendulkar”
“Lebron James”
“Michael Shoemaker”
“Hollywood – Career as an Actor”
I need to extract all the characters until a space or a dash(-) is reached
I need to extract no more than 10 characters
My desired output is
“Louis”
“Tiger”
“Sachin”
“Lebron”
“Michael”
“Hollywood”
I tried using below function, but it didn’t work
Sportstars<-function(charvec)
{min.length < 10, (x, hyph.pattern = Null)}
Can anyone help, please?
We can use sub
sub("^([^- ]+).*", "\\1", v1)
#[1] "Louis" "Tiger" "Sachin" "Lebron" "Michael" "Hollywood"
Or another version with the length condition as well
grep("^.{1,10}$", sub("\\s+.*", "", v1), value = TRUE)
#[1] "Louis" "Tiger" "Sachin" "Lebron" "Michael" "Hollywood"
Or with word from stringr
library(stringr)
word(v1, 1)
#[1] "Louis" "Tiger" "Sachin" "Lebron" "Michael" "Hollywood"
Also, if we need to implement the last condition as well
sapply(strsplit(v1, "[– -]"), function(x) {
x1 <- setdiff(x, "")
x1[1][nchar(x1[1]) < 10]})
#[1] "Louis" "Tiger" "Sachin" "Lebron" "Michael" "Hollywood"
data
v1 <- c( "Louis Hamilton", "Tiger Wolf", "Sachin Tendulkar",
"Lebron James", "Michael Shoemaker", "Hollywood – Career as an Actor")

How to replace space with "_" after last slash in a string with R

I have a list of strings, and for each string, I need to replace all spaces after the last slash with an "_". Here's a minimum reproducible example.
my_list <- list("abc/as 345/as df.pdf", "adf3344/aer4 ffsd.doc", "abc/3455/dfr.xls", "abc/3455/dfr serf_dff.xls", "abc/34 5 5/dfr 345 dsdf 334.pdf")
After doing the replacement, the result should be:
list("abc/as 345/as_df.pdf", "adf3344/aer4_ffsd.doc", "abc/3455/dfr.xls", "abc/3455/dfr_serf_dff.xls", "abc/34 5 5/dfr_345_dsdf_334.pdf")
I thought of matching the text after the last slash using regex, and then replace " " for "_", but didn't find a way to implement it.
It would be something like this:
gsub(pattern, "_", my_list),
in which pattern would be a regex that would be saying: match every space after the last slash (there is at least one slash in every element of the list).
You may use negative lookahead:
gsub(" (?!.*/.*)", "_", unlist(my_list), perl = TRUE)
# [1] "abc/as 345/as_df.pdf" "adf3344/aer4_ffsd.doc"
# [3] "abc/3455/dfr.xls" "abc/3455/dfr_serf_dff.xls"
# [5] "abc/34 5 5/dfr_345_dsdf_334.pdf"
Here we match and replace all such spaces that ahead of them there are no more slashes left.
You can use dirname, basename and file.path :
as.list(file.path(
dirname(unlist(my_list)),
gsub(" ", "_", basename(unlist(my_list)))
))
# [[1]]
# [1] "abc/as 345/as_df.pdf"
#
# [[2]]
# [1] "adf3344/aer4_ffsd.doc"
#
# [[3]]
# [1] "abc/3455/dfr.xls"
#
# [[4]]
# [1] "abc/3455/dfr_serf_dff.xls"
#
# [[5]]
# [1] "abc/34 5 5/dfr_345_dsdf_334.pdf"
or a bit more efficient and compact :
as.list(file.path(
dirname(. <- unlist(my_list)),
gsub(" ", "_", basename(.))
))
Here's a thought. First, split by slash:
l2 <- strsplit(unlist(my_list), "/")
l2
# [[1]]
# [1] "abc" "as 345" "as df.pdf"
# [[2]]
# [1] "adf3344" "aer4 ffsd.doc"
# [[3]]
# [1] "abc" "3455" "dfr.xls"
# [[4]]
# [1] "abc" "3455" "dfr serf_dff.xls"
# [[5]]
# [1] "abc" "34 5 5" "dfr 345 dsdf 334.pdf"
Now we do a gsub on just the last element of each split-string, recombining with slashes:
mapply(function(a,i) paste(c(a[-i], gsub(" ", "_", a[i])), collapse="/"),
l2, lengths(l2), SIMPLIFY=FALSE)
# [[1]]
# [1] "abc/as 345/as_df.pdf"
# [[2]]
# [1] "adf3344/aer4_ffsd.doc"
# [[3]]
# [1] "abc/3455/dfr.xls"
# [[4]]
# [1] "abc/3455/dfr_serf_dff.xls"
# [[5]]
# [1] "abc/34 5 5/dfr_345_dsdf_334.pdf"
Here's a solution that uses the gsubfn package.
You use the regex (/[^/]+)$ to find the content following the last slash and you edit that content with a function that converts spaces to underscores.
library(gsubfn)
change_space_to_underscore <- function(x) gsub(x = x, pattern = "[[:space:]]+", replacement = "_")
gsubfn(x = my_list,
pattern = "(/[^/]+)$",
replacement = change_space_to_underscore)

extract multiple parts of a string with R

I have two strings:
data = "Product Number: #76 in c (See Top 10 products in this department)"
data1 = "Product Number: #321,222 in Thin Base Pizzas (See Top 10 products in this department)"
using str_match() in R, what would be the regex for the following results?
str_match(data, regex)
[,1] [,2] [,3]
[1,] "#76 in Fruit Juices " "76" "Fruit Juices "
str_match(data1, regex)
[,1] [,2] [,3]
[1,] "#321,222 in Thin Base Pizzas " "321,222" "Thin Base Pizzas "
You can use this regex to extract the information you need:
#([0-9,]+) in ([A-z ]+)
you can see in action here: https://regex101.com/r/IM0wHV/1
Given your first comment I think this will generalize to give you the product number.
sub(" .*", "", sub(".*#", "", data))
"76"
And this second one will give you whatever is between the in and (.
sub(" \\(.*", "", sub(".*[0-9]+ in ", "", data))
"Fruit Juices"
Not an ideal solution but it's an working example you can take forward from here.

Resources