I have a vector of Strings in R that represent NCAA Basketball Player's names.
The names are displayed as below:
(Justin Ahrens\justin-ahrens-1
Kyle Ahrens\kyle-ahrens-1
...
Zavier Simpson\zavier-simpson-1)
How can I trim these names so I get "Justin Ahrens" and "Zavier Simpson" rather than what I have now?
I am assuming that different rows in your example correspond to different elements in the vector of strings. Then you can apply the following:
string_vector <- c("(Justin Ahrens\\justin-ahrens-1",
"Kyle Ahrens\\kyle-ahrens-1",
"Zavier Simpson\\zavier-simpson-1)")
gsub("\\(|\\)|\\\\.+$", "", string_vector)
Here, you are getting rid of any parenthesis \\(|\\) or any string that comes after backslash, including the backslash \\\\.+$
It depends how your names are presented in the data, so I will offer a couple of possible solutions using stringr
library(stringr)
## Case 1: names are vectorised
players1<- c("Justin Ahrens\\justin-ahrens-1",
"Kyle Ahrens\\kyle-ahrens-1",
"Zavier Simpson\\zavier-simpson-1")
# Use str_extract
players1 %>% str_extract("[^\\\\]+")
## Case 2: one long character string containing multiple names and newlines
players2 <- ("(Justin Ahrens\\justin-ahrens-1\nKyle Ahrens\\kyle-ahrens-1\nZavier Simpson\\zavier-simpson-1)")
# Split strings on new lines (alternatively "1" if no newline characters), then map str_extract
players2 %>% str_split("\n") %>%
purrr::map(~str_extract(.x,"[^(\\\\]+")) %>%
unlist
Related
For a pattern that starts with "pr" following with multiple "r", e.g., pr, prr, pr...r. I would like to split the non-pattern string and ALL pattern strings, without deleting the pattern. strsplit() does the job but deletes all pr..r. However, stringr::str_extract_all extracts patterned strings but non-pattern strings gone.
Is there a way to simply keep all strings but single out patterned strings?
x<-c("zprzzzprrrrrzpzr")
"z" "pr" "zzz" "prrrrr" "zpzr" # desired output; keep original character order
This is a bit hacky but you can do one replacement to separate out the values you want with some separator character and then split on that separator character. For example
unlist(strsplit(gsub("(pr+)","~\\1~", x), "~"))
# [1] "z" "pr" "zzz" "prrrrr" "zpzr"
which will work fine if you don't have "~" in your string.
Here is a way using stringr. I would hope there is a way to make this a bit more concise.
Locate the pattern with str_locate_all()
Add one to all the end positions, so that we have split locations
Add the start and end positions to the vector to split correctly
Use the vectorized str_sub() to extract them all
library(stringr)
x <- c("zprzzzprrrrrzpzr")
locs <- str_locate_all(x, "(pr+)")[[1]]
locs[,2] <- locs[,2] + 1
locs_all <- sort(c(1, locs, nchar(x) + 1))
str_sub(x, head(locs_all, -1), tail(locs_all, -1))
# [1] "zp" "prz" "zzzp" "prrrrrz" "zpzr"
Hello there: I currently have a list of file names (100s) which are separated by multiple "/" at certain points. I would like to find the last "/" in each name and replace it with "/Old". A quick example of what I have tried:
I have managed to do it for a single file name in the list but can't seem to apply it to the whole list.
Test<- "Cars/sedan/Camry"
Then I know I tried finding the last "/" in the name I tried the following :
Last <- tail(gregexpr("/", Test)[[1]], n= 1)
str_sub(Test, Last, Last)<- "/Old"
Which gives me
Test[1] "Cars/sedan/OldCamry"
Which is exactly what I need but I am having troubling applying tail and gregexpr to my list of names so that it does it all at the same time.
Thanks for any help!
Apologies for my poor formatting still adjusting.
If your file names are in a character vector you can use str_replace() from the stringr package for this:
items <- c(
"Cars/sedan/Camry",
"Cars/sedan/XJ8",
"Cars/SUV/Cayenne"
)
stringr::str_replace(items, pattern = "([^/]+$)", replacement = "Old\\1")
[1] "Cars/sedan/OldCamry" "Cars/sedan/OldXJ8" "Cars/SUV/OldCayenne"
Keeping a stringi function as an alternative.
If your dataframe is "df" and your text is in column named "text.
library(stringi)
df %>%
mutate(new_text=stringi::stri_replace_last_fixed(text, '/', '/Old '))
How can I extract digits from a string that can have a structure of xxxx.x or xxxx.x-x and combine them as a number? e.g.
list <- c("1010.1-1", "1010.2-1", "1010.3-1", "1030-1", "1040-1",
"1060.1-1", "1060.2-1", "1070-1", "1100.1-1", "1100.2-1")
The desired (numeric) output would be:
101011, 101021, 101031...
I tried
regexp <- "([[:digit:]]+)"
solution <- str_extract(list, regexp)
However that only extracts the first set of digits; and using something like
regexp <- "([[:digit:]]+\\.[[:digit:]]+\\-[[:digit:]]+)"
returns the first result (data in its initial form) if matched otherwise NA for shorter strings. Thoughts?
Remove all non-digit symbols:
list <- c("1010.1-1", "1010.2-1", "1010.3-1", "1030-1", "1040-1", "1060.1-1", "1060.2-1", "1070-1", "1100.1-1", "1100.2-1")
as.numeric(gsub("\\D+", "", list))
## => [1] 101011 101021 101031 10301 10401 106011 106021 10701 110011 110021
See the R demo online
I have no experience with R but I do know regular expressions. When I look at the pattern you're specifying "([[:digit:]]+)". I assume [[:digit:]] stands for [0-9], so you're capturing one group of digits.
It seems to me you're missing a + to make it capture multiple groups of digits.
I'd think you'd need to use "([[:digit:]]+)+".
How can I extract digits from a string that can have a structure of xxxx.x or xxxx.x-x and combine them as a number? e.g.
list <- c("1010.1-1", "1010.2-1", "1010.3-1", "1030-1", "1040-1",
"1060.1-1", "1060.2-1", "1070-1", "1100.1-1", "1100.2-1")
The desired (numeric) output would be:
101011, 101021, 101031...
I tried
regexp <- "([[:digit:]]+)"
solution <- str_extract(list, regexp)
However that only extracts the first set of digits; and using something like
regexp <- "([[:digit:]]+\\.[[:digit:]]+\\-[[:digit:]]+)"
returns the first result (data in its initial form) if matched otherwise NA for shorter strings. Thoughts?
Remove all non-digit symbols:
list <- c("1010.1-1", "1010.2-1", "1010.3-1", "1030-1", "1040-1", "1060.1-1", "1060.2-1", "1070-1", "1100.1-1", "1100.2-1")
as.numeric(gsub("\\D+", "", list))
## => [1] 101011 101021 101031 10301 10401 106011 106021 10701 110011 110021
See the R demo online
I have no experience with R but I do know regular expressions. When I look at the pattern you're specifying "([[:digit:]]+)". I assume [[:digit:]] stands for [0-9], so you're capturing one group of digits.
It seems to me you're missing a + to make it capture multiple groups of digits.
I'd think you'd need to use "([[:digit:]]+)+".
I have a table as below that I would like to extract the number following the underscore
description desired_output
desc_lvl1_id_1 1
desc_lvl1_id_2 2
The solution that I have come up with is split into two parts, first to get the underscore and the number that I want, then to take out the underscore gsub("_", "", str_extract(description, "_[0-9]")). I'm hoping though that this can be done in one step
We can use a positive lookbehind ((?<=_)) and match the numbers that follow the _ as the pattern in str_extract.
library(stringr)
df1$desired_output <- as.numeric(str_extract(df1$description, '(?<=_)\\d+'))