Extract text with gsub - r

I am setting up an automated data analysis procedure and, more or less at the end of the procedure, I would like to extract automatically the name of the file that has been analysed. I have a data frame with a column containing names, with the following style:
Baseline/Cell_Line_2_KB_1813_B_Baseline
Dose 0001/Cell_Line_3_KB1720_1_0001
Dose 0010/Cell_Line_1_KB1810 mat_0010
I would like to extract just the characters in bold: "KB_1813_B", "KB1720_1" and "KB1810 mat" in a separate column.
I used gsub with the following command:
df$column.with.names <- gsub(".*KB|_.*", "KB", df$column.with.new.names)
I could easily remove the first part of the problem, but I am stuck trying to remove the second part. Is there some command in gsub to remove everything, starting from the end of the name, until you encounter a special character ( "_" in my case)?
Thank you :)

We can use str_extract
library(stringr)
str_extract(df$column.with.new.names, "KB_*\\d+[_ ]*[^_]*")
#[1] "KB_1813_B" "KB1720_1" "KB1810 mat"
Or the same pattern can be captured as a group with sub
sub(".*(KB_*\\d+[_ ]*[^_]*).*", "\\1", df$column.with.new.names)
#[1] "KB_1813_B" "KB1720_1" "KB1810 mat"
data
df <- data.frame(column.with.new.names = c("Baseline/Cell_Line_2_KB_1813_B_Baseline",
"Dose 0001/Cell_Line_3_KB1720_1_0001",
"Dose 0010/Cell_Line_1_KB1810 mat_0010"), stringsAsFactors = FALSE)

The way to do this is using regex groups:
x <- c("Baseline/Cell_Line_2_KB_1813_B_Baseline",
"Dose 0001/Cell_Line_3_KB1720_1_0001",
"Dose 0010/Cell_Line_1_KB1810 mat_0010")
gsub("^.+Cell_Line_._(.+)_.+$", "\\1", x)
[1] "KB_1813_B" "KB1720_1" "KB1810 mat"

Related

How to randomly reshuffle letters in words

I am trying to make a word scrambler in R. So i have put some words in a collection and tried to use strsplit() to split the letters of each word in the collection.
But I don't understand how to jumble the letters of a word and merge them to one word in R Tool. Does anyone know how can I solve this?
This is what I have done
enter image description here
Once you've split the words, you can use sample() to rescramble the letters, and then paste0() with collapse="", to concatenate back into a 'word'
lapply(words, function(x) paste0(sample(strsplit(x, split="")[[1]]), collapse=""))
You can use the stringi package if you want:
> stringi::stri_rand_shuffle(c("hello", "goodbye"))
[1] "oellh" "deoygob"
Here's a one-liner:
lapply(lapply(strsplit(strings, ""), sample), paste0, collapse = "")
[[1]]
[1] "elfi"
[[2]]
[1] "vleo"
[[3]]
[1] "rmsyyet"
Use unlistto get rid of the list:
unlist(lapply(lapply(strsplit(strings, ""), sample), paste0, collapse = ""))
Data:
strings <- c("life", "love", "mystery")
You can use the sample function for this.
here is an example of doing it for a single word. You can use this within your for-loop:
yourword <- "hello"
# split: Split will return a list with one char vector in it.
# We only want to interact with the vector not the list, so we extract the first
# (and only) element with "[[1]]"
jumble <- strsplit(yourword,"")[[1]]
jumble <- sample(jumble, # sample random element from jumble
size = length(jumble), # as many times as the length of jumble
# ergo all Letters
replace = FALSE # do not sample an element multiple times
)
restored <- paste0(jumble,
collapse = "" # bas
)
As the answer from langtang suggests, you can use the apply family for this, which is more efficient. But maybe this answer helps the understanding of what R is actually doing here.

Removing leading question marks from first two words of data frame entries in R

I have a large data frame in R with column "NameFull" holding a text string made up of two words (binomial scientific name), followed by author name(s) and initials. Both have been corrupted (presumably UTF translation issues). This means that in the binomials any leading "x" (indicating hybrids) has been replaced with "?". Unfortunately any non-standard characters in the author names have also been replaced with "?" so I cannot just replace all "?" with x.
I simply want to replace and leading "?" in the first two words with "x" (I will then have to manually compose a list of corrected author names to replace the corrupted ones, unless anyone has a bright idea on that!).
Example chunk of df:
df.corrupt <- data.frame(Bing = 1:6, FullName = c("?Anthematricaria dominii Rohlena", "?Anthemimatricaria inolens P.Fourn.", "?Anthemimatricaria maleolens P.Fourn.", "Achillea ?albinea Bjel?i? & K.Mal?", "Achillea carpatica B?ocki ex Dubovik", "Floscaldasia azorelloides Sklen ? & H.Rob."), Bang = 1:6)
I've tried to shoehorn it into regex but can't get close. Any help appreciated!
On my understanding, you want to replace ?only if it occurs in word-initial position in either the first or the second word; if that's correct this should work:
Data: (I've changed a few chars)
df.corrupt <- data.frame(Bing = 1:6,
FullName = c("?Anthematricaria dominii ?Rohlena",
"?Anthemimatricaria inolens P.Fourn.",
"?Anthemimatricaria maleolens ?P.Fourn.",
"Achillea ?albinea Bjel?i? & K.Mal?",
"Achillea carpatica B?ocki ex Dubovik",
"Floscaldasia azorelloides Sklen ? & H.Rob."), Bang = 1:6)
Solution:
library(stringr)
str_replace_all(df.corrupt$FullName, "^\\?|(?<=^(\\?)?\\b\\w{1,100}\\b\\s)\\?", "x")
[1] "xAnthematricaria dominii ?Rohlena" "xAnthemimatricaria inolens P.Fourn."
[3] "xAnthemimatricaria maleolens ?P.Fourn." "Achillea xalbinea Bjel?i? & K.Mal?"
[5] "Achillea carpatica B?ocki ex Dubovik" "Floscaldasia azorelloides Sklen ? & H.Rob."
This stringr solution puts x where ?occurs right at the start of the string (^) or (|) using positive lookbehind (i.e., a non-consuming capturing group) where it follows a whitespace char (\\s), which in turn follows a word boundary (\\b) following up to 100 \\w chars following a word boundary, following finally an optional ?
We can check for the ? that succeeds a space or at the start of the string, replace with 'x'
trimws(gsub("(^|\\s)\\?", " x", df.corrupt$FullName))

R: How can I Split and Trim Data Successfully

I've successfully split the data and removed the "," with the following code:
s = MSA_data$area_title
str_split(s, pattern = ",")
Result
[1] "Albany" " GA"
I need to trim this data, removing white space, however this places the comma back into the data which was initially removed.
"Albany, GA"
How can I successfully split and trim the data so that the result is:
[1] "Albany" "GA"
Thank you
An alternative is to use trimws function to trim the whitespace at the beginning and end of the string.
Result <- trimws(Result)
We just need to use zero or more spaces (\\s*) (the question OP asked) and this can be done in a single step
strsplit(MSA_data$area_title, pattern = ",\\s*")
If we are using the stringr, then make use of the str_trim
library(stringr)
str_trim(str_split("Albany, GA", ",")[[1]])
#[1] "Albany" "GA"

extract string before dash in R

I have a column for name, the format is mixed with AAA and AAA-D. I want to extract the name before dash (if it has dash) or keep the non-dashed name.
the list is
Name
W1-D1
Empty
W2-D1
what I want to extract are
Name
W1
Empty
W2
I found several syntaxes like v1<-gsub("^(.*?)-.*", "\\1",v) but this does not work on my list, I got this “c(\"W1" in v1 . Did I use this syntax wrong?
You can as well use stringr
library(stringr)
v2<-str_extract(v, "[^-]+")
The following regex will do it.
sub("(^[^-]+)-.*", "\\1", Name)
#[1] "W1" "Empty" "W2"
Data.
Name <- scan(what = character(), text ="
W1-D1
Empty
W2-D1
")

extracting names and numbers using regex

I think I might have some issues with understanding the regular expressions in R.
I need to extract phone numbers and names from a sample vector and create a data-frame with corresponding columns for names and numbers using stringr package functionality.
The following is my sample vector.
phones <- c("Ann 077-789663", "Johnathan 99656565",
"Maria2 099-65-6569 office")
The code that I came up with to extract those is as follows
numbers <- str_remove_all(phones, pattern = "[^0-9]")
numbers <- str_remove_all(numbers, pattern = "[a-zA-Z]")
numbers <- trimws(numbers)
names <- str_remove_all(phones, pattern = "[A-Za-z]+", simplify = T)
phones_data <- data.frame("Name" = names, "Phone" = numbers)
It doesn't work, as it takes the digit in the name and joins with the phone number. (not optimal code as well)
I would appreciate some help in explaining the simplest way for accomplishing this task.
Not a regex expert, however with stringr package we can extract a number pattern with optional "-" in it and replace the "-" with empty string to extract numbers without any "-". For names, we extract the first word at the beginning of the string.
library(stringr)
data.frame(Name = str_extract(phones, "^[A-Za-z]+"),
Number = gsub("-","",str_extract(phones, "[0-9]+[-]?[0-9]+[-]?[0-9]+")))
# Name Number
#1 Ann 077789663
#2 Johnathan 99656565
#3 Maria 099656569
If you want to stick completely with stringr we can use str_replace_all instead of gsub
data.frame(Name = str_extract(phones, "[A-Za-z]+"),
Number=str_replace_all(str_extract(phones, "[0-9]+[-]?[0-9]+[-]?[0-9]+"), "-",""))
# Name Number
#1 Ann 077789663
#2 Johnathan 99656565
#3 Maria 099656569
I think Ronak's answer is good for the name part, I don't really have a good alternative to offer there.
For numbers, I would go with "numbers and hyphens, with a word boundary at either end", i.e.
numbers = str_extract(phones, "\\b[-0-9]+\\b") %>%
str_remove_all("-")
# Can also specify that you need at least 5 numbers/hyphens
# in a row to match
numbers2 = str_extract(phones, "\\b[-0-9]{5,}\\b") %>%
str_remove_all("-")
That way, you're not locked into a fixed format for the number of hyphens that appear in the number (my suggested regex allows for any number).
If you (like me) prefer to use base-R and want to keep the regex as simple as possible you could do something like this:
phone_split <- lapply(
strsplit(phones, " "),
function(x) {
name_part <- grepl("[^-0-9]", x)
c(
name = paste(x[name_part], collapse = " "),
phone = x[!name_part]
)
}
)
phone_split
[[1]]
name phone
"Ann" "077-789663"
[[2]]
name phone
"Johnathan" "99656565"
[[3]]
name phone
"Maria2 office" "099-65-6569"
do.call(rbind, phone_split)
name phone
[1,] "Ann" "077-789663"
[2,] "Johnathan" "99656565"
[3,] "Maria2 office" "099-65-6569"

Resources