Extracting data from corrupted string - r

Hi I have a dataframe with a column in which the variable is email. Unfortunately, something went wrong and several of the email id have number prefix seperated by underscore. These are the two patterns I have noticed.
Is there a way to extract the data after the underscore, if we processing from the left. Can some logic be built so that the script is smart enough to check if there is one underscore or two. I can do this in excel using find() and right() functions but was wondering how to accomplish this in R.
For example:
product$email
83837_83838_abcd#gmail.com
83837_abcd#gmail.com
output
abcd#gmail.com
abcd#gmail.com

We can use sub
sub('.*_', '', str1)
#[1] "abcd#gmail.com" "abcd#gmail.com"
Or
library(stringr)
str_extract(str1, '[^_]+$')
data
str1 <- c('83837_83838_abcd#gmail.com', '83837_abcd#gmail.com')

Related

I want to extract the first string before "\" to create a new variable in a data.frame

System: Windows 10, R 3.6.2
I import the data from an EXCEL file into a data.frame. One variable has values like this:
What I want is to extract the data before the first "\", and create a new variable.
I tried split, str.split, str_extract, and gsub, and none of them works. I think the main problem is the separative sign, but I still don't know how to work around. I really appreciate if anyone can help me with this.
Since you want to extract the first four characters in the string, which come before the "\" sign. One solution is to load the stringr library, and extract the substring.
library(stringr)
str_sub(string, 1, 4)
Hope it helps!
You could use sub and remove everything after first backslash.
sub("\\\\.*", "", df$account)
Another option is to capture everything before first backslash.
sub("(.*?)\\\\.*", "\\1", df$account)
Regarding why you need 4 "\", read How to escape backslashes in R string .

Removing part of strings within a column

I have a column within a data frame with a series of identifiers in, a letter and 8 numbers, i.e. B15006788.
Is there a way to remove all instances of B15.... to make them empty cells (there’s thousands of variations of numbers within each category) but keep B16.... etc?
I know if there was just one thing I wanted to remove, like the B15, I could do;
sub(“B15”, ””, df$col)
But I’m not sure on the how to remove a set number of characters/numbers (or even all subsequent characters after B15).
Thanks in advance :)
Welcome to SO! This is a case of regex. You can use base R as I show here or look into the stringR package for handy tools that are easier to understand. You can also look for regex rules to help define what you want to look for. For what you ask you can use the following code example to help:
testStrings <- c("KEEPB15", "KEEPB15A", "KEEPB15ABCDE")
gsub("B15.{2}", "", testStrings)
gsub is the base R function to replace a pattern with something else in one or a series of inputs. To test our regex I created the testStrings vector for different examples.
Breaking down the regex code, "B15" is the pattern you're specifically looking for. The "." means any character and the "{2}" is saying what range of any character we want to grab after "B15". You can change it as you need. If you want to remove everything after "B15". replace the pattern with "B15.". the "" means everything till the end.
edit: If you want to specify that "B15" must be at the start of the string, you can add "^" to the start of the pattern as so: "^B15.{2}"
https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf has a info on different regex's you can make to be more particular.

Match everything up until first instance of a colon

Trying to code up a Regex in R to match everything before the first occurrence of a colon.
Let's say I have:
time = "12:05:41"
I'm trying to extract just the 12. My strategy was to do something like this:
grep(".+?(?=:)", time, value = TRUE)
But I'm getting the error that it's an invalid Regex. Thoughts?
Your regex seems fine in my opinion, I don't think you should use grep, also you are missing perl=TRUE that is why you are getting the error.
I would recommend using :
stringr::str_extract( time, "\\d+?(?=:)")
grep is little different than it is being used here, its good for matching separate values and filtering out those which has similar pattern, but you can't pluck out values within a string using grep.
If you want to use Base R you can also go for sub:
sub("^(\\d+?)(?=:)(.*)$","\\1",time, perl=TRUE)
Also, you may split the string using strsplit and filter out the first string like below:
strsplit(time, ":")[[1]][1]

Repeating a regex pattern for date parsing

I have the following string
"31032017"
and I want to use regular expressions in R to get
"31.03.2017"
What is the best function to do it?
And a general question, how can I repeat the matched part, like as in sed in bash? There, we use \1 to repeat the first matched part.
You need to put the single parts in round brackets like this:
sub("([0-9]{2})([0-9]{2})([0-9]{4})", "\\1.\\2.\\3", "31032017")
You can then use \\1 to access the part matched by the first group, \\2 for the second and so on.
Note that if your string is a date, there are better ways to parse / reformat it than directly using regex.
date_vector = c("31032017","28052017","04052022")
as.character(format(as.Date(date_vector, format = "%d%m%Y"), format = "%d.%m.%Y"))
#[1] "31.03.2017" "28.05.2017" "04.05.2022"
If you want to work/do math with dates, omit as.character.

Replacing all occurrences of a pattern in a string

Used to run R with numbers and matrix, when it comes to play with strings and characters I am lost. I want to analyze some data where the time is read into R as follow:
>my.time.char[1]
[1] "\"2011-10-05 15:55:00\""
I want to end up with a string containing only:
"2011-10-05 15:55:00"
Using the function sub() (that i barely understand...), I got the following result:
> sub("(\")","",my.time.char[1])
[1] "2011-10-05 15:55:00\""
This is closer to the format i am looking for, but I still need to get rid of the two last characters (\").
The second line from ?sub explains:
sub and gsub perform replacement of the first and all matches respectively.
which should tell you to use gsub instead.

Resources