Regex to remove a tail after a date - r

I've looked around but I'm having trouble utilizing regex to remove the "*/tues" from a date variable using the sub() function.
all variables in a$date look like this:
01/01/2017/Sun
01/03/2017/Tues
etc..
And i'm trying to do this:
sub(a$date,"*7/\\*","")
I'm sure I've just messed up the regex.
How do I escape the first two / and tell it only to delete the third one and everything after?

We can use
sub("\\/[^0-9]+$","", a$date)
#[1] "01/01/2017" "01/03/2017"
Or with substr
substr(a$date, 1, 10)
#[1] "01/01/2017" "01/03/2017"
data
a <- data.frame(date = c("01/01/2017/Sun", "01/03/2017/Tues"))

Another solution is simply to truncate your string:
library("stringr")
# truncate date after 10 characters
a$date <- str_trunc(a$date,10)

Related

String Manipulation in R data frames

I just learnt R and was trying to clean data for analysis using R using string manipulation using the code given below for Amount_USD column of a table. I could not find why changes were not made. Please help.
Code:
csv_file2$Amount_USD <- ifelse(str_sub(csv_file$Amount_USD,1,10) == "\\\xc2\\\xa0",
str_sub(csv_file$Amount_USD,12,-1),csv_file2$Amount_USD)
Result:
\\xc2\\xa010,000,000
\\xc2\\xa016,200,000
\\xc2\\xa019,350,000
Expected Result:
10,000,000
16,200,000
19,350,000
You could use the following code, but maybe there is a more compact way:
vec <- c("\\xc2\\xa010,000,000", "\\xc2\\xa016,200,000", "\\xc2\\xa019,350,000")
gsub("(\\\\x[[:alpha:]]\\d\\\\x[[:alpha:]]0)([d,]*)", "\\2", vec)
[1] "10,000,000" "16,200,000" "19,350,000"
A compact way to extract the numbers is by using str_extract and negative lookahead:
library(stringr)
str_extract(vec, "(?!0)[\\d,]+$")
[1] "10,000,000" "16,200,000" "19,350,000"
How this works:
(?!0): this is negative lookahead to make sure that the next character is not 0
[\\d,]+$: a character class allowing only digits and commas to occur one or more times right up to the string end $
Alternatively:
str_sub(vec, start = 9)
There were a few minor issues with your code.
The main one being two unneeded backslashes in your matching statement. This also leads to a counting error in your first str_sub(), where you should be getting the first 8 characters not 10. Finally, you should be getting the substring from the next character after the text you want to match (i.e. position 9, not 12). The following should work:
csv_file2$Amount_USD <- ifelse(str_sub(csv_file$Amount_USD,1,8) == "\\xc2\\xa0", str_sub(csv_file$Amount_USD,9,-1),csv_file2$Amount_USD)
However, I would have done this with a more compact gsub than provided above. As long as the text at the start to remove is always going to be "\\xc2\\xa0", you can simply replace it with nothing. Note that for gsub you will need to escape all the backslashes, and hence you end up with:
csv_file2$Amount_USD <- gsub("\\\\xc2\\\\xa0", replacement = "", csv_file2$Amount_USD)
Personally, especially if you plan to do any sort of mathematics with this column, I would go the additional step and remove the commas, and then coerce the column to be numeric:
csv_file2$Amount_USD <- as.numeric(gsub("(\\\\xc2\\\\xa0)|,", replacement = "", csv_file2$Amount_USD))

How to match any character existing between a pattern and a semicolon

I am trying to get anything existing between sample_id= and ; in a vector like this:
sample_id=10221108;gender=male
tissue_id=23;sample_id=321108;gender=male
treatment=no;tissue_id=98;sample_id=22
My desired output would be:
10221108
321108
22
How can I get this?
I've been trying several things like this, but I don't find the way to do it correctly:
clinical_data$sample_id<-c(sapply(myvector, function(x) sub("subject_id=.;", "\\1", x)))
You could use sub with a capture group to isolate that which you are trying to match:
out <- sub("^.*\\bsample_id=(\\d+).*$", "\\1", x)
out
[1] "10221108" "321108" "22"
Data:
x <- c("sample_id=10221108;gender=male",
"tissue_id=23;sample_id=321108;gender=male",
"treatment=no;tissue_id=98;sample_id=22")
Note that the actual output above is character, not numeric. But, you may easily convert using as.numeric if you need to do that.
Edit:
If you are unsure that the sample IDs would always be just digits, here is another version you may use to capture any content following sample_id:
out <- sub("^.*\\bsample_id=([^;]+).*$", "\\1", x)
out
You could try the str_extract method which utilizes the Stringr package.
If your data is separated by line, you can do:
str_extract("(?<=\\bsample_id=)([:digit:]+)") #this tells the extraction to target anything that is proceeded by a sample_id= and is a series of digits, the + captures all of the digits
This would extract just the numbers per line, if your data is all collected like that, it becomes a tad more difficult because you will have to tell the extraction to continue even if it has extracted something. The code would look something like this:
str_extract_all("((?<=sample_id=)\\d+)")
This code will extract all of the numbers you're looking for and the output will be a list. From there you can manipulate the list as you see fit.

Using regular expression, how can I add elements after I find a match in r?

I have a column that has a string of numbers-length range 10 and 11. This is an example of some of the values in the column:
column=c("5699420001","00409226602")
How can I place a hyphen after the first four digits (in the strings with 10 characters) and after the first five digits (in the strings with 11 characters), as well as after the second four digits for both lengths? Output is provided below. I wanted to use stringr for this.
column_standard=c("5699-4200-01","00409-2266-02")
Here's a solution using capture groups with stringr's str_replace() function:
library(stringr)
column <- c("5699420001","00409226602")
column_standard <- sapply(column, function(x){
ifelse(nchar(x) == 11,
stringr::str_replace(x, "^([0-9]{5})([0-9]{4})(.*)", "\\1\\-\\2-\\3"),
stringr::str_replace(x, "^([0-9]{4})([0-9]{4})(.*)", "\\1\\-\\2-\\3"))
})
column_standard
# 5699420001 00409226602
# "5699-4200-01" "00409-2266-02"
The code should be fairly self-explanatory. I can provide a detailed explanation upon request.
try using this as your expression:
\b(\d{4,5})(\d{4})(\d{2}\b)
It sets up three capture groups that you can later use in your replacement to easily add hyphens between them.
Then you just replace with:
\1-\2-\3
Thanks to #Dunois for pointing out how it would look in code:
column_standard <- sapply(column, function(x) stringr::str_replace(x, "^(\\d{4,5})(\\d{4})(\\d{2})", "\\1\\-\\2-\\3"))
Here is a live example.

regex single digit

I have a question which I think is solved by regex use in R.
I have a set of dates (as chr) which I would like in a different format (as chr).
I have tried to fool around with the below examples where the first (new_dates) gives the right format for months 1-9 and wrong for 10-12 and (new_dates2) gives the right format for 10-12 but nothing for 1-9.
I see that the code in the first case matches a single digit twice for 10-12, but don't really know how to tell it to match only single digit.
The final vector of correct dates shows the result I would like.
dates <- c("1/2016", "2/2016", "3/2016", "4/2016", "5/2016", "6/2016", "7/2016", "8/2016", "9/2016", "10/2016", "11/2016", "12/2016", "1/2017")
new_dates <- sub("(\\d)[:/:](\\d{4})","\\2M0\\1", dates)
new_dates2 <- sub("(\\d{2})[:/:](\\d{4})","\\2M\\1", dates)
correctdates <- c("2016M01", "2016M02", "2016M03", "2016M04", "2016M05", "2016M06", "2016M07", "2016M08", "2016M09", "2016M10", "2016M11", "2016M12", "2017M1")
Here's a base R method that will return the desired format:
format(as.Date(paste0("1/",dates), "%d/%m/%Y"), "%YM%m")
[1] "2016M01" "2016M02" "2016M03" "2016M04" "2016M05" "2016M06" "2016M07" "2016M08" "2016M09"
[10] "2016M10" "2016M11" "2016M12" "2017M01"
The idea is to first convert to a Date object and then use the format function to create the desired character representation. I pasted on 1/ so that a day is present in each element.
As #a p o m said it might be better to look for another solution if you are manipulating dates but if you want to stick with regular expressions you can try this one.
([02-9]|1[0-2]?)[:\/](\d{4}) example
new_dates <- sub("(\\d{1,2})\\/(\\d{4})","\\2M0\\1", dates)
It's fine.

How do I remove a specific sign like a comma partially from a data set

I have a data set like this:
Quest_main=c("quest2,","quest5,","quest4,","quest12,","quest4,","quest5,quest7")
And I would like to remove the comma from for example "quest2," so that it is "quest2", but not from the "quest5,quest7". I think I have to use substr or ifelse, but I am not sure. The final result is this when I call up Quest_main:
"quest2" "quest5" "quest4" "quest12" "quest4" "quest5,quest7"
Thanks!
All you need is
gsub(",$","",Quest_main)
The $ signifies the end of a string: for full explanation, see the (long and complicated) ?regexp, or a more general introduction to regular expressions, or search for the tags [r] [regex] on Stack Overflow.
If you insist on doing it with substr() and ifelse(), you can:
nc <- nchar(Quest_main)
lastchar <- substr(Quest_main,nc,nc)
ifelse(lastchar==",",substr(Quest_main,1,nc-1),
Quest_main)
With substring and ifelse:
ifelse(substring(Quest_main,nchar(Quest_main))==',',substring(Quest_main,1,nchar(Quest_main)-1),Quest_main)
Here's an alternative approach (just for general knowledge) using negative lookahead
gsub("(,)(?!\\w)", "", Quest_main, perl = TRUE)
## [1] "quest2" "quest5" "quest4" "quest12" "quest4" "quest5,quest7"
This approach is more general in case you want to delete commas not only from end of the word, but specify other conditions too
A more general solution would be using stringis stri_trim_right which will work in cases Bens or Jealie solutions will fail, for example when you have many commas at the end of the sentence which you want to get rid of, for example:
Quest_main <- c("quest2,,,," ,"quest5,quest7,,,,")
Quest_main
#[1] "quest2,,,," "quest5,quest7,,,,"
library(stringi)
stri_trim_right(Quest_main, pattern = "[^,]")
#[1] "quest2" "quest5,quest7"

Resources