Regex for extracting string from csv before numbers - r

I'm very new to the regex world and would like to know how to extract strings using regex from a bunch of file names I've imported to R. My files follow the general format of:
testing1_010000.csv
check3_012000.csv
testing_checking_045880.csv
test_check2_350000.csv
And I'd like to extract everything before the 6 numbers.csv part, including the "_" to get something like:
testing1_
check3_
testing_checking_
test_check2_
If it helps, the pattern I essentially want to remove will always be 6 numbers immediately followed by .csv.
Any help would be great, thank you!

There's a few ways you could go about this. For example, match anything before a string of six digits followed by ".csv". For this one you would want to get the first capturing group.
/(.*)\d{6}.csv/
https://regex101.com/r/MPH6mE/1/
Or match everything up to the last underscore character. For this one you would want the whole match.
.*_
https://regex101.com/r/4GFPIA/1

Files = c("testing1_010000.csv", "check3_012000.csv",
"testing_checking_045880.csv", "test_check2_350000.csv")
sub("(.*_)[[:digit:]]{6}.*", "\\1", Files)
[1] "testing1_" "check3_" "testing_checking_"
[4] "test_check2_"

We can use stringr::str_match(). It will also work for different that six digits.
library(tidyverse)
files <- c("testing1_010000.csv", "check3_012000.csv", "testing_checking_045880.csv", "test_check2_350000.csv")
str_match(files, '(.*_)\\d+\\.csv$')[, 2]
#> [1] "testing1_" "check3_" "testing_checking_"
#> [4] "test_check2_"
The regex can be interpreted as:
"capture everything before and including an underscore, that is then followed by one or more digits .csv as an ending"
Created on 2021-12-03 by the reprex package (v2.0.1)

Using nchar:
Files = c("testing1_010000.csv", "check3_012000.csv",
"testing_checking_045880.csv", "test_check2_350000.csv")
substr(Files, 1, nchar(Files)-10)
OR
library(stringr)
str_remove(Files, "\\d{6}.csv")
[1] "testing1_" "check3_" "testing_checking_"
[4] "test_check2_"

Related

Extract a part of a changeabel string

I have a simple but yet complicated question (at least for me)!
I would like to extract a part of a string like in this example:
From this string:
name <- "C:/Users/admin/Desktop/test/plots/"
To this:
name <- "test/plots/"
The plot twist for my problem that the names are changing. So its not always "test/plots/", it could be "abc/ccc/" or "m.project/plots/" and so on.
In my imagination I would use something to find the last two "/" in the string and cut out the text parts. But I have no idea how to do it!
Thank you for your help and time!
Without regex
Use str_split to split your path by /. Then extract the first three elements after reversing the string, and paste back the / using the collapse argument.
library(stringr)
name <- "C:/Users/admin/Desktop/m.project/plots/"
paste0(rev(rev(str_split(name, "\\/", simplify = T))[1:3]), collapse = "/")
[1] "m.project/plots/"
With regex
Since your path could contain character/numbers/symbols, [^/]+/[^/]+/$ might be better, which matches anything that is not /.
library(stringr)
str_extract(name, "[^/]+/[^/]+/$")
[1] "m.project/plots/"
With {stringr}, assuming the path comprises folders with lower case letters only. You could adjust the alternatives in the square brackets as required for example if directory names include a mix of upper and lower case letters use [.A-z]
Check a regex reference for options:
name <- c("C:/Users/admin/Desktop/m.project/plots/",
"C:/Users/admin/Desktop/test/plots/")
library(stringr)
str_extract(name, "[.a-z]+/[.a-z]+/$")
#> [1] "m.project/plots/" "test/plots/"
Created on 2022-03-22 by the reprex package (v2.0.1)

Extract dates in a complex string

I have a problem for extract dates in files names, in my example a have the file.name object:
file.name<- c("AZAMBUJAI002A20190518T133231_20190518T133919_T22JCM_2021_05_19_01_18_22.tif","RINCAODOSSOARES051B20210107T133231_20190518T133919_T22JSM_2021_05_19_01_18_22",
"VILAPALMA33K20181018T133231_20190518T133919_T23JCM_2020_05_19_01_18_22.tif")
I need to extract in a new object the specific dates: 20190518, 20210107 and 20181018 inside in the files names. But for this a can't use substr because a have different lengths of areas names (AZAMBUJAI002A,RINCAODOSSOARES051B and VILAPALMA33K) and not to use remove letters too (a cause of numeric area id - 002, 051 and 33). The dates in the end before ".tif" separated by "_" is not useful information.
My desirable output is:
mydates
[1] 2019-05-18
[2] 2021-01-07
[3] 2018-10-18
Is there any solution to the problem described? Thanks!!
Solution using base R functions. Works as long as the format is always "yyyymmdd" and the relevant string appears before the first underscore:
file.name<- c("AZAMBUJAI002A20190518T133231_20190518T133919_T22JCM_2021_05_19_01_18_22.tif",
"RINCAODOSSOARES051B20210107T133231_20190518T133919_T22JSM_2021_05_19_01_18_22",
"VILAPALMA33K20181018T133231_20190518T133919_T23JCM_2020_05_19_01_18_22.tif")
Using gsub twice: First (in the inner function) to get rid of everything after the first underscore, and then to extract the sequence of eight numbers ([0-9]{8}:
dates <- gsub(".*([0-9]{8}).*", "\\1", gsub("^([^_]*)_.*", "\\1", file.name))
Finally using as.Date to convert the strings to a R date object (can be re-cast to a string using format):
dates_as_actual_date <- as.Date(dates, format("%Y%m%d"))
dates_as_actual_date is a R date object and looks like this:
[1] "2019-05-18" "2021-01-07" "2018-10-18"
Here is a way to extract using regex - assume you only have year start with 20xx
library(stringr)
library(lubridate)
date_string <- str_extract(file.name,
"20\\d{2}\\[0,1][1-9]\\[0-3][1-9]")
date_string
#> [1] "20190518" "20210107" "20181018"
ymd(date_string)
#> [1] "2019-05-18" "2021-01-07" "2018-10-18"
Created on 2021-05-19 by the reprex package (v2.0.0)
library(lubridate)
ymd(gsub("(^.*_)(20[0-9]{2}_)([0-9]{2}_)([0-9]{2}_)(.*$)",
"\\2\\3\\4",
file.name))
ymd is a lubridate function that identifies YYYY-MM-DD dates, almost irrespective of the separator used.
gsub converts a string. The regex inside:
(^.*_) is the first capture group. Takes anything from the beginning to an underscore.
(20[0-9]{2}_) is the second capture group. It takes a string that starts with 20 and is followed by any two digits and an underscore.
([0-9]{2}_) is the third and fourth capture groups. It takes two digits followed by an underscore.
(.*$) is the last (5th) capture group. Takes anything to the end of the string.
"\2\3\4" returns second, third and fourth capture groups.
EDIT:
The explanation to the code is still OK, but to retrieve the dates just after the names then the code needed is this:
ymd(gsub("(^.*[A-Z])(20[0-9]{2})([0-9]{2})([0-9]{2})(.*$)",
"\\2\\3\\4",
file.name))

How to delete all data after the third colon in strings/ substrings in R?

So I have a series of about 200,000 data points that look like this: DATA:abc:de123fg:12ghk8d and DATA:ghi:56kdv:128485hg. The only identifying data that I need to look at is before the third colon. I want to remove everything after the third colon so I can aggregate unique identifiers from the rest of the substring..
So far, I have attempted to use str_remove_all and gsub to remove everything after the third colon. The problem with this is that sometimes the data points are grouped together in the same string like this:
DATA:ghi:56kdv:128485hg|DATA:abc:de123fg:12ghk8d
So string_remove_all is just removing the end of the last substring and it ends up looking like this:
DATA:ghi:56kdv:128485hg|DATA:abc:de123fg
Does anyone know how I can accomplish this task? Thanks in advance..
Here's an option in base R with regmatches and regexpr:
str <- c("DATA:abc:de123fg:12ghk8d", "DATA:ghi:56kdv:128485hg|DATA:abc:de123fg:12ghk8d")
regmatches(str, regexpr("[^:]*:[^:]*", str))
#> [1] "DATA:abc" "DATA:ghi"
And the corresponding solution in stringr, if you prefer:
library(stringr)
str_extract(str, "[^:]*:[^:]*")
#> [1] "DATA:abc" "DATA:ghi"
Created on 2019-12-03 by the reprex package (v0.3.0)

How can I target specific data frames based on pattern matching using Regex?

I have several data frames and I want to merge specific ones into a list for easier management.
As there are lots of data frames, I am using lapply() to perform the merge quickly.
My pattern matching (regex) element looks like this:
ls(pattern = "jan[0-9]")
which returns:
[1] "jan0000" "jan0000_0059" "jan0100" "jan0100_0159" "jan0200" "jan0200_0259" "jan0300" "jan0300_0359" "jan0400" "jan0400_0459" "jan0500"
[12] "jan0500_0559" "jan0600" "jan0600_0659" "jan0700" "jan0700_0759" "jan0800" "jan0800_0859" "jan0900" "jan0900_0959" "jan1000" "jan1000_1059"
[23] "jan1100" "jan1100_1159" "jan1200" "jan1200_1259" "jan1300" "jan1300_1359" "jan1400" "jan1400_1459" "jan1500" "jan1500_1559" "jan1600"
[34] "jan1600_1659" "jan1700" "jan1700_1759" "jan1800" "jan1800_1859" "jan1900" "jan1900_1959" "jan2000" "jan2000_2059" "jan2100" "jan2100_2159"
[45] "jan2200" "jan2200_2259" "jan2300" "jan2300_2359"
However, the problem is that I only want to extract the data frames whose names are 12 characters in length.
I have tried numerous things like searching for an exact length (the ones I am interested in are all the same length):
ls(pattern = "jan[0-9]{12}")
but it returns:
character(0)
Another way I think would work would be to search for anything that begins with jan, followed by four numbers and then an underscore. The problem is that I can't seem to get the regex expression to return any results.
What is the best way to achieve this?
It seems you may use
ls(pattern = "^jan[0-9]{4}_")
Details
^ - start of string
jan - a literal substring
[0-9]{4} - any four ASCII digits
_ - an underscore.
See the regex demo.
If you append [0-9]{4}$ you will restrict the pattern further to require 4 digits and end of string to the right of the underscore. See another regex demo.

replacing the nth character in a string only if it is a particular character in R

I am importing a series of surveys as .csv files and combining into one data set. The problem is for one of the seven files some of the variables are importing slightly differently. The data set is huge and I would like to find a way to write a function to run over dataset that is giving me trouble.
In some of the variables there is an underscore when there should be a dot. Not all variables are of the same format but the ones that are incorrect are, in that the underscore is always the 6th element of the column name.
I want R to look for the 6th element and if it is an underscore replace it with a dot. here is a made up example below.
col_names <- c("s1.help_needed",
"s1.Q2_im_stuck",
"s1.Q2.im_stuck",
"s1.Q3.regex",
"s1.Q3_regex",
"s2.Q1.is_confusing",
"s2.Q2.answer_please",
"s2.Q2_answer_please",
"s2.someone_knows_the answer",
"s3.appreciate_the_help")
I assume there is a Regex answer to this but i am struggling to find one. perhaps there is also a tidyr answer?
As #thelatemail pointed out, none of your data actually has underscores in the fifth position, but some have it in the sixth position (where others have dot). A base R approach would be to use gsub():
result <- gsub("^(.{5})_", "\\1.", col_names)
> result
[1] "s1.help_needed" "s1.Q2.im_stuck"
[3] "s1.Q2.im_stuck" "s1.Q3.regex"
[5] "s1.Q3.regex" "s2.Q1.is_confusing"
[7] "s2.Q2.answer_please" "s2.Q2.answer_please"
[9] "s2.someone_knows_the answer" "s3.appreciate_the_help"
Here is an explanation of the regex:
^ from the start of the string
(.{5}) match AND capture any five characters
_ followed by an underscore
The quantity in parentheses is called a capture group and can be used in the replacement via \\1. So the regex is saying replace the first six characters with the five characters we captured but use a dot as the sixth character.
You can use a "capture-class" defined by the first 4 (actually 5) characters of any sort followed by an underscore and replace with whatever those 5 characters were was followed a "dot". Since all the examples had the underscore in the 6th position, I'm guessing you were not counting the original "dots":
> col_names
[1] "s1.help_needed" "s1.Q2_im_stuck"
[3] "s1.Q2.im_stuck" "s1.Q3.regex"
[5] "s1.Q3_regex" "s2.Q1.is_confusing"
[7] "s2.Q2.answer_please" "s2.Q2_answer_please"
[9] "s2.someone_knows_the answer" "s3.appreciate_the_help"
> sub("^(.....)_", "\\1.", col_names)
[1] "s1.help.needed" "s1.Q2.im_stuck"
[3] "s1.Q2.im.stuck" "s1.Q3.regex"
[5] "s1.Q3.regex" "s2.Q1.is.confusing"
[7] "s2.Q2.answer.please" "s2.Q2.answer_please"
[9] "s2.someone.knows_the answer" "s3.appreciate.the_help"
Since the replacement argument does not have the same issues with escapes, you do not need to use the doubled backslashes as you might have used in an R-regex pattern argument.

Resources