Extract timestamp in file name - r

I have a vector containing file names:
Filenames = c("blabla_bli_20140524_002532_000.wav",
"20201025_231205.wav",
"ble_20190612_220013_012.wav",
"X-20150312_190225_Blablu.wav",
"0000125.wav")
Of course the real vector is longer. I need to extract the timestamp from the name so to obtain the following result in date format:
Result = c("2014-05-24 00:25:32.000",
"2020-10-25 23:12:05.000",
"2019-06-12 22:00:13.012",
"2015-03-12 19:02:25.000",
NA)
Note that sometimes characters are present before or after the timestamp, and that some file names display the milliseconds while some do not (in this case they are assumed to be 000 milliseconds). I also need to obtain something like "NA" if the file name does not have the timestamp in it.
I found what looks like a good solution here but it is in python and I don't know how to translate it in R. I tried this, which is not working:
str_extract(Filenames, regex("_((\d+)_(\d+))"))
Error: '\d' is an unrecognized escape in character string starting ""_((\d"

You can use gsub to extract your datetime part by pattern, then rely (mostly) on the parsing of datetimes by POSIX standards. The only catch is that in POSIX, milliseconds must be expressed as ss.mmm (fractional seconds), so you need to replace the _ with a .
timestamps <- gsub(".*?(([0-9]{8}_[0-9]{6})(_([0-9]{3}))?).*?$", "\\2.\\4", Filenames)
timestamps
[1] "20140524_002532.000" "20201025_231205." "20190612_220013.012"
[4] "20150312_190225." "0000125.wav"
We have captured the datetime section (\\2), added a dot, then the (optional) millisecond section (\\4), without the underscore. Note that the mismatched filename remains unchanged - that's ok.
Now we specify the datetime format using the POSIX specification to first parse the strings, then print them in a different format:
times <- as.POSIXct(timestamps, format="%Y%m%d_%H%M%OS")
format(times, "%Y-%m-%d %H:%M:%OS3")
[1] "2014-05-24 00:25:32.000" "2020-10-25 23:12:05.000" "2019-06-12 22:00:13.012"
[4] "2015-03-12 19:02:25.000" NA
Note that the string that was not a timestamp just got turned into NA, so you can easily get rid of it here. Also, if you want to see the milliseconds printed out, you seed to use the %OS (fractional seconds) format, plus the number of digits (0-6) you want printed - in this case, %OS3.

You can use
library(stringr)
rx <- "(\\d{4})(\\d{2})(\\d{2})_(\\d{2})(\\d{2})(\\d{2})(?:_(\\d+))?"
Filenames = c("blabla_bli_20140524_002532_000.wav", "20201025_231205.wav", "ble_20190612_220013_012.wav", "X-20150312_190225_Blablu.wav", "0000125.wav")
m <- str_match(Filenames, rx)
result <- ifelse(is.na(m[,8]),
str_c(m[,2], "-", m[,3], "-", m[,4], " ", m[,5], ":", m[,6], ":", m[,7], ".000"),
str_c(m[,2], "-", m[,3], "-", m[,4], " ", m[,5], ":", m[,6], ":", m[,7], ".", m[,8]))
result
See the R demo. Output:
> result
[1] "2014-05-24 00:25:32.000"
[2] "2020-10-25 23:12:05.000"
[3] "2019-06-12 22:00:13.012"
[4] "2015-03-12 19:02:25.000"
[5] NA
See the regex demo. It captures parts of the datetime string into separate groups. If the last group ((\d+) in the (?:_(\d+))? optional non-capturing group matching milliseconds) is matched, we add it preceded with a dot, else, we add .000.

Related

Removing leading question marks from first two words of data frame entries in R

I have a large data frame in R with column "NameFull" holding a text string made up of two words (binomial scientific name), followed by author name(s) and initials. Both have been corrupted (presumably UTF translation issues). This means that in the binomials any leading "x" (indicating hybrids) has been replaced with "?". Unfortunately any non-standard characters in the author names have also been replaced with "?" so I cannot just replace all "?" with x.
I simply want to replace and leading "?" in the first two words with "x" (I will then have to manually compose a list of corrected author names to replace the corrupted ones, unless anyone has a bright idea on that!).
Example chunk of df:
df.corrupt <- data.frame(Bing = 1:6, FullName = c("?Anthematricaria dominii Rohlena", "?Anthemimatricaria inolens P.Fourn.", "?Anthemimatricaria maleolens P.Fourn.", "Achillea ?albinea Bjel?i? & K.Mal?", "Achillea carpatica B?ocki ex Dubovik", "Floscaldasia azorelloides Sklen ? & H.Rob."), Bang = 1:6)
I've tried to shoehorn it into regex but can't get close. Any help appreciated!
On my understanding, you want to replace ?only if it occurs in word-initial position in either the first or the second word; if that's correct this should work:
Data: (I've changed a few chars)
df.corrupt <- data.frame(Bing = 1:6,
FullName = c("?Anthematricaria dominii ?Rohlena",
"?Anthemimatricaria inolens P.Fourn.",
"?Anthemimatricaria maleolens ?P.Fourn.",
"Achillea ?albinea Bjel?i? & K.Mal?",
"Achillea carpatica B?ocki ex Dubovik",
"Floscaldasia azorelloides Sklen ? & H.Rob."), Bang = 1:6)
Solution:
library(stringr)
str_replace_all(df.corrupt$FullName, "^\\?|(?<=^(\\?)?\\b\\w{1,100}\\b\\s)\\?", "x")
[1] "xAnthematricaria dominii ?Rohlena" "xAnthemimatricaria inolens P.Fourn."
[3] "xAnthemimatricaria maleolens ?P.Fourn." "Achillea xalbinea Bjel?i? & K.Mal?"
[5] "Achillea carpatica B?ocki ex Dubovik" "Floscaldasia azorelloides Sklen ? & H.Rob."
This stringr solution puts x where ?occurs right at the start of the string (^) or (|) using positive lookbehind (i.e., a non-consuming capturing group) where it follows a whitespace char (\\s), which in turn follows a word boundary (\\b) following up to 100 \\w chars following a word boundary, following finally an optional ?
We can check for the ? that succeeds a space or at the start of the string, replace with 'x'
trimws(gsub("(^|\\s)\\?", " x", df.corrupt$FullName))

Regular expressions, stringr - have the regex, can't get it to work in R

I have a data frame with a field called "full.path.name"
This contains things like
s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.docx
01 GROUP is a pattern of variable size in the whole string.
I would like to add a new field onto the data frame called "short.path"
and it would contain things like
s:///01 GROUP
s:///02 GROUP LONGER NAME
I've managed to extract the last four characters of the file using stringr, I think I should use stringr again.
This gives me the file extension
sfiles$file_type<-as.factor(str_sub(sfiles$Type.of.file,-4))
I went to https://www.regextester.com/
and got this
s:///*.[^/]*
as the regex to use
so I tried it below
sfiles$file_path_short<-as.factor(str_match(sfiles$Full.path.name,regex("s:///*.[^/]*")))
What I thought I would get is a new field on my data frame containing
01 GROUP etc
I get NA
When I try this
sfiles$file_path_short<-str_extract(sfiles$Full.path.name,"[S]")
Gives me S
Where am I going wrong?
When I use: https://regexr.com/
I get
\d* [A-Z]* [A-Z]*[^/]
How do I put that into
sfiles$file_path_short<-str_extract(sfiles$Full.path.name,\d* [A-Z]* [A-Z]*[^\/])
And make things work?
EDIT:
There are two solutions here.
The reason the solutions didn't work at first was because
sfiles$Full.path.name
was >255 in some cases.
What I did:
To make g_t_m's regex work
library(tidyverse)
#read the file
sfiles1<-read.csv("H:/sdrive_files.csv", stringsAsFactors = F)
# add a field to calculate path length and filter out
sfiles$file_path_length <- str_length(sfiles$Full.path.name)
sfiles<-sfiles%>%filter(file_path_length <=255)
# then use str_replace to take out the full path name and leave only the
top
# folder names
sfiles$file_path_short <- as.factor(str_replace(sfiles$Full.path.name, "
(^.+?/[^/]+?)/.+$", "\\1"))
levels(sfiles$file_path_short)
[1] "S:///01 GROUP 1"
[2] "S:///02 GROUP 2"
[3] "S:///03 GROUP 3"
[4] "S:///04 GROUP 4"
[5] "S:///05 GROUP 5"
[6] "S:///06 GROUP 6"
[7] "S:///07 GROUP 7
I think it was the full.path.name field that was causing problems.
To make Wiktor's answer work I did this:
#read the file
sfiles<-read.csv("H:/sdrive_files.csv", stringsAsFactors = F)
str(sfiles)
sfiles$file_path_length <- str_length(sfiles$Full.path.name)
sfiles<-sfiles%>%filter(file_path_length <=255)
sfiles$file_path_short <- str_replace(sfiles$Full.path.name, "
(^.+?/[^/]+?)/.+$", "\\1")
You may use a mere
sfiles$file_path_short <- str_extract(sfiles$Full.path.name, "^s:///[^/]+")
If you plan to exclude s:/// from the results, wrap it within a positive lookbehind:
"(?<=^s:///)[^/]+"
See the regex demo
Details
^ - start of string
s:/// - a literal substring
[^/]+ - a negated character class matching any 1+ chars other than /.
(?<=^s:///) - a positive lookbehind that requires the presence of s:/// at the start of the string immediately to the left of the current location (but this value does not appear in the resulting matches since lookarounds are non-consuming patterns).
Firstly, I would amend your regex to extract the file extension, since file extensions are not always 4 characters long:
library(stringr)
df <- data.frame(full.path.name = c("s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.docx",
"s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.pdf"), stringsAsFactors = F)
df$file_type <- str_replace(basename(df$full.path.name), "^.+\\.(.+)$", "\\1")
df$file_type
[1] "docx" "pdf"
Then, the following code should give you your short name:
df$file_path_short <- str_replace(df$full.path.name, "(^.+?/[^/]+?)/.+$", "\\1")
df
full.path.name file_type file_path_short
1 s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.docx docx s:///01 GROUP
2 s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.pdf pdf s:///01 GROUP

Is this an encoding issue?

I downloaded a text file that contains basically two columns—a date column and a contents column.
The date column was initially in the format: mm/dd/yy h:mm:ss am/pm. For example, one such date would be 10/16/2018 8:10:10 PM
I wanted to get the normal date isolated. I split the text column using the strsplit() command and so now I have a vector with dates in the common mm/dd/yy format. I want to convert this using the as.Date(x, format = '%m/%d/%y) coommand.
I notice, however, that I get a good chunk of my character vector coming out as NA. I compared the NA values to the values surrounding it. Here is what I see:
normal_vector[1:3]
[1] "10/12/17" "‎10/12/17" "10/12/17"
**The middle one (normal_vector[2]) is the problem one. **
as.Date(normal_vector[1:3], format = "%m/%d/%y")
[1] "2017-10-12" NA "2017-10-12"
Could this be an encoding issue? I try using the as.Date(iconv(normal_vector[1:3], to = "UTF-8"), format = "%m/%d/%y") but it does not appear to help. Furthermore, if I inspect the encoding of the character vectors as it already is, I get the following:
Encoding(normal_vector[1:3])
[1] "unknown" "UTF-8" "unknown"
Again, I just want to convert all three of these elements into a normal date object in R. They appear identical, and the encoding would have me think that a "UTF-8" character would be easily handled by an as.Date() function. What are some possible reasons that it refuses to be converted to a date?
Thanks!
There are indeed some strange characters (three 'dots') in your second string
look at the hex e280 8e
fread from the data.table-package can read these text just fine...
data.table::fread("./temp.csv", header = FALSE)
# V1 V2 V3
# 1: 10/12/17 ‎10/12/17 10/12/17
after reading, you can cleanse your data using some regex-magic...
dt <- data.table::fread("./temp.csv", header = FALSE)
# V1 V2 V3
# 1: 10/12/17 ‎10/12/17 10/12/17
#strip all NON 0-9, a-z, A-z AND '/' -characters
cleaned <- as.character( gsub( "[^0-9a-zA-Z/]", "", as.matrix( dt ) ) )
as.Date( cleaned, format = "%m/%d/%y" )
# [1] "2017-10-12" "2017-10-12" "2017-10-12"

Retrieving a specific part of a string in R

I have the next vector of strings
[1] "/players/playerpage.htm?ilkidn=BRYANPHI01"
[2] "/players/playerpage.htm?ilkidhh=WILLIROB027"
[3] "/players/playerpage.htm?ilkid=THOMPWIL01"
I am looking for a way to retrieve the part of the string that is placed after the equal sign meaning I would like to get a vector like this
[1] "BRYANPHI01"
[2] "WILLIROB027"
[3] "THOMPWIL01"
I tried using substr but for it to work I have to know exactly where the equal sign is placed in the string and where the part i want to retrieve ends
We can use sub to match the zero or more characters that are not a = ([^=]*) followed by a = and replace it with ''.
sub("[^=]*=", "", str1)
#[1] "BRYANPHI01" "WILLIROB027" "THOMPWIL01"
data
str1 <- c("/players/playerpage.htm?ilkidn=BRYANPHI01",
"/players/playerpage.htm?ilkidhh=WILLIROB027",
"/players/playerpage.htm?ilkid=THOMPWIL01")
Using stringr,
library(stringr)
word(str1, 2, sep = '=')
#[1] "BRYANPHI01" "WILLIROB027" "THOMPWIL01"
Using strsplit,
strsplit(str1, "=")[[1]][2]
# [1] "BRYANPHI01"
With Sotos comment to get results as vector:
sapply(str1, function(x){
strsplit(x, "=")[[1]][2]
})
Another solution based on regex, but extracting instead of substituting, which may be more efficient.
I use the stringi package which provides a more powerful regex engine than base R (in particular, supporting look-behind).
str1 <- c("/players/playerpage.htm?ilkidn=BRYANPHI01",
"/players/playerpage.htm?ilkidhh=WILLIROB027",
"/players/playerpage.htm?ilkid=THOMPWIL01")
stri_extract_all_regex(str1, pattern="(?<==).+$", simplify=T)
(?<==) is a look-behind: regex will match only if preceded by an equal sign, but the equal sign will not be part of the match.
.+$ matches everything until the end. You could replace the dot with a more precise symbol if you are confident about the format of what you match. For example, '\w' matches any alphanumeric character, so you could use "(?<==)\\w+$" (the \ must be escaped so you end up with \\w).

How to import data in the right monthly order using list.files

I have 100 years of monthly data where each month is a file and the file name ends with the year and month of the data.
e.g. "cru_ts_3_10.1901.2009.pet_1901_1.asc" is the file for year 1901, month 1 (January).
The problem is when I list my files the order of the files changes, the months 10, 11 and 12 come after 1:
files <- list.files(pattern=".asc")
head(files)
[1] "cru_ts_3_10.1901.2009.pet_1901_1.asc" "cru_ts_3_10.1901.2009.pet_1901_10.asc" "cru_ts_3_10.1901.2009.pet_1901_11.asc"
[4] "cru_ts_3_10.1901.2009.pet_1901_12.asc" "cru_ts_3_10.1901.2009.pet_1901_2.asc" "cru_ts_3_10.1901.2009.pet_1901_3.asc"
I can see why that happens, but how can I import my data in the right monthly order?
Another regex based solution. It works by extracting the year and month from a filename to construct a real date, and then uses the sort order to print the file list.
pat <- "^.*pet_([0-9]{1,})_([0-9]{1,}).asc$"
ord_files <- as.Date(gsub(pat, sprintf("%s-%s-01", "\\1", "\\2"), files))
files[order(ord_files)]
EXPLANATION
We use regular expressions to match the year and month in the file name. Accordingly \\1 matches the year and \\2 matches the month. We still need to convert it to a date. The statement sprintf("%s-%s-01",\1,\2) substitutes the values of year and month in place of %s. The as.Date is required to convert the string into a date.
files <- c("cru_ts_3_10.1901.2009.pet_1901_1.asc",
"cru_ts_3_10.1901.2009.pet_1901_10.asc",
"cru_ts_3_10.1901.2009.pet_1901_11.asc",
"cru_ts_3_10.1901.2009.pet_1901_12.asc",
"cru_ts_3_10.1901.2009.pet_1901_2.asc",
"cru_ts_3_10.1901.2009.pet_1901_3.asc",
"cru_ts_3_10.1901.2009.pet_1902_1.asc",
"cru_ts_3_10.1901.2009.pet_1902_10.asc",
"cru_ts_3_10.1901.2009.pet_1902_11.asc")
This splits the names on underscores, and selects the last part. (e.g. "1.asc") and removes the ".asc" using sub. Then it converts what is is left into a number and uses sprintf on the number to get a 2 character (digit) string. Then it turns the year and month into a number and orders based on that.
files[order(sapply(strsplit(files, "_"), function(x) {
m <- sprintf("%02d", as.numeric(sub(".asc", "", last(x)))) # turns "1.asc" into "01"
as.numeric(paste(x[length(x) - 1], m, sep=""))
}))]
Returns:
[1] "cru_ts_3_10.1901.2009.pet_1901_1.asc"
[2] "cru_ts_3_10.1901.2009.pet_1901_2.asc"
[3] "cru_ts_3_10.1901.2009.pet_1901_3.asc"
[4] "cru_ts_3_10.1901.2009.pet_1901_10.asc"
[5] "cru_ts_3_10.1901.2009.pet_1901_11.asc"
[6] "cru_ts_3_10.1901.2009.pet_1901_12.asc"
[7] "cru_ts_3_10.1901.2009.pet_1902_1.asc"
[8] "cru_ts_3_10.1901.2009.pet_1902_10.asc"
[9] "cru_ts_3_10.1901.2009.pet_1902_11.asc"
Look at the mixedsort function in the gtools package.

Resources