Is this an encoding issue?

Is this an encoding issue? - r

I downloaded a text file that contains basically two columns—a date column and a contents column.
The date column was initially in the format: mm/dd/yy h:mm:ss am/pm. For example, one such date would be 10/16/2018 8:10:10 PM
I wanted to get the normal date isolated. I split the text column using the strsplit() command and so now I have a vector with dates in the common mm/dd/yy format. I want to convert this using the as.Date(x, format = '%m/%d/%y) coommand.
I notice, however, that I get a good chunk of my character vector coming out as NA. I compared the NA values to the values surrounding it. Here is what I see:
normal_vector[1:3]
[1] "10/12/17" "‎10/12/17" "10/12/17"
**The middle one (normal_vector[2]) is the problem one. **
as.Date(normal_vector[1:3], format = "%m/%d/%y")
[1] "2017-10-12" NA "2017-10-12"
Could this be an encoding issue? I try using the as.Date(iconv(normal_vector[1:3], to = "UTF-8"), format = "%m/%d/%y") but it does not appear to help. Furthermore, if I inspect the encoding of the character vectors as it already is, I get the following:
Encoding(normal_vector[1:3])
[1] "unknown" "UTF-8" "unknown"
Again, I just want to convert all three of these elements into a normal date object in R. They appear identical, and the encoding would have me think that a "UTF-8" character would be easily handled by an as.Date() function. What are some possible reasons that it refuses to be converted to a date?
Thanks!

There are indeed some strange characters (three 'dots') in your second string
look at the hex e280 8e
fread from the data.table-package can read these text just fine...
data.table::fread("./temp.csv", header = FALSE)
# V1 V2 V3
# 1: 10/12/17 â€Ž10/12/17 10/12/17
after reading, you can cleanse your data using some regex-magic...
dt <- data.table::fread("./temp.csv", header = FALSE)
# V1 V2 V3
# 1: 10/12/17 â€Ž10/12/17 10/12/17
#strip all NON 0-9, a-z, A-z AND '/' -characters
cleaned <- as.character( gsub( "[^0-9a-zA-Z/]", "", as.matrix( dt ) ) )
as.Date( cleaned, format = "%m/%d/%y" )
# [1] "2017-10-12" "2017-10-12" "2017-10-12"

Related

Extract timestamp in file name

I have a vector containing file names:
Filenames = c("blabla_bli_20140524_002532_000.wav",
"20201025_231205.wav",
"ble_20190612_220013_012.wav",
"X-20150312_190225_Blablu.wav",
"0000125.wav")
Of course the real vector is longer. I need to extract the timestamp from the name so to obtain the following result in date format:
Result = c("2014-05-24 00:25:32.000",
"2020-10-25 23:12:05.000",
"2019-06-12 22:00:13.012",
"2015-03-12 19:02:25.000",
NA)
Note that sometimes characters are present before or after the timestamp, and that some file names display the milliseconds while some do not (in this case they are assumed to be 000 milliseconds). I also need to obtain something like "NA" if the file name does not have the timestamp in it.
I found what looks like a good solution here but it is in python and I don't know how to translate it in R. I tried this, which is not working:
str_extract(Filenames, regex("_((\d+)_(\d+))"))
Error: '\d' is an unrecognized escape in character string starting ""_((\d"

You can use gsub to extract your datetime part by pattern, then rely (mostly) on the parsing of datetimes by POSIX standards. The only catch is that in POSIX, milliseconds must be expressed as ss.mmm (fractional seconds), so you need to replace the _ with a .
timestamps <- gsub(".*?(([0-9]{8}_[0-9]{6})(_([0-9]{3}))?).*?$", "\\2.\\4", Filenames)
timestamps
[1] "20140524_002532.000" "20201025_231205." "20190612_220013.012"
[4] "20150312_190225." "0000125.wav"
We have captured the datetime section (\\2), added a dot, then the (optional) millisecond section (\\4), without the underscore. Note that the mismatched filename remains unchanged - that's ok.
Now we specify the datetime format using the POSIX specification to first parse the strings, then print them in a different format:
times <- as.POSIXct(timestamps, format="%Y%m%d_%H%M%OS")
format(times, "%Y-%m-%d %H:%M:%OS3")
[1] "2014-05-24 00:25:32.000" "2020-10-25 23:12:05.000" "2019-06-12 22:00:13.012"
[4] "2015-03-12 19:02:25.000" NA
Note that the string that was not a timestamp just got turned into NA, so you can easily get rid of it here. Also, if you want to see the milliseconds printed out, you seed to use the %OS (fractional seconds) format, plus the number of digits (0-6) you want printed - in this case, %OS3.

You can use
library(stringr)
rx <- "(\\d{4})(\\d{2})(\\d{2})_(\\d{2})(\\d{2})(\\d{2})(?:_(\\d+))?"
Filenames = c("blabla_bli_20140524_002532_000.wav", "20201025_231205.wav", "ble_20190612_220013_012.wav", "X-20150312_190225_Blablu.wav", "0000125.wav")
m <- str_match(Filenames, rx)
result <- ifelse(is.na(m[,8]),
str_c(m[,2], "-", m[,3], "-", m[,4], " ", m[,5], ":", m[,6], ":", m[,7], ".000"),
str_c(m[,2], "-", m[,3], "-", m[,4], " ", m[,5], ":", m[,6], ":", m[,7], ".", m[,8]))
result
See the R demo. Output:
> result
[1] "2014-05-24 00:25:32.000"
[2] "2020-10-25 23:12:05.000"
[3] "2019-06-12 22:00:13.012"
[4] "2015-03-12 19:02:25.000"
[5] NA
See the regex demo. It captures parts of the datetime string into separate groups. If the last group ((\d+) in the (?:_(\d+))? optional non-capturing group matching milliseconds) is matched, we add it preceded with a dot, else, we add .000.

data frame with mixed date format

I would like to change all the mixed date format into one format for example d-m-y
here is the data frame
x <- data.frame("Name" = c("A","B","C","D","E"), "Birthdate" = c("36085.0","2001-sep-12","Feb-18-2005","05/27/84", "2020-6-25"))
I hv tried using this code down here, but it gives NAs
newdateformat <- as.Date(x$Birthdate,
format = "%m%d%y", origin = "2020-6-25")
newdateformat
Then I tried using parse, but it also gives NAs which means it failed to parse
require(lubridate)
parse_date_time(my_data$Birthdate, orders = c("ymd", "mdy"))
[1] NA NA "2001-09-12 UTC" NA
[5] "2005-02-18 UTC"
and I also could find what is the format for the first date in the data frame which is "36085.0"
i did found this code but still couldn't understand what the number means and what is the "origin" means
dates <- c(30829, 38540)
betterDates <- as.Date(dates,
origin = "1899-12-30")
p/s : I'm quite new to R, so i appreciate if you can use an easier explanation thank youuuuu

You should parse each format separately. For each format, select the relevant rows with a regular expression and transform only those rows, then move on the the next format. I'll give the answer with data.table instead of data.frame because I've forgotten how to use data.frame.
library(lubridate)
library(data.table)
x = data.table("Name" = c("A","B","C","D","E"),
"Birthdate" = c("36085.0","2001-sep-12","Feb-18-2005","05/27/84", "2020-6-25"))
# or use setDT(x) to convert an existing data.frame to a data.table
# handle dates like "2001-sep-12" and "2020-6-25"
# this regex matches strings beginning with four numbers and then a dash
x[grepl('^[0-9]{4}-',Birthdate),Birthdate1:=ymd(Birthdate)]
# handle dates like "36085.0": days since 1904 (or 1900)
# see https://learn.microsoft.com/en-us/office/troubleshoot/excel/1900-and-1904-date-system
# this regex matches strings that only have numeric characters and .
x[grepl('^[0-9\\.]+$',Birthdate),Birthdate1:=as.Date(as.numeric(Birthdate),origin='1904-01-01')]
# assume the rest are like "Feb-18-2005" and "05/27/84" and handle those
x[is.na(Birthdate1),Birthdate1:=mdy(Birthdate)]
# result
> x
Name Birthdate Birthdate1
1: A 36085.0 2002-10-18
2: B 2001-sep-12 2001-09-12
3: C Feb-18-2005 2005-02-18
4: D 05/27/84 1984-05-27
5: E 2020-6-25 2020-06-25

convert Single digit day in R

I have dates in the format Apr42016, Aug12017, Apr112018. I am trying to convert in Y/m/d using R. I have tried the codes below but when I have a single digit for the day it returned NA. Anyone could help me, please?
strptime(data$date, "%b%e%Y")
as.Date (data$date, format="%b%d%Y")
as.POSIXct(data$date, format="%b%e%Y")
Thank you

You can modify the strings with sub (and add a 0 if necessary) before using as.Date:
myvec <- c("Apr42016", "Aug12017", "Apr112018") # the data
myvec2 <- sub("(?<=[^0])(?=[0-9]{5})", "0", myvec, perl = TRUE)
# [1] "Apr042016" "Aug012017" "Apr112018"
as.Date(myvec2, format = "%b%d%Y")
# [1] "2016-04-04" "2017-08-01" "2018-04-11"

If you can break up the numbers before as.Date, it will make things much easier. (Borrowing Sven's look-behind.)
sub("(?<=\\D)(\\d+)(\\d{4})$", "-\\1-\\2",
c("Apr42016", "Aug12017", "Apr112018"), perl=TRUE)
# [1] "Apr-4-2016" "Aug-1-2017" "Apr-11-2018"
From here, the format should be rather straight-forward:
as.Date(sub("(?<=\\D)(\\d+)(\\d{4})$", "-\\1-\\2", c("Apr42016", "Aug12017", "Apr112018"), perl = TRUE),
format="%b-%d-%Y")
# [1] "2016-04-04" "2017-08-01" "2018-04-11"

R use of '\' in a string

I am experiment with R and came across an issue I don't fully understand.
dates = c("03-19-76", "04/19/76", as.character("04\19\76"), "05.19.76", "060766")
dates
[1] "03-19-76" "04/19/76" "04\0019>" "05.19.76" "060766"
Why should the third date be interpreted and what sort of interpretation is taking place. I also got this output when I left out the as.character function.
Thanks

Echoing the comments, make sure to escape backslashes in strings.
dates = c("03-19-76", "04/19/76", "04\\19\\76", "05.19.76", "060766")
> dates
[1] "03-19-76" "04/19/76" "04\\19\\76" "05.19.76" "060766"
Now that you've got the dates stored, there's actually a lot of built in functions you can use with dates. Dates even have their own object types! To do so use as.Date. Since you're using nonstandard date formats, you have to tell R how you've formatted them.
> as.Date(dates[1], "%m-%d-%y")
[1] "1976-03-19"
> as.Date(dates[2], "%m/%d/%y")
[1] "1976-04-19"
> as.Date("20\\10\\1999", "%d\\%m\\%Y")
[1] "1999-10-20"
a <- as.Date(dates[1], "%m-%d-%y")
b <- as.Date(dates[2], "%m/%d/%y")
> b - a
Time difference of 31 days
d <- as.numeric(b-a)
> d
[1] 31
> a + d^2
[1] "1978-11-05"
Note that since you're using 2-digit years, you use %y. If you used 4-digit years, you'd use %Y. If you forget, you'll get oddities like this:
> as.Date("03/14/2001", "%m/%d/%y")
[1] "2020-03-14"
> as.Date("03/14/10", "%m/%d/%Y")
[1] "0010-03-14"

How can you insert a colon every two characters?

I have a column of time values, except that they are in character format and do not have the colons to separate H, M, S. The column looks similar to the following:
Time
024201
054722
213024
205022
205024
125440
I want to convert all the values in the column to look like actual time values in the format H:M:S. The values are already in HMS format, so it is simply a matter of inserting colons, but that is proving more difficult than I thought. I found a package that adds commas every three digits from the right to make Strings look like currency values, but nothing for time (without also adding a date value, which I do not want to do). Any help would be appreciated.

Since the data is time related, you should consider storing it in a POSIX format:
> df <- data.frame(Time=c("024201", "054722", "213024", "205022", "205024", "125440")
> df$Time <- as.POSIXct(df$Time, format="%H%M%S")
> df
Time
1 2014-01-05 02:42:01
2 2014-01-05 05:47:22
3 2014-01-05 21:30:24
4 2014-01-05 20:50:22
5 2014-01-05 20:50:24
6 2014-01-05 12:54:40
To output just the times:
> format(df, "%H:%M:%S")
Time
1 02:42:01
2 05:47:22
3 21:30:24
4 20:50:22
5 20:50:24
6 12:54:40

A regular expression with lookaround works for this:
gsub('(..)(?=.)', '\\1:', x$Time, perl=TRUE)
The (?=.) means a character (matched by .) must follow, but is not considered part of the match (and is not captured).

Here is a regex solution:
x <- readLines(n=6)
024201
054722
213024
205022
205024
125440
gsub("(\\d\\d)(\\d\\d)(\\d\\d)", "\\1:\\2:\\3", x)
## [1] "02:42:01" "05:47:22" "21:30:24"
## [4] "20:50:22" "20:50:24" "12:54:40 "
Here the (\\d\\d) says we're looking for 2 digits. The parenthesis breaks the string into 3 parts. Then the \\1: says take chunk 1 and place a colon after it.

Or via date/times classes:
time <- c("024201", "054722", "213024", "205022", "205024", "125440")
time <- as.POSIXct(paste0("1970-01-01", time), format="%Y-%d-%m %H%M%S")
(time <- format(time, "%H:%M:%S"))
# [1] "02:42:01" "05:47:22" "21:30:24" "20:50:22" "20:50:24" "12:54:40"

This gives a chron "times" class vector:
> library(chron)
> times(gsub("(..)(..)(..)", "\\1:\\2:\\3", DF$Time))
[1] 02:42:01 05:47:22 21:30:24 20:50:22 20:50:24 12:54:40
The "times" class can display times without having to display the date and supports various methods on the times.
On the other hand, if only a character string is wanted then only the gsub part is needed.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Is this an encoding issue? - r

Related

Extract timestamp in file name

data frame with mixed date format

convert Single digit day in R

R use of '\' in a string

How can you insert a colon every two characters?

Categories

Resources