I have been given a csv with a column called month as a char variable with the first three letters of the month. E.g.:
"Jan", "Feb","Mar",..."Dec"
Is there any way to convert this to a numeric representation of the month, 1 to 12, or even a type that is in a date format?
Use match and the predefined vector month.abb:
tst <- c("Jan","Mar","Dec")
match(tst,month.abb)
[1] 1 3 12
You can use the built-in vector month.abb to check against when converting to a number, eg :
mm <- c("Jan","Dec","jan","Mar","Apr")
sapply(mm,function(x) grep(paste("(?i)",x,sep=""),month.abb))
Jan Dec jan Mar Apr
1 12 1 3 4
The grep construct takes care of differences in capitalization. If that's not needed,
match(mm,month.abb)
works just as fine.
If you also have a day and a year column, you can use any of the conversion functions, using the appropriate codes (see also ?strftime)
eg
mm <- c("Jan","Dec","jan","Mar","Apr")
year <- c(1998,1998,1999,1999,1999)
day <- c(4,10,3,16,25)
dates <- paste(year,mm,day,sep="-")
strptime(dates,format="%Y-%b-%d")
[1] "1998-01-04" "1998-12-10" "1999-01-03" "1999-03-16" "1999-04-25"
Just adding to the existing answers and the comment in the question:
readr::parse_date("20/DEZEMBRO/18","%d/%B/%y",locale=locale("pt"))
Results date format "2018-12-20". locale("pt") is for Portuguese, which is used in Brazil, can do "es" for Spanish, "fr" for French etc.
A couple of options using:
vec <- c("Jan","Dec","Jan","Apr")
are
> Months <- 1:12
> names(Months) <- month.abb
> unname(Months[vec])
[1] 1 12 1 4
and/or
> match(vec, month.abb)
[1] 1 12 1 4
Related
I have df1:
ID Time
1 16:00:00
2 14:30:00
3 9:23:00
4 10:00:00
5 23:59:00
and would like to change the current 'character' column 'Time' into a an 'integer' as below:
ID Time
1 1600
2 1430
3 923
4 1000
5 2359
We could replace the :'s, make numeric, divide by 100, and convert to integer like this:
df1$Time = as.integer(as.numeric(gsub(':', '', df1$Time))/100)
You want to use as.POSIXct().
Functions to manipulate objects of classes "POSIXlt" and "POSIXct" representing calendar dates and times.
R Documents as.POSIXct()
So in the case of row 1: as.POSIXct("16:00:00", format = "%H%M")
Then use as.numeric if you need it to truly be an int.
Converts a character matrix to a numeric matrix.
R Docs as.Numeric()
df1 <- data.frame(Time = "16:00:00")
df1[, "Time"] <- as.numeric(paste0(substr(df1[, "Time"], 1, 2), substr(df1[, "Time"], 4, 5)))
print(df1)
# Time
# 1 1600
There are many ways to process this, but here's one example:
library(dplyr)
df1 <- mutate(df1, Time = gsub(":", "", Time) # replace colons with blanks
df1 <- mutate(df1, Time = as.numeric(Time)/100) # coerce to numeric type, divide by 100
I have tried to extract it but the methods seem to only work for YYYY-MM. I have data in terms of a date (YYYYMM) and am trying to get in terms of just the month, such as: Month
Ultimately, I would like it to look like this:
ID Date Month
1 200402 2
2 200603 3
3 200707 7
I am doing this in hopes of plotting monthly mean values.
You can simply do it using:
library(stringr)
str_sub(df$Date,-2,-1)
Or using;
df['Date'].str[-2:]
Hope this helps!
Assuing your Date column be numeric, you could just use the modulus:
df$Month <- df$Date %% 100
df
ID Date Month
1 1 200402 2
2 2 200603 3
3 3 200707 7
Data:
df <- data.frame(ID=c(1,2,3), Date=c(200402, 200603, 200707))
To make the above work when Date be character, just cast it to numeric first.
You can extract last two characters of Date Column.
sub('.*(..)$', '\\1', df$Date)
#Or without capture groups suggested by #Tim Biegeleisen
#sub("^.*(?=..$)", "", df$Date, perl = TRUE)
#[1] "02" "03" "07"
However, ideally you should avoid parsing information from date-time using regex. Convert it to date and then extract the month.
format(as.Date(paste(df$Date, '01'), "%Y%m%d"), '%m')
#Or with zoo::yearmon
#format(zoo::as.yearmon(as.character(df$Date), "%Y%m"), '%m')
When entering behavior data in a different system, I wrote the subjects in a form such as 3-2 (to mean rank 3 to rank 2). I exported these to Excel, which took these entries as dates (so 2-Mar for this example).
I now have thousands of entries in this format. I have added two columns ("Actor" and "Recipient") and would like to fill in the rank numbers for these, based on what is in the "Subject" column.
A couple of lines of what I'm hoping my R output will give me:
Subject Actor Recipient
2-Mar 3 2
5-Jun 6 5
6-Feb 2 6
etc.
So I already have the "Subject" columns and need help figuring out code to fill in the "Actor" and "Recipient" columns. Rank numbers only go up to 6.
I've tried a couple of things but just keep getting error messages... Any help with this would be GREATLY appreciated!
Here you can use tstrsplit() after converting to date format
# Recreate your data
x <- data.frame("Subject" = c("2-Mar", "5-Jun", "6-Feb"))
# Change the format of your Subject coumn
x[, "Subject"] <- format(as.POSIXct(x[, "Subject"], format = "%d-%b"), "%m %d")
# Split into the two strings
library(data.table) # to get tstrsplit() function
x[, c("Actor", "Recipient")] <- tstrsplit(x[, "Subject"], " ")
# Convert to numeric
x[, "Actor"] <- as.numeric(x[, "Actor"])
x[, "Recipient"] <- as.numeric(x[, "Recipient"])
This returns
> x
Subject Actor Recipient
1 02 03 3 2
2 05 06 6 5
3 06 02 2 6
And if you want Subject in its original format
# Return Subject to original format
x[, "Subject"] <- format(as.POSIXct(x[, "Subject"], format = "%m %d"), "%d-%b")
Giving
> x
Subject Actor Recipient
1 02-Mar 3 2
2 05-Jun 6 5
3 06-Feb 2 6
Explained:
Your vector/variable "Subject" was imported as a character-type atomic vector (atomic vectors are a 1 dimensional structure of one or more elements, where all elements must be the same type). The solution was to convert that something that R would interpret as a date using the as.POSIXct(..., format = "...") function, where format is telling R how the string is formatted (see codes here). I then wrapped that in the format() function, telling it to change the format to numeric months. That was then split into two columns using the tstrsplit() function, but R interpreted those as character-type data, so I converted them using the as.numeric() function to double-type data.
You could convert Subject to date and extract month and year from it.
temp <- as.Date(df$Subject, "%d-%b")
df$Actor <- as.integer(format(temp, "%m"))
df$Recipient <- as.integer(format(temp, "%d"))
df
# Subject Actor Recipient
#1 2-Mar 3 2
#2 5-Jun 6 5
#3 6-Feb 2 6
This can also be done using lubridate functions.
df$Actor <- month(temp)
df$Recipient <- day(temp)
I want to format several columns in datatable/dataframe using lubridate and column indexing.
Suppose that there is a very large data set which has several unformatted date columns. The question is how can I identify those columns (most likely through indexing) and then format them at the same time in one script using lubridate.
library(data.table)
library (lubridate)
> dt <- data.frame(date1 = c("14.01.2009", "9/2/2005", "24/1/2010", "28.01.2014"),var1 = rnorm(4,2,1), date2 = c("09.01.2009", "23/8/2005","17.01.2000", "04.01.2005"))
> dt
date1 var1 date2
1 14.01.2009 2.919293 09.01.2009
2 9/2/2005 2.390123 23/8/2005
3 24/1/2010 0.878209 17.01.2000
4 28.01.2014 2.224461 04.01.2005
dt <- setDT(dt)
I tried these :
> dmy(dt$date1,dt$date2)# his dose not generate two columns
[1] "2009-01-14" "2005-02-09" "2010-01-24" "2014-01-28" "2009-01-09" "2005-08-23"
[7] "2000-01-17" "2005-01-04"
> as.data.frame(dmy(dt$date1,dt$date2))
dmy(dt$date1, dt$date2) # this dose not generate two columns either
1 2009-01-14
2 2005-02-09
3 2010-01-24
4 2014-01-28
5 2009-01-09
6 2005-08-23
7 2000-01-17
8 2005-01-04
dmy(dt[,.SD, .SD =c(1,3)])
[1] NA NA
> sapply(dmy(dt$date1,dt$date2),dmy)
[1] NA NA NA NA NA NA NA NA
Warning messages:
1: All formats failed to parse. No formats found.
Any help is highly appreciated.
How about:
dt <- data.frame(date1 = c("14.01.2009", "9/2/2005", "24/1/2010", "28.01.2014"),var1 = rnorm(4,2,1), date2 = c("09.01.2009", "23/8/2005","17.01.2000", "04.01.2005"))
for(i in c(1,3)){
dt[,i] <- dmy(dt[,i])
}
Here's a data.table way. Suppose you have k columns named dateX:
k = 2
date_cols = paste0('date', 1:k)
for (col in date_cols) {
set(dt, j=col, value=dmy(dt[[col]])
}
You can avoid the loop, but apparently the loop may be faster; see this answer
dt[,(date_cols) := lapply(.SD, dmy), .SDcols=date_cols]
EDIT
If you have aribitray column names, assuming data looks as in OP
date_cols = names(dt)[grep("^\\d{4}(\\.|/)", names(dt))]
date_cols = c(date_cols, names(dt)[grep("(\\.|/)\\d{4}", names(dt))])
You can add regular expressions if there are more delimiters than . or /, and you can combine this into a single grep but this is clearer to me.
Far from perfect, this is a solution which should be more general:
The only assumption here is, that the date columns contain digits separated by either . , / or -. If there's other separators, they may be added. But if you have another variable which is similar, but not a date, this won't work well.
for (j in seq_along(dt)) if (all(grepl('\\d+(\\.|/|-)\\d+(\\.|/|-)\\d+',dt[,j]))) dt[,j] <- dmy(dt[,j])
This loops through the columns and checks if a date could be present using regular expressions. If so, it will convert it to a date and overwrite the column.
Using data.table:
for (j in seg_along(dt)) if (all(grepl('\\d+(\\.|/|-)\\d+(\\.|/|-)\\d+',dt[,j]))) set(dt,j = j, value = dmy(dt[[j]]))
You could also replace all with any with the idea that if you have any match in the column, you could assume all of the values in that column are dates which can be read by dmy.
I have a csv file, June_8th with 2 columns, a time stamp, V1 (hour:minute:second)(01:55:41) and an ID number, V2 (Not really important at this stage). I want to separate the data in 24 different sections based on the hour of the time stamp. And then find the count of how many time stamps were in each hour.
My code I've attempted is:
Time_2am = subset(June_8th, V1 >= 02:00:00 & V1 < 03:00:00)
I keep getting warning message stating:
1: In 2:0:0 : numerical expression has 3 elements: only the first one
used
2: In Ops.factor(V1, 2:0:0) : '>=' not meaningful for factors
3: In 3:0:0 : numerical expression has 4 elements: only the first one
used
4: In Ops.factor(V1, 3:0:0) : '<' not meaningful for factors
A couple things:
02:00:00 doesn't stand for a timestamp - it is, in fact, equivalent to (2:0):0, in which 2:0 is the same as the vector c(2, 1, 0), and by doing another operation :, you are trying to create a vector starting with c(2, 1, 0) and ends with 0, which doesn't make sense, so R only use the first value from the vector, namely 2 and the second argument 0, which gives the vector c(2, 1, 0).
You timestamp seem to have the type factor (https://stat.ethz.ch/R-manual/R-devel/library/base/html/factor.html). They can't be compared with the usual comparison operators, and their levels might not correspond to order of the actual timestamp.
What you can do, is to cast the timestamp to string and then do the comparison with another string, e.g., use as.character(V1) > '02:00:00'.
If you want to separate your data in 24 sections based on the hour of the timestamp, you could extract the hours with:
library(lubridate)
hour(hms(June_8th$V1))
which gives:
> hour(hms(June_8th$V1))
[1] 1 2 3
If you want to subset your data, then you can do:
June_8th[hour(hms(June_8th$V1)) == 2,]
which gives:
V1 V2
2 02:35:51 2
In base R you can achieve the same result with:
# create an hour variable
> format(strptime(paste('2016-06-08', June_8th$V1), format = '%Y-%m-%d %H:%M:%S'), '%H')
[1] "01" "02" "03"
# subset the data to select only the time between 02:00:00 and 03:00:00
> June_8th[format(strptime(paste('2016-06-08', June_8th$V1), format = '%Y-%m-%d %H:%M:%S'), '%H') == '02',]
V1 V2
2 02:35:51 2
Used data:
June_8th <- data.frame(V1 = c('01:55:41','02:35:51','03:09:34'), V2 = 1:3)