I have some data of the form:
date,time,val1,val2
20090503,0:05:12,107.25,1
20090503,0:05:17,108.25,20
20090503,0:07:45,110.25,5
20090503,0:07:56,106.25,5
that comes from a csv file. I am relatively new to R, so I tried
data <-read.csv("sample.csv", header = TRUE, sep = ",")
and using POSIXlt, as well as POSIXct in the colClasses argument, but I cant seem to be able to create one column or 'variable' out of my date and time data. I want to do so, so I can then choose arbitrary timeframes over which to calculate running statistics such as max, min, mean (and then boxplots, etc.).
I also thought that I might convert it to a time series and get around it that way,
dataTS <-ts(data)
but have yet been able to use the start, end, and frequency to my advantage. Thanks for your help.
You can't do this upon reading the data in to R using the colClasses argument because the data span two "columns" in the CSV file. Instead, load the data and process the date and time columns into a single POSIXlt variable:
dat <- read.csv(textConnection("date,time,val1,val2
20090503,0:05:12,107.25,1
20090503,0:05:17,108.25,20
20090503,0:07:45,110.25,5
20090503,0:07:56,106.25,5"))
dat <- within(dat, Datetime <- as.POSIXlt(paste(date, time),
format = "%Y%m%d %H:%M:%S"))
[I presume it is year month day??, If not use "%Y%d%m %H:%M:%S"]
Which gives:
> head(dat)
date time val1 val2 Datetime
1 20090503 0:05:12 107.25 1 2009-05-03 00:05:12
2 20090503 0:05:17 108.25 20 2009-05-03 00:05:17
3 20090503 0:07:45 110.25 5 2009-05-03 00:07:45
4 20090503 0:07:56 106.25 5 2009-05-03 00:07:56
> str(dat)
'data.frame': 4 obs. of 5 variables:
$ date : int 20090503 20090503 20090503 20090503
$ time : Factor w/ 4 levels "0:05:12","0:05:17",..: 1 2 3 4
$ val1 : num 107 108 110 106
$ val2 : int 1 20 5 5
$ Datetime: POSIXlt, format: "2009-05-03 00:05:12" "2009-05-03 00:05:17" ...
You can now delete date and `time if you wish:
> dat <- dat[, -(1:2)]
> head(dat)
val1 val2 Datetime
1 107.25 1 2009-05-03 00:05:12
2 108.25 20 2009-05-03 00:05:17
3 110.25 5 2009-05-03 00:07:45
4 106.25 5 2009-05-03 00:07:56
Related
loop for list element with datetime in r
I have a df with name mistake. I splitted the mistake df by ID. Now I have over 300 different objects in the list.
library(dplyr)
df <- split.data.frame(mistake, mistake$ID)
Every list object has two different datetime stamps. At first I need the minutes between this two datetime stamps. Then I duplicate the rows of the object by the variable stay (this is the difftime between the sat and end time too). Then I overwrite the test variable with the increment n_mintes.
library(lubridate)
start_date <- df[[1]]$datetime
end_date <- df[[1]]$gehtzeit
n_minutes <- interval(start_date,end_date)/minutes(1)
see <- start_date + minutes(0:n_minutes)#the diff time in minutes I need
df[[1]]$test<- Sys.time()#a new variable
df[[1]] <- data.frame(df[[1]][rep(seq_len(dim(df[[1]])[1]),df[[1]]$stay+1),1:17, drop= F], row.names=NULL)
df[[1]]$test <- format(start_date + minutes(0:n_minutes), format = "%d.%m.%Y %H:%M:%S")
I want to do this with every objcet of the list. And then 'rbind' or 'unsplit' my list. I know I need a loop. But I don' t know how to do this with the list element.
Any help would be create!
Here is a small df example;
mistake
Baureihe Verbund Fahrzeug Code Codetext Subsystem Kommt.Zeit
71 411 ICE1166 93805411866-7 1A50 Querfederdruck 1 ungleich Sollwert Neigetechnik 29.07.2018 23:00:07
72 411 ICE1166 93805411866-7 1A50 Querfederdruck 1 ungleich Sollwert Neigetechnik 04.08.2018 11:16:41
Geht.Zeit Anstehdauer Jahr Monat KW Tag Wartung.geht datetime gehtzeit
71 29.07.2018 23:02:56 00 Std 02 Min 49 Sek 2018 7 KW30 29 0 2018-07-29 23:00:00 2018-07-29 23:02:00
72 04.08.2018 11:19:20 00 Std 02 Min 39 Sek 2018 8 KW31 4 0 2018-08-04 11:16:00 2018-08-04 11:19:00
bleiben ID
71 2 secs 2018-07-29 23:00:00 2018-07-29 23:02:00 1A50
72 3 secs 2018-08-04 11:16:00 2018-08-04 11:19:00 1A50
And here ist the structure:
str(mistake)
'data.frame': 2 obs. of 18 variables:
$ Baureihe : int 411 411
$ Verbund : Factor w/ 1 level "ICE1166": 1 1
$ Fahrzeug : Factor w/ 7 levels "93805411066-4",..: 7 7
$ Code : Factor w/ 6 levels "1A07","1A0E",..: 3 3
$ Codetext : Factor w/ 6 levels "ITD Karte gestört",..: 5 5
$ Subsystem : Factor w/ 1 level "Neigetechnik": 1 1
$ Kommt.Zeit : Factor w/ 70 levels "02.08.2018 00:07:23",..: 68 6
$ Geht.Zeit : Factor w/ 68 levels "01.08.2018 01:30:25",..: 68 8
$ Anstehdauer : Factor w/ 46 levels "00 Std 00 Min 01 Sek ",..: 12 4
$ Jahr : int 2018 2018
$ Monat : int 7 8
$ KW : Factor w/ 5 levels "KW27","KW28",..: 4 5
$ Tag : int 29 4
$ Wartung.geht: int 0 0
$ datetime : POSIXlt, format: "2018-07-29 23:00:00" "2018-08-04 11:16:00"
$ gehtzeit : POSIXlt, format: "2018-07-29 23:02:00" "2018-08-04 11:19:00"
$ bleiben :Class 'difftime' atomic [1:2] 2 3
.. ..- attr(*, "units")= chr "secs"
$ ID : chr "2018-07-29 23:00:00 2018-07-29 23:02:00 1A50" "2018-08-04 11:16:00 2018-08-04 11:19:00 1A50"
Consider building a generalized user-defined function that receives a data frame as input parameter. Then, call the function with by. Like split, by also subsets a data frame by one or more factor(s) such as ID but, unlike split, by can then pass subsets into a function. To row bind all together, run do.call at end.
Below removes the redundant df$test <- Sys.time() which is overwritten later and uses see object inside format() call at end to avoid re-calculation and repetition.
calc_datetime <- function(df) {
# INITIAL CALCS
start_date <- df$datetime
end_date <- df$gehtzeit
n_minutes <- interval(start_date, end_date)/minutes(1)
see <- start_date + minutes(0:n_minutes) # the diff time in minutes I need
# BUILD OUTPUT DF
df <- data.frame(df[rep(seq_len(dim(df)[1]), df$stay+1), 1:17, drop= F], row.names=NULL)
df$test <- format(see, format = "%d.%m.%Y %H:%M:%S")
return(df)
}
# BUILD LIST OF SUBSETTED DFs
df_list <- by(mistake, mistake$ID, calc_datetime)
# APPEND ALL RESULT DFs TO SINGLE FINAL DF
final_df <- do.call(rbind, df_list)
Along the same lines as Parfait's answer, and using the same user defined function calc_datetime, but I would use map_dfr from the purrr package:
df_list <- split(mistake, mistake$ID)
final_df <- map_dfr(df_list, calc_datetime)
If you update the question to have data I can use I can give a demonstration that works
following is my ex.csv data input to R.
Date pr pa
1 2015-01-01 6497985 4833118
2 2015-02-01 88289 4305786
3 2015-03-01 0 1149480
4 2015-04-01 0 16706470
5 2015-05-01 0 7025197
6 2015-06-01 0 6752085
also, here is raw data
Date,pr,pa
2015/1/1,6497985,4833118
2015/2/1,88289,4305786
2015/3/1,0,1149480
2015/4/1,0,16706470
2015/5/1,0,7025197
2015/6/1,0,6752085
how can I use R package dygraph with this data?
> str(ex)
'data.frame': 6 obs. of 3 variables:
$ Date: Factor w/ 6 levels "2015/1/1","2015/2/1",..: 1 2 3 4 5 6
$ pr : int 6497985 88289 0 0 0 0
$ pa : int 4833118 4305786 1149480 16706470 7025197 6752085
> dygraph(ex)
Error in dygraph(ex) : Unsupported type passed to argument 'data'.
Please help me.appreciate a lot.
Here are the steps to get it done: First, you need to convert your strings to a Date that is understandable for R. Then convert your data to an xts time series (required by dygraphs). Then plot it with dygraphs.
library(dygraphs)
library(xts)
data<-read.csv("test.csv")
data$Date<- as.Date(data$Date) #convert to date
time_series <- xts(data, order.by = data$Date) #make xts
dygraph(time_series) #now plot
I have a dataset calles marathon and I have tried to use lubridate and churn to convert the characters of marathon$Official.Time into time value in order to work on them. I would like the times to be shown in minutes (meaning that 2 hours are shown as 120 minutes).
data.frame': 5616 obs. of 11 variables:
$ Overall.Position : int 1 2 3 4 5 6 7 8 9 10 ...
$ Gender.Position : int 1 2 3 4 5 6 7 8 9 10 ...
$ Category.Position: int 1 1 2 2 3 4 3 4 5 5 ...
$ Category : chr "MMS" "MMI" "MMI" "MMS" ...
$ Race.No : int 21080 14 2 21077 18 21 21078 21090 21084 12 ...
$ Country : chr "Kenya" "Kenya" "Ethiopia" "Kenya" ...
$ Official.Time : chr "2:12:12" "2:12:14" "2:12:20" "2:12:29" ...
I tried with:
library(lubridate)
times(marathon$Official.Time)
Or
library(chron)
chron(times=marathon$Official.Time)
as.difftime(marathon$Official.Time, units = "mins")
But I only get NA
You were almost there with difftime (which requires two times and gives you the difference). Instead, use as.difftime (which requires one "difference" - ie marathon time) and specify the format as hours:minutes:seconds.
> as.difftime("2:12:12", format="%H:%M:%S", units="mins")
Time difference of 132.2 mins
> as.numeric(as.difftime("2:12:12", format="%H:%M:%S", units="mins"))
[1] 132.2
No extra packages needed.
NOTE: #mathemetical.coffee's solution is ++gd better than these.
Pretty straightforward to kick it out manually:
library(stringi)
library(purrr)
df <- data.frame(Official.Time=c("2:12:12","2:12:14","2:12:20","2:12:29"),
stringsAsFactors=FALSE)
map(df$Official.Time, function(x) {
stri_split_fixed(x, ":")[[1]] %>%
as.numeric() %>%
`*`(c(60, 1, 1/60)) %>%
sum()
}) -> df$minutes
df
## Official.Time minutes
## 1 2:12:12 132.2
## 2 2:12:14 132.2333
## 3 2:12:20 132.3333
## 4 2:12:29 132.4833
You can also do it with just base R operations and w/o "piping":
df$minutes <- sapply(df$Official.Time, function(x) {
x <- strsplit(x, ":", TRUE)[[1]]
x <- as.numeric(x)
x <- x * (c(60, 1, 1/60))
sum(x)
}, USE.NAMES=FALSE)
If "stuck" with base R then I'd prbly actually do:
vapply(df$Official.Time, function(x) {
x <- strsplit(x, ":", TRUE)[[1]]
x <- as.numeric(x)
x <- x * (c(60, 1, 1/60))
sum(x)
}, double(1), USE.NAMES=FALSE)
to ensure type safety.
But, chron can also be used:
library(chron)
60 * 24 * as.numeric(times(df$Official.Time))
NOTE that lubridate has no times() function.
I have a data set with the following variables:
steps: Number of steps taking in a 5-minute interval
date: The date on which the measurement was taken in YYYY-MM-DD format
interval: Identifier for the 5-minute interval in which measurement was taken (288 intervals per day)
The main data set:
> head(activityData, 3)
steps date interval
1 1.7169811 2012-10-01 0
2 0.3396226 2012-10-01 5
3 0.1320755 2012-10-01 10
> str(activityData)
'data.frame': 17568 obs. of 3 variables:
$ steps : num 1.717 0.3396 0.1321 0.1509 0.0755 ...
$ date : chr "2012-10-01" "2012-10-01" "2012-10-01" "2012-10-01" ...
$ interval: num 0 5 10 15 20 25 30 35 40 45 ...
The data set has a range of two months.
I had to divided it by weekdays and weekend days. I did it with the following functions:
> dataAs.xtsWeekday <- dataAs.xts[.indexwday(dataAs.xts) %in% 1:5]
> dataAs.xtsWeekend <- dataAs.xts[.indexwday(dataAs.xts) %in% c(0, 6)]
After doing this I had to make some calculation, at which I failed so I decided to export the files and read them in, again.
After I imported the data again, I made the calculation I wanted, and I tried to merge the 2 datasets, but did not succeed.
First data set:
> head(weekdays, 3)
X steps date interval daytype
1 1 37.3826 2012-10-01 0 weekday
2 2 37.3826 2012-10-01 5 weekday
3 3 37.3826 2012-10-01 10 weekday
> str(weekdays)
'data.frame': 12960 obs. of 5 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ steps : num 37.4 37.4 37.4 37.4 37.4 ...
$ date : chr "2012-10-01" "2012-10-01" "2012-10-01" "2012-10-01" ...
$ interval: int 0 5 10 15 20 25 30 35 40 45 ...
$ daytype : chr "weekday" "weekday" "weekday" "weekday" ...
Second data set:
> head(weekend, 3)
X steps date interval daytype
1 1 0 2012-10-06 0 weekend
2 2 0 2012-10-06 5 weekend
3 3 0 2012-10-06 10 weekend
> str(weekend)
'data.frame': 4608 obs. of 5 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ steps : num 0 0 0 0 0 0 0 0 0 0 ...
$ date : chr "2012-10-06" "2012-10-06" "2012-10-06" "2012-10-06" ...
$ interval: int 0 5 10 15 20 25 30 35 40 45 ...
$ daytype : chr "weekend" "weekend" "weekend" "weekend" ...
Now I would like to merge the 2 data sets (weekdays, weekends) by date, but the problem is that I don't have any common dates or anything else common.
The final data set should have 4 columns and 17568 observations.
The columns should be:
steps: Number of steps taking in a 5-minute interval
date: The date on which the measurement was taken in YYYY-MM-DD format
interval: Identifier for the 5-minute interval in which measurement was taken
daytype: weekends days or normal weekdays.
I tried with:
merge
join(plyr)
union
Everywhere I looked all the data sets had a common ID or a common column in both data sets, not like in my case.
I also looked here, but I did not understand much and at many others, but they had nothing in common with my data set.
The other option I thought about was to add a column to the original data set and call it
"ID" and redo everything that I did so far; thing that I'll have to do if I don't find a way around this problem.
I would like some advice on how to proceed or what to try next.
Since you mentioned that your final data set should have 17568 (=4608+12960) observations/rows, I assume you want to stack the two data.frames over each other (and possibly order them by date afterwards). This is done by using rbind().
finaldata <- rbind(weekdays, weekend)
If you want to remove column X:
finaldata$X <- NULL
To convert your date column to actual dates:
finaldata$date <- as.Date(finaldata$date, format="%Y-%m-%d")
To order the whole data by date:
finaldata <- finaldata[order(finaldata$date),]
I have a .csv file that I have loaded into R using the following basic command:
lace <- read.csv("lace for R.csv")
It pulls in my data just fine. Here is the str of the data:
str(lace)
'data.frame': 2054 obs. of 20 variables:
$ Admission.Day : Factor w/ 872 levels "1/1/2013","1/10/2011",..: 231 238 238 50 59 64 64 64 67 67 ...
$ Year : int 2010 2010 2010 2011 2011 2011 2011 2011 2011 2011 ...
$ Month : int 12 12 12 1 1 1 1 1 1 1 ...
$ Day : int 28 30 30 3 4 6 6 6 7 7 ...
$ DayOfWeekNumber : int 3 5 5 2 3 5 5 5 6 6 ...
$ Day.of.Week : Factor w/ 7 levels "Friday","Monday",..: 6 5 5 2 6 5 5 5 1 1 ...
What I am trying to do is create three (3) different histograms and then plot them all together on one. I want to create a histogram for each year, where the x axis or labels will be the days of the week starting with Sunday and ending on Saturday.
Firstly how would I go about creating a histogram out of Factors, which the days of the week are in?
Secondly how do I create a histogram for the days of the week for a given year?
I have tried using the following post here but cannot get it working. I use the Admission.Day as the variable and get an error message:
dat <- as.Date(lace$Admission.Day)
Error in charToDate(x) : character string is not in a standard unambiguous format
Thank you,
Expanding on the comment above: the problem seems to be with importing dates, rather than making the histogram. Assuming there is an excel workbook "lace for R.xlsx", with a sheet "lace":
## Not tested...
library(XLConnect)
myData <- "lace for R.xlsx" # NOTE: need path also...
wb <- loadWorkbook(myData)
lace <- readWorksheet(wb, sheet="lace")
lace$Admission.Day <- as.Date(lace$Admission.Day)
should provide dates that work with all the R date functions. Also, the lubridate package provides a number of functions that are more intuitive to use than format(...).
Then, as an example:
library(lubridate) # for year(...) and wday(...)
library(ggplot2)
# random dates around Jun 1, across 5 years...
set.seed(123)
lace <- data.frame(date=as.Date(rnorm(1000,sd=50)+365*(0:4),origin="2008/6/1"))
lace$year <- factor(year(lace$date))
lace$dow <- wday(lace$date, label=T)
# This creates the histograms...
ggplot(lace) +
geom_histogram(aes(x=dow, fill=year)) + # fill color by year
facet_grid(~year) + # facet by year
theme(axis.text.x=element_text(angle=90)) # to rotate weekday names...
Produces this: