I have this dataframe containing a date column, and a unique ID. I would simply want to extract the first observation of each day.
I tried to use the dpylr package (aggregate function) and date function but I'm still a beginner in R. I also tried to look for an answer on this forum without success. Thnx in advance for your return !
Here is the situation:
df <- as.data.frame(c(2013-01-12 07:30:00, 2013-01-12 12:40:00, 2013-01-16 06:50:00, 2013-01-16 15:10:00, 2013-01-14 11:20:00, 2013-01-14 08:15:00),
c(A,B,E,F,C,D))
The outcome should be:
2013-01-12 07:30:00 A
2013-01-14 08:15:00 D
2013-01-16 06:50:00 E
Try the code below. Note, that I have edited your example data.
library(dplyr)
df <- data.frame(date = as.POSIXct(c("2013-01-12 07:30:00",
"2013-01-12 12:40:00",
"2013-01-16 06:50:00",
"2013-01-16 15:10:00",
"2013-01-14 11:20:00",
"2013-01-14 08:15:00")),
id = letters[1:6])
df %>%
group_by(as.Date(date)) %>%
filter(date == min(date))
The result should look like this:
# A tibble: 3 x 3
# Groups: as.Date(date) [3]
date id `as.Date(date)`
<dttm> <fct> <date>
1 2013-01-12 07:30:00 a 2013-01-12
2 2013-01-16 06:50:00 c 2013-01-16
3 2013-01-14 08:15:00 f 2013-01-14
Here is an approach using aggregate from stats package, also editing your dataset definition:
df <- data.frame(times=strptime(c('2013-01-12 07:30:00', '2013-01-12 12:40:00',
'2013-01-16 06:50:00', '2013-01-16 15:10:00',
'2013-01-14 11:20:00', '2013-01-14 08:15:00'),
format = "%Y-%m-%d %H:%M:%S"),
id=c('A','B','E','F','C','D'))
df$day <- as.Date(df$times, format='%Y-%m-%d') #create a day column
aggregate(times ~ day, data = df, FUN='min')
# day times
# 1 2013-01-12 2013-01-12 07:30:00
# 2 2013-01-14 2013-01-14 08:15:00
# 3 2013-01-16 2013-01-16 06:50:00
Related
I know a lot of questions have been asked on the same subject but I have not found an answer to this particular question, despite trying to adapt other codes to my problem.
My data frame "v1" has more than 300 thousand lines with the variable "Date" in the following format:
Date
2015-07-27 17:35:00
2015-07-27 17:40:00
2015-07-27 17:45:00
1st I want to know if all the "Date" intervals are in the 5 to 5 minutes interval. If not I would like to track where different intervals are.
2nd I pretend to create a new column where it can be seen the time stamp of the different intervals. For example, "time_int" where it would be seen "00:05:00", "00:05:00"...
Any help will be appreciated. Thank you in advance.
Here is an option to calculate the difference using lag. If you'd like, you could create another column showing hours with units = "hours".
library(tidyverse)
library(lubridate)
df <- data.frame(date = ymd_hms(c("2015-07-27 17:35:00",
"2015-07-27 17:40:00", "2015-07-27 17:49:00", "2015-07-27 19:49:00")))
df %>%
mutate(diff = date - lag(date),
diff_minutes = as.numeric(diff, units = "mins"),
time_int = format(.POSIXct(diff_minutes*60, "UTC"), "%H:%M:%S")) %>%
select(date, diff_minutes, time_int) %>%
# Filter the data for a range of minutes
filter(diff_minutes >= 5 & diff_minutes < 10)
# OUTPUT:
#> date diff_minutes time_int
#> 1 2015-07-27 17:40:00 5 00:05:00
#> 2 2015-07-27 17:49:00 9 00:09:00
Created on 2021-03-09 by the reprex package (v0.3.0)
Original Data
date
<S3: POSIXct>
2015-07-27 17:35:00
2015-07-27 17:40:00
2015-07-27 17:49:00
2015-07-27 19:49:00
You can use rollapplyr to find the time difference between two consecutive rows. And then you can use which to find the rows that the time difference is not 5 minutes.
dt=read.table(text=text, header=TRUE)
library(lubridate)
library(dplyr)
library(zoo)
dt=mutate(dt, Date=ymd_hms(Date)) %>%
mutate(dt, Dif=rollapplyr(Date, 2, function(x) {
return(difftime(x[2], x[1]))
}, fill=NA))
dt
Date Dif
1 2015-07-27 17:35:00 NA
2 2015-07-27 17:40:00 5
3 2015-07-27 17:45:00 5
4 2015-07-27 17:49:00 4
dt[which(dt$Dif != as.difftime(5, units="mins")),]
Date Dif
4 2015-07-27 17:49:00 4
Lastly, to format the times in your desired format:
dt %>% mutate(DifString=format(.POSIXct(Dif*60, tz="GMT"), "%H:%M:%S"))
Date Dif DifString
1 2015-07-27 17:35:00 NA <NA>
2 2015-07-27 17:40:00 5 00:05:00
3 2015-07-27 17:45:00 5 00:05:00
4 2015-07-27 17:49:00 4 00:04:00
Data
text="Date
'2015-07-27 17:35:00'
'2015-07-27 17:40:00'
'2015-07-27 17:45:00'
'2015-07-27 17:49:00'"
dt=read.table(text=text, header=TRUE)
I have 2 columns
one is date :
2011-04-13
2013-07-29
2010-11-23
the other is time :
3
22
15
I want to make a new column contains date time
it will be like this
2011-04-13 3:00:00
2013-07-29 22:00:00
2010-11-23 15:00:00
I managed to combine them as string
but when i convert them to datetime i get only date the time disappears
any idea how to get date and time in one column?
my script
data <- read.csv("d:\\__r\\hour.csv")
data$date <- as.POSIXct(paste(data$dteday , paste(data$hr, ":00:00", sep=""), sep=" "))
as example you can use ymd_hm function from lubridate:
a <- c("2014-09-08", "2014-09-08", "2014-09-08")
b <- c(3, 4, 5)
library(lubridate)
library(tidyverse)
tibble(a, b) %>%
mutate(time = paste0(a, " ", b, "-0"),
time = ymd_hm(time))
output would be:
# A tibble: 3 x 3
a b time
<chr> <dbl> <dttm>
1 2014-09-08 3 2014-09-08 03:00:00
2 2014-09-08 4 2014-09-08 04:00:00
3 2014-09-08 5 2014-09-08 05:00:00
found this fixed the problem
data$date <- as.POSIXct(strptime(paste(data$dteday , paste(data$hr, ":00:00", sep=""), sep=" "), "%Y-%m-%d %H:%M:%S"))
I have a dataframe with many days data on it and i wan to obtain hte max and min per day but im getting the same df as the start showing an hour value. The original df looks like this:
date temperature
1: 2006-04-17 00:00:00 12.67833
2: 2006-04-17 01:00:00 12.14133
3: 2006-04-17 02:00:00 10.36833
4: 2006-04-17 03:00:00 10.78600
5: 2006-04-17 04:00:00 10.76967
6: 2006-04-17 05:00:00 10.92467
And im getting this:
date Max Min
1: 2006-04-17 00:00:00 12.67833 12.67833
2: 2006-04-17 01:00:00 12.14133 12.14133
3: 2006-04-17 02:00:00 10.36833 10.36833
4: 2006-04-17 03:00:00 10.78600 10.78600
5: 2006-04-17 04:00:00 10.76967 10.76967
6: 2006-04-17 05:00:00 10.92467 10.92467
Im using the next code:
library(lubridate)
datatemp<- read.csv("04_2006.csv", header = T)
datatemp$date_time<-parse_date_time(datatemp$date_time,orders = "mdy HMS")
temp_aveg<-aggregate(list(temperature = datatemp$temp),
list(date = cut(datatemp$date_time, "1 hour")),
mean)
library(data.table)
Tmaxmin<-setDT(temp_aveg)[, list(Max=max(temperature), Min=min(temperature)), by=list(date)]
I dont know what im missing?
You are grouping on the both the date and the time rather than just the date.
A solution using lubridate and dplyr.
library(lubridate)
library(dplyr)
datatemp$date <- date(datatemp$date_time)
datatemp <- na.omit(datatemp)
output <- datatemp %>%
group_by(date) %>%
summarise(max_val = max(temperature),
min_val = min(temperature))
I have a data.frame looking like this:
date1 date2
2015-09-17 03:07:00 2015-09-17 11:53:00
2015-09-17 08:00:00 2015-09-18 11:48:59
2015-09-18 15:58:00 2015-09-22 12:14:00
2015-09-22 12:14:00 2015-09-24 13:58:21
I'd like to combine these two into one column, something like:
dates
2015-09-17 03:07:00
2015-09-17 11:53:00
2015-09-17 08:00:00
2015-09-18 11:48:59
2015-09-18 15:58:00
2015-09-22 12:14:00
2015-09-22 12:14:00
2015-09-24 13:58:21
Please note that dates (like the last but one and the last but two) can be equal. Now I'd like to add a column 'value'. For every date that has it's origin in date1, the value should be 1, if it's origin is in date2, it should be 2.
Adding a new column is obvious. Merging works fine. I've used:
df <- as.data.frame(df$date1)
df <- data.frame(date1 = c(df$date1, test$date2 ))
That works perfectly fine for the merging of the columns, but how to get the correct value for df$value?
The result should be:
dates value
2015-09-17 03:07:00 1
2015-09-17 11:53:00 2
2015-09-17 08:00:00 1
2015-09-18 11:48:59 2
2015-09-18 15:58:00 1
2015-09-22 12:14:00 1
2015-09-22 12:14:00 2
2015-09-24 13:58:21 1
I tried to mock your problem.
If you are not concerned about time complexity, this is the simplest solution that I can suggest.
a = c(1,3,5)
b = c(2,4,6)
df = data.frame(a, b)
d1 = c()
d2 = c()
for(counter in 1:length(df$a))
{
d1 = c(d1,df$a[counter],df$b[counter])
d2 = c(d2,1,2)
}
df = data.frame(d1, d2)
print(df)
Input:
a b
1 2
3 4
5 6
Output:
d1 d2
1 1
2 2
3 1
4 2
5 1
6 2
Can't you just do something like this?
dates1 <- data.frame(dates = c("2015-09-17 03:07:00",
"2015-09-17 08:00:00",
"2015-09-18 15:58:00",
"2015-09-22 12:14:00"), value = 1)
dates2 <- data.frame(dates = c("2015-09-17 11:53:00",
"2015-09-18 11:48:59",
"2015-09-22 12:14:00",
"2015-09-24 13:58:21"), value = 2)
# row-bind the two data.frames
df <- rbind(dates1, dates2)
# if "dates" is in a string format, convert to timestamp
df$dates <- strptime(df$dates, format = "%Y-%m-%d %H:%M:%S")
# order by "dates"
df[order(df$dates),]
# result:
dates value
1 2015-09-17 03:07:00 1
2 2015-09-17 08:00:00 1
5 2015-09-17 11:53:00 2
6 2015-09-18 11:48:59 2
3 2015-09-18 15:58:00 1
4 2015-09-22 12:14:00 1
7 2015-09-22 12:14:00 2
8 2015-09-24 13:58:21 2
There might be a more clever solution, but I'd just separate each column into its own data frame, add a value column, and then rbind() into a single dates data frame.
df1 <- df$date1
df1$value <- 1
df2 <- df$date2
df2$value <- 2
dates <- rbind(df1,df2)
I have hourly values for precipitation that I'd like to sum up over the hour.
My data (Nd_hourly) looks like this:
Datum Uhrzeit Nd
1 2013-05-01 01:00:00 0.0
2 2013-05-01 02:00:00 0.1
3 2013-05-01 03:00:00 0.0
4 2013-05-01 04:00:00 0.3
(date,time, precipitation)
and I'd like to have an output of Datum - Nd
I did the min and max temperatur with the package plyr and the function ddply with
t_maxmin=ddply(t_air,.(Datum),summarize,Datum=Datum[which.max(T_Luft)],max.value=max(T_Luft),min.value=min(T_Luft))
I then tried to do something similar for the precipitation and tried
Nd_daily=ddply(Nd_hourly,.(Datum),summarize,Datum=Datum, sum(Nd_hourly))
but get the error message
Error: only defined on a data frame with all numeric variables
I assume something may be wrong with my data input? I imported data from Excel 2010 via a .txt file.
Still very new to R and programming in general, so I would really appreciate some help :)
Is this what you want?
library(plyr)
ddply(.data = df, .variables = .(Datum), summarize,
sum_precip = sum(Nd))
# Datum sum_precip
# 1 2013-05-01 0.4
I think #Henrik has identified your problem, but here's an alternative approach, using data.table:
# Create some fake datetime data
datetime <- seq(ISOdate(2000,1,1), ISOdate(2000,1,10), "hours")
# A data.frame with columns for date, time, and random precipitation data.
DF <- data.frame(date=format(datetime, "%Y-%m-%d"),
time=format(datetime, "%H:%M:%S"),
precip=runif(length(datetime)))
head(DF)
# date time precip
# 1 2000-01-01 12:00:00 0.9294353
# 2 2000-01-01 13:00:00 0.5082905
# 3 2000-01-01 14:00:00 0.5222088
# 4 2000-01-01 15:00:00 0.1841305
# 5 2000-01-01 16:00:00 0.9121000
# 6 2000-01-01 17:00:00 0.2434706
library(data.table)
DT <- as.data.table(DF) # convert to a data.table
DT[, list(precip=sum(precip)), by=date]
# date precip
# 1: 2000-01-01 7.563350
# 2: 2000-01-02 10.147659
# 3: 2000-01-03 10.936760
# 4: 2000-01-04 13.925727
# 5: 2000-01-05 11.415149
# 6: 2000-01-06 10.966494
# 7: 2000-01-07 12.751461
# 8: 2000-01-08 15.218148
# 9: 2000-01-09 12.213046
# 10: 2000-01-10 6.219439
There's a great introductory text on data.tables here.
Given your particular data structure, the following should do the trick.
library(data.table)
DT <- data.table(Nd_hourly)
DT[, list(Nd_daily=sum(Nd)), by=Datum]