I'm working with a dataframe including the columns 'timestamp' and 'amount'. The data can be produced like this
sample_size <- 40
start_date = as.POSIXct("2020-01-01 00:00")
end_date = as.POSIXct("2020-01-03 00:00")
timestamps <- as.POSIXct(sample(seq(start_date, end_date, by=60), sample_size))
amount <- rpois(sample_size, 5)
df <- data.frame(timestamps=timestamps, amount=amount)
Now I'd like to plot the sum of the amount entries for some timeframe (like every hour, 30 min, 20 min). The final plot would look like a histogram of the timestamps but should not just count how many timestamps fell into the timeframe, but what amount fell into the timeframe.
How can I approach this? I could create an extra vector with the amount of each timeframe, but don't know how to proceed.
Also I'd like to add a feature to reduce by hour. Such that just just one day is plotted (notice the range between start_date and end_date is two days) and in each timeframe (lets say every hour) the amount of data located in this hour is plotted. In this case the data
2020-01-01 13:03:00 5
2020-01-02 13:21:00 10
2020-01-02 13:38:00 1
2020-01-01 13:14:00 3
would produce a bar of height sum(5, 10, 1, 3) = 19 in the timeframe 13:00-14:00. How can I implement the plotting to easily switch between these two modes (plot days/plot just one day and reduce)?
EDIT: Following the advice of #Gregor Thomas I added a grouping column like this:
df$time_group <- lubridate::floor_date(df$timestamps, unit="20 minutes")
Now I'm wondering how to ignore the dates and thus reduce by 20 minute frame (independent of date).
Related
Context: I have a survey dataset with daily observations in a 6-7 day period each month for about a year. Observations include party choice and trust in government (Likert-scale).
Problem: the N is too small for observations each day, so I need to group the daily observations from each period. How?
I've tried the following (using lubridate), but that supposes each period of observations begins at the start of the week.
df <- df %>%
group_by(date_week = floor_date(date_variable, "week"))
Unfortunately, this is a mess as it takes all observations from Monday-Sunday and groups them together (starting Monday), but some survey periods "crosses" weeks from e.g. Thursday-Wednesday, and thus R creates two periods of observations.
I need to solve this problem and then visualize (I'm using ggplot). So the new date-variable needs to be in date style, and it would be perfect, if it could visualize from the median day in each period.
Example of data
Date Party N Trust-in-gov-average
"2021-10-02" A 25 1.5
"2021-10-02" B 10 2.5
"2021-10-02" C 15 3.8
"2021-10-03" A 12 1.2
"2021-10-03" B 53 3.2
"2021-10-03" C 23 2.8
"2021-10-04" A 58 1.6
"2021-10-04" B 33 2.6
"2021-10-04" C 44 3.0
After many sleepless nights (in part thanks to New Years Eve) I finally found a solution to my problem. It's all about combining lubridate and dplyr.
First convert the variable to date-format.
df$date_string <- ymd(df$date_string)
Then use mutate and %withnin% commands to extract periods. Define the name as the date you want to define the period e.g. first day of observation.
df <- df %>%
mutate(waves = case_when(date_string %within% interval(ymd("2020-09-13"), ymd("2020-09-19")) ~ "2020-09-13",
date_string %within% interval(ymd("2020-09-20"), ymd("2020-10-03")) ~ "2020-09-20",
date_string %within% interval(ymd("2020-10-11"), ymd("2020-10-17")) ~ "2020-10-11",
date_string %within% interval(ymd("2020-10-25"), ymd("2020-10-31")) ~ "2020-10-25"))
At last convert the new variable back to a date-variable using ymd-command again
df$waves <- ymd(df$waves)
Say I have the following data:
Date Month Year Miles Activity
3/1/2014 3 2014 72 Walking
3/1/2014 3 2014 85 Running
3/2/2014 3 2014 42 Running
4/1/2014 4 2014 65 Biking
1/1/2015 1 2015 21 Walking
1/2/2015 1 2015 32 Running
I want to make graphs that display the sum of each month's date for miles, grouped and colored by year. I know that I can make a separate data frame with the sum of the miles per month per activity, but the issue is in displaying. Here in Excel is basically what I want--the sums displayed chronologically and colored by activity.
I know ggplot2 has a scale_x_date command, but I run into issues on "both sides" of the problem--if I use the Date column as my X variable, they're not summed. But if I sum my data how I want it in a separate data frame (i.e., where every activity for every month has just one row), I can't use both Month and Year as my x-axis--at least, not in any way that I can get scale_x_date to understand.
(And, I know, if Excel is graphing it correctly why not just use Excel--unfortunately, my data is so large that Excel was running very slowly and it's not feasible to keep using it.) Any ideas?
The below worked fine for me with the small dataset. If you convert you data.frame to a data.table you can sum the data up to the mile per activity and month level with just a couple preprocessing steps. I've left some comments in the code to give you an idea of what's going on but it should be pretty self-explanatory.
# Assuming your dataframe looks like this
df <- data.frame(Date = c('3/1/2014','3/1/2014','4/2/2014','5/1/2014','5/1/2014','6/1/2014','6/1/2014'), Miles = c(72,14,131,534,123,43,56), Activity = c('Walking','Walking','Biking','Running','Running','Running', 'Biking'))
# Load lubridate and data.table
library(lubridate)
library(data.table)
# Convert dataframe to a data.table
setDT(df)
df[, Date := as.Date(Date, format = '%m/%d/%Y')] # Convert data to a column of Class Date -- check with class(df[, Date]) if you are unsure
df[, Date := floor_date(Date, unit = 'month')] # Reduce all dates to the first day of the month for summing later on
# Create ggplot object using data.tables functionality to sum the miles
ggplot(df[, sum(Miles), by = .(Date, Activity)], aes(x = Date, y = V1, colour = factor(Activity))) + # Data.table creates the column V1 which is the sum of miles
geom_line() +
scale_x_date(date_labels = '%b-%y') # %b is used to display the first 3 letters of the month
I have two data frames: rainfall data collected daily and nitrate concentrations of water samples collected irregularly, approximately once a month. I would like to create a vector of values for each nitrate concentration that is the sum of the previous 5 days' rainfall. Basically, I need to match the nitrate date with the rain date, sum the previous 5 days' rainfall, then print the sum with the nitrate data.
I think I need to either make a function, a for loop, or use tapply to do this, but I don't know how. I'm not an expert at any of those, though I've used them in simple cases. I've searched for similar posts, but none get at this exactly. This one deals with summing by factor groups. This one deals with summing each possible pair of rows. This one deals with summing by aggregate.
Here are 2 example data frames:
# rainfall df
mm<- c(0,0,0,0,5, 0,0,2,0,0, 10,0,0,0,0)
date<- c(1:15)
rain <- data.frame(cbind(mm, date))
# b/c sums of rainfall depend on correct chronological order, make sure the data are in order by date.
rain[ do.call(order, list(rain$date)),]
# nitrate df
nconc <- c(15, 12, 14, 20, 8.5) # nitrate concentration
ndate<- c(6,8,11,13,14)
nitrate <- data.frame(cbind(nconc, ndate))
I would like to have a way of finding the matching rainfall date for each nitrate measurement, such as:
match(nitrate$date[i] %in% rain$date)
(Note: Will match work with as.Date dates?) And then sum the preceding 5 days' rainfall (not including the measurement date), such as:
sum(rain$mm[j-6:j-1]
And prints the sum in a new column in nitrate
print(nitrate$mm_sum[i])
To make sure it's clear what result I'm looking for, here's how to do the calculation 'by hand'. The first nitrate concentration was collected on day 6, so the sum of rainfall on days 1-5 is 5mm.
Many thanks in advance.
You were more or less there!
nitrate$prev_five_rainfall = NA
for (i in 1:length(nitrate$ndate)) {
day = nitrate$ndate[i]
nitrate$prev_five_rainfall[i] = sum(rain$mm[(day-6):(day-1)])
}
Step by step explanation:
Initialize empty result column:
nitrate$prev_five_rainfall = NA
For each line in the nitrate df: (i = 1,2,3,4,5)
for (i in 1:length(nitrate$ndate)) {
Grab the day we want final result for:
day = nitrate$ndate[i]
Take the rainfull sum and it put in in the results column
nitrate$prev_five_rainfall[i] = sum(rain$mm[(day-6):(day-1)])
Close the for loop :)
}
Disclaimer: This answer is basic in that:
It will break if nitrate's ndate < 6
It will be incorrect if some dates are missing in the rain dataframe
It will be slow on larger data
As you get more experience with R, you might use data manipulation packages like dplyr or data.table for these types of manipulations.
#nelsonauner's answer does all the heavy lifting. But one thing to note, in my actual data my dates are not numerical like they are in the example above, they are dates listed as MM/DD/YYYY with the appropriate as.Date(nitrate$date, "%m/%d/%Y").
I found that the for loop above gave me all zeros for nitrate$prev_five_rainfall and I suspected it was a problem with the dates.
So I changed my dates in both data sets to numerical using the difference in number of days between a common start date and the recorded date, so that the for loop would look for a matching number of days in each data frame rather than a date. First, make a column of the start date using rep_len() and format it:
nitrate$startdate <- rep_len("01/01/1980", nrow(nitrate))
nitrate$startdate <- as.Date(all$startdate, "%m/%d/%Y")
Then, calculate the difference using difftime():
nitrate$diffdays <- as.numeric(difftime(nitrate$date, nitrate$startdate, units="days"))
Do the same for the rain data frame. Finally, the for loop looks like this:
nitrate$prev_five_rainfall = NA
for (i in 1:length(nitrate$diffdays)) {
day = nitrate$diffdays[i]
nitrate$prev_five_rainfall[i] = sum(rain$mm[(day-5):(day-1)]) # 5 days
}
I have the following problem:
Suppose we have:
Idx ID StartTime EndTime
1: 1 2014-01-01 02:20:00 2014-01-01 03:42:00
2: 1 2014-01-01 14:51:00 2014-01-01 16:44:00
note: Idx is not given, but I'm simply adding it to the table view.
Now we see that person with ID=1 is using the computer from 2:20 to 3:42. Now what I would like to do is to convert this interval into a set of variables representing hour and weekday and the duration in those periods.
Idx ID Monday-0:00 Monday-1:00 ... Wednesday-2:00 Wednesday-3:00
1: 1 40 42
For the second row we would have
Idx ID Monday-0:00 Monday-1:00 ... Wednesday-14:00 Wednesday-15:00 Wednesday-16:00
2: 1 9 60 44
Now the problem is of course that it can span over multiple hours as you can see from the second row.
I would like to do this per row and I was wondering if this is possible without too much computational effort and using data.table?
PS: it is also possible that the interval spans over the day.
library(data.table)
library(lubridate)
#produce sample data
DT<-data.table(idx=1:100,ID=rep(1:20,5), StartTime=runif(100,60*60,60*60*365)+ymd('2014-01-01'))
DT[,EndTime:=StartTime+runif(1,60,60*60*8)]
#make fake start and end dates with same day of week and time but all within a single calendar week
DT[,fakestart:=as.numeric(difftime(StartTime,ymd('1970-01-01'),units="days"))%%7*60*60*24+ymd('1970-01-01')]
DT[,fakeend:=as.numeric(difftime(EndTime,ymd('1970-01-01'),units="days"))%%7*60*60*24+ymd('1970-01-01')]
setkey(DT,fakestart,fakeend)
#check that weekdays line up
nrow(DT[weekdays(EndTime)==weekdays(fakeend)])
nrow(DT[weekdays(StartTime)==weekdays(fakestart)])
#both are 100 so we're good.
#check that fakeend > fakestart
DT[fakeend<fakestart]
#uh-oh some ends are earlier than starts, let's add 7 days to those ends
DT[fakeend<fakestart,fakeend:=fakeend+days(7)]
#make data.table with all possible labels
DTin<-data.table(start=seq(from=ymd('1970-01-01'),to=DT[,floor_date(max(fakeend),"hour")],by=as.difftime(hours(1))))
DTin[,end:=start+hours(1)]
DTin[,label:=paste0(format(start,format="%A-%H:00"),' ',format(end,format="%A-%H:00"))]
#set key and use new foverlaps feature of data.table which merges by interval
setkey(DT,fakestart,fakeend)
setkey(DTin,start,end)
DTout<-foverlaps(DT,DTin,type="any")
#compute duration in each interval
DTout[,dur:=60-pmax(0,difftime(fakestart,start,unit="mins"))-pmax(0,difftime(end,fakeend,unit="mins"))]
#cast all the rows up to columns for final result
castout<-dcast.data.table(DTout,idx+ID~label,value.var="dur",fill=0)
I have a big data.table that contains the following cols:
timestamp, value, house
The value is a cumulative value of eg energy of that one house. So here is a small sample:
time value house
2014-10-27 11:40:00 100 2
2014-10-27 15:40:00 150 2
2014-10-27 19:40:30 160 2
2014-10-28 00:00:01 170 2
2014-10-28 20:20:20 180 2
2014-10-27 11:40:00 200 3
2014-10-27 15:40:00 300 3
2014-10-27 19:40:30 400 3
2014-10-28 00:00:01 500 3
2014-10-28 20:20:20 600 3
I want to get 3 bar charts: one with the average per house usage per hour of a day, one with the average per house usage per day of a week, and the average per house usage per month of a year.
To get the value of one hour of one day, I guess I should do something like
max(data$value) - min(data$value)
, but that per time interval of an hour and also per house. I know cut(data$time, breaks="hour") splits it up in intervals, but of course does not take the difference of the maximum and minimum value and also doesn't consider the house it is from. On top of that I would also need the average of course.
How can I do this?
First, I'd split time variable to hours, days, months. Convenient and quick way is using regular expressions, for example
hour <- str_extract(rl, ' [[:digit:]]{2}')
hour <- substring(hour, 2)
day <- str_extract(rl, '-[[:digit:]]{2} ')
day <- substring(day, 2, 3)
Then we need to cope with value being in cumulated form, reverse cumsum with diff (both from base R):
value <- diff(value)
Aggregated data for one of barplots created with data.table syntax
data[ , .(avg = mean(value)), by=.(house, day)]
or by using aggregate(base), which looks more readable
aggregate(data, value ~ house + day, mean)