Using R to subset overlapping daily sensor data - r

I have a data set (3.2 million rows) in R which consists of pairs of time (milliseconds) and volts. The sensor that gathers the data only runs during the day so the time is actually the milliseconds since start-up that day.
For example, if the sensor runs 12 hours per day, then the maximum possible time value for one day is 43,200,000 ms (12h * 60m * 60s * 1000ms).
The data is continually added to a single file, which means there are many overlapping time values:
X: [1,2,3,4,5,1,2,3,4,5,1,2,3,4,5...] // example if range was 1-5 for one day
Y: [voltage readings at each point in time...]
I would like to separate each "run" into unique data frames so that I could clearly see individual days. Currently when I plot the entire data set it is incredibly muddy because in fact all of the days are being shown in the single plot. Thanks for any help.

If your data.frame df has columns X and Y, you can use diff to find every time X goes down (meaning a new day, it sounds like):
df$Day = cumsum(c(1, diff(df$X) < 0))
Day1 = df[df$Day==1,]
plot(Day1$X, Day1$Y)

Related

Average and conversion order of operation

I have a data that I need to analyse.
The data consists of a series of numbers (floating point) represent duration in milliseconds.
From the duration I need to calculate frequency of those events (occurrence in seconds). So I simply calculate it like
occurrence per second = (1000/ time in milliseconds)
Now I need to find the average occurrence of that event in seconds.
But I am not sure which would be accurate order of operation.
Should I average the duration first and then calculate the average occurrence by
average occurrence = (1000/average time)
or I should calculate the frequency for each duration and average the result?
Both case result varies a bit. So I am not sure which pne would be correct approach.
Example:
Say we are measure frame rate of a device,
Each frame take x milliseconds to draw.
From that we can say
frame per second = (1000/x)
Now if my data has 1000 duration,
Either I can average them and get a average duration of a frame and get a frame per second = (1000/average duration)
or
we calculate 1000 frame per seconds first,
frame per seconds = (1000/duration)
and average those 100 fps value?
which one is correct?
Any suggestions?
You should choose the first method: Calculate the average duration of each frame and then let avg_fps = 1000 / avg_duration_in_milliseconds, or, probably easier: avg_fps = number_of_frames / total_duration_in_seconds. This gives the same thing.
Example:
Say you have 3 frames in one second of durations 200ms, 300ms and 500ms. Since you have 3 frames in 1 second, the avg_fps is 3.
The average duration is 333.33ms which gives the right result (1000/333.33 = 3). But if you calculate the individual fps of each frame you get (1000/200 = 5), (1000/300 = 3.33) and (1000/500 = 2). The average of 5, 3.33 and 2 is 3.44 - the highest values will skew the result in the wrong direction. So choose the first method instead.

Excel - How do I match a minute in time series with specified minute to return max value from another column?

I am trying to extract the max values of CO2ppm (column E) that were logged every second over 1 hour (column D) for a total of 60 minutes (rows ~3300). I have column D for time (in HH:MM:SS) and CO2ppm (numeric). I have 50 CO2 samples that I collected that correspond with a minute (e.g. I collected sample #1 at minute 20:54 in F2), but the data is logging every second within that minute, and I want the the highest CO2 ppm in that minute).
The =MAX(IF(D:D=A2,E:E)) works to return the max value CO2ppm when I use the target value as the date (A2) for the entire day of sampling, but it does not work when I try to match my target minute (F2, 20:54) with the column D (HH:MM:SS). I tried to convert this column to text using =TEXT(D:D,"H:M") so that the target value will match the values of minute, excluding all of the seconds, but with no luck.
How can I match my minute (F2) with the range of rows that have that minute (20:54:xx, column D) to find the max value in column E?
Example data:
Thank you!
An easy way to do this would be to add a helper column with the timestamp stripped of the second component.
However in case that is not an option, you could use a formula like the following, which strips out the seconds from the timestamps in column D:
=MAX(IF((D2:D5-SECOND(D2:D5)/86400)=F2,E2:E5))
Depending on your version of Excel, you may have to confirm the formula with Ctrl+Shift+Enter.

Calculating Idle time for Uber service

I have uber dataset containing variables pickup point, request time, drop time, date variable without month and year.
I need code for calculating idle time and creating a new variable idle time. Calculation as follows:
If pickup points are same for consecutive rows and date is different for consecutive rows then NA value if not difference between drop time of first row and the pickup time in second row. I have done it in excel and need to do it in R
Attached is the screenshot of data in excel
Try something like this, if this is what you are looking for
for(i in 2:nrow(df)){
df$idle[1]<-NA
if(df$Pickup.point[i]!=df$Pickup.point[i-1])
df$idle[i]<-NA
else
if(df$Date[i]!=df$Date[i-1])
df$idle[i]<-NA
else
df$idle[i]<-(df$Req[i]-df$Drop[i-1])
}

How do I make periods out of times in R?

I have 10 million+ data points which look like:
Identifier Times Data
6597104 2015-05-01 04:08:05 0.15512575543732
In order to study these I want to add a Period (1, 2,...) column so the oldest row with the 6597104 identifier is period 1 and the second oldest is period 2 etc. However the times come irregularly so I can't just make it a time series object.
Does anyone know how to do this? Thanks in advance
Let's call your data frame data
First sort it using
data <- data[sort(data$Times,decreasing=TRUE),]
Then add a new column called Period
for i in 1:nrow(data){
data$Period[i] <- paste("Period",i,sep=" ")
}

R: subsetting timestamped dataframe periodically

I have a csv file that contains many thousands of timestamped data points. The file includes the following columns: Date, Tag, East, North & DistFromMean. The following is a sample of the data in the file:
The data is recorded approximately every 15 minutes for 12 tags over a month. What I'm wanting to do is select from the data, starting from the first date entry, subsets of data i.e. every 3 hours but due to the tags transmitting at slightly different rates I need a minimum and maximum value start and end time.
I have found the a related previous question but don't understand the answer enough to implement.
The solution could firstly ask for the Tag number, then the period required perhaps in minutes from the start time (i.e. every 3hrs or 180 minutes), the minimum time range and the maximum time range, both of which would be constant for whatever time period was used. The minimum and maximum would probably need to be plus and minus 6 minutes from the period selected.
As the code below shows, I've managed to read in the file, change the Date format to POSIXlt and extract data within a specific time frame but the bit I'm stuck on is extracting the data every nth minute and within a range.
TestData<- read.csv ("TestData.csv", header=TRUE, as.is=TRUE)
TestData$Date <- strptime(TestData$Date, "%d/%m/%Y %H:%M")
TestData[TestData$Date >= as.POSIXlt("2014-02-26 7:10:00") & TestData$Date < as.POSIXlt("2014-02-26 7:18:00"),]

Resources