I have a data frame that has 3 columns a subid , test,day. For each subject, I want to identify which tests happened within a time frame of x days and calculate max change in test value. Please see example below. For each subject and a given test ,I want to identify which tests happened within 3 days. so if we look at "Day" column, for the value =1 it wont have any groups as subsequent test was done 6 days after. Values of Day= 10,7,8,9 should be identified as a group and the max change among these should be calculated. Similarly Day = 12,11,10,9 should be identified as another group and the max change among these should be calculated. How can i do this using R. Thank you in advance.
I have about a million data points of two columns: time and quantity.
times are in 24 hours format (11:23:08 AM), and many of them are repeated many times (about 10-100 duplicates for each seconds). I am coding in R and I want a third column to sum (add) all the quantities for the last 5 seconds of time. the values for each specific seconds would become duplicates obviously.
It seems to be very easy but traditionally I just know writing two "for loops" in other programs to search for the condition of time in the last 5 seconds. However this would be very time-consuming in R . I need a second technique.
I'm trying to add a new column to my data table that contains the average of some of the following rows. How many rows to be selected for the average however depends on the time stamp of the rows.
Here is some test data:
DT<-data.table(Weekstart=c(1,2,2,3,3,4,5,5,6,6,7,7,8,8,9,9),Art=c("a","b","a","b","a","a","a","b","b","a","b","a","b","a","b","a"),Demand=c(1:16))
I want to add a column with the mean of all demands, which occured in the weeks ("Weekstart") up to three weeks before the respective week (grouped by Art, excluding the actual week).
With rollapply from zoo-library, it works like this:
setorder(DT,-Weekstart)
DT[,RollMean:=rollapply(Demand,width=list(1:3),partial=TRUE,FUN=mean,align="left",fill=NA),.(Art)]
The problem however is, some data is missing. In the example, the data for the Art b lack the week no 4, there is no Demand in week 4. As I want the average of the three prior weeks, not the three prior rows, the average is wrong. Instead, the result for Art b for week 6 should look like this:
DT[Art=="b"&Weekstart==6,RollMean:=6]
(6 instead of 14/3, because only Week 5 and Week 3 count: (8+4)/2)
Here is what I tired so far:
It would be possible to loop through the minima of the week of the following rows in order to create a vector that defines for each row, how wide the 'width' should be (the new column 'rollwidth'):
i<-3
DT[,rollwidth:=Weekstart-rollapply(Weekstart,width=list(1:3),partial=TRUE,FUN=min,align="left",fill=1),.(Art)]
while (max(DT[,Weekstart-rollapply(Weekstart,width=list(1:i),partial=TRUE,FUN=min,align="left",fill=NA),.(Art)][,V1],na.rm=TRUE)>3) {
i<-i-1
DT[rollwidth>3,rollwidth:=i]
}
But that seems very unprofessional (excuse my poor skills). And, unfortunately, the rollapply with width and rollwidth doesnt work as intended (produces warnings as 'rollwidth' is considered as all the rollwidths in the table):
DT[,RollMean2:=rollapply(Demand,width=list(1:rollwidth),partial=TRUE,FUN=mean,align="left",fill=NA),.(Art)]
What does work is
DT[,RollMean3:=rollapply(Demand,width=rollwidth,partial=TRUE,FUN=mean,align="left",fill=NA),.(Art)]
but then again, the average includes the actual week (not what I want).
Does anybody know how to apply a criterion (i.e. the difference in the weeks shall be <= 3) instead of a number of rows to the argument width?
Any suggestions are appreciated!
I've spent a lot of time searching for a solution but not successfully. For that reason I decided to post my problem or question here hoping somebody of you can help me.
I want to find out which variables are influencing the travel distance of two animals (same species).
The response variable is distance moved (in meters). In total I have 66 tracking sessions for both animals.
The independent variables are: temperature, rainfall, offspring (yes = 1, no = 0), observation period (in minutes) and activity.
I looked at the animals (one day - one animal) every 15 minutes and noted the state of activity (active = 1 or inactive = 0). For that reason my data table consists around 1800 points and the same amount of activity records.
Then I created a table with following columns:
Animal, Tracking-Session, rainfall, offspring, observation period, active, inactive, distance
The two columns active and inactive contain the sum of active (inactive) records per tracking session.
For example in tracking-session 1 the animal A was 30 times active and 11 inactive and moved 6000 meters during that tracking session.
I thought I could do my analysis with this table using the command cbind() to make one column for activity out of the two columns with "inactive" and "active". But this does not work, I get:
Error in lme4::lFormula(formula = distance~ (1 | animal) + activity + offspring + ...
rank of X = 12 < ncol(X) = 13
I want to include the second animal as a random factor to get an output valid for the whole "population" (which only consits of two animals in that case).
How can I fit a linear mixed model to this data or the first question is: how my data table has to look like to do such analysis?
I started running a linear mixed model with my original data table consisting of 1800 rows but the outcome was not convincing. And I don't know if this table was built up correctly for this task. Because I have only 60 tracking sessions and for that reason only 60 resulting travel distances, but 1800 records of activity (each 15 minutes - active or inactive). I don't know how to handle this situation the only possibility for me to overcome this problem was to copy the travel distace (which is the result of all points watched per day) and assign it to each single point of that tracking session.
The same is for rainfall and temperature because these conditions were only measured once a day I had to copy the value for each single point taken on the same day.
Is this correct or better can R handle such tables (like in the picture)? Or is it better to create a table with one row for each day (as I describe above)?
If the the second table (the one with one row per tracking session) is the better choice, how has it be transformed that R can use it?
Hopefully you can follow my explanations (I tried to explain it as detailed as possible) and anyone can help me!
Thanks in advance!
Iris
I have 34 subsets with a bunch of variables and I am making a new dataframe with summarizing information about each variable for the subsets.
- Example: A10, T2 and V2 are all subsets with ~10 variables and 14 observations where one variable is population.
I want my new dataframe to have a column which says how many times per subset variable 2 hit zero.
I've looked at a bunch of different count functions but they all seem to make separate tables and count the occurrences of all variables. I'm not interested in how many times each unique value shows up because most of the values are unique, I just want to know how many times population hit zero for each subset of 14 observations.
I realize this is probably a simple thing to do but I'm not very good at creating my own solutions from other R code yet. Thanks for the help.
I've done something similar with a different dataset where I counted how many times 'NA' occurred in a vector where all the other values were numerical. For that I used:
na.tmin<- c(sum(is.na(s1997$TMIN)), sum(is.na(s1998$TMIN)), sum(is.na(s1999$TMIN))...
Which created a column (na.tmin) that had the number of times each subset recorded NA instead of a number. I'd like to just count the number of times the value 0 occurred but is.0 is of course not a function because 0 is numerical. Is there a function that will just count the number of times a specific value shows up? If there's not should I use the count occurrences for unique values function?
Perhaps:
sum( abs( s1997$TMIN ) < 0.00000001 )
It's safer to use a tolerance value unless you are sure that you value is an integer. See FAQ 7.31.
sum( abs( pi - (355/113+seq(-0.001, 0.001, length=1000 ) ) )< 0.00001 )
[1] 10