Grouping data based on difference in days - r

I have a data frame that has 3 columns a subid , test,day. For each subject, I want to identify which tests happened within a time frame of x days and calculate max change in test value. Please see example below. For each subject and a given test ,I want to identify which tests happened within 3 days. so if we look at "Day" column, for the value =1 it wont have any groups as subsequent test was done 6 days after. Values of Day= 10,7,8,9 should be identified as a group and the max change among these should be calculated. Similarly Day = 12,11,10,9 should be identified as another group and the max change among these should be calculated. How can i do this using R. Thank you in advance.

Related

Creating a variable with observation number based on date and id number for participants with multiple observations

I have a database with 100 participants, identified by the variable 'id'.The database also includes a 'StartDate' variable indicating the date and time each participant took the survey in the format 'dd.mm.yyyy hh:mm:ss'. One observation per day.
I want to create a new variable called 'observation' that indicates the number of observations for each participant according to the date variable. If there is a 2-day gap, the number skips 2.
For example, if a participant has observations on 11.10.2023 23:08:13, 12.10.2023 22:01:12, 13.10.2023 20:14:17, 14.10.2023 10:30:18, 14.10.2023 19:45:18 the 'observation' variable should take the values 1, 3, 3, 4, and 5 respectively.
If a participant skipped a day, the 'observation' number should jump. For example, if a participant has observations on 11.10.2023 23:08:13, 13.10.2023 20:14:17, 14.10.2023 19:30:18, 16.10.2023 19:45:18, the 'observation' variable should take the values 1, 3, 4, and 6 respectively.
How can I write an SPSS syntax?
In order to calculate the observation variable you need to change the date-time variable into a date only.
If it is originally in text format, do this:
COMPUTE YourDate=number(YourOldDate, DATE8).
FORMATS YourDate (DATE9).
If it is originally in number or date-time format, do this:
compute YourDate=YourOldDate.
alter type YourDate (DATE8).
FORMATS YourDate (DATE9).
Now we can calculate the observation, by comparing the date in each line to the previous one. The max function is added so in cases of zero difference in days since previous observation we still add 1.
sort cases by ID YourDate.
compute observation=1.
if $casenum>1 and ID=lag(ID)
observation=lag(observation)+max(datediff(YourDate,lag(YourDate),"days"),1).
exe.

Count total values after CountDistinct

i created a table in which i want to see all the resources that were used on 1 day, for different missions. It's possible that a resource executed more than 1 mission / day. that's why i used an expression with CountDistinct to only show the unique number of resources, used in 1 day for all the missions.
Now as a next step , i want to see what the average number of unique resources is, for a selected time period.
Unfortunately i am not able to use a count or sum expression on the CountDistinct-expression.
If i execute a sum function it gives me the total number of unique values, spread accross the time period, but i want to make a sum of the resources used per day.
fex i have 3 resources , on day 1 i use resource A for 5 missions on day 2 i use resource A & B for 6 missions. so that makes 11 missions on 2 days, and 3 resources ( A + A + B ).
so i want to count the 82+92+100+90+91+92. How do i get the sum of these values ?
any suggestions on how to fix this please?
MANY THANKS!!!!!
Found the solution, created 2 extra datasets to pull the unique values / day.
Added a lookup function in one of the two tablix to compare the values on the same dates ( dates in both datesets ) = > unique values per day. Afterwards made the sum of the values and divided by number of days to get get average unique values / per day.

How do I create a different data set from existing data set with only certain variables and values that I need?

So, I have this data set where I have age of chicks (bird chicks) from day 2 to day 10 (2,4,6,8,10) and I have a mass data for each of them on 2,4,6,8 and 10 days. But, not all chicks survive till day 10. So how do I extract a datasheet in R, using the overall datasheet but get only those individuals that have values for each of those days for the mass. And if I also wanted to sort them by Mass and Tarsus. Data set of those that have values for both variables on those days.

Moving average with dynamic window

I'm trying to add a new column to my data table that contains the average of some of the following rows. How many rows to be selected for the average however depends on the time stamp of the rows.
Here is some test data:
DT<-data.table(Weekstart=c(1,2,2,3,3,4,5,5,6,6,7,7,8,8,9,9),Art=c("a","b","a","b","a","a","a","b","b","a","b","a","b","a","b","a"),Demand=c(1:16))
I want to add a column with the mean of all demands, which occured in the weeks ("Weekstart") up to three weeks before the respective week (grouped by Art, excluding the actual week).
With rollapply from zoo-library, it works like this:
setorder(DT,-Weekstart)
DT[,RollMean:=rollapply(Demand,width=list(1:3),partial=TRUE,FUN=mean,align="left",fill=NA),.(Art)]
The problem however is, some data is missing. In the example, the data for the Art b lack the week no 4, there is no Demand in week 4. As I want the average of the three prior weeks, not the three prior rows, the average is wrong. Instead, the result for Art b for week 6 should look like this:
DT[Art=="b"&Weekstart==6,RollMean:=6]
(6 instead of 14/3, because only Week 5 and Week 3 count: (8+4)/2)
Here is what I tired so far:
It would be possible to loop through the minima of the week of the following rows in order to create a vector that defines for each row, how wide the 'width' should be (the new column 'rollwidth'):
i<-3
DT[,rollwidth:=Weekstart-rollapply(Weekstart,width=list(1:3),partial=TRUE,FUN=min,align="left",fill=1),.(Art)]
while (max(DT[,Weekstart-rollapply(Weekstart,width=list(1:i),partial=TRUE,FUN=min,align="left",fill=NA),.(Art)][,V1],na.rm=TRUE)>3) {
i<-i-1
DT[rollwidth>3,rollwidth:=i]
}
But that seems very unprofessional (excuse my poor skills). And, unfortunately, the rollapply with width and rollwidth doesnt work as intended (produces warnings as 'rollwidth' is considered as all the rollwidths in the table):
DT[,RollMean2:=rollapply(Demand,width=list(1:rollwidth),partial=TRUE,FUN=mean,align="left",fill=NA),.(Art)]
What does work is
DT[,RollMean3:=rollapply(Demand,width=rollwidth,partial=TRUE,FUN=mean,align="left",fill=NA),.(Art)]
but then again, the average includes the actual week (not what I want).
Does anybody know how to apply a criterion (i.e. the difference in the weeks shall be <= 3) instead of a number of rows to the argument width?
Any suggestions are appreciated!

Sub setting data frame by group means

I would like to subset a data frame by group means. I want to subset all data values greater than the group mean. The code I have tried is:
data<-read.csv("TreeData.csv")
library(plyr)
#Calculating the group means
MDBH<-ddply(data, .(PLTPA),summarise, MDBH=mean(DBH))
MDBH
dataDHT<-subset(data,DBH>MDBH)
#The subset data is incorrect, it excluded some value greater than the mean
and included some values less than the mean.
dataDHT
The data set I created for this problem is at:
https://www.dropbox.com/s/ejnjhg4ogk2g4rw/TreeData.csv?dl=0
Thank you in advance for the help.

Resources