Creating a variable with observation number based on date and id number for participants with multiple observations - datetime

I have a database with 100 participants, identified by the variable 'id'.The database also includes a 'StartDate' variable indicating the date and time each participant took the survey in the format 'dd.mm.yyyy hh:mm:ss'. One observation per day.
I want to create a new variable called 'observation' that indicates the number of observations for each participant according to the date variable. If there is a 2-day gap, the number skips 2.
For example, if a participant has observations on 11.10.2023 23:08:13, 12.10.2023 22:01:12, 13.10.2023 20:14:17, 14.10.2023 10:30:18, 14.10.2023 19:45:18 the 'observation' variable should take the values 1, 3, 3, 4, and 5 respectively.
If a participant skipped a day, the 'observation' number should jump. For example, if a participant has observations on 11.10.2023 23:08:13, 13.10.2023 20:14:17, 14.10.2023 19:30:18, 16.10.2023 19:45:18, the 'observation' variable should take the values 1, 3, 4, and 6 respectively.
How can I write an SPSS syntax?

In order to calculate the observation variable you need to change the date-time variable into a date only.
If it is originally in text format, do this:
COMPUTE YourDate=number(YourOldDate, DATE8).
FORMATS YourDate (DATE9).
If it is originally in number or date-time format, do this:
compute YourDate=YourOldDate.
alter type YourDate (DATE8).
FORMATS YourDate (DATE9).
Now we can calculate the observation, by comparing the date in each line to the previous one. The max function is added so in cases of zero difference in days since previous observation we still add 1.
sort cases by ID YourDate.
compute observation=1.
if $casenum>1 and ID=lag(ID)
observation=lag(observation)+max(datediff(YourDate,lag(YourDate),"days"),1).
exe.

Related

Grouping data based on difference in days

I have a data frame that has 3 columns a subid , test,day. For each subject, I want to identify which tests happened within a time frame of x days and calculate max change in test value. Please see example below. For each subject and a given test ,I want to identify which tests happened within 3 days. so if we look at "Day" column, for the value =1 it wont have any groups as subsequent test was done 6 days after. Values of Day= 10,7,8,9 should be identified as a group and the max change among these should be calculated. Similarly Day = 12,11,10,9 should be identified as another group and the max change among these should be calculated. How can i do this using R. Thank you in advance.

Convert column with dates (as strings) into date type with only the year

I have a dataset (call it df) that has several columns. One of those columns is the column date, which has strings of the form "d-MON-yy" or "dd-MON-yy" depending on if the day number is less than 10 (e.g. 9-Jan-04, 15-Oct-98) or NA.
I am trying to change this to date type values, but I only need the year. Specifically, all the dates whose yy digits are less than 20 are from this century, and all the dates whose yy digits are greater than or equal to 20are from the 1900s. I want to have the four numbers of the year in the end.
Since I am only interested in the year, I don't mind a solution that returns numeric values.
In the end, I'd like to also filter out the rows that have NA on *the date variable only.
I am pretty new to R, and I have tried to make it work with several answers I found here to no avail.
Thank you.

Identify categorical variables when importing dataset in R

I'm importing a large dataset in R and curious if there's a way to quickly go through the columns and identify whether the column has categorical values, numeric, date, etc. When I use str(df) or class(df), the columns mostly come back mislabeled.
For example, some columns are labeled as numeric, but there are only 10 unique values in the column (ranging from 1-10), indicating that it should really be a factor. There are other columns that only have 11 unique values representing a rating, from 0-5 in 0.5 increments. Another column has country codes (172 values), which range from 1-230.
Is there a way to quickly identify if a column should be a factor without going through each of the columns to understand the nature of variable? (there are many columns in the dataset)
Thanks!
At the moment, I've been using variations of the following code to catch the first two cases:
as.numeric(df[,51]) #convert the column to numeric
len = length(unique(df[,51])) #find number of unique values
diff = max(df[,51]) - min(df[,51]) #calculate difference between min and max
ord = (len - 1) / diff # calculate the increment if equally spaced
#subtract the max value from second to max value to find the actual increment (only uses last two values)
step = sort(unique(df[,51]),partial=len)[len] -
sort(unique(df[,51]),partial=len-1)[len-1]
ord == step #check if the last increment equals the implied increment
However, this approach assumes that each of the variables are equally spaced (for example, incremented 0.5) and only tests the space between the last two values. This wouldn't catch a column that contains c(1,2,3.5,4.5,5,6) which has 6 unique values, but uneven spacing in the middle (not that this is common in my dataset).
It is not obvious how many distinct values would indicate a factor vs a numeric variable, but you can examine all variables to see what is in your data with
table(sapply(df, function(x) { length(unique(x))} ))
and if you decide that the boundary between factor and numeric is k you can identify the factors with
which(sapply(df, function(x) {length(unique(x)) < k}))

Moving average with dynamic window

I'm trying to add a new column to my data table that contains the average of some of the following rows. How many rows to be selected for the average however depends on the time stamp of the rows.
Here is some test data:
DT<-data.table(Weekstart=c(1,2,2,3,3,4,5,5,6,6,7,7,8,8,9,9),Art=c("a","b","a","b","a","a","a","b","b","a","b","a","b","a","b","a"),Demand=c(1:16))
I want to add a column with the mean of all demands, which occured in the weeks ("Weekstart") up to three weeks before the respective week (grouped by Art, excluding the actual week).
With rollapply from zoo-library, it works like this:
setorder(DT,-Weekstart)
DT[,RollMean:=rollapply(Demand,width=list(1:3),partial=TRUE,FUN=mean,align="left",fill=NA),.(Art)]
The problem however is, some data is missing. In the example, the data for the Art b lack the week no 4, there is no Demand in week 4. As I want the average of the three prior weeks, not the three prior rows, the average is wrong. Instead, the result for Art b for week 6 should look like this:
DT[Art=="b"&Weekstart==6,RollMean:=6]
(6 instead of 14/3, because only Week 5 and Week 3 count: (8+4)/2)
Here is what I tired so far:
It would be possible to loop through the minima of the week of the following rows in order to create a vector that defines for each row, how wide the 'width' should be (the new column 'rollwidth'):
i<-3
DT[,rollwidth:=Weekstart-rollapply(Weekstart,width=list(1:3),partial=TRUE,FUN=min,align="left",fill=1),.(Art)]
while (max(DT[,Weekstart-rollapply(Weekstart,width=list(1:i),partial=TRUE,FUN=min,align="left",fill=NA),.(Art)][,V1],na.rm=TRUE)>3) {
i<-i-1
DT[rollwidth>3,rollwidth:=i]
}
But that seems very unprofessional (excuse my poor skills). And, unfortunately, the rollapply with width and rollwidth doesnt work as intended (produces warnings as 'rollwidth' is considered as all the rollwidths in the table):
DT[,RollMean2:=rollapply(Demand,width=list(1:rollwidth),partial=TRUE,FUN=mean,align="left",fill=NA),.(Art)]
What does work is
DT[,RollMean3:=rollapply(Demand,width=rollwidth,partial=TRUE,FUN=mean,align="left",fill=NA),.(Art)]
but then again, the average includes the actual week (not what I want).
Does anybody know how to apply a criterion (i.e. the difference in the weeks shall be <= 3) instead of a number of rows to the argument width?
Any suggestions are appreciated!

Count of columns with filters

I have a dataframe with multiple columns and I want to apply different functions on each column.
An example of my dataset -
I want to calculate the count of column pq110a for each country mentioned in qcountry2 column(me-mexico,br-brazil,ar-argentina). The problem I face here is that I have to use filter on these columns for example for sample patients I want-
Count of pq110 when the values are 1 and 2 (for some patients)
Count of pq110 when the value is 3 (for another patients)
Similarly when the value is 6.
For total patient I want-total count of pq110.
Output I am expecting is-Output
Similalry for each country I want this output.
Please suggest how can I do this for other columns also,countrywise.
Thanks !!
I guess what you want to do is count the number of columns of 'pq110' which have the same value within different 'qcountry2'.
So I'll try to use 'tapply' to divide data into several subsets and then use 'table' to count column number for each different value.
tapply(my_data[,"pq110"], INDEX = as.factor(my_data[,"qcountry2"]), function(x)table(x))

Resources