So, I have a time series with a number of observations per ID. I measure a 1/0 variable at each time point. I want to know how many times a person switches from 1 to 0 or 0 to 1.
There are missing data at random in each ID.
id=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4)
indicator=c(1,0,0,0,1,1,NA,1,1,0,1,0,NA,1,1,0)
timepoint=c(2003,2004,2005,2006)
td = data.frame(id,timepoint,indicator)
I need code to return the number of switches per person in these data.
To count the number switch from 0 to 1 or 1 to 0 all you need to do is shifting your vector and compare the two shifted versions. And they applying this function by id assuming your data is already sorted in chronological order.
library(data.table)
count.switch=function(x) sum(tail(x,-1)!= head(x,-1),na.rm=T)
td[,count.switch(indicator),keyby=.(id)]
With this method you could specify the switch you would like to count
count.switch.01=function(x) sum(tail(x,-1) == 0 & head(x,-1) ==1 ,na.rm=T)
You could also count both with a (==0 & ==1) | (==1 & ==0)
Another trick would be to count when the shifted vectors add up to 1 element wise. They only do when one element is 0 and the other 1.
count.switch=function(x) sum(tail(x,-1) + head(x,-1) == 1,na.rm=T)
Related
I have a panel with 50 subjects and one observation per subject per year, 2007-2020. I need to construct a multiple-event survival model, where events can happen more than once per subject.
In terms of variables, I have a dummy for the event, a variable indicating the year, a time_overall variable that shows the total time that has passed (ranges from 1-14 and increases irrespective of the event, each increase by one indicates a year passing), and a time_conditional variable that shows the time between the last two events.
I want to create start and stop time variables based on time overall, the event, and the time conditional. I wrote
for (row in c(1:nrow(df))) {
if (df$time_overall[row] == 1 & df$event[row] == 1) {
df$time[row] <- 1
df$time_start[row] <- 0
df$time_stop[row] <- 1
}
if (df$time_overall[row] > 1 & df$event == 1) {
df$time_start[row] <-
df$time_overall[row]-df$time_conditional[row]
df$time_stop[row] <- df$time_overall[row]
df$time[row] <- df$time_stop[row]-df$time_start[row]
}
}
but most of the time_start values change to zero, leading to hundreds of observations being dropped when I construct the model.
Any idea why this might be, or an alternative approach that can create start and stop time variables based on the information I've presented?
I have a dataset called college, and one of the columns is 'accepted'. There are two values for this column - 1 (which means student was accepted) and 0 (which means student was not accepted). I was to find the accepted student percentage.
I did this...
table(college$accepted)
which gave me the frequency of 1 and 0. (1 = 44,224 and 0 = 75,166). I then manually added those two values together (119,390) and divided the 44,224/119,390. This is fine and gets me the value I was looking for. But I would really like to know how I could do this with R code, since I'm sure there is a way to do it that I just haven't thought of.
Thanks!
Perhaps you can use prop.table like below
prop.table(table(college$accepted))["1"]
If it's a simple 0/1 column then you only need take the column mean.
mean_accepted <- mean(df$accepted)
you could first sum the column, and the count the total number in the column
sum(college$accepted)/length(college$accepted)
To make the code more explicit and describe your intent better, I suggest using a condition to identify the cases that meet your criteria for inclusion. For example:
college$accepted == 1
Then take the average of the logical vector to compute the proportion (between 0 and 1), multiply by 100 to make it a percentage.
100 * mean(college$accepted == 1, na.rm = TRUE)
I'm quite new to R, unfortunately I wasn't able to find help in other related questions so far.
I have this dataframe called selection, including column 'RUN' and column 'TRNO'.
It originally had 9 columns. I added the column 'RUN' which contains a count that increases by 1 whenever the value in the column 'DAP' is 0, using this code:
# Insert column RUN in "selection" dataframe
library(dplyr)
selection$RUN <- cumsum(selection$DAP == 0)
That worked perfectly. Now I would like to do a similar operation for the column 'TRNO'. It also needs to contain a count that this time only increases when the column 'RUN' arrives at multiples of 80 (i.e. from RUN == 1-80 --> count =1; RUN == 81-160 --> count =2,...)
I tried several codes, amongst others this one:
# Insert column TRNO in "selection" dataframe
i = 0
repeat{
i = i+80
selection$TRNO <- cumsum(selection$RUN == i)
break
}
Instead of increasing the count at every multiple of 80, it returns "0" when RUN values are between 1-80, increases to 92 when RUN values are at 80, and then stagnates at 92 for all the higher values in RUN.
try this:
selection$TRONO <- ceiling(selection$RUN/80)
I am having some trouble coming up with a solution that properly handles classifying a variable number of neighbors for any given observation in a data frame based on some condition. I would like to be able to add a simple, binary indicator variable to a data frame that will equal 1 if the condition is satisfied, and 0 if it is not.
Where I am getting stuck is I am unsure how to iteratively check the condition against neighboring observations only, in either direction (i.e., to check if out of 4 neighboring observations in a given column in my data frame, that at least 3 out of 4 of them contain the same value). I have tried first creating another indicator variable indicating if the condition is satisfied or not (1 or 0 = yes or no). Then, I tried setting up a series of ifelse() statements within a loop to try to assign the proper categorization of the observation where the initial condition is satisfied, +/- 2 observations in either direction. However, when I inspect the dataframe after running the loop, only the observation itself (not its neighbors) where the condition is satisfied is receiving the value, rather than all neighboring observations also receiving the value. Here is my code:
#sample data
sample_dat <- data.frame(initial_ind = c(0,1,0,1,0,0,1,1,0,1,0,0))
sample_dat$violate <- NULL
for(i in 1:nrow(dat_date_ord)){
sample_dat$violate[i] <- ifelse(sample_dat$initial_ind[i]==1 &
((sample_dat$initial_ind[i-2]==1 |
sample_dat$initial_ind[i-1]==1) &
(sample_dat$initial_ind[i+2]==1 |
sample_dat$initial_ind[i+1]==1)),
"trending",
"non-trending"
)
}
This loop correctly identifies one of the four points that needs to be labelled "trending", but it does not also assign "trending" to the correct neighbors. In other words, I expect the output to be "trending for observations 7-10, since 3/4 observations in that group of 4 all have a value of 1 in the initial indicator column. I feel like there might be an easier way to accomplish this - but what I need to ensure is that my code is robust enough to identify and assign observations to a group regardless of if I want 3/4 to indicate a group, 5/6, 2/5, etc.
Thank you for any and all advice.
You can use the rollapply function from the zoo package to apply a function to set intervals in your data. The question then becomes about creating a function that satisfies your needs. I'm not sure if I've understood correctly, but it seems you want a function that checks if the condition is true for at least 3/5 of the observation plus its four closest neighbors. In this case just adding the 1s up and checking if they're above 2 works.
library(zoo)
sample_dat <- data.frame(initial_ind = c(0,1,0,1,0,0,1,1,0,1,0,0))
trend_test = function(x){
ifelse(sum(x) > 2, "trending", "non-trending")
}
sample_dat$violate_new = rollapply(sample_dat$initial_ind, FUN = trend_test, width = 5, fill = NA)
Edit: If you want a function that checks if the observation and the next 3 observations have at least 3 1s, you can do something very similar, just by changing the align argument on rollapply:
trend_test_2 = function(x){
ifelse(sum(x) > 2, "trending", "non-trending")
}
sample_dat$violate_new = rollapply(sample_dat$initial_ind, FUN = trend_test_2, width = 4,
fill = NA, align = "left")
I have a dataset in which the Rating column is an integer column with values ranging from 1 to 10.
I would like to convert that column into a simple boolean positive/negative categorical column, so that if the value is less than 6 it is a negative rating, and if it is greater or equal 6 it would become a positive rating.
I'm not sure how to do that.
Azure Machine Learning allows at least 3 options to do that:
Apply SQL Transformation select *,case when rating<6 then 0 else 1 end RatingB from t1
Execute Python Script return dataframe1.rating[dataframe1.rating < 6] = 0
Execute R Script dataset1$rating[dataset1$rating < 6] <- 0