I have a time series and panel data data frame with a specific ID in the first column, and a weekly status for employment: Unemployed (1), employed (0).
I have 261 variables (the weeks every year) and 1.000.000 observations.
I would like to count the maximum number of times '1' occurs consecutively for every row in R.
I have looked a bit at rowSums and rle(), but I am not as far as I know interested in the sum of the row, as it is very important the values are consecutive.
You can see an example of the structure of my data set here - just imagine more rows and columns
We can write a little helper function to return the maximum number of times a certain value is consecutively repeated in a vector, with a nice default value of 1 for this use case
most_consecutive_val = function(x, val = 1) {
with(rle(x), max(lengths[values == val]))
}
Then we can apply this function to the rows of your data frame, dropping the first column (and any other columns that shouldn't be included):
apply(your_data_frame[-1], MARGIN = 1, most_consecutive_val)
If you share some easily imported sample data, I'll be happy to help debug in case there are issues. dput is an easy way to share a copy/pasteable subset of data, for example dput(your_data[1:5, 1:10]) would be a great way to share the first 5 rows and 10 columns of your data.
If you want to avoid warnings and -Inf results in the case where there are no 1s, use Ryan's suggestion from the comments:
most_consecutive_val = function(x, val = 1) {
with(rle(x), if(all(values != val)) 0 else max(lengths[values == val]))
}
Related
I have a data frame with an ID column that includes duplicates. There is a column called type that takes the values "S" or "N." There are two additional date columns - admission date and discharge date. My question is a bit similar to comparing two data frames and isolating rows based on certain date differences, but not quite. If needed, I could separate my data into two data frames, but I'm wondering if I can accomplish what I want without the extra steps.
Here is a small example of what the data for two patients looks like in R:
example <- data.frame(ID = c(22,22,22,52,52,52),
admission_date = c("2013-10-03","2014-03-11","2014-03-16","2012-02-08","2014-06-10","2014-06-20"),
discharge_date = c("2013-10-11","2014-03-16","2014-03-28","2012-02-13","2014-06-12","2014-06-30"),
type = c('S','S','N','S','S','N'))
What I want to do is compare within patients, entries that take the value "N" and entries that take the value "S" in the type variable. Based on the discharge date for entries with the value "S," I would like to find entries with the value "N" that have an admission date within 5 days of the former's discharge date (the discharge date with value "S" should be before the admission date with value "N").
So in the example data frame, the only two entries that should be retained are rows 2 and 3 and not rows 5 and 6 since the difference between admission date and discharge date is greater than 5.
Does anyone have any suggestions of how I can filter this data? Any help is greatly appreciated.
This was an interesting challenge. One reason for this is because iterating over rows is less intuitive than iterating over columns (see this question for lots of suggestions: For each row in an R dataframe).
Now I know vectorized solutions are preferred over for loops, but one of the challenges with this problem was that instead of just performing functions on each row, we're comparing the iterated rows to other rows and deleting some rows as we go along. I expect there's a better solution out there and I hope someone posts a better solution to help me learn.
One minor note before I begin, "example" isn't a great name for an object because it's also a function in base R. Additionally, the solution is much easier if we're only dealing with alternating rows of "S" and "N" - that is if many S's precede an N then only the lowest S might be within 5 days of N. Nonetheless it was worth the effort to attack the more challenging case.
Ultimately I ended up solving this as a 2-stage problem, each solved with a for loop. First, I took out all the S rows which weren't within 5 days of the corresponding N rows. Then I took out those N rows which didn't have any appropriate S companions. All of this is implemented in base R.
So to begin:
example_df <- data.frame(ID = c(22,22,22,52,52,52),
admission_date = c("2013-10-03","2014-03-11","2014-03-16","2012-02-08","2014-06-10","2014-06-20"),
discharge_date = c("2013-10-11","2014-03-16","2014-03-28","2012-02-13","2014-06-12","2014-06-30"),
type = c('S','S','N','S','S','N'))
example_df$admission_date<-as.numeric(as.Date(example_df$admission_date))
example_df$discharge_date<-as.numeric(as.Date(example_df$discharge_date))
The first thing I did was to take the date columns (which were characters) and convert them to numeric based on date. Originally I was doing mathematical operations with date objects, but this became complicated with the subsetting operations I ended up using.
Here's the first for loop:
del_vec <- vector("integer")
for( i in 1:nrow(example_df)) {
if (example_df[i,"type"]== "S") {
next
}
if (example_df[i,"type"] == "N") {
add_on <- which
(
example_df["type"] == "S" &
example_df["ID"]==example_df[i,"ID"] &
example_df["discharge_date"] < (example_df[i,"admission_date"] - 5)
)
}
del_vec<- append(del_vec,add_on)
}
example_df_new <- example_df[-c(del_vec),]
rownames(example_df_new) <- 1:nrow(example_df_new)
example_df_new
What I did here is start by creating a vector which will contain the row numbers that we delete. To get rid of the inappropriate S rows we need to actually work on the N rows, so I have the loop skip the S rows. Then when the loop encounters an N row, we find the rows which meet the following conditions:
have type S
have the same ID as the N row in question
have a discharge date which is more than 5 days from the admission date for the N row in question
Using which()captures the row numbers that meet these criteria. Now I add these rows to the empty vector and remove them from the original df. I also rename the rows of the new df to get the following output for example_df_new
ID admission_date discharge_date type
1 22 16140 16145 S
2 22 16145 16157 N
3 52 16241 16251 N
So we've preserved the 2 rows you wanted to keep, but now we have this bottom row that we want to get rid of. I do this in the second loop which iterates over the rows in the new reduced df:
del_vec2 <- vector()
for(i in 1:nrow(example_df_new)) {
if (example_df_new[i,"type"]=="S") {
next
}
if (example_df_new[i,"type"] == "N") {
add_on_two <- which(example_df_new["type"] == "S" & example_df_new["ID"] == example_df_new[i,"ID"])
}
if(length(add_on_two !=0)) {
next
} else {
del_vec2 <- append(del_vec2,i)
}
}
example_df_3<-example_df_new[-c(del_vec2),]
example_df_3
Again, we tell the loop to skip the S rows — whichever ones made the first cut should stay in. Now when the loop encounters an N row we ask the loop to look for rows that meet the following criteria:
is type S
has the same ID as the N row in question
Again I use which() to save the positions of these rows. If these criteria are met then we skip ahead - we want to keep all the N's that have an appropriate S companion. If not then we add the row number of (i) - that is the row number for the N in question to our vector of rows that we want to delete.
We then delete those rows and end up with the desired output:
ID admission_date discharge_date type
1 22 16140 16145 S
2 22 16145 16157 N
At this point you can change the date columns back to a date format.
Again, while this may be the first, I expect it's not the best solution. I hope to see an improved solution, but the problem is more tricky than it appears at first.
After attempting to filter within the same data frame, I decided to separate the data into two tables: one containing only data of type "S" and the other containing only data of type "N." Then, I did a full join while matching on the ID column. While this creates a greater number of rows than before, I was then able to compare the two date of interest. The resulting data frame contains only one row - the entry of a patient with an admission date with type "N" within 5 days of a discharge date with type "S."
The code in R is as follows:
library(dplyr)
example_df <- data.frame(ID = c(22,22,22,52,52,52),
admission_date = c("2013-10-03","2014-03-11","2014-03-16","2012-02-08","2014-06-10","2014-06-20"),
discharge_date = c("2013-10-11","2014-03-16","2014-03-28","2012-02-13","2014-06-12","2014-06-30"),
type = c('S','S','N','S','S','N'))
N_only <- example_df %>%
filter(type == "N")
S_only <- example_df %>%
filter(type == "S")
example_df_merged <- merge(N_only, S_only, by = "ID")
example_df_merged$admission_date.x <- as.Date(as.character(example_df_merged$admission_date.x), format="%Y-%m-%d")
example_df_merged$discharge_date.y <- as.Date(as.character(example_df_merged$discharge_date.y), format="%Y-%m-%d")
example_df_merged$dateDiff <- example_df_merged$discharge_date.y - example_df_merged$admission_date.x
example_df_final <- example_df_merged %>%
filter(dateDiff <= 5 & dateDiff >= 0)
For clearer variable names, I would have changed the variables ending in ".x" and ".y," but that is not necessary.
I want to find a series of consecutive rows in a dataset where a condition is met the most often.
I have two columns that I can use for this; Either one with ones and zeros that alternate based on the presence or absence of a condition or a column which increments for the duration across which the desirable condition is present. I envision that I will need to use subset(),filter(), and/or rle() in order to make this happen but am at a loss as to how to get it to work.
In the example, I want to find 6 sequential rows that maximize the instances in which happens occurs.
Given the input:
library(data.frame)
df<-data.frame(time=c(1:10),happens=c(1,1,0,0,1,1,1,0,1,1),count=c(1,2,0,0,1,2,3,0,1,2))
I would like to see as the output the rows 5 through 10, inclusive, as the data subset output, using either the happens or count columns since this sequence of rows would yield the highest output of happens occurrences on 6 consecutive rows.
library(zoo)
which.max( rollapply( df$happens, 6, sum) )
#[1] 5
The fifth window of 6 rows apparently holds the maximum sum of df$happens
So the answer is row 5:10
Just wondering if there is a way to expand rows which have multiple observations, into rows of unique observations using R? I have data in an excel spreadsheet with the variable headings: Lease, Line, Bay, Date, Predators, Food.Index, DD, MM, YY.
On some dates, there have been multiple predators (from 1 to 4) recorded in the same row. Other days just have 0. On a day where there has been 4 predators recorded, I would like to somehow transform the data to show four unique observations (instead of one row with 4 recorded under "Predators").
I have 1669 rows of data and multiple rows need to be expanded
Example of Data set
Many thanks for your help in advance.
enter image description here
Assuming you have your data in a data.frame, df, one possible solution would be
df.expanded <- df[rep(row.names(df), df$Predators), ]
EDIT: If you also want to keep the rows with 0 predators, you can use pmax to always return at least one:
df.expanded <- df[rep(row.names(df), pmax(df$Predators, 1)),]
Here the pmax(df$Predators, 1) will return the elementwise maximum of df$Predators and 1 so that it returns a new vector where each element is at least 1 but takes the value of df$Predators if that number is greater than 1.
I'm importing a large dataset in R and curious if there's a way to quickly go through the columns and identify whether the column has categorical values, numeric, date, etc. When I use str(df) or class(df), the columns mostly come back mislabeled.
For example, some columns are labeled as numeric, but there are only 10 unique values in the column (ranging from 1-10), indicating that it should really be a factor. There are other columns that only have 11 unique values representing a rating, from 0-5 in 0.5 increments. Another column has country codes (172 values), which range from 1-230.
Is there a way to quickly identify if a column should be a factor without going through each of the columns to understand the nature of variable? (there are many columns in the dataset)
Thanks!
At the moment, I've been using variations of the following code to catch the first two cases:
as.numeric(df[,51]) #convert the column to numeric
len = length(unique(df[,51])) #find number of unique values
diff = max(df[,51]) - min(df[,51]) #calculate difference between min and max
ord = (len - 1) / diff # calculate the increment if equally spaced
#subtract the max value from second to max value to find the actual increment (only uses last two values)
step = sort(unique(df[,51]),partial=len)[len] -
sort(unique(df[,51]),partial=len-1)[len-1]
ord == step #check if the last increment equals the implied increment
However, this approach assumes that each of the variables are equally spaced (for example, incremented 0.5) and only tests the space between the last two values. This wouldn't catch a column that contains c(1,2,3.5,4.5,5,6) which has 6 unique values, but uneven spacing in the middle (not that this is common in my dataset).
It is not obvious how many distinct values would indicate a factor vs a numeric variable, but you can examine all variables to see what is in your data with
table(sapply(df, function(x) { length(unique(x))} ))
and if you decide that the boundary between factor and numeric is k you can identify the factors with
which(sapply(df, function(x) {length(unique(x)) < k}))
I´m analyzing absenteeism data for schools and am seeking a bit of help. For every day, I have 360 rows (classrooms) containing that day´s (column 1) number of absent (column 2) and non-absent students (column 3).
Some days (holidays) only have, say, 20 classes reporting, because the other 340 classes did not have class. I want to ELIMINATE those rows from my dataset. In other words, I want to eliminate
classes and want to eliminate from my dataset all entries in which the total number of entries for the date are less than a certain number. In other words, I want to eliminate all rows with date x if the total number of rows containg date x are less than 200.
Here´s what I´ve got so far:
for (i in c(min(df$date):max(df$date))){
b <- df[df$date == i,]
z <- as.vector(ifelse(nrow(b[which(b$date==i),]) <200, "FALSE", "TRUE"))
print(z)
df$newcolumn <- z
}
This prints z, which goes day-by-day telling me if that day meets my conditions, but I cannot figure out a way to incorporate z back into the dataframe´s 10000s of rows. Instead, my df$newcolumn is simply populated by all TRUEs.
Any help would be greatly appreciated.
Hard to do for real without a reproducible example, but doesn't something like df[ ! df$date %in% z, ] work?
%in% will return a logical vector of whether each element exists in the other vector, ! negates so it returns TRUE if it's >200, and the [ rowselector,] selects rows from the data.frame.