In R: help using rle() function in dataframe - r

I am trying to find the number of consecutive runs of '1' values from a dataframe of over 1M obs. of 11 binary variables. I have looked at a number of similar questions on here, but none deal with lengthy dataframes like mine.
I can find the consecutive runs of '1's individually row-by-row, but I'm looking for a solution that can deal with my entire dataframe a bit more elegantly.
Simple example data:
test <- data.frame(v1=c(1,0,1),v2=c(1,1,1),v3=c(0,1,1),v4=c(1,1,0),v5=c(1,1,1))
test
vtest <- as.vector(test[1,])
vtest
r <- rle(vtest)
r$length[r$values ==1]
row1_max <- lapply(r$length[r$values ==1], FUN=max)
row1_max
What's the best way for me to find the max consecutive runs of '1' for each row of my dataframe without having to find each one individually by row?
My real dataset also contains an ID# variable that identifies each record uniquely, and I ultimately want to know the max consecutive runs by ID#, so any additional help there would be much appreciated.
Thanks in advance!

You can use apply to apply a function to each row of your data frame:
apply(test, 1, function(x) {
r <- rle(x)
max(r$lengths[as.logical(r$values)])
})
This returns the maximum number of consecutive 1s per row:
[1] 2 4 3

I would use a combinations of the apply family
library(dplyr)
apply(test, 1, rle) %>% lapply(function(x) x$lengths) %>% vapply(max, numeric(1))
[1] 2 4 3

I'm assuming your df is tidy and that the binaries are in columns
set.seed(1)
event <- sample(1:3,365*3,replace=TRUE) # proxy for one of your columns
runs <- rle(event)
sum(runs$lengths >= 6 & runs$values == 1)
[1] 2
I'm currently working on finding the row numbers where the 6 or longer sequences begin

Related

Comparing dates in different columns to isolate certain within-group entries in R

I have a data frame with an ID column that includes duplicates. There is a column called type that takes the values "S" or "N." There are two additional date columns - admission date and discharge date. My question is a bit similar to comparing two data frames and isolating rows based on certain date differences, but not quite. If needed, I could separate my data into two data frames, but I'm wondering if I can accomplish what I want without the extra steps.
Here is a small example of what the data for two patients looks like in R:
example <- data.frame(ID = c(22,22,22,52,52,52),
admission_date = c("2013-10-03","2014-03-11","2014-03-16","2012-02-08","2014-06-10","2014-06-20"),
discharge_date = c("2013-10-11","2014-03-16","2014-03-28","2012-02-13","2014-06-12","2014-06-30"),
type = c('S','S','N','S','S','N'))
What I want to do is compare within patients, entries that take the value "N" and entries that take the value "S" in the type variable. Based on the discharge date for entries with the value "S," I would like to find entries with the value "N" that have an admission date within 5 days of the former's discharge date (the discharge date with value "S" should be before the admission date with value "N").
So in the example data frame, the only two entries that should be retained are rows 2 and 3 and not rows 5 and 6 since the difference between admission date and discharge date is greater than 5.
Does anyone have any suggestions of how I can filter this data? Any help is greatly appreciated.
This was an interesting challenge. One reason for this is because iterating over rows is less intuitive than iterating over columns (see this question for lots of suggestions: For each row in an R dataframe).
Now I know vectorized solutions are preferred over for loops, but one of the challenges with this problem was that instead of just performing functions on each row, we're comparing the iterated rows to other rows and deleting some rows as we go along. I expect there's a better solution out there and I hope someone posts a better solution to help me learn.
One minor note before I begin, "example" isn't a great name for an object because it's also a function in base R. Additionally, the solution is much easier if we're only dealing with alternating rows of "S" and "N" - that is if many S's precede an N then only the lowest S might be within 5 days of N. Nonetheless it was worth the effort to attack the more challenging case.
Ultimately I ended up solving this as a 2-stage problem, each solved with a for loop. First, I took out all the S rows which weren't within 5 days of the corresponding N rows. Then I took out those N rows which didn't have any appropriate S companions. All of this is implemented in base R.
So to begin:
example_df <- data.frame(ID = c(22,22,22,52,52,52),
admission_date = c("2013-10-03","2014-03-11","2014-03-16","2012-02-08","2014-06-10","2014-06-20"),
discharge_date = c("2013-10-11","2014-03-16","2014-03-28","2012-02-13","2014-06-12","2014-06-30"),
type = c('S','S','N','S','S','N'))
example_df$admission_date<-as.numeric(as.Date(example_df$admission_date))
example_df$discharge_date<-as.numeric(as.Date(example_df$discharge_date))
The first thing I did was to take the date columns (which were characters) and convert them to numeric based on date. Originally I was doing mathematical operations with date objects, but this became complicated with the subsetting operations I ended up using.
Here's the first for loop:
del_vec <- vector("integer")
for( i in 1:nrow(example_df)) {
if (example_df[i,"type"]== "S") {
next
}
if (example_df[i,"type"] == "N") {
add_on <- which
(
example_df["type"] == "S" &
example_df["ID"]==example_df[i,"ID"] &
example_df["discharge_date"] < (example_df[i,"admission_date"] - 5)
)
}
del_vec<- append(del_vec,add_on)
}
example_df_new <- example_df[-c(del_vec),]
rownames(example_df_new) <- 1:nrow(example_df_new)
example_df_new
What I did here is start by creating a vector which will contain the row numbers that we delete. To get rid of the inappropriate S rows we need to actually work on the N rows, so I have the loop skip the S rows. Then when the loop encounters an N row, we find the rows which meet the following conditions:
have type S
have the same ID as the N row in question
have a discharge date which is more than 5 days from the admission date for the N row in question
Using which()captures the row numbers that meet these criteria. Now I add these rows to the empty vector and remove them from the original df. I also rename the rows of the new df to get the following output for example_df_new
ID admission_date discharge_date type
1 22 16140 16145 S
2 22 16145 16157 N
3 52 16241 16251 N
So we've preserved the 2 rows you wanted to keep, but now we have this bottom row that we want to get rid of. I do this in the second loop which iterates over the rows in the new reduced df:
del_vec2 <- vector()
for(i in 1:nrow(example_df_new)) {
if (example_df_new[i,"type"]=="S") {
next
}
if (example_df_new[i,"type"] == "N") {
add_on_two <- which(example_df_new["type"] == "S" & example_df_new["ID"] == example_df_new[i,"ID"])
}
if(length(add_on_two !=0)) {
next
} else {
del_vec2 <- append(del_vec2,i)
}
}
example_df_3<-example_df_new[-c(del_vec2),]
example_df_3
Again, we tell the loop to skip the S rows — whichever ones made the first cut should stay in. Now when the loop encounters an N row we ask the loop to look for rows that meet the following criteria:
is type S
has the same ID as the N row in question
Again I use which() to save the positions of these rows. If these criteria are met then we skip ahead - we want to keep all the N's that have an appropriate S companion. If not then we add the row number of (i) - that is the row number for the N in question to our vector of rows that we want to delete.
We then delete those rows and end up with the desired output:
ID admission_date discharge_date type
1 22 16140 16145 S
2 22 16145 16157 N
At this point you can change the date columns back to a date format.
Again, while this may be the first, I expect it's not the best solution. I hope to see an improved solution, but the problem is more tricky than it appears at first.
After attempting to filter within the same data frame, I decided to separate the data into two tables: one containing only data of type "S" and the other containing only data of type "N." Then, I did a full join while matching on the ID column. While this creates a greater number of rows than before, I was then able to compare the two date of interest. The resulting data frame contains only one row - the entry of a patient with an admission date with type "N" within 5 days of a discharge date with type "S."
The code in R is as follows:
library(dplyr)
example_df <- data.frame(ID = c(22,22,22,52,52,52),
admission_date = c("2013-10-03","2014-03-11","2014-03-16","2012-02-08","2014-06-10","2014-06-20"),
discharge_date = c("2013-10-11","2014-03-16","2014-03-28","2012-02-13","2014-06-12","2014-06-30"),
type = c('S','S','N','S','S','N'))
N_only <- example_df %>%
filter(type == "N")
S_only <- example_df %>%
filter(type == "S")
example_df_merged <- merge(N_only, S_only, by = "ID")
example_df_merged$admission_date.x <- as.Date(as.character(example_df_merged$admission_date.x), format="%Y-%m-%d")
example_df_merged$discharge_date.y <- as.Date(as.character(example_df_merged$discharge_date.y), format="%Y-%m-%d")
example_df_merged$dateDiff <- example_df_merged$discharge_date.y - example_df_merged$admission_date.x
example_df_final <- example_df_merged %>%
filter(dateDiff <= 5 & dateDiff >= 0)
For clearer variable names, I would have changed the variables ending in ".x" and ".y," but that is not necessary.

Extracting minor allele counts in each row in R

Trying to extract the minor allele counts in a set of three columns. The counts are just the number of times each allele is seen in each row. I need to extract the lowest number without reporting 0. Some lines have 0 in one of the rows which is not wanted in the final minor count. Instances of equal rows should report the minor count as the equal value.
I have tried having multiple lines of if (true) statements but this is cumbersome and does not solve the issues fully because of the combination of different scenarios.
set.seed(100)
df <- data.frame((sample(0:100,50)),(sample(0:100,50)),(sample(0:100,50)))
names(df) <-c("nAA", "nAa", "naa")
# Minor count output
df[1,] <- "31"
df[2,] <- "19"
df[3,] <- "4"
I expect a fourth column with the minor count for each row.
You can use apply and select there with x[x>0] mimimum from counts lager 0 and with which you get the column where it is:
apply(df, 1, function(x) min(x[x>0])) #will give you the minimum
apply(df, 1, function(x) which(x==min(x[x>0]))) #will give you the column of the minumum
You can do it with this code. Here the function pmin give you the parallel min of a set of vectors (in this case, the 3 varaibles on your data frame).
library(dplyr)
mutate(df, min = pmin(nAA, nAa, naa))

Given large data.table, use binary search to find the correct row based on the first two columns and then add 1 to third column

I have a dataframe with 3 columns. First two columns are IDs (ID1 and ID2) referring to the same item and the third column is a count of how many times items with these two IDs appear. The dataframe has many rows so I want to use binary search to first find the appropriate row where both IDs match and then add 1 to the cell under the count column in that row.
I have used the which() function to find the index of the correct row and then using the index added 1 to the count column.
For example:
index <- which(DF$ID1 == x & DF$ID1 == y)
DF$Count[index] <- DF$Count[index] + 1
While this works, the which function is very inefficient. Because I have to do this within a for loop for more than a trillion times, it takes a lot of time. Also, there is only one row in the data frame with this ID combination. While the which function goes through all the rows, a function that stops once it finds the correct row should suffice. I have looked into using data.table and setkey for this purpose but do not know how to implement that for my purpose. Thank you in advance.
Indeed you can use data.table and setkeyv (not setkey because you need 2 columns as indexes)
library(data.table)
DF <- data.frame(ID1=sample(1:100,100000,replace=TRUE),ID2=sample(1:100,100000,replace=TRUE))
# convert DF to a data.table
DF <- as.data.table(DF)
# put both ID1 and ID2 as indexes, in that order
setkeyv(DF,c("ID1","ID2"))
# random x and y values
x <- 10
y <- 18
# select value for ID1=x and ID2=y and add 1 in the Count column
DF[.(x,y),"Count"] <- DF[,.(x,y),"Count"]+1

r - difficulty apply()ing a filter() to a data.frame

Good Morning,
I am having some difficulty finding a solution to this:
My data frame has several thousand entities (columns) ranked (values) over time (rows):
Year Entity1 Entity2 ... EntityN
2001 1302 36 ... 1
2002 2 576 ... 1101
I am trying to find a way to output a new data frame with only the top 3 ranked Entities for a year, if possible, with the Entity names as values and the ranks as column names.
I have been playing with something like this, but to no avail:
library(dplyr)
newdf <- apply(mydata, 1, function(x) filter(x, values > 3)
If anyone has any insight, it would be very welcome!
Adding my comment as an answer:
dplyr::filter takes a data.frame as input, but the apply function converts each row to a numeric vector. Try:
apply(mydata, 1, function(x) names(x)[x>3])

efficiently match values and average column where TRUE

Having trouble just matching values and taking an average of a column when those values match up efficiently in R. Essentially I have a chess table that I have pulled data out of and want to get the average for each player's pre-chess rating based on who they played against.
If I have a dataframe:
number <- c(1:10) #number assigned to each player
rating <- c(1000,1200,1210,980,1000,1001,1100,1300,1100,1250) #rating of the player
df <- data.frame(number= number, rating = rating)
p1_games <- c(1,2,3,4,5) # player 1 played against players 2,3,4,5
I want to essentially do is check to see if the values in p1_games match a number in the table, and when they match, average the values in the ratings column.
I just want to return one value and so I've had trouble trying to make ifelse() work:
avg_rate <- ifelse(p1_games %in% df$number, sum(df$rating)/length(p1_games)) #not working
I would like to like to avoid looping if possible but if there's no other efficient way that's fine. Just can't figure out what's up here. Ideally I'd like to apply this logic over many p*_games vectors.
If p1_games in df$number, sum each corresponding rating and divide by the number or ratings. So the output for p1_games in this case would be 1078. I feel like this is really simple but can't quite make this work.
%in% is great at this kind of thing
> mean(df[number %in% p1_games, "rating"])
[1] 1078
An alternate answer using data.table, which may be of use with larger data sets (although since p1_games isn't a column, I'm not sure):
> setDT(df)
> df[number %in% p1_games, mean(rating)]
[1] 1078

Resources