I want to find a series of consecutive rows in a dataset where a condition is met the most often.
I have two columns that I can use for this; Either one with ones and zeros that alternate based on the presence or absence of a condition or a column which increments for the duration across which the desirable condition is present. I envision that I will need to use subset(),filter(), and/or rle() in order to make this happen but am at a loss as to how to get it to work.
In the example, I want to find 6 sequential rows that maximize the instances in which happens occurs.
Given the input:
library(data.frame)
df<-data.frame(time=c(1:10),happens=c(1,1,0,0,1,1,1,0,1,1),count=c(1,2,0,0,1,2,3,0,1,2))
I would like to see as the output the rows 5 through 10, inclusive, as the data subset output, using either the happens or count columns since this sequence of rows would yield the highest output of happens occurrences on 6 consecutive rows.
library(zoo)
which.max( rollapply( df$happens, 6, sum) )
#[1] 5
The fifth window of 6 rows apparently holds the maximum sum of df$happens
So the answer is row 5:10
Related
I have a time series and panel data data frame with a specific ID in the first column, and a weekly status for employment: Unemployed (1), employed (0).
I have 261 variables (the weeks every year) and 1.000.000 observations.
I would like to count the maximum number of times '1' occurs consecutively for every row in R.
I have looked a bit at rowSums and rle(), but I am not as far as I know interested in the sum of the row, as it is very important the values are consecutive.
You can see an example of the structure of my data set here - just imagine more rows and columns
We can write a little helper function to return the maximum number of times a certain value is consecutively repeated in a vector, with a nice default value of 1 for this use case
most_consecutive_val = function(x, val = 1) {
with(rle(x), max(lengths[values == val]))
}
Then we can apply this function to the rows of your data frame, dropping the first column (and any other columns that shouldn't be included):
apply(your_data_frame[-1], MARGIN = 1, most_consecutive_val)
If you share some easily imported sample data, I'll be happy to help debug in case there are issues. dput is an easy way to share a copy/pasteable subset of data, for example dput(your_data[1:5, 1:10]) would be a great way to share the first 5 rows and 10 columns of your data.
If you want to avoid warnings and -Inf results in the case where there are no 1s, use Ryan's suggestion from the comments:
most_consecutive_val = function(x, val = 1) {
with(rle(x), if(all(values != val)) 0 else max(lengths[values == val]))
}
Just wondering if there is a way to expand rows which have multiple observations, into rows of unique observations using R? I have data in an excel spreadsheet with the variable headings: Lease, Line, Bay, Date, Predators, Food.Index, DD, MM, YY.
On some dates, there have been multiple predators (from 1 to 4) recorded in the same row. Other days just have 0. On a day where there has been 4 predators recorded, I would like to somehow transform the data to show four unique observations (instead of one row with 4 recorded under "Predators").
I have 1669 rows of data and multiple rows need to be expanded
Example of Data set
Many thanks for your help in advance.
enter image description here
Assuming you have your data in a data.frame, df, one possible solution would be
df.expanded <- df[rep(row.names(df), df$Predators), ]
EDIT: If you also want to keep the rows with 0 predators, you can use pmax to always return at least one:
df.expanded <- df[rep(row.names(df), pmax(df$Predators, 1)),]
Here the pmax(df$Predators, 1) will return the elementwise maximum of df$Predators and 1 so that it returns a new vector where each element is at least 1 but takes the value of df$Predators if that number is greater than 1.
I am a new R user.
I have a dataframe consisting of 50 columns and 300 rows. The first column indicates the ID while the 2nd until the last column are standard deviation (sd) of traits. The pooled sd for each column are indicated at the last row. For each column, I want to remove all those values ten times greater than the pooled sd. I want to do this in one run. So far, the script below is what I have came up for knowing whether a value is greater than the pooled sd. However, even the ID (character) are being processed (resulting to all FALSE). If I put raw_sd_summary[-1], I have no way of knowing which ID on which trait has the criteria I'm looking for.
logic_sd <- lapply(raw_sd_summary, function(x) x>tail(x,1) )
logic_sd_df <- as.data.frame(logic_sd)
What shall I do? And how can I extract all those values labeled as TRUE (greater than pooled sd) that are ten times greater than the pooled SD (along with their corresponding ID's)?
I think your code won't work since lapply will run on a data.frame's columns, not its rows as you want. Change it to
logic_sd <- apply(raw_sd_summary, 2, function(x) x>10*tail(x,1) )
This will give you a logical array of being more than 10 times the last row. You could recover the IDs by replacing the first column
logic_sd[,1] <- raw_sd_summary[,1]
You could remove/replace the unwanted values in the original table directly by
raw_sd_summary[-300,-1][logic_sd[-300,-1]]<-NA # or new value
I would like to create a new column in my dataframe that assigns a categorical value based on a condition to the other observations.
In detail, I have a column that contains timestamps for all observations. The columns are ordered ascending according to the timestamp.
Now, I'd like to calculate the difference between each consecutive timestamp and if it exceeds a certain threshold the factor should be increased by 1 (see Desired Output).
Desired Output
I tried solved it with a for loop, however that takes a lot of time because the dataset is huge.
After searching for a bit I found this approach and tried to adapt it: R - How can I check if a value in a row is different from the value in the previous row?
ind <- with(df, c(TRUE, timestamp[-1L] > (timestamp[-length(timestamp)]-7200)))
However, I can not make it work for my dataset.
Thanks for your help!
I have 34 subsets with a bunch of variables and I am making a new dataframe with summarizing information about each variable for the subsets.
- Example: A10, T2 and V2 are all subsets with ~10 variables and 14 observations where one variable is population.
I want my new dataframe to have a column which says how many times per subset variable 2 hit zero.
I've looked at a bunch of different count functions but they all seem to make separate tables and count the occurrences of all variables. I'm not interested in how many times each unique value shows up because most of the values are unique, I just want to know how many times population hit zero for each subset of 14 observations.
I realize this is probably a simple thing to do but I'm not very good at creating my own solutions from other R code yet. Thanks for the help.
I've done something similar with a different dataset where I counted how many times 'NA' occurred in a vector where all the other values were numerical. For that I used:
na.tmin<- c(sum(is.na(s1997$TMIN)), sum(is.na(s1998$TMIN)), sum(is.na(s1999$TMIN))...
Which created a column (na.tmin) that had the number of times each subset recorded NA instead of a number. I'd like to just count the number of times the value 0 occurred but is.0 is of course not a function because 0 is numerical. Is there a function that will just count the number of times a specific value shows up? If there's not should I use the count occurrences for unique values function?
Perhaps:
sum( abs( s1997$TMIN ) < 0.00000001 )
It's safer to use a tolerance value unless you are sure that you value is an integer. See FAQ 7.31.
sum( abs( pi - (355/113+seq(-0.001, 0.001, length=1000 ) ) )< 0.00001 )
[1] 10