Delete following observations when goal has been reached - r

Given the dataframe:
df = data.frame(
ID = c(1,1,1,1,2,3,3),
Start = c(0,8,150,200,6,7,60),
Stop = c(5,60,170,210,NA,45,80))
ID Start Stop Dummy
1 1 0 5 0
2 1 8 60 1
3 1 150 170 1
4 1 200 210 1
5 2 6 NA 0
6 3 7 45 0
7 3 60 80 1
For each ID, I would like to keep all rows until Start[i+1] - Stop[i] >= 28, and then delete the following observations of that ID
In this example, the output should be
ID Start Stop Dummy
1 1 0 5 0
2 1 8 60 1
5 2 6 NA 0
6 3 7 45 0
7 3 60 80 1
I ended up having to set NA's to a value easy to identify later and the following code
df$Stop[is.na(df$Stop)] = 10000
df$diff <- df$Start-c(0,df$Stop[1:length(df$Stop)-1])
space <- with(df, unique(ID[diff<28]))
df2 <- subset(df, (ID %in% space & diff < 28) | !ID %in% space)

Using data.table...
library(data.table)
setDT(df)
df[,{
w = which( shift(Start,type="lead") - Stop >= 28 )
if (length(w)) .SD[seq(w[1])] else .SD
}, by=ID]
# ID Start Stop
# 1: 1 0 5
# 2: 1 8 60
# 3: 2 6 NA
# 4: 3 7 45
# 5: 3 60 80
.SD is the Subset of Data associated with each by=ID group.

Create a diff column.
df$diff<-df$Start-c(0,df$Stop[1:length(df$Stop)-1])
Subset on the basis of this column
df[df$diff<28,]
PS: I have converted 'NA' to 0. You would have to handle that anyway.

p <- which(df$Start[2:nrow(df)]-df$Stop[1:(nrow(df)-1)] >= 28)
df <- df[p,]
Assuming you want to keep entries where next entry start if higher than giben entry stop by 28 or more
The result is:
>p 2 3
> df[p,]
ID Start Stop
2 1 8 60
3 1 150 170
start in row 2 ( i + 1 = 2) is higher than stop in row 1 (i=1) by 90.
Or, if by until you mean the reverse condition, then
df <- df[which(df$Start[2:nrow(df)]-df$Stop[1:(nrow(df)-1)] < 28),]
Inclusion of NA in your data frame got me thinking. You have to be very careful how you word your condition. If you want to keep all the cases where difference between next start and stop is less than 28, then the above statement will do.
However, if you want to keep all cases EXCEPT when difference is 28 or more, then you should
p <- which((df$Start[2:nrow(df)]-df$Stop[1:(nrow(df)-1)] >= 28))
rp <- which((!is.element(1:nrow(df),p)))
df <- df[rp,]
As it will include the unknown difference.

Related

Mapping 2 unrelated data frames in R

I need to use data from a dataframe A to fill a column in my dataframe B.
Here is a subset of dataframe A:
> dfA <- data.frame(Family=c('A','A','A','B','B'), Count=c(1,2,3,1,2), Start=c(0,10,35,0,5), End=c(10,35,50,5,25))
> dfA
Family Count Start End
1 A 1 0 10
2 A 2 10 35
3 A 3 35 50
4 B 1 0 5
5 B 2 5 25
and a subset of dataframe B
> dfB <- data.frame(Family=c('A','A','A','B','B'), Start=c(1,4,36,2,10), End=c(3,6,40,4,24), BelongToCount=c(NA,NA,NA,NA,NA))
> dfB
Family Start End BelongToCount
1 A 1 3 NA
2 A 4 6 NA
3 A 36 40 NA
4 B 2 4 NA
5 B 10 24 NA
What I want to do is to fill in the BelongToCount column in B according to the data from dataframe A, which would end up with dataframe B filled as:
Family Start End BelongToCount
A 1 3 1
A 4 6 1
A 36 40 3
B 2 4 1
B 10 24 2
I need to do this for each family (so grouping by family), and the condition to fill the BelongToCount column is that if B$Start >= A$Start && B$End <= A$End.
I can't seem to find a clean (and fast) way to do this in R.
Right now, I am doing as follows:
split_A <- split(dfA, dfA$Family)
split_A_FamilyA <- split_A[["A"]]
split_B <- split(dfB, dfB$Family)
split_B_FamilyA <- split_B[["A"]]
for(i in 1:nrow(split_B_FamilyA)) {
row <- split_B_FamilyA[i,]
start <- row$dStart
end <- row$dEnd
for(j in 1:nrow(split_A_FamilyA)) {
row_base <- split_A_FamilyA[j,]
start_base <- row_base$Start
end_base <- row_base$End
if ((start >= start_base) && (end <= end_base)) {
split_B_FamilyA[i,][i,]$BelongToCount <- row_base$Count
break
}
}
}
I admit this is a very bad way of handling the problem (and it is awfully slow). I usually use dplyr when it comes to applying operations on specific groups, but I can't find a way to do such a thing using it. Joining the tables does not make a lot of sense either because the number of rows don't match.
Can someone point me any relevant R function / an efficient way of solving this problem in R?
You can do this with non-equi join in data.table:
library(data.table)
setDT(dfB)
setDT(dfA)
set(dfB, j='BelongToCount', value = as.numeric(dfB$BelongToCount))
dfB[dfA, BelongToCount := Count, on = .(Family, Start >= Start, End <= End)]
# Family Start End BelongToCount
# 1: A 1 3 1
# 2: A 4 6 1
# 3: A 36 40 3
# 4: B 2 4 1
# 5: B 10 24 2
In case a row in dfB is contained in multiple roles of dfA:
dfA2 <- rbind(dfA, dfA)
dfA2[dfB, .(BelongToCount = sum(Count)),
on = .(Family, Start <= Start, End >= End), by = .EACHI]
# Family Start End BelongToCount
# 1: A 1 3 2
# 2: A 4 6 2
# 3: A 36 40 6
# 4: B 2 4 2
# 5: B 10 24 4

Filling (NA values) in the column based on its previous records and another column (with interval) in R

I want to fill action column based on its records and time column. NA in action column should be filled based on previous action record and time interval. lets say we set time interval to 10, which means that if action is A and time is 1, all NA in action should be A till time==11 (1+10).
Please note that if action or ID change, this process should be reset. For example (in row 3) we have B with time==11, I want to fill the next NAs with B until time==21, but we have C in time==16, so we continue filling NA with C until time==26.
df<-read.table(text="
id action time
1 A 1
1 NA 4
1 NA 9
1 B 11
1 NA 12
1 C 16
1 NA 19
1 NA 30
1 A 31
1 NA 32
2 NA 1
2 A 2
2 NA 6",header=T,stringsAsFactors = F)
Desired Result:
id action time
1 A 1
1 A 4
1 A 9
1 B 11
1 B 12
1 C 16
1 C 19
1 NA 30
1 A 31
1 A 32
2 NA 1
2 A 2
2 A 6
We can extract the non-NA rows to use as a reference for filling in values, then iterate through the data set and conditionally replace values based on if they meet the requirements of id and the time interval.
# Use row numbers as an index (unique Id)
df$idx <- 1:nrow(df)
# Find the non-NA rows to use a reference for imputation
idx <- df %>%
group_by(id) %>%
na.omit(action)
The temporary data set idx is used as the reference and the column idx is our unique identifier. Let's first look at the logic for finding and filling in the missing values without worrying about the time interval, so that it's easier to read and understand:
# Ignoring the 'interval' limitation, we'd fill them in like this:
for(r in 1:nrow(df)){
if(is.na(df$action[r])){
df$action[r] <- dplyr::last(idx$action[idx$idx < df$idx[r] & idx$id == df$id[r]])
}
}
If you're running this example code make sure you re-create df and idx before proceeding, since it would be modified by that last example code block.
The time interval requires us to do a logical test on the value of time and also another test to avoid trying to conduct the time comparison on NA values:
# Accounting for the max interval:
interval <- 10
for(r in 1:nrow(df)){
if(is.na(df$action[r])){
if(!is.na(dplyr::last(idx$time[idx$idx < df$idx[r] & idx$id == df$id[r]]))){
if(dplyr::last(idx$time[idx$idx < df$idx[r] & idx$id == df$id[r]]) + interval >= df$time[r])
df$action[r] <- dplyr::last(idx$action[idx$idx < df$idx[r] & idx$id == df$id[r]])
}
}
}
df
This gives us:
id action time idx
1 1 A 1 1
2 1 A 4 2
3 1 A 9 3
4 1 B 11 4
5 1 B 12 5
6 1 C 16 6
7 1 C 19 7
8 1 <NA> 30 8
9 1 A 31 9
10 1 A 32 10
11 2 <NA> 1 11
12 2 A 2 12
13 2 A 6 13
which matches your desired output.

Subset specific row and last row from data frame

I have a data frame which contains data relating to a score of different events. There can be a number of scoring events for one game. What I would like to do, is to subset the occasions when the score goes above 5 or below -5. I would also like to get the last row for each ID. So for each ID, I would have one or more rows depending on whether the score goes above 5 or below -5. My actual data set contains many other columns of information, but if I learn how to do this then I'll be able to apply it to anything else that I may want to do.
Here is a data set
ID Score Time
1 0 0
1 3 5
1 -2 9
1 -4 17
1 -7 31
1 -1 43
2 0 0
2 -3 15
2 0 19
2 4 25
2 6 29
2 9 33
2 3 37
3 0 0
3 5 3
3 2 11
So for this data set, I would hopefully get this output:
ID Score Time
1 -7 31
1 -1 43
2 6 29
2 9 33
2 3 37
3 2 11
So at the very least, for each ID there will be one line printed with the last score for that ID regardless of whether the score goes above 5 or below -5 during the event( this occurs for ID 3).
My attempt can subset when the value goes above 5 or below -5, I just don't know how to write code to get the last line for each ID:
Data[Data$Score > 5 | Data$Score < -5]
Let me know if you need anymore information.
You can use rle to grab the last row for each ID. Check out ?rle for more information about this useful function.
Data2 <- Data[cumsum(rle(Data$ID)$lengths), ]
Data2
# ID Score Time
#6 1 -1 43
#13 2 3 37
#16 3 2 11
To combine the two conditions, use rbind.
Data2 <- rbind(Data[Data$Score > 5 | Data$Score < -5, ], Data[cumsum(rle(Data$ID)$lengths), ])
To get rid of rows that satisfy both conditions, you can use duplicated and rownames.
Data2 <- Data2[!duplicated(rownames(Data2)), ]
You can also sort if desired, of course.
Here's a go at it in data.table, where df is your original data frame.
library(data.table)
setDT(df)
df[df[, c(.I[!between(Score, -5, 5)], .I[.N]), by = ID]$V1]
# ID Score Time
# 1: 1 -7 31
# 2: 1 -1 43
# 3: 2 6 29
# 4: 2 9 33
# 5: 2 3 37
# 6: 3 2 11
We are grouping by ID. The between function finds the values between -5 and 5, and we negate that to get our desired values outside that range. We then use a .I subset to get the indices per group for those. Then .I[.N] gives us the row number of the last entry, per group. We use the V1 column of that result as our row subset for the entire table. You can take unique values if unique rows are desired.
Note: .I[c(which(!between(Score, -5, 5)), .N)] could also be used in the j entry of the first operation. Not sure if it's more or less efficient.
Addition: Another method, one that uses only logical values and will never produce duplicate rows in the output, is
df[df[, .I == .I[.N] | !between(Score, -5, 5), by = ID]$V1]
# ID Score Time
# 1: 1 -7 31
# 2: 1 -1 43
# 3: 2 6 29
# 4: 2 9 33
# 5: 2 3 37
# 6: 3 2 11
Here is another base R solution.
df[as.logical(ave(df$Score, df$ID,
FUN=function(i) abs(i) > 5 | seq_along(i) == length(i))), ]
ID Score Time
5 1 -7 31
6 1 -1 43
11 2 6 29
12 2 9 33
13 2 3 37
16 3 2 11
abs(i) > 5 | seq_along(i) == length(i) constructs a logical vector that returns TRUE for each element that fits your criteria. ave applies this function to each ID. The resulting logical vector is used to select the rows of the data.frame.
Here's a tidyverse solution. Not as concise as some of the above, but easier to follow.
library(tidyverse)
lastrows <- Data %>% group_by(ID) %>% top_n(1, Time)
scorerows <- Data %>% group_by(ID) %>% filter(!between(Score, -5, 5))
bind_rows(scorerows, lastrows) %>% arrange(ID, Time) %>% unique()
# A tibble: 6 x 3
# Groups: ID [3]
# ID Score Time
# <int> <int> <int>
# 1 1 -7 31
# 2 1 -1 43
# 3 2 6 29
# 4 2 9 33
# 5 2 3 37
# 6 3 2 11

How to use apply function once for each unique factor value

I'm trying on some commands on the R-studio built-in databse, ChickWeight. The data looks as follows.
weight Time Chick Diet
1 42 0 1 1
2 51 2 1 1
3 59 4 1 1
4 64 6 1 1
5 76 8 1 1
6 93 10 1 1
7 106 12 1 1
8 125 14 1 1
9 149 16 1 1
10 171 18 1 1
11 199 20 1 1
12 205 21 1 1
13 40 0 2 1
14 49 2 2 1
15 58 4 2 1
Now what I would like to do is to simply output the difference between the chicken-weight for the "Chick" column for time 0 and 21 (last time value). I.e the weight the chick has put on.
I've been trying tapply(ChickWeight$weight, ChickWeight$Chick, function(x) x[length(x)] - x[1]). But this of course applies the value to all rows.
How do I make it so that it applies only once for each unique Chick-value?
If we need a single value per each 'factor' column (assuming that 'Chick', and 'Diet' are the factor columns)
library(data.table)
setDT(df1)[, list(Diff= abs(weight[Time==21]-weight[Time==0])) ,.(Chick, Diet)]
and If we need to create a column
setDT(df1)[, Diff:= abs(weight[Time==21]-weight[Time==0]) ,.(Chick, Diet)]
I noticed that in the example Time = 21 is not found in the Chick No:2, may be in that case, we need one of the number
setDT(df1)[, {tmp <- Time %in% c(0,21)
list(Diff= if(sum(tmp)>1) abs(diff(weight[tmp])) else weight[tmp]) } ,
by = .(Chick, Diet)]
# Chick Diet Diff
#1: 1 1 163
#2: 2 1 40
If we are taking the difference of 'weight' based on the max and min 'Time' for each group
setDT(df1)[, list(Diff=weight[which.max(Time)]-
weight[which.min(Time)]), .(Chick, Diet)]
# Chick Diet Diff
#1: 1 1 163
#2: 2 1 18
Also, if the 'Time' is ordered
setDT(df1)[, list(Diff= abs(diff(weight[c(1L,.N)]))), by =.(Chick, Diet)]
Using by from base R
by(df1[1:2], df1[3:4], FUN= function(x) with(x,
abs(weight[which.max(Time)]-weight[which.min(Time)])))
#Chick: 1
#Diet: 1
#[1] 163
#------------------------------------------------------------
#Chick: 2
#Diet: 1
#[1] 18
Here's a solution using dplyr:
ChickWeight %>%
group_by(Chick = as.numeric(as.character(Chick))) %>%
summarise(weight_gain = last(weight) - first(weight), final_time = last(Time))
(First and last as suggested by #ulfelder.)
Note that ChickWeight$Chick is an ordered factor so without coercing it into numeric the final order looks odd.
Using base R:
ChickWeight$Chick <- as.numeric(as.character(ChickWeight$Chick))
tapply(ChickWeight$weight, ChickWeight$Chick, function(x) x[length(x)] - x[1])

Conditional calculation of means of different columns in data.table with R

Here was discussed the question of calculation of means and medians of vector t, for each value of vector y (from 1 to 4) where x=1, z=1, using aggregate function in R.
x y z t
1 1 1 10
1 0 1 15
2 NA 1 14
2 3 0 15
2 2 1 17
2 1 NA 19
3 4 2 18
3 0 2 NA
3 2 2 45
4 3 2 NA
4 1 3 59
5 0 3 0
5 4 3 45
5 4 4 74
5 1 4 86
Multiple aggregation in R with 4 parameters
But how can I for each value (from 1 to 5) of vector x calculate (mean(y)+mean(z))/(mean(z)-mean(t)) ? And do not make calculations for values 0 and NA in any vector. For example, in vector y the 3rd value is 0, so the 3rd number in every vector (y,z,t) should not be used. And in result the the third row (for x=3) should be NA.
Here is the code for calculating means of y,z and t and it`s needed to add the formula for calculation (mean(y)+mean(z))/(mean(z)-mean(t)):
data <- data.table(dataframe)
bar <- data[,.N,by=x]
foo <- data[ ,list(mean.y =mean(y, na.rm = T),
mean.z=mean(z, na.rm = T),
mean.t=mean(t,na.rm = T)),
by=x]
In this code for calculating means all rows are used, but for calculating (mean(y)+mean(z))/(mean(z)-mean(t)), any row where y or z or t equal to zero or NA should not be used.
Update:
Oh, this can be further simplified, as data.table doesn't subset NA by default (especially with such cases in mind, similar to base::subset). So, you just have to do:
dt[y != 0 & z != 0 & t != 0,
list(ans = (mean(y) + mean(z))/(mean(z) - mean(t))), by = x]
FWIW, here's how I'd do it in data.table:
dt[(y | NA) & (z | NA) & (t | NA),
list(ans=(mean(y)+mean(z))/(mean(z)-mean(t))), by=x]
# x ans
# 1: 1 -0.22222222
# 2: 2 -0.18750000
# 3: 3 -0.16949153
# 4: 4 -0.07142857
# 5: 5 -0.10309278
Let's break it down with the general syntax: dt[i, j, by]:
In i, we filter out for your conditions using a nice little hack TRUE | NA = TRUE and FALSE | NA = NA and NA | NA = NA (you can test these out in your R session).
Since you say you need only the non-zero non-NA values, it's just a matter of |ing each column with NA - which'll return TRUE only for your condition. That settles the subset by condition part.
Then for each group in by, we aggregate according to your function, in j, to get the result.
HTH
Here's one solution:
# create your sample data frame
df <- read.table(text = " x y z t
1 1 1 10
1 0 1 15
2 NA 1 14
2 3 0 15
2 2 1 17
2 1 NA 19
3 4 2 18
3 0 2 NA
3 2 2 45
4 3 2 NA
4 1 3 59
5 0 3 0
5 4 3 45
5 4 4 74
5 1 4 86", header = TRUE)
library('dplyr')
dfmeans <- df %>%
filter(!is.na(y) & !is.na(z) & !is.na(t)) %>% # remove rows with NAs
filter(y != 0 & z != 0 & t != 0) %>% # remove rows with zeroes
group_by(x) %>%
summarize(xmeans = (mean(y) + mean(z)) / (mean(z) - mean(t)))
I'm sure there is a simpler way to remove the rows with NAs and zeroes, but it's not coming to me. Anyway, dfmeans looks like this:
# x xmeans
# 1 1 -0.22222222
# 2 2 -0.18750000
# 3 3 -0.16949153
# 4 4 -0.07142857
# 5 5 -0.10309278
And if you just want the values from xmeans use dfmeans$xmeans.

Resources