How do I select rows in a data frame before and after a condition is met? - r

I'm searching the web for a few a days now and I can't find a solution to my (probably easy to solve) problem.
I have huge data frames with 4 variables and over a million observations each. Now I want to select 100 rows before, all rows while and 1000 rows after a specific condition is met and fill the rest with NA's. I tried it with a for loop and if/ifelse but it doesn't work so far. I think it shouldn't be a big thing, but in the moment I just don't get the hang of it.
I create the data using:
foo<-data.frame(t = 1:15, a = sample(1:15), b = c(1,1,1,1,1,4,4,4,4,1,1,1,1,1,1), c = sample(1:15))
My Data looks like this:
ID t a b c
1 1 4 1 7
2 2 7 1 10
3 3 10 1 6
4 4 2 1 4
5 5 13 1 9
6 6 15 4 3
7 7 8 4 15
8 8 3 4 1
9 9 9 4 2
10 10 14 1 8
11 11 5 1 11
12 12 11 1 13
13 13 12 1 5
14 14 6 1 14
15 15 1 1 12
What I want is to pick the value of a (in this example) 2 rows before, all rows while and 3 rows after the value of b is >1 and fill the rest with NA's. [Because this is just an example I guess you can imagine that after these 15 rows there are more rows with the value for b changing from 1 to 4 several times (I did not post it, so I won't spam the question with unnecessary data).]
So I want to get something like:
ID t a b c d
1 1 4 1 7 NA
2 2 7 1 10 NA
3 3 10 1 6 NA
4 4 2 1 4 2
5 5 13 1 9 13
6 6 15 4 3 15
7 7 8 4 15 8
8 8 3 4 1 3
9 9 9 4 2 9
10 10 14 1 8 14
11 11 5 1 11 5
12 12 11 1 13 11
13 13 12 1 5 NA
14 14 6 1 14 NA
15 15 1 1 12 NA
I'm thankful for any help.
Thank you.
Best regards,
Chris

here is the same attempt as missuse, but with data.table:
library(data.table)
foo<-data.frame(t = 1:11, a = sample(1:11), b = c(1,1,1,4,4,4,4,1,1,1,1), c = sample(1:11))
DT <- setDT(foo)
DT[ unique(c(DT[,.I[b>1] ],DT[,.I[b>1]+3 ],DT[,.I[b>1]-2 ])), d := a]
t a b c d
1: 1 10 1 2 NA
2: 2 6 1 10 6
3: 3 5 1 7 5
4: 4 11 4 4 11
5: 5 4 4 9 4
6: 6 8 4 5 8
7: 7 2 4 8 2
8: 8 3 1 3 3
9: 9 7 1 6 7
10: 10 9 1 1 9
11: 11 1 1 11 NA
Here
unique(c(DT[,.I[b>1] ],DT[,.I[b>1]+3 ],DT[,.I[b>1]-2 ]))
gives you your desired indixes : the unique indices of the line for your condition, the same indices+3 and -2.

Here is an attempt.
Get indexes that satisfy the condition b > 1
z <- which(foo$b > 1)
get indexes for (z - 2) : (z + 3)
ind <- unique(unlist(lapply(z, function(x){
g <- pmax(x - 2, 1) #if x - 2 is negative
g : (x + 3)
})))
create d column filled with NA
foo$d <- NA
replace elements with appropriate indexes with foo$a
foo$d[ind] <- foo$a[ind]

library(dplyr)
library(purrr)
# example dataset
foo<-data.frame(t = 1:15,
a = sample(1:15),
b = c(1,1,1,1,1,4,4,4,4,1,1,1,1,1,1),
c = sample(1:15))
# function to get indices of interest
# for a given index x go 2 positions back and 3 forward
# keep only positive indices
GetIDsBeforeAfter = function(x) {
v = (x-2) : (x+3)
v[v > 0]
}
foo %>% # from your dataset
filter(b > 1) %>% # keep rows where b > 1
pull(t) %>% # get the positions
map(GetIDsBeforeAfter) %>% # for each position apply the function
unlist() %>% # unlist all sets indices
unique() -> ids_to_remain # keep unique ones and save them in a vector
foo$d = foo$c # copy column c as d
foo$d[-ids_to_remain] = NA # put NA to all positions not in our vector
foo
# t a b c d
# 1 1 5 1 8 NA
# 2 2 6 1 14 NA
# 3 3 4 1 10 NA
# 4 4 1 1 7 7
# 5 5 10 1 5 5
# 6 6 8 4 9 9
# 7 7 9 4 15 15
# 8 8 3 4 6 6
# 9 9 7 4 2 2
# 10 10 12 1 3 3
# 11 11 11 1 1 1
# 12 12 15 1 4 4
# 13 13 14 1 11 NA
# 14 14 13 1 13 NA
# 15 15 2 1 12 NA

Related

Replace row value in a data frame group by the smallest value in that group

I have the following data set:
time <- c(0,1,2,3,4,5,0,1,2,3,4,5,0,1,2,3,4,5)
value <- c(10,8,6,5,3,2,12,10,6,5,4,2,20,15,16,9,2,2)
group <- c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3)
data <- data.frame(time, value, group)
I want to create a new column called data$diff that is equal to data$value minus the value of data$value when data$time == 0 within each group.
I am beginning with the following code
for(i in 1:nrow(data)){
for(n in 1:max(data$group)){
if(data$group[i] == n) {
data$diff[i] <- ???????
}
}
}
But cannot figure out what to put in place of the question marks. The desired output would be this table: https://i.stack.imgur.com/1bAKj.png
Any thoughts are appreciated.
Since in your example data$time == 0 is always the first element of the group, you can use this data.table approach.
library(data.table)
setDT(data)
data[, diff := value[1] - value, by = group]
In case that data$time == 0 is not the first element in each group you can use this:
data[, diff := value[time==0] - value, by = group]
Output:
> data
time value group diff
1: 0 10 1 0
2: 1 8 1 2
3: 2 6 1 4
4: 3 5 1 5
5: 4 3 1 7
6: 5 2 1 8
7: 0 12 2 0
8: 1 10 2 2
9: 2 6 2 6
10: 3 5 2 7
11: 4 4 2 8
12: 5 2 2 10
13: 0 20 3 0
14: 1 15 3 5
15: 2 16 3 4
16: 3 9 3 11
17: 4 2 3 18
18: 5 2 3 18
Here is a base R approach.
within(data, diff <- ave(
seq_along(value), group,
FUN = \(i) value[i][time[i] == 0] - value[i]
))
Output
time value group diff
1 0 10 1 0
2 1 8 1 2
3 2 6 1 4
4 3 5 1 5
5 4 3 1 7
6 5 2 1 8
7 0 12 2 0
8 1 10 2 2
9 2 6 2 6
10 3 5 2 7
11 4 4 2 8
12 5 2 2 10
13 0 20 3 0
14 1 15 3 5
15 2 16 3 4
16 3 9 3 11
17 4 2 3 18
18 5 2 3 18
Here is a short way to do it with dplyr.
library(dplyr)
data %>%
group_by(group) %>%
mutate(diff = value[which(time == 0)] - value)
Which gives
# Groups: group [3]
time value group diff
<dbl> <dbl> <dbl> <dbl>
1 0 10 1 0
2 1 8 1 2
3 2 6 1 4
4 3 5 1 5
5 4 3 1 7
6 5 2 1 8
7 0 12 2 0
8 1 10 2 2
9 2 6 2 6
10 3 5 2 7
11 4 4 2 8
12 5 2 2 10
13 0 20 3 0
14 1 15 3 5
15 2 16 3 4
16 3 9 3 11
17 4 2 3 18
18 5 2 3 18
library(dplyr)
vals2use <- data %>%
group_by(group) %>%
filter(time==0) %>%
select(c(2,3)) %>%
rename(value4diff=value)
dataNew <- merge(data, vals2use, all=T)
dataNew$diff <- dataNew$value4diff-dataNew$value
dataNew <- dataNew[,c(1,2,3,5)]
dataNew
group time value diff
1 1 0 10 0
2 1 1 8 2
3 1 2 6 4
4 1 3 5 5
5 1 4 3 7
6 1 5 2 8
7 2 0 12 0
8 2 1 10 2
9 2 2 6 6
10 2 3 5 7
11 2 4 4 8
12 2 5 2 10
13 3 0 20 0
14 3 1 15 5
15 3 2 16 4
16 3 3 9 11
17 3 4 2 18
18 3 5 2 18

Suming up consecutive values in groups [duplicate]

This question already has answers here:
Calculate cumulative sum (cumsum) by group
(5 answers)
Closed 2 years ago.
I'd like to sum up consecutive values in one column by groups, without long explanation, I have df like this:
set.seed(1)
gr <- c(rep('A',3),rep('B',2),rep('C',5),rep('D',3))
vals <- floor(runif(length(gr), min=0, max=10))
idx <- c(seq(1:3),seq(1:2),seq(1:5),seq(1:3))
df <- data.frame(gr,vals,idx)
gr vals idx
1 A 2 1
2 A 3 2
3 A 5 3
4 B 9 1
5 B 2 2
6 C 8 1
7 C 9 2
8 C 6 3
9 C 6 4
10 C 0 5
11 D 2 1
12 D 1 2
13 D 6 3
And I'm looking for this one:
gr vals idx
1 A 2 1
2 A 5 2
3 A 10 3
4 B 9 1
5 B 11 2
6 C 8 1
7 C 17 2
8 C 23 3
9 C 29 4
10 C 29 5
11 D 2 1
12 D 3 2
13 D 9 3
So ex. in group C we have 8+9=17 (first and second element of the group) and second value is replaced by the sum. Then 17+6=23 (sum of previously summed elements and third element), 3rd element replaced by the new result and so on...
I was looking for some solution here but it isn't what I'm looking for.
Ok, I think I got it
df %>%
group_by(gr) %>%
mutate(nvals = cumsum(vals))
gr vals idx nvals
1 A 2 1 2
2 A 3 2 5
3 A 5 3 10
4 B 9 1 9
5 B 2 2 11
6 C 8 1 8
7 C 9 2 17
8 C 6 3 23
9 C 6 4 29
10 C 0 5 29
11 D 2 1 2
12 D 1 2 3
13 D 6 3 9

Is there any way to replace a missing value based on another columns' value to match the column name

I have a dataset:
a day day.1.time day.2.time day.3.time day.4.time day.5.time
1 NA 2 4 5 7 10 4
2 NA 5 4 1 1 6 NA
3 NA 3 7 9 6 7 4
4 NA 3 6 8 8 4 5
5 NA 3 5 2 4 5 6
6 NA 3 87 3 2 1 78
7 NA 1 NA 7 5 9 54
8 NA 5 6 6 3 2 3
9 NA 2 5 10 9 8 3
10 NA 3 9 4 10 3 3
I am trying to use the day column value to match with the day.x.time column to replace the missing value in column a. For instance, in the first row, the first value in the day column is 2, then we should use day.2.time value 5 to replace the first value in column a.
If the day.x.time value is missing, we should use -1 day or +1 day to replace the missing in column a. For instance, in the second row, the day column shows 5, so we should use the value in day.5.time column, but it's also a missing value. In this case, we should use the value in day.4.time column to replace the missing value in column a.
You can use dat = data.frame(a = rep(NA,10), day = c(2,5,3,3,3,3,1,5,2,3), day.1.time = c(4,4,7,6,5,87,NA,6,5,9), day.2.time = sample(10), day.3.time = sample(10), day.4.time = sample(10), day.5.time = c(4,NA,4,5,6,78,54,3,3,3)) to generate the sample data.
I have tried grep(paste0("^day."dat$day,".time$", names(dat)) to match with the column but my code isn't matching in every row, so any help would be appreciated!
Here is one way to do this.
The first part is easy to match day column with the corresponding day.x.time column. We can do this using matrix subsetting.
cols <- grep('day\\.\\d+\\.time', names(dat))
dat$a <- dat[cols][cbind(1:nrow(dat), dat$day)]
dat
# a day day.1.time day.2.time day.3.time day.4.time day.5.time
#1 3 2 4 3 3 3 4
#2 NA 5 4 4 10 2 NA
#3 1 3 7 8 1 8 4
#4 4 3 6 6 4 5 5
#5 6 3 5 10 6 7 6
#6 8 3 87 5 8 9 78
#7 NA 1 NA 1 7 10 54
#8 3 5 6 7 9 1 3
#9 2 2 5 2 5 6 3
#10 2 3 9 9 2 4 3
To fill values where day.x.time column is NA we can select the closest non-NA value in that row.
inds <- which(is.na(dat$a))
dat$a[inds] <- mapply(function(x, y)
na.omit(unlist(dat[x, cols[order(abs(y- seq_along(cols)))]])[1:4])[1],
inds, dat$day[inds])
dat
# a day day.1.time day.2.time day.3.time day.4.time day.5.time
#1 3 2 4 3 3 3 4
#2 2 5 4 4 10 2 NA
#3 1 3 7 8 1 8 4
#4 4 3 6 6 4 5 5
#5 6 3 5 10 6 7 6
#6 8 3 87 5 8 9 78
#7 1 1 NA 1 7 10 54
#8 3 5 6 7 9 1 3
#9 2 2 5 2 5 6 3
#10 2 3 9 9 2 4 3
Using sapply to loop over the rows and subset by day[i] + 2 column.
res <- transform(dat, a=sapply(1:nrow(dat), function(i) dat[i, dat$day[i] + 2]))
res
# a day day.1.time day.2.time day.3.time day.4.time day.5.time
# 1 5 2 4 5 7 10 4
# 2 NA 5 4 1 1 6 NA
# 3 6 3 7 9 6 7 4
# 4 8 3 6 8 8 4 5
# 5 4 3 5 2 4 5 6
# 6 2 3 87 3 2 1 78
# 7 NA 1 NA 7 5 9 54
# 8 3 5 6 6 3 2 3
# 9 10 2 5 10 9 8 3
# 10 10 3 9 4 10 3 3
Edit
The +/-2 days would require a decision rule, what to chose, if day is NA, but none of day - 1 and day + 1 is NA and both have the same values.
Here a solution that goes from day backwards and takes the first non-NA. If it is day one, as it's the case in row 7, we get NA.
res <- transform(dat, a=sapply(1:nrow(dat), function(i) {
days <- dat[i, -(1:2)]
day.value <- days[dat$day[i]]
if (is.na(day.value)) {
day.value <- tail(na.omit(unlist(days[1:dat$day[i]])), 1)
if (length(day.value) == 0) day.value <- NA
}
return(day.value)
}))
res
# a day day.1.time day.2.time day.3.time day.4.time day.5.time
# 1 10 2 4 10 1 2 4
# 2 10 5 4 1 3 10 NA
# 3 2 3 7 7 2 7 4
# 4 6 3 6 2 6 6 5
# 5 10 3 5 9 10 5 6
# 6 8 3 87 6 8 4 78
# 7 NA 1 NA 3 7 1 54
# 8 3 5 6 4 4 9 3
# 9 8 2 5 8 5 8 3
# 10 9 3 9 5 9 3 3

Fill Missing Values

data=data.frame("student"=c(1,1,1,1,2,2,2,2,3,3,3,3,4),
"timeHAVE"=c(1,4,7,10,2,5,NA,11,6,NA,NA,NA,3),
"timeWANT"=c(1,4,7,10,2,5,8,11,6,9,12,15,3))
library(dplyr);library(tidyverse)
data$timeWANTattempt=data$timeHAVE
data <- data %>%
group_by(student) %>%
fill(timeWANTattempt)+3
I have 'timeHAVE' and I want to replace missing times with the previous time +3. I show my dplyr attempt but it does not work. I seek a data.table solution. Thank you.
you can try.
data %>%
group_by(student) %>%
mutate(n_na = cumsum(is.na(timeHAVE))) %>%
mutate(timeHAVE = ifelse(is.na(timeHAVE), timeHAVE[n_na == 0 & lead(n_na) == 1] + 3*n_na, timeHAVE))
student timeHAVE timeWANT n_na
<dbl> <dbl> <dbl> <int>
1 1 1 1 0
2 1 4 4 0
3 1 7 7 0
4 1 10 10 0
5 2 2 2 0
6 2 5 5 0
7 2 8 8 1
8 2 11 11 1
9 3 6 6 0
10 3 9 9 1
11 3 12 12 2
12 3 15 15 3
13 4 3 3 0
I included the little helper n_na which counts NA's in a row. Then the second mutate muliplies the number of NAs with three and adds this to the first non-NA element before NA's
Here's an approach using 'locf' filling
setDT(data)
data[ , by = student, timeWANT := {
# carry previous observations forward whenever missing
locf_fill = nafill(timeHAVE, 'locf')
# every next NA, the amount shifted goes up by another 3
na_shift = cumsum(idx <- is.na(timeHAVE))
# add the shift, but only where the original data was missing
locf_fill[idx] = locf_fill[idx] + 3*na_shift[idx]
# return the full vector
locf_fill
}]
Warning that this won't work if a given student can have more than one non-consecutive set of NA values in timeHAVE
Another data.table option without grouping:
setDT(data)[, w := fifelse(is.na(timeHAVE) & student==shift(student),
nafill(timeHAVE, "locf") + 3L * rowid(rleid(timeHAVE)),
timeHAVE)]
output:
student timeHAVE timeWANT w
1: 1 1 1 1
2: 1 4 4 4
3: 1 7 7 7
4: 1 10 10 10
5: 2 2 2 2
6: 2 5 5 5
7: 2 NA 8 8
8: 2 11 11 11
9: 3 6 6 6
10: 3 NA 9 9
11: 3 NA 12 12
12: 3 NA 15 15
13: 4 NA NA NA
14: 4 3 3 3
data with student=4 having NA for the first timeHAVE:
data = data.frame("student"=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4),
"timeHAVE"=c(1,4,7,10,2,5,NA,11,6,NA,NA,NA,NA,3),
"timeWANT"=c(1,4,7,10,2,5,8,11,6,9,12,15,NA,3))

How to replace the NA values after merge two data.frame? [duplicate]

This question already has answers here:
Replacing NAs with latest non-NA value
(21 answers)
Closed 7 years ago.
I have two data.frame as the following:
> a <- data.frame(x=c(1,2,3,4,5,6,7,8), y=c(1,3,5,7,9,11,13,15))
> a
x y
1 1 1
2 2 3
3 3 5
4 4 7
5 5 9
6 6 11
7 7 13
8 8 15
> b <- data.frame(x=c(1,5,7), z=c(2, 4, 6))
> b
x z
1 1 2
2 5 4
3 7 6
Then I use "join" for two data.frames:
> c <- join(a, b, by="x", type="left")
> c
x y z
1 1 1 2
2 2 3 NA
3 3 5 NA
4 4 7 NA
5 5 9 4
6 6 11 NA
7 7 13 6
8 8 15 NA
My requirement is to replace the NAs in the Z column by the last None-Na value before the current place. I want the result like this:
> c
x y z
1 1 1 2
2 2 3 2
3 3 5 2
4 4 7 2
5 5 9 4
6 6 11 4
7 7 13 6
8 8 15 6
This time (if your data is not too large) a loop is an elegant option:
for(i in which(is.na(c$z))){
c$z[i] = c$z[i-1]
}
gives:
> c
x y z
1 1 1 2
2 2 3 2
3 3 5 2
4 4 7 2
5 5 9 4
6 6 11 4
7 7 13 6
8 8 15 6
data:
library(plyr)
a <- data.frame(x=c(1,2,3,4,5,6,7,8), y=c(1,3,5,7,9,11,13,15))
b <- data.frame(x=c(1,5,7), z=c(2, 4, 6))
c <- join(a, b, by="x", type="left")
You might also want to check na.locf in the zoo package.

Resources