Sum over a range with start and end point variables in R - r

I have a dataframe with the following variables:
start_point end_point variable_X
1 5 0.3757
2 7 0.4546
3 7 0.1245
4 8 0.3455
5 11 0.2399
6 12 0.0434
7 15 0.4323
... ... ...
I would like to add a fourth column that sums variable X from the start point to the end points defined in the first two columns, i.e. the entry in the first row would be the sum between 1 and 5 (inclusive): 0.3757+0.4546+0.1245+0.3455+0.2399 = 1.5402, the entry in second row would be sum between 2 and 7 (inclusive): 0.4546+0.1245+0.3455+0.2399+0.0434+0.4323 = 1.6402 and so forth.
I'm new to R, any help would be greatly appreciated.

There are probably slicker ways to do this, but here's a quick version:
df$sumX <- apply(df, 1, function(x) sum(df$variable_X[x[1]:x[2]]))
df
start_point end_point variable_X sumX
1 1 5 0.3757 1.5402
2 2 7 0.4546 1.6402
3 3 7 0.1245 1.1856
4 4 8 0.3455 NA
5 5 11 0.2399 NA
6 6 12 0.0434 NA
7 7 15 0.4323 NA
The last few rows are NA here because I don't have rows 8 through 15 of your data.

A solution with dplyr, using another reproducible example to address the situation with NA's in end_point as in the OP's comment (with ifelse):
# Reproducible example
mydf = data.frame(start_point = 1:9,
end_point = c(5, NA, 7, 8, 11, 12, 7, 15, NA),
variable_X = c(1, 5, 2, 3, 5, 4, 2, 1, 2))
library(dplyr)
mydf %>% rowwise() %>%
mutate(sumX = ifelse(is.na(end_point), NA, sum(mydf$variable_X[start_point:end_point])))
# start_point end_point variable_X sumX
# <int> <dbl> <dbl> <dbl>
# 1 1 5 1 16
# 2 2 NA 5 NA
# 3 3 7 2 16
# 4 4 8 3 15
# 5 5 11 5 NA
# 6 6 12 4 NA
# 7 7 7 2 2
# 8 8 15 1 NA
# 9 9 NA 2 NA

Related

Create lagged variables for consecutive time points only using R

I have an unbalanced panel (with unequally spaced measurement points) and would like to create a lagged variable of x by group (Variable: id) but only for consecutive time points. My data looks like this:
# simple example with an unbalanced panel
base <- data.frame(id = rep(1:2, each = 7),
time = c(1, 2, 3, 4, 7, 8, 10, 3, 4, 6, 9, 10, 11, 14),
x = rnorm(14, mean = 3, sd = 1))
I already tried this code using dplyr:
base_lag <- base %>% # Add lagged column
group_by(id) %>%
dplyr::mutate(lag1_x = dplyr::lag(x, n = 1, default = NA)) %>%
as.data.frame()
base_lag # Print updated data
However, this way I get a lagged variable regardless of the fact that in some cases it is not two consecutive time points.
My final data set should look like this:
id time x lag1_x
1 1 1 3.437416 NA
2 1 2 2.300553 3.437416
3 1 3 2.374212 2.300553
4 1 4 4.374009 2.374212
5 1 7 1.177433 NA
6 1 8 1.543353 1.177433
7 1 10 3.222358 NA
8 2 3 3.763765 NA
9 2 4 3.881182 3.763765
10 2 6 4.754420 NA
11 2 9 4.518227 NA
12 2 10 2.512486 4.518227
13 2 11 3.129230 2.512486
14 2 14 2.152509 NA
Does anyone here have a tip for me on how to create this lagged variable? Many thanks in advance!
You could use ifelse, testing whether diff(time) is equal to 1. If so, write the lag. If not, write an NA.
base %>%
group_by(id) %>%
mutate(lag1_x = ifelse(c(0, diff(time)) == 1, lag(x, n = 1, default = NA), NA)) %>%
as.data.frame()
#> id time x lag1_x
#> 1 1 1 1.852343 NA
#> 2 1 2 2.710538 1.852343
#> 3 1 3 2.700785 2.710538
#> 4 1 4 2.588489 2.700785
#> 5 1 7 3.252223 NA
#> 6 1 8 2.108079 3.252223
#> 7 1 10 3.435683 NA
#> 8 2 3 1.762462 NA
#> 9 2 4 2.775732 1.762462
#> 10 2 6 3.377396 NA
#> 11 2 9 3.133336 NA
#> 12 2 10 3.804190 3.133336
#> 13 2 11 2.942893 3.804190
#> 14 2 14 3.503608 NA
An option is also to create a grouping based on the difference
library(dplyr)
base %>%
group_by(id, grp = cumsum(c(TRUE, diff(time) != 1))) %>%
mutate(lag1_x = lag(x)) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 14 × 4
id time x lag1_x
<int> <dbl> <dbl> <dbl>
1 1 1 3.81 NA
2 1 2 2.79 3.81
3 1 3 3.04 2.79
4 1 4 1.76 3.04
5 1 7 1.72 NA
6 1 8 2.68 1.72
7 1 10 3.31 NA
8 2 3 2.92 NA
9 2 4 2.02 2.92
10 2 6 1.71 NA
11 2 9 2.56 NA
12 2 10 1.62 2.56
13 2 11 3.30 1.62
14 2 14 3.69 NA

Create new column and carry forward value from previous group to next

I am trying to carry forward value from the previous group to the next group. I tried to solve it using rleid but that could not get the desired result.
df <- data.frame(signal = c(1,1,5,5,5,2,3,3,3,4,4,5,5,5,5,6,7,7,8,9,9,9,10),
desired_outcome = c(NA, NA, 1, 1, 1, 5, 2, 2, 2, 3, 3, 4, 4,4,4,5,6,6,7,8,8,8,9))
# outcome column has the expected result -
signal desired_outcome
1 1 NA
2 1 NA
3 5 1
4 5 1
5 5 1
6 2 5
7 3 2
8 3 2
9 3 2
10 4 3
11 4 3
12 5 4
13 5 4
14 5 4
15 5 4
16 6 5
17 7 6
18 7 6
19 8 7
20 9 8
21 9 8
22 9 8
23 10 9
rle will give the lengths and values of sequences where the same value occur. Then: remove the last value, shift remaining values one over, add an NA to the beginning of the value to account for removing the last value, and repeat each value as given by lengths (i.e. the lengths of sequences of same value in the original vector).
with(rle(df$signal), rep(c(NA, head(values, -1)), lengths))
# [1] NA NA 1 1 1 5 2 2 2 3 3 4 4 4 4 5 6 6 7 8 8 8 9
Another way could be to first lag signal then use rleid to create groups and use mutate to broadcast first value of each group to all the values.
library(dplyr)
df %>%
mutate(out = lag(signal)) %>%
group_by(group = data.table::rleid(signal)) %>%
mutate(out = first(out)) %>%
ungroup() %>%
select(-group)
# A tibble: 23 x 2
# signal out
# <dbl> <dbl>
# 1 1 NA
# 2 1 NA
# 3 5 1
# 4 5 1
# 5 5 1
# 6 2 5
# 7 3 2
# 8 3 2
# 9 3 2
#10 4 3
# … with 13 more rows

Left join two tables on LHS rows that meet a condition, leave others as NA

Consider the following scenario:
test <- data.frame(Id1 = c(1, 2, 3, 4, 5, 10, 11),
Id2 = c(3, 4, 10, 11, 12, 15, 9),
Type = c(1, 1, 1, 2, 2, 2, 1) )
test
#> Id1 Id2 Type
#> 1 1 3 1
#> 2 2 4 1
#> 3 3 10 1
#> 4 4 11 2
#> 5 5 12 2
#> 6 10 15 2
#> 7 11 9 1
I want to join test on itself by Id2 = Id1 only when Type has a certain value, e.g. Type == 1 in such a way that I get the following result:
#> Id1 Id2 Type.x Id2.y Type.y
#> 1 1 3 1 10 1 # matches row 3
#> 2 2 4 1 11 2 # matches row 4
#> 3 3 10 1 15 2 # matches row 6
#> 4 4 11 2 NA NA # matches row 7 but Type != 1
#> 5 5 12 2 NA NA # Type !=1
#> 6 10 15 2 NA NA # Type !=1
#> 7 11 9 1 NA NA # Type == 1 but no matches
Since in this case, test represents a hierarchy, a join of this type would allow me to "expand" the hierarchy so that each row eventually terminated in an Id2 that was not equal to any value of Id1.
How can one achieve such a join?
Tidyverse is a great package for data manipulation. Here, we can do as follows:
library(tidyverse)
joined <- test %>% left_join(test %>% filter(Type==1), by = c("Id1" = "Id2"))
joined
UPDATE:
library(tidyverse)
joined <- test %>%
filter(Type==1) %>% left_join(test, by = c("Id2" = "Id1")) %>%
bind_rows(test %>% filter(Type==2) %>% rename(Type.x = Type))
joined
Id1 Id2 Type.x Id2.y Type.y
1 3 1 10 1
2 4 1 11 2
3 10 1 15 2
11 9 1 NA NA
4 11 2 NA NA
5 12 2 NA NA
10 15 2 NA NA

eliminating categories with a certain number of non-NA values in R

I have a data frame df which looks like this
> g <- c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6)
> m <- c(1, NA, NA, NA, 3, NA, 2, 1, 3, NA, 3, NA, NA, 4, NA, NA, NA, 2, 1, NA, 7, 3, NA, 1)
> df <- data.frame(g, m)
where g is the category (1 to 6) and m are values in that category.
I've managed to find the amount of none NA values per category by :
aggregate(m ~ g, data=df, function(x) {sum(!is.na(x))}, na.action = NULL)
g m
1 1 1
2 2 3
3 3 2
4 4 1
5 5 2
6 6 3
and would now like to eliminate the rows (categories) where the number of None-NA is 1 and only keep those where the number of NA is 2 and above.
the desired outcome would be
g m
5 2 3
6 2 NA
7 2 2
8 2 1
9 3 3
10 3 NA
11 3 3
12 3 NA
17 5 NA
18 5 2
19 5 1
20 5 NA
21 6 7
22 6 3
23 6 NA
24 6 1
every g=1 and g=4 is eliminated because as shown there is only 1 none-NA in each of those categories
any suggestions :)?
If you want base R, then I suggest you use your aggregation:
df2 <- aggregate(m ~ g, data=df, function(x) {sum(!is.na(x))}, na.action = NULL)
df[ ! df$g %in% df2$g[df2$m < 2], ]
# g m
# 5 2 3
# 6 2 NA
# 7 2 2
# 8 2 1
# 9 3 3
# 10 3 NA
# 11 3 3
# 12 3 NA
# 17 5 NA
# 18 5 2
# 19 5 1
# 20 5 NA
# 21 6 7
# 22 6 3
# 23 6 NA
# 24 6 1
If you want to use dplyr, perhaps
library(dplyr)
group_by(df, g) %>%
filter(sum(!is.na(m)) > 1) %>%
ungroup()
# # A tibble: 16 × 2
# g m
# <dbl> <dbl>
# 1 2 3
# 2 2 NA
# 3 2 2
# 4 2 1
# 5 3 3
# 6 3 NA
# 7 3 3
# 8 3 NA
# 9 5 NA
# 10 5 2
# 11 5 1
# 12 5 NA
# 13 6 7
# 14 6 3
# 15 6 NA
# 16 6 1
One can try a dplyr based solution. group_by on g will help to get the desired count.
library(dplyr)
df %>% group_by(g) %>%
filter(!is.na(m)) %>%
filter(n() >=2) %>%
summarise(count = n())
#Result
# # A tibble: 6 x 2
# g count
# <dbl> <int>
# 1 2.00 3
# 2 3.00 2
# 3 5.00 2
# 4 6.00 3

how to make a vector of x from 1 to max value

In a dataset like this one
what code should I use if I want to make a vector of
x <- 1: max (day)/ID
? So x will be
1:7 for B1
1:11 for B2
1:22 for B3
I tried
max_day <- summaryBy(day ~ ID , df ,FUN=max) # to extract the maximum day per ID
df<- merge (df, max_day) ## to create another column with the maximum day
max_day<- unique(df[,c("ID", " day.max")]) ## to have one value (max) per ID
##& Finlay the vector
x <- 1: (max_day$day.max)
I got this message
Warning message:
In 1:(max_day$day.max) :
numerical expression has 11134 elements: only the first used
Any suggestions?
tapply(df$day, df$ID, function(x) 1:max(x))
I don't know how should look your output, but you can try this:
my_data <- data.frame(ID = c(rep("B1", 3), rep("B2", 4), rep("B3", 3)),
day = sample(1:20, 10, replace = TRUE))
tmp <- aggregate(test$day, by = list(test$ID), FUN = max)
sapply(1:nrow(tmp), function(y) return(1:tmp$x[y]))
# [[1]]
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
# [[2]]
# [1] 1 2 3 4 5 6 7 8 9 10 11
# [[3]]
# [1] 1 2 3 4 5 6 7 8 9 10 11
We can use sapply to loop over unique element of ID and generate a sequence from 1 to the max for that ID in the day column
sapply(unique(df$ID), function(x) seq(1, max(df[df$ID == x, "day"])))
#[[1]]
#[1] 1 2 3 4 5 6 7
#[[2]]
#[1] 1 2 3 4 5 6 7 8 9 10 11
#[[3]]
#[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
If we want all as one vector , we can try unlist
unlist(sapply(unique(df$ID), function(x) seq(1, max(df[df$ID == x, "day"]))))
#[1] 1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10
# 11 12 13 14 15 16 17 18 19 20 21 22
Yet another option, using Hadley Wickham's purrr package, as part of the tidyverse.
d <- data.frame(id = rep(c("B1", "B2", "B3"), c(3, 4, 5)),
v = c(1:3, 1:4, 1:5),
day = c(1, 3, 7, 1, 5, 9, 11, 3, 5, 11, 20, 22),
number = c(15, 20, 30, 25, 26, 28, 35, 10, 12, 14, 16, 18))
library(purrr)
d %>%
split(.$id) %>%
map(~1:max(.$day))
# $B1
# [1] 1 2 3 4 5 6 7
# $B2
# [1] 1 2 3 4 5 6 7 8 9 10 11
# $B3
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
df <-
data.frame(ID = c(rep("B1",3),rep("B2",4),rep("B3",5)),
V = c(1,2,3,1,2,3,4,1,2,3,4,5),
day = c(1,3,7,1,5,9,11,3,5,11,20,22),
number = c(15,20,30,25,26,28,35,10,12,14,16,18))
x <- list()
n <- 1
for(i in unique(df$ID)){
max_day <- max(df$day[df$ID==i])
x[[n]] <- 1:max_day
n <- n+1
}
x
[[1]]
[1] 1 2 3 4 5 6 7
[[2]]
[1] 1 2 3 4 5 6 7 8 9 10 11
[[3]]
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Resources