Get the nth lagged value in grouped data in R - r

I have a data frame similar to this
mydf = data_frame(letter = rep(c("a", "b", "c"), each =5),
var1 = sample(1:25, 15, replace = TRUE))
# A tibble: 15 x 2
letter var1
<chr> <int>
1 a 16
2 a 9
3 a 5
4 a 14
5 a 6
6 b 13
7 b 9
8 b 20
9 b 18
10 b 4
11 c 18
12 c 11
13 c 9
14 c 1
15 c 12
I know I can get the immediate value from the previous row with dplyr::lag(). However, I am trying to obtain a similar solution to obtain the third value before each observation. The expected result should look like this:
# A tibble: 15 x 3
# Groups: letter [3]
letter var1 var2
<chr> <int> <dbl>
1 a 16 NA
2 a 9 NA
3 a 5 16
4 a 14 9
5 a 6 5
6 b 13 NA
7 b 9 NA
8 b 20 13
Thanks in advance

Related

Assign value to data based on more than two conditions and on other data

I have a data frame that looks like this
> df
name time count
1 A 10 9
2 A 12 17
3 A 24 19
4 A 3 15
5 A 29 11
6 B 31 14
7 B 7 7
8 B 30 18
9 C 29 13
10 C 12 12
11 C 3 16
12 C 4 6
and for each name group (A, B, C) I would need to assign a category following the rules below:
if time<= 10 then category = 1
if 10 <time<= 20 then category = 2
if 20 <time<= 30 then category = 3
if time> 30 then category = 4
to have a data frame that looks like this:
> df_final
name time count category
1 A 10 9 1
2 A 12 17 2
3 A 24 19 3
4 A 3 15 1
5 A 29 11 3
6 B 31 14 4
7 B 7 7 1
8 B 30 18 3
9 C 29 13 3
10 C 12 12 2
11 C 3 16 1
12 C 4 6 1
after that I would need to sum the value in count based on their category. The ultimate data frame should loo like this:
> df_ultimate
name count category
1 A 24 1
2 A 17 2
3 A 30 3
4 A NA 4
5 B 7 1
6 B NA 2
7 B 18 3
8 B 14 4
9 C 22 1
10 C 12 2
11 C 13 3
12 C NA 4
I have tried to play around with summarise and group_by but without much success.
Thanks for your help
With cut + complete:
library(dplyr)
library(tidyr)
dat %>%
group_by(name, category = cut(time, breaks = c(-Inf, 10, 20, 30, Inf), labels = 1:4)) %>%
summarise(count = sum(count)) %>%
complete(category)
# # Groups: name [3]
# name category count
# 1 A 1 24
# 2 A 2 17
# 3 A 3 30
# 4 A 4 NA
# 5 B 1 7
# 6 B 2 NA
# 7 B 3 18
# 8 B 4 14
# 9 C 1 22
# 10 C 2 12
# 11 C 3 13
# 12 C 4 NA

R, How to generate additional observations denoted by numbered sequence

I'm currently a bit stuck, since I'm a bit unsure of how to even formulate my problem.
What I have is a dataframe of observations with a few variables.
Lets say:
test <- data.frame(var1=c("a","b"),var2=c(15,12))
Is my initial dataset.
What I want to end up with is something like:
test2 <- data.frame(var1_p=c("a","a","a","a","a","b","b","b","b","b"),
var2=c(15,15,15,15,15,12,12,12,12,12),
var3=c(1,2,3,4,5,1,2,3,4,5)
However, the initial observation count and the fact, that I need the numbering to run from 0-9 makes it rather tedious to do by hand.
Does anybody have a nice alternative solution?
Thank you.
What I tried so far was:
a)
testdata$C <- 0
testdata <- for (i in testdata$Combined_Number) {add_row(testdata,C=seq(0,9))}
which results in the dataset to be empty.
b)
testdata$C <- with(testdata, ave(Combined_Number,flur, FUN = seq(0,9)))
which gives the following error code:
Error in get(as.character(FUN), mode = "function", envir = envir) :
object 'FUN' of mode 'function' was not found
Perhaps crossing helps
library(tidyr)
crossing(df, var3 = 0:9)
-output
# A tibble: 20 × 3
var1 var2 var3
<chr> <dbl> <int>
1 a 15 0
2 a 15 1
3 a 15 2
4 a 15 3
5 a 15 4
6 a 15 5
7 a 15 6
8 a 15 7
9 a 15 8
10 a 15 9
11 b 12 0
12 b 12 1
13 b 12 2
14 b 12 3
15 b 12 4
16 b 12 5
17 b 12 6
18 b 12 7
19 b 12 8
20 b 12 9
With dplyr this is one approach
library(dplyr)
df %>%
group_by(var1) %>%
summarize(var2, var3 = 0:9, .groups="drop")
# A tibble: 20 × 3
var1 var2 var3
<chr> <dbl> <int>
1 a 15 0
2 a 15 1
3 a 15 2
4 a 15 3
5 a 15 4
6 a 15 5
7 a 15 6
8 a 15 7
9 a 15 8
10 a 15 9
11 b 12 0
12 b 12 1
13 b 12 2
14 b 12 3
15 b 12 4
16 b 12 5
17 b 12 6
18 b 12 7
19 b 12 8
20 b 12 9
Data
df <- structure(list(var1 = c("a", "b"), var2 = c(15, 12)), class = "data.frame", row.names = c(NA,
-2L))

calculate difference between rows, but keep the raw value by group

I have a dataframe with cumulative values by groups that I need to recalculate back to raw values. The function lag works pretty well here, but instead of the first number in a sequence, I get back either NA, either the lag between two groups.
How to instead of NA values or difference between groups get the first number in group?
My dummy data:
# make example
df <- data.frame(id = rep(1:3, each = 5),
hour = rep(1:5, 3),
value = sample(1:15))
First calculate cumulative values, than convert it back to row values. I.e value should equal to valBack. The suggestion mutate(valBack = c(cumsum[1], (cumsum - lag(cumsum))[-1])) just replace the first (NA) value to the correct value, but does not work for first numbers for each group?
df %>%
group_by(id) %>%
dplyr::mutate(cumsum = cumsum(value)) %>%
mutate(valBack = c(cumsum[1], (cumsum - lag(cumsum))[-1])) # skip the first value in a lag vector
Which results:
# A tibble: 15 x 5
# Groups: id [3]
id hour value cumsum valBack
<int> <int> <int> <int> <int>
1 1 1 10 10 10 # this works
2 1 2 13 23 13
3 1 3 8 31 8
4 1 4 4 35 4
5 1 5 9 44 9
6 2 1 12 12 -32 # here the new group start. The number should be 12, instead it is -32??
7 2 2 14 26 14
8 2 3 5 31 5
9 2 4 15 46 15
10 2 5 1 47 1
11 3 1 2 2 -45 # here should be 2 istead of -45
12 3 2 3 5 3
13 3 3 6 11 6
14 3 4 11 22 11
15 3 5 7 29 7
I want to a safe calculation to make my valBack equal to value. (Of course, in real data I don't have value column, just cumsum column)
Try:
library(dplyr)
df %>%
group_by(id) %>%
mutate(
cumsum = cumsum(value),
valBack = c(cumsum[1], (cumsum - lag(cumsum))[-1])
)
Giving:
# A tibble: 15 x 5
# Groups: id [3]
id hour value cumsum valBack
<int> <int> <int> <int> <int>
1 1 1 10 10 10
2 1 2 13 23 13
3 1 3 8 31 8
4 1 4 4 35 4
5 1 5 9 44 9
6 2 1 12 12 12
7 2 2 14 26 14
8 2 3 5 31 5
9 2 4 15 46 15
10 2 5 1 47 1
11 3 1 2 2 2
12 3 2 3 5 3
13 3 3 6 11 6
14 3 4 11 22 11
15 3 5 7 29 7
While the accepted answer works, it is more complicated than it needs to be. If you look at lag function you would see that it has different arguments
dplyr::lag(x, n = 1L, default = NA, order_by = NULL, ...)
which here we can use default and set it to 0 to get the desired output. Look below:
library(dplyr)
df %>%
group_by(id) %>%
mutate(cumsum = cumsum(value),
rawdata = cumsum - lag(cumsum, default = 0))
#> # A tibble: 15 x 5
#> # Groups: id [3]
#> id hour value cumsum rawdata
#> <int> <int> <int> <int> <dbl>
#> 1 1 1 2 2 2
#> 2 1 2 1 3 1
#> 3 1 3 13 16 13
#> 4 1 4 15 31 15
#> 5 1 5 10 41 10
#> 6 2 1 3 3 3
#> 7 2 2 8 11 8
#> 8 2 3 4 15 4
#> 9 2 4 12 27 12
#> 10 2 5 11 38 11
#> 11 3 1 14 14 14
#> 12 3 2 6 20 6
#> 13 3 3 5 25 5
#> 14 3 4 7 32 7
#> 15 3 5 9 41 9

Partition groups of data by group

I have the following dataset:
df<- as.data.frame(c(rep("a", times = 9), rep("b", times = 18), rep("c", times = 27)))
colnames(df)<-"Location"
Year<-c(rep(1:3,times = 3), rep(1:6, times = 3), rep(1:9, times = 3))
df$Year<-Year
df<- df %>%
mutate(Predictor = seq_along(Location)) %>%
ungroup(df)
print(df)
Location Year Predictor
a 1 1
a 2 2
a 3 3
a 1 4
a 2 5
a 3 6
a 1 7
a 2 8
a 3 9
b 1 10
b 2 11
b 3 12
b 4 13
b 5 14
... 40 more rows
I want to split the above dataframe into training and test sets. For the test set, I want to randomly sample a third of the number of years in each Location, while keeping the years together. So if year "1" is selected for location "a", I want all three "1's" in the test set and so on. My test set should look something like this:
Location Year Predictor
a 1 1
a 1 4
a 1 7
b 3 12
b 3 18
b 3 24
b 5 14
b 5 20
b 5 26
c 3 30
c 3 39
c 3 48
c 6 33
c 6 42
c 6 51
c 7 34
c 7 43
c 7 52
I found a similar question here, but this procedure would sample the same year and the same number of years from every location (and YEAR is numeric, not a factor). I want a different random sample of years from each location and a proportional number of samples.
Would like to do this in dplyr if possible
You can first create a distinct set of year/location combinations, then sample some of them for each location and use that in a semi_join on the original data. This could be done as:
df %>%
distinct(Location, Year) %>%
group_by(Location) %>%
sample_frac(.3) %>%
semi_join(df, .)
# Location Year Predictor
# 1 a 3 3
# 2 a 3 6
# 3 a 3 9
# 4 b 4 13
# 5 b 4 19
# 6 b 4 25
# 7 b 5 14
# 8 b 5 20
# 9 b 5 26
# 10 c 8 35
# 11 c 8 44
# 12 c 8 53
# 13 c 1 28
# 14 c 1 37
# 15 c 1 46
# 16 c 2 29
# 17 c 2 38
# 18 c 2 47

R - calculating the average value of a dataframe column from the top row to bottom row

The title may not be that clear, since it was difficult to summarize the problem in a few words, although I don't think the problem is that difficult to solve. To explain the problem, let me share a dataframe for reference:
head(df, n = 10)
team score
1 A 10
2 A 4
3 A 10
4 A 16
5 A 20
6 B 5
7 B 11
8 B 8
9 B 16
10 B 5
I'd like to add a third column, that calculates the average score for each team, with the average score updating as I go down the rows for each team, and then resetting at a new team. For example, the output column I am hoping for would look like this:
head(df, n = 10)
team score avg_score
1 A 10 10
2 A 4 7
3 A 10 8
4 A 16 10
5 A 20 12
6 B 5 5
7 B 11 8
8 B 8 8
9 B 16 10
10 B 5 9
# row1: 10 = 10
# row2: 7 = (10 + 4)/2
# row3: 8 = (10 + 4 + 10)/3
# ...
with the pattern following, and the calculation restarting for a new team.
Thanks,
library("data.table")
setDT(df)[, `:=` (avg_score = cumsum(score)/1:.N), by = team]
or more readable as per the comment by #snoram
setDT(dt)[, avg_score := cumsum(score)/(1:.N), by = team]
# team score avg_score
# 1: A 10 10
# 2: A 4 7
# 3: A 10 8
# 4: A 16 10
# 5: A 20 12
# 6: B 5 5
# 7: B 11 8
# 8: B 8 8
# 9: B 16 10
# 10: B 5 9
Here's an R base solution
df$avg_score <- unlist(tapply(df$score, df$team, function(x) cumsum(x)/seq_along(x)))
> df
team score avg_score
1 A 10 10
2 A 4 7
3 A 10 8
4 A 16 10
5 A 20 12
6 B 5 5
7 B 11 8
8 B 8 8
9 B 16 10
10 B 5 9
We can use the cummean from dplyr (also noted that #aosmith commented it - assuming that he is not posting it as solution)
library(dplyr)
df %>%
group_by(team) %>%
mutate(avg_score = cummean(score))
# team score avg_score
# <chr> <int> <dbl>
#1 A 10 10
#2 A 4 7
#3 A 10 8
#4 A 16 10
#5 A 20 12
#6 B 5 5
#7 B 11 8
#8 B 8 8
#9 B 16 10
#10 B 5 9

Resources