Trying to label sequentially within groups of dataframe R - r

I have a subset of my dataframe:
df = data.frame(retailer_id = c(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
store_id = c(166, 166, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167),
quad_id = c(2017010104, 2017012904, 2017010104, 2017012904, 2017022604, 2017032604 ,2017042304, 2017052104, 2017061804,
2017071604, 2017081304, 2017091004, 2017100804, 2017110504, 2017120304, 2017123104, 2018012804, 2018022504, 2018032504, 2018042204))
where 2017010104 corresponds to the date 01/01/2017 and so on. I am trying to label these different quad_ids sequentially with reference to the year. So for example I am trying to get the output:
df = data.frame(retailer_id = c(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
store_id = c(166, 166, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167),
quad_id = c(2017010104, 2017012904, 2017010104, 2017012904, 2017022604, 2017032604 ,2017042304, 2017052104, 2017061804,
2017071604, 2017081304, 2017091004, 2017100804, 2017110504, 2017120304, 2017123104, 2018012804, 2018022504, 2018032504, 2018042204),
Snum = c(1, 2, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1, 2, 3, 4))
where you can see for retailer_id = 2, store_id = 167, the weeks for the year 2017 are labeled 1-14 and then when the week begins with 2018 it starts counting sequentially from 1 again until it will reach a week that starts with 2019 within this grouping.
I tried:
DT <- data.table(df)
DT[, Snum := seq_len(.N), by = list(retailer_id, store_id)]
However, this is not labeling sequentially by year, instead it is labelling sequentially by store_id. Is there a way to fix this? (this example code is only showing two retailers and two stores, whereas my actual dataframe and hundreds of different retailers and stores)

Here's a solution using tidyverse
df = data.frame(retailer_id = c(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
store_id = c(166, 166, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167),
quad_id = c(2017010104, 2017012904, 2017010104, 2017012904, 2017022604, 2017032604 ,2017042304, 2017052104, 2017061804,
2017071604, 2017081304, 2017091004, 2017100804, 2017110504, 2017120304, 2017123104, 2018012804, 2018022504, 2018032504, 2018042204))
library(tidyverse)
getYear = function(x) {
x %>%
str_extract("^\\d{4}") %>%
as.integer() %>%
return()
}
tmp = df %>%
mutate(year = getYear(quad_id)) %>%
group_by(year, retailer_id, store_id) %>%
mutate(Snum = 1:n())
> tmp
# A tibble: 20 x 5
# Groups: year, retailer_id, store_id [3]
retailer_id store_id quad_id year Snum
<dbl> <dbl> <dbl> <int> <int>
1 1 166 2017010104 2017 1
2 1 166 2017012904 2017 2
3 2 167 2017010104 2017 1
4 2 167 2017012904 2017 2
5 2 167 2017022604 2017 3
6 2 167 2017032604 2017 4
7 2 167 2017042304 2017 5
8 2 167 2017052104 2017 6
9 2 167 2017061804 2017 7
10 2 167 2017071604 2017 8
11 2 167 2017081304 2017 9
12 2 167 2017091004 2017 10
13 2 167 2017100804 2017 11
14 2 167 2017110504 2017 12
15 2 167 2017120304 2017 13
16 2 167 2017123104 2017 14
17 2 167 2018012804 2018 1
18 2 167 2018022504 2018 2
19 2 167 2018032504 2018 3
20 2 167 2018042204 2018 4
Note that if your data isn't sorted by retailer_id, store_id and year that would cause an issue.

We could use str_match from stringr package together with regex '^[[:digit:]]{4}' to match for the first four digits:
library(dplyr)
library(stringr)
df %>%
group_by(Snum = str_match(quad_id, '^[[:digit:]]{4}')) %>%
mutate(Snum = row_number())
output:
retailer_id store_id quad_id Snum
<dbl> <dbl> <dbl> <int>
1 1 166 2017010104 1
2 1 166 2017012904 2
3 2 167 2017010104 3
4 2 167 2017012904 4
5 2 167 2017022604 5
6 2 167 2017032604 6
7 2 167 2017042304 7
8 2 167 2017052104 8
9 2 167 2017061804 9
10 2 167 2017071604 10
11 2 167 2017081304 11
12 2 167 2017091004 12
13 2 167 2017100804 13
14 2 167 2017110504 14
15 2 167 2017120304 15
16 2 167 2017123104 16
17 2 167 2018012804 1
18 2 167 2018022504 2
19 2 167 2018032504 3
20 2 167 2018042204 4

Related

Mix "color_bar" and "style" in formattable package

I'm using formattable package and I want to personalize my table but I can't in the way I want.
Here is my table
structure(list(PJ = c(4, 4, 4, 4, 4, 4), V = c(4, 2, 2, 2, 1,
1), E = c(0, 0, 0, 0, 0, 0), D = c(0, 2, 2, 2, 3, 3), GF = c(182,
91, 92, 185, 126, 119), GC = c(84, 143, 144, 115, 141, 168),
Dif = c(98, -52, -52, 70, -15, -49), Pts = c(12, 6, 6, 6,
3, 3)), class = "data.frame", row.names = c("Player1", "Player2",
"Player3", "Player4", "Player5", "Player6"))
It looks like this:
PJ V E D GF GC Dif Pts
Player1 4 4 0 0 182 84 98 12
Player2 4 2 0 2 91 143 -52 6
Player3 4 2 0 2 92 144 -52 6
Player4 4 2 0 2 185 115 70 6
Player5 4 1 0 3 126 141 -15 3
Player6 4 1 0 3 119 168 -49 3
If I want the column GF in bold, I use
formattable(TAB.df, list(
GF = formatter("span",style = style("font.weight"="bold"))
))
If I want a color_bar I run this code:
formattable(TAB.df, list(
GF = color_bar("lightgreen")
))
Nevertheless, I don't know how to combine them and get the "color_bar" with "bold" numbers.

Matching values in different datasets by groups in R

I have the following two datasets:
df1 <- data.frame(
"group" = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5),
"numbers" = c(55, 75, 60, 55, 75, 60, 55, 75, 60, 55, 75, 60, 55, 75, 60))
df2 <- data.frame(
"group" = c(1, 1, 2, 2, 2, 3, 3, 4, 5),
"P1" = c(55, NA, 60, 55, 75, 75, 55, 55, 60),
"P2" = c(55, 75, 55, 60, NA, 75, 55, NA, 60),
"P3" = c(75, 55, 60, 75, NA, 75, 60, 55, 60))
In df1 each group has the same three numbers (in reality there are around 500 numbers).
I want to check whether the values in the column "numbers" in df1 are contained in the columns P1, P2, and P3 of df2. There are two problems I am stuck with. 1. the values in the numbers column of df1 can occur in different groups in df2 (defined by the group column in df1 and df2). 2. the datasets have different lengths. Is there a way to merge both datasets and have the following dataset:
df3 <- data.frame(
"group" = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5),
"numbers" = c(55, 75, 60, 55, 75, 60, 55, 75, 60, 55, 75, 60, 55, 75, 60,),
"P1new" = c(1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1),
"P2new" = c(1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1),
"P3new" = c(1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1))
where P1new (P2new and P3new respectively) contain the value 1 if df2$P1 contains the value in df1$numbers within the correct group (as I said numbers can reoccur in different groups). For example, P3 has the value 75 in group 1 but not in group 5. So in group 1 P3new would have a 1 and in group 5 P3new would have a 0.
This question is similar to Find matching values in different datasets by groups in R
but I could not adapt the code according to my objectives. So, I would really appreciate any help.
Interesting question. Here's a way with dplyr functions:
library(dplyr)
df2 %>%
group_by(group) %>%
summarise(across(P1:P3, ~ list(unique(na.omit(.x))))) %>%
inner_join(df1, .) %>%
rowwise() %>%
mutate(across(P1:P3, ~ +(numbers %in% .x)))
group numbers P1 P2 P3
<dbl> <dbl> <int> <int> <int>
1 1 55 1 1 1
2 1 75 0 1 1
3 1 60 0 0 0
4 2 55 1 1 0
5 2 75 1 0 1
6 2 60 1 1 1
7 3 55 1 1 0
8 3 75 1 1 1
9 3 60 0 0 1
10 4 55 1 0 1
11 4 75 0 0 0
12 4 60 0 0 0
13 5 55 0 0 0
14 5 75 0 0 0
15 5 60 1 1 1
Another possible solution:
library(tidyverse)
map_dfc(names(df2[-1]),
~ df1 %>%
group_by(group) %>%
mutate(!!.x := +(numbers %in% df2[df2$group == cur_group_id(), .x])) %>%
ungroup %>%
select(all_of(.x))) %>%
bind_cols(df1, .)
#> group numbers P1 P2 P3
#> 1 1 55 1 1 1
#> 2 1 75 0 1 1
#> 3 1 60 0 0 0
#> 4 2 55 1 1 0
#> 5 2 75 1 0 1
#> 6 2 60 1 1 1
#> 7 3 55 1 1 0
#> 8 3 75 1 1 1
#> 9 3 60 0 0 1
#> 10 4 55 1 0 1
#> 11 4 75 0 0 0
#> 12 4 60 0 0 0
#> 13 5 55 0 0 0
#> 14 5 75 0 0 0
#> 15 5 60 1 1 1
Or, without purrr, another possibility:
library(dplyr)
df1 %>%
inner_join(df2) %>%
group_by(group) %>%
mutate(across(starts_with("P"), ~ +(numbers %in% .x))) %>%
ungroup %>%
distinct

Find maximum in a group, subset by a subset from a different dataframe, to select other value's

I have two data.frames df1 with raw data. df2 has information on where to look in df1.
df1 has groups, defined by "id". In those groups, a subset is defined by df2$value_a1 and df2$value_a2, which represent the range of rows to look in the group. In that subsetgroup I want to find the maximum value_a, to select value_b.
code for df1 and df2
df1 <- data.frame("id" = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3), "value_a" = c(0, 10, 21, 30, 43, 53, 69, 81, 93, 5, 16, 27, 33, 45, 61, 75, 90, 2, 11, 16, 24, 31, 40, 47, 60, 75, 88), "value_b" = c(100, 101, 100, 95, 90, 104, 88, 84, 75, 110, 105, 106, 104, 95, 109, 96, 89, 104, 104, 104, 103, 106, 103, 101, 99, 98, 97), "value_c" = c(0, -1, -2, -2, -2, -2, -1, -1, 0, 0, 0, 0, 1, 1, 2, 2, 1, -1, 0, 0, 1, 1, 2, 2, 1, 1, 0), "value_d" = c(1:27))
df2 <- data.frame("id" = c(1, 2, 3), "value_a1" = c(21, 33, 16), "value_a2" = c(69, 75, 60))
This is df1
id value_a value_b value_c value_d
1 1 0 100 0 1
2 1 10 101 -1 2
3 1 21 100 -2 3
4 1 30 95 -2 4
5 1 43 90 -2 5
6 1 53 104 -2 6
7 1 69 88 -1 7
8 1 81 84 -1 8
9 1 93 75 0 9
10 2 5 110 0 10
11 2 16 105 0 11
12 2 27 106 0 12
13 2 33 104 1 13
14 2 45 95 1 14
15 2 61 109 2 15
16 2 75 96 2 16
17 2 90 89 1 17
18 3 2 104 -1 18
19 3 11 104 0 19
20 3 16 104 0 20
21 3 24 103 1 21
22 3 31 106 1 22
23 3 40 103 2 23
24 3 47 101 2 24
25 3 60 99 1 25
26 3 75 98 1 26
27 3 88 97 0 27
This is df2
id value_a1 value_a2
1 1 21 69
2 2 33 75
3 3 16 60
My result would be df3, which would look like this
id value_a value_c
1 1 53 -2
2 2 61 2
3 3 31 1
I wrote this code to show my line of thinking.
df3 <- df1 %>%
group_by(id) %>%
filter(value_a >= df2$value_a1 & value_a <= df2$value_a2) %>%
filter(value_a == max(value_a)) %>%
pull(value_b)
This however generates a value with three entry's:
[1] 88 95 99
These are not the maximum value_b's...
Perhaps by() would work, but this gets stuck on using a function on two different df's.
It feels like I'm almost there, but still far away...
You can try this. I hope this helps.
df1 %>% left_join(df2) %>% mutate(val=ifelse(value_a>value_a1 & value_a<value_a2,value_b,NA)) %>%
group_by(id) %>% summarise(val=max(val,na.rm=T))
# A tibble: 3 x 2
id val
<dbl> <dbl>
1 1 104
2 2 109
3 3 106
Why don't you try a merge?
Then with data.table syntax:
library(data.table)
df3 <- merge(df1, df2, by = "id", all.x = TRUE)
max_values <- df3[value_a > value_a1 & value_a < value_a2, max(value_b), by = "id"]
max_values
# id V1
# 1: 1 104
# 2: 2 109
# 3: 3 106
I would do this using data.table package since is just what I'm used to
library(data.table)
dt.1 <- data.table("id" = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3), "value_a" = c(0, 10, 21, 30, 43, 53, 69, 81, 93, 5, 16, 27, 33, 45, 61, 75, 90, 2, 11, 16, 24, 31, 40, 47, 60, 75, 88), "value_b" = c(100, 101, 100, 95, 90, 104, 88, 84, 75, 110, 105, 106, 104, 95, 109, 96, 89, 104, 104, 104, 103, 106, 103, 101, 99, 98, 97), "value_c" = c(0, -1, -2, -2, -2, -2, -1, -1, 0, 0, 0, 0, 1, 1, 2, 2, 1, -1, 0, 0, 1, 1, 2, 2, 1, 1, 0), "value_d" = c(1:27))
dt.2 <- data.table("id" = c(1, 2, 3), "value_a1" = c(21, 33, 16), "value_a2" = c(69, 75, 60))
dt.3 <- dt.1[id %in% dt.2[,id],max(value_b), by="id"]
setnames(dt.3, "V1", "max_value_b")
dt.3
To get corresponding line where b is the max values there are several ways, here's one where I only modified a line from the previous code
dt.1[id %in% dt.2[,id],.SD[which.max(value_b), .(value_a, value_b, value_c, value_d)], by="id"]
.SD means the sub-table you already selected with by so for each id selects the local max b and then returns a table which.max() selects the row, and finally .() is an alias for list, so lists the columns you wish from that table.
Perhaps a more readable approach is to first select the desired rows
max.b.rows <- dt.1[id %in% dt.2[,id], which.max(value_b), by="id"][,V1]
dt.3 <- dt.1[max.b.rows,]
BTW, the id %in% dt.2[,id] part is just there to make sure you only select maxima for those ids in table 2
Best

Averaging dataframe based on current row-value and preceeding rows

I have a simple data set with the following form
df<- data.frame(c(10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20),
c(80, 80, 80, 80, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90, 90, 90, 80, 80, 80, 80, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90, 90, 90),
c(1, 1, 2, 2, 3, 3, 4, 4, 1, 1, 2, 2, 3, 3, 4, 4, 1, 1, 2, 2, 3, 3, 4, 4, 1, 1, 2, 2, 3, 3, 4, 4),
c(25, 75, 20, 40, 60, 50, 20, 10, 20, 30, 40, 60, 25, 75, 20, 40, 5, 5, 2, 4, 6, 5, 2, 1, 2, 3, 4, 6, 2, 7, 2, 4))
colnames(df)<-c("car_number", "year", "marker", "val")
What I am trying to do is quite simple, actually: Per car_number, I want to find the average of the values associated with a marker -value and the preceeding 3 values. So for the example data above the output I want is
car=10, year=80 1: 50
car=10, year=80 2: 40
car=10, year=80 3: 45
car=10, year=80 4: 37.5
car=10, year=90 1: 31.25
car=10, year=90 2: 36.25
car=10, year=90 3: 35
car=10, year=90 4: 38.75
car=20, year=80 1: 5
car=20, year=80 2: 4
car=20, year=80 3: 4.5
car=20, year=80 4: 3.75
car=20, year=90 1: 3.125
car=20, year=90 2: 3.625
car=20, year=90 3: 3.375
car=20, year=90 4: 3.750
Note that for simplicity of the example the markers above come in pairs of two. That is not the case with the real data, so I am thinking a general solution will contain some sort of group_by (?)
Any efficient solution is welcome!
Here is a second example data set and output:
df<- data.frame(c(10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20),
c(80, 80, 80, 80, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90, 90, 90, 80, 80, 80, 80, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90, 90, 90),
c(1, 2, 2, 2, 3, 3, 4, 4, 1, 1, 2, 2, 3, 3, 3, 4, 1, 1, 1, 2, 3, 3, 4, 4, 4, 1, 2, 2, 3, 3, 3, 4),
c(25, 75, 20, 40, 60, 50, 20, 10, 20, 30, 40, 60, 25, 75, 20, 40, 5, 5, 2, 4, 6, 5, 2, 1, 2, 3, 4, 6, 2, 7, 2, 4))
colnames(df)<-c("car_number", "year", "marker", "val")
And the output is (based on the rules above)
car=10, year=80 1: Mean{{25}] = 25
car=10, year=80 2: Mean[{40, 20, 75, 25}] = 40
car=10, year=80 3: Mean[{50, 60, 40, 20, 75, 25}] = 45
car=10, year=80 4: Mean[{10, 20, 50, 60, 40, 20, 75, 25}] = 37.5
car=10, year=90 1: Mean[{30, 20, 10, 20, 50, 60, 40, 20, 75}] = 36.11
car=10, year=90 2: Mean[{60, 40, 30, 20, 10, 20, 50, 60}] = 36.25
car=10, year=90 3: Mean[{20, 75, 25, 60, 40, 30, 20, 10, 20}] = 33.33
car=10, year=90 4: Mean[{40, 20, 75, 25, 60, 40, 30, 20}] = 38.75
car=20, year=80 1: Mean[{2, 5, 5}] = 4
car=20, year=80 2: Mean[{4, 2, 5, 5}] = 4
car=20, year=80 3: Mean[{5, 6, 4, 2, 5, 5}] = 4.5
car=20, year=80 4: Mean[{2, 1, 2, 5, 6, 4, 2, 5, 5}] = 3.55
car=20, year=90 1: Mean[{3, 2, 1, 2, 5, 6, 4}] = 3.29
car=20, year=90 2: Mean[{6, 4, 3, 2, 1, 2, 5, 6}] = 3.625
car=20, year=90 3: Mean[{2, 7, 2, 6, 4, 3, 2, 1, 2}] = 3.22
car=20, year=90 4: Mean[{4, 2, 7, 2, 6, 4, 3}] = 4
A first group_by computes the mean by car_number, year, marker, and retains its weight (number of rows).
A second group_by by car_number allows us to retrieve lagging means and weights to compute the desired mean:
library(purrr)
library(dplyr)
df %>%
arrange(car_number, year, marker) %>%
group_by(car_number, year, marker) %>%
summarise(mean_1 = mean(val, na.rm = TRUE), weight = n()) %>%
group_by(car_number) %>%
mutate(mean_2 = pmap_dbl(
list(mean_1, lag(mean_1), lag(mean_1, 2), lag(mean_1, 3),
weight, lag(weight), lag(weight, 2), lag(weight, 3)),
~ weighted.mean(c(..1, ..2, ..3, ..4),
c(..5, ..6, ..7, ..8),
na.rm = TRUE)
)) %>%
ungroup()
Result:
# # A tibble: 16 × 6
# car_number year marker mean_1 weight mean_2
# <dbl> <dbl> <dbl> <dbl> <int> <dbl>
# 1 10 80 1 50.0 2 50.000
# 2 10 80 2 30.0 2 40.000
# 3 10 80 3 55.0 2 45.000
# 4 10 80 4 15.0 2 37.500
# 5 10 90 1 25.0 2 31.250
# 6 10 90 2 50.0 2 36.250
# 7 10 90 3 50.0 2 35.000
# 8 10 90 4 30.0 2 38.750
# 9 20 80 1 5.0 2 5.000
# 10 20 80 2 3.0 2 4.000
# 11 20 80 3 5.5 2 4.500
# 12 20 80 4 1.5 2 3.750
# 13 20 90 1 2.5 2 3.125
# 14 20 90 2 5.0 2 3.625
# 15 20 90 3 4.5 2 3.375
# 16 20 90 4 3.0 2 3.750
Edit: Alternative syntax for purrr versions prior to 0.2.2.9000:
df %>%
arrange(car_number, year, marker) %>%
group_by(car_number, year, marker) %>%
summarise(mean_1 = mean(val, na.rm = TRUE), weight = n()) %>%
group_by(car_number) %>%
mutate(mean_2 = pmap_dbl(
list(mean_1, lag(mean_1), lag(mean_1, 2), lag(mean_1, 3),
weight, lag(weight), lag(weight, 2), lag(weight, 3)),
function(a, b, c, d, e, f, g, h)
weighted.mean(c(a, b, c, d),
c(e, f, g, h),
na.rm = TRUE)
)) %>%
ungroup()
Just throwing a base R solution in the mix. We can make a custom function using Reduce with accumulate = TRUE and tail(x, 4) to ensure that only last 3 observations will be included. All these after we average the data set by car_type, year, marker, i.e.
f1 <- function(x){
sapply(Reduce(c, x, accumulate = TRUE), function(i) mean(tail(i,4)))
}
dd <- aggregate(val ~ car_number+year+marker, df, mean)
dd <- dd[order(dd$car_number, dd$year, dd$marker),]
dd$new_avg <- with(dd, ave(val, car_number, FUN = f1))
dd
# car_number year marker val new_avg
#1 10 80 1 50.0 50.000
#5 10 80 2 30.0 40.000
#9 10 80 3 55.0 45.000
#13 10 80 4 15.0 37.500
#3 10 90 1 25.0 31.250
#7 10 90 2 50.0 36.250
#11 10 90 3 50.0 35.000
#15 10 90 4 30.0 38.750
#2 20 80 1 5.0 5.000
#6 20 80 2 3.0 4.000
#10 20 80 3 5.5 4.500
#14 20 80 4 1.5 3.750
#4 20 90 1 2.5 3.125
#8 20 90 2 5.0 3.625
#12 20 90 3 4.5 3.375
#16 20 90 4 3.0 3.750
Here is a method with data.table that modifies Frank's suggestion in David Arenburg's answer here.
# aggregate data by car_number, year, and marker
dfNew <- setDT(df)[, .(val=mean(val)), by=.(car_number, year, marker)]
# calculate average of current a previous three values
dfNew[, val := rowMeans(dfNew[,shift(val, 0:3), by=car_number][, -1], na.rm=TRUE)]
The first line is a standard aggregation call. The second line makes some changes to the rowMeans method in the linked answer. rowMeans is fed a data.table of the shifted values, where the shift occurs by car_number (thanks to sotos for catching this), which is chained to a statement that drops the first column (using -1), which is the car_number column returned in the first part of the chain.
this returns
car_number year marker val
1: 10 80 1 50.000
2: 10 80 2 40.000
3: 10 80 3 45.000
4: 10 80 4 37.500
5: 10 90 1 31.250
6: 10 90 2 36.250
7: 10 90 3 35.000
8: 10 90 4 38.750
9: 20 80 1 5.000
10: 20 80 2 4.000
11: 20 80 3 4.500
12: 20 80 4 3.750
13: 20 90 1 3.125
14: 20 90 2 3.625
15: 20 90 3 3.375
16: 20 90 4 3.750
You can do it this way:
df %>%
group_by(car_number, year, marker) %>%
summarise(s = sum(val), w = n()) %>% # sum and number of values
group_by(car_number) %>%
mutate(S = cumsum(s) - cumsum(lag(s, 4, default=0))) %>% # sum of last four s
mutate(W = cumsum(w) - cumsum(lag(w, 4, default=0))) %>% # same for the weights
mutate(result = S/W)
Output of your second example:
# Source: local data frame [16 x 8]
# Groups: car_number [2]
#
# car_number year marker s w S W result
# <dbl> <dbl> <dbl> <dbl> <int> <dbl> <int> <dbl>
# 1 10 80 1 25 1 25 1 25.000000
# 2 10 80 2 135 3 160 4 40.000000
# 3 10 80 3 110 2 270 6 45.000000
# 4 10 80 4 30 2 300 8 37.500000
# 5 10 90 1 50 2 325 9 36.111111
# 6 10 90 2 100 2 290 8 36.250000
# 7 10 90 3 120 3 300 9 33.333333
# 8 10 90 4 40 1 310 8 38.750000
# 9 20 80 1 12 3 12 3 4.000000
# 10 20 80 2 4 1 16 4 4.000000
# 11 20 80 3 11 2 27 6 4.500000
# 12 20 80 4 5 3 32 9 3.555556
# 13 20 90 1 3 1 23 7 3.285714
# 14 20 90 2 10 2 29 8 3.625000
# 15 20 90 3 11 3 29 9 3.222222
# 16 20 90 4 4 1 28 7 4.000000
Edit:
It's probably more efficient with package RcppRoll, you can try that: S = roll_sum(c(0, 0, 0, s), 4) (and same for W).
considering df as your input, you can use dplyr and zoo and try:
grouping only over car_number, you can try:
df %>%
group_by(car_number, year, marker) %>%
summarise(mm = mean(val)) %>%
group_by(car_number) %>%
mutate(rM=rollapply(mm, if_else(row_number() < 4, marker, 4), FUN=mean, align="right"))%>%
select(year, rM)
which gives:
Source: local data frame [16 x 3]
Groups: car_number [2]
car_number year rM
<dbl> <dbl> <dbl>
1 10 80 50.000
2 10 80 40.000
3 10 80 45.000
4 10 80 37.500
5 10 90 31.250
6 10 90 36.250
7 10 90 35.000
8 10 90 38.750
9 20 80 5.000
10 20 80 4.000
11 20 80 4.500
12 20 80 3.750
13 20 90 3.125
14 20 90 3.625
15 20 90 3.375
16 20 90 3.750

2 columns into list and sort in R

Let's say we have two list
x <- c(1, 3, 4, 2, 6, 5)
y <- c(12, 14, 15, 61, 71, 21)
I want to combine into a list so that we have 2 column x and y and values should be in same order.
x <- c(1, 3, 4, 2, 6, 5)
y <- c(12, 14, 15, 61, 71, 21)
After you have a list I want to sort it on y so the final list looks like
x <- c(1, 3, 4, 5, 2, 6)
y <- c(12, 14, 15, 21, 61, 71)
I am really new to R.
I tried list(x,y) but it seems to make a
list(1, 3, 4, 2, 6, 5, 12, 14, 15, 61, 71, 21)
so I was wondering someone could help me.
You need to put them in a data.frame first and then use order:
x <- c(1, 3, 4, 2, 6, 5)
y <- c(-12, 14, 15, 61, 71, 21)
DF <- data.frame(x, y)
> DF[order(DF$y),]
x y
1 1 -12
2 3 14
3 4 15
6 5 21
4 2 61
5 6 71
keeping as a list, using lapply:
x <- c(1, 3, 4, 2,6,5)
y <- c(12, 14,15,61,71,21)
l <- list(x = x, y = y)
## thelatemail
lapply(l, `[`, order(l$y))
# $x
# [1] 1 3 4 5 2 6
#
# $y
# [1] 12 14 15 21 61 71
a more explicit version of the short one given by #thelatemail above but doesn't preserve the names:
lapply(seq_along(l), function(x) l[[x]][order(l$y)])
# [[1]]
# [1] 1 3 4 5 2 6
#
# [[2]]
# [1] 12 14 15 21 61 71
or rapply:
rapply(l, function(x) x[order(l$y)], how = 'list')
# $x
# [1] 1 3 4 5 2 6
#
# $y
# [1] 12 14 15 21 61 71

Resources