Customized normalize by group in R - r

I have a dataframe that looks like this:
group1<-c(rep(1,12))
group2<-c(rep('Low',6), rep('High',6))
var <-c(1:6,1:6)
var1 <-c(2:13)
var2 <-c(20:31)
df1<-data.frame(group1,group2,var,var1,var2)
group1<-c(rep(2,12))
group2<-c(rep('Low',6), rep('High',6))
var <-c(1:6,1:6)
var1 <-c(2:13)
var2 <-c(20:31)
df2<-data.frame(group1,group2,var,var1,var2)
df<-rbind(df1,df2)
group1 group2 var var1 var2
1 1 Low 1 2 20
2 1 Low 2 3 21
3 1 Low 3 4 22
4 1 Low 4 5 23
5 1 Low 5 6 24
6 1 Low 6 7 25
7 1 High 1 8 26
8 1 High 2 9 27
9 1 High 3 10 28
10 1 High 4 11 29
11 1 High 5 12 30
12 1 High 6 13 31
13 2 Low 1 2 20
14 2 Low 2 3 21
15 2 Low 3 4 22
16 2 Low 4 5 23
17 2 Low 5 6 24
18 2 Low 6 7 25
19 2 High 1 8 26
20 2 High 2 9 27
21 2 High 3 10 28
22 2 High 4 11 29
23 2 High 5 12 30
24 2 High 6 13 31
I want to do normalize my columns in the following way. For each combination of group1 and group2, I want to divide var1 and var1 columns with their first element. This allows me to construct a common scale/index across the columns of interest. For example, looking at the combination of group1=1 and group2=low, the relevant elements of var1 should be transformed into 2/2,3/2,4/2,5/2,6/2,7/2 respectively for the combination group1=1 and group2=High should be 8/8,9/8,10/8,11/8,12/8,13/8 and so on.
I want to do the above transformations for both var1 and var2. The expected output should look like this:
group1 group2 var var1 var2 var1_tra var2_tra
1 1 Low 1 2 20 1.000 1.000000
2 1 Low 2 3 21 1.500 1.050000
3 1 Low 3 4 22 2.000 1.100000
4 1 Low 4 5 23 2.500 1.150000
5 1 Low 5 6 24 3.000 1.200000
6 1 Low 6 7 25 3.500 1.250000
7 1 High 1 8 26 1.000 1.000000
8 1 High 2 9 27 1.125 1.038462
9 1 High 3 10 28 1.250 1.076923
10 1 High 4 11 29 1.375 1.115385
11 1 High 5 12 30 1.500 1.153846
12 1 High 6 13 31 1.625 1.192308
13 2 Low 1 2 20 1.000 1.000000
14 2 Low 2 3 21 1.500 1.050000
15 2 Low 3 4 22 2.000 1.100000
16 2 Low 4 5 23 2.500 1.150000
17 2 Low 5 6 24 3.000 1.200000
18 2 Low 6 7 25 3.500 1.250000
19 2 High 1 8 26 1.000 1.000000
20 2 High 2 9 27 1.125 1.038462
21 2 High 3 10 28 1.250 1.076923
22 2 High 4 11 29 1.375 1.115385
23 2 High 5 12 30 1.500 1.153846
24 2 High 6 13 31 1.625 1.192308
NOTE: Numbers could be anything, usually positive real numbers and because my dataframe is really big, cannot know in advance what could be the element that I want to divide with in order to perform such transformations.

After grouping by 'group1', 'group2', use mutate_at to do the division of the columns selected by the first value of that column
library(dplyr)
df %>%
group_by(group1, group2) %>%
mutate_at(vars(var1, var2), list(tra = ~ ./first(.)))
# A tibble: 24 x 7
# Groups: group1, group2 [4]
# group1 group2 var var1 var2 var1_tra var2_tra
# <dbl> <fct> <int> <int> <int> <dbl> <dbl>
# 1 1 Low 1 2 20 1 1
# 2 1 Low 2 3 21 1.5 1.05
# 3 1 Low 3 4 22 2 1.1
# 4 1 Low 4 5 23 2.5 1.15
# 5 1 Low 5 6 24 3 1.2
# 6 1 Low 6 7 25 3.5 1.25
# 7 1 High 1 8 26 1 1
# 8 1 High 2 9 27 1.12 1.04
# 9 1 High 3 10 28 1.25 1.08
#10 1 High 4 11 29 1.38 1.12
# … with 14 more rows
Or using data.table
nm1 <- c("var1", "var2")
nm2 <- paste0(nm1, "_tra")
library(data.table)
setDT(df)[, (nm2) := lapply(.SD, function(x) x/first(x)),
by = .(group1, group2), .SDcols = nm1]

Also you can use from sqldf likes the following:
result <- sqldf('select df.*, (df.var1 + 0.0) / scale.s_var1 as var1_tra, (df.var2 + 0.0) / scale.s_var2 as var2_tra
from df join
(select group1, group2, min(var1) as s_var1, min(var2) as s_var2
from df
group by group1, group2) as scale
on df.group1 = scale.group1 AND df.group2 = scale.group2
')
In the above code first we find the minimum value for var1 and var2 by each group using the following query:
select group1, group2, min(var1) as s_var1, min(var2) as s_var2
from df
group by group1, group2
And use that as a nested query and joining with the original data frame df on equality over the value of group1 and group2.

Related

filter() rows from dataframe with condition on previous and next row, keeping NA values

I have a dataframe like this:
AA<-c(1,2,4,5,6,7,10,11,12,13,14,15)
BB<-c(32,21,21,NA,27,31,31,12,28,NA,48,7)
df<- data.frame(AA,BB)
I want to remove rows where BB value is equal to previous or next row, to keep only first and last occurrences from each value of BB column. I also want to keep NA rows. I arrive to that code which is not so far from what I want:
lighten_df <- df %>% filter(BB!=lag(BB) | BB!=lead(BB) | is.na(BB) )
which gives me:
> lighten_df
AA BB
1 1 32
2 2 21
3 5 NA
4 6 27
5 7 31
6 10 31
7 11 12
8 12 28
9 13 NA
10 14 48
11 15 7
My problem is that I would like to keep first and last 21 value for col BB. That's the result I expect:
AA BB
1 1 32
2 2 21
3 4 21
4 5 NA
5 6 27
6 7 31
7 10 31
8 11 12
9 12 28
10 13 NA
11 14 48
12 15 7
Any Idea?
I would suggest a different approach: define a grouping variable and keep the first and last rows within each group:
df %>%
group_by(grp = data.table::rleid(BB)) %>%
slice(unique(c(1, n())))
# # A tibble: 12 × 3
# # Groups: grp [10]
# AA BB grp
# <dbl> <dbl> <int>
# 1 1 32 1
# 2 2 21 2
# 3 4 21 2
# 4 5 NA 3
# 5 6 27 4
# 6 7 31 5
# 7 10 31 5
# 8 11 12 6
# 9 12 28 7
# 10 13 NA 8
# 11 14 48 9
# 12 15 7 10

R: How to split a row in a dataframe into a number of rows, conditional on a value in a cell?

I have a data.frame which looks like the following:
id <- c("a","a","a","a","b","b","b","b")
age_from <- c(0,2,3,7,0,1,2,6)
age_to <- c(2,3,7,10,1,2,6,10)
y <- c(100,150,100,250,300,200,100,150)
df <- data.frame(id,age_from,age_to,y)
df$years <- df$age_to - df$age_from
Which gives a df that looks like:
id age_from age_to y years
1 a 0 2 100 2
2 a 2 3 150 1
3 a 3 7 100 4
4 a 7 10 250 3
5 b 0 1 300 1
6 b 1 2 200 1
7 b 2 6 100 4
8 b 6 10 150 4
Instead of having an unequal number of years per row, I would like to have 20 rows, 10 for each id, with each row accounting for one year. This would also involve averaging the y column across the number of years listed in the years column.
I believe this may have to be done using a loop 1:n with the n equaling a value in the years column. Although I am not sure how to start with this.
You can use rep to repeat the rows by the number of given years.
x <- df[rep(seq_len(nrow(df)), df$years),]
x
# id age_from age_to y years
#1 a 0 2 50.00000 2
#1.1 a 0 2 50.00000 2
#2 a 2 3 150.00000 1
#3 a 3 7 25.00000 4
#3.1 a 3 7 25.00000 4
#3.2 a 3 7 25.00000 4
#3.3 a 3 7 25.00000 4
#4 a 7 10 83.33333 3
#4.1 a 7 10 83.33333 3
#4.2 a 7 10 83.33333 3
#5 b 0 1 300.00000 1
#6 b 1 2 200.00000 1
#7 b 2 6 25.00000 4
#7.1 b 2 6 25.00000 4
#7.2 b 2 6 25.00000 4
#7.3 b 2 6 25.00000 4
#8 b 6 10 37.50000 4
#8.1 b 6 10 37.50000 4
#8.2 b 6 10 37.50000 4
#8.3 b 6 10 37.50000 4
When you mean with averaging the y column across the number of years to divide by the number of years:
x$y <- x$y / x$years
In case age_from should go from 0 to 9 and age_to from 1 to 10 for each id:
x$age_from <- x$age_from + ave(x$age_from, x$id, x$age_from, FUN=seq_along) - 1
#x$age_from <- ave(x$age_from, x$id, FUN=seq_along) - 1 #Alternative
x$age_to <- x$age_from + 1
Here is a solution with tidyr and dplyr.
First of all we complete age_from from 0 to 9 as you wanted, by keeping only the existing ids.
You will have several NAs on age_to, y and years. So, we fill them by dragging down each value in order to complete the immediately following values that are NA.
Now you can divide y by years (I assumed you meant this by setting the average value so to leave the sum consistent).
At that point, you only need to recalculate age_to accordingly.
Remember to ungroup at the end!
library(tidyr)
library(dplyr)
df %>%
complete(id, age_from = 0:9) %>%
group_by(id) %>%
fill(y, years, age_to) %>%
mutate(y = y/years) %>%
mutate(age_to = age_from + 1) %>%
ungroup()
# A tibble: 20 x 5
id age_from age_to y years
<chr> <dbl> <dbl> <dbl> <dbl>
1 a 0 1 50 2
2 a 1 2 50 2
3 a 2 3 150 1
4 a 3 4 25 4
5 a 4 5 25 4
6 a 5 6 25 4
7 a 6 7 25 4
8 a 7 8 83.3 3
9 a 8 9 83.3 3
10 a 9 10 83.3 3
11 b 0 1 300 1
12 b 1 2 200 1
13 b 2 3 25 4
14 b 3 4 25 4
15 b 4 5 25 4
16 b 5 6 25 4
17 b 6 7 37.5 4
18 b 7 8 37.5 4
19 b 8 9 37.5 4
20 b 9 10 37.5 4
A tidyverse solution.
library(tidyverse)
df %>%
mutate(age_to = age_from + 1) %>%
group_by(id) %>%
complete(nesting(age_from = 0:9, age_to = 1:10)) %>%
fill(y, years) %>%
mutate(y = y / years)
# A tibble: 20 x 5
# Groups: id [2]
id age_from age_to y years
<chr> <dbl> <dbl> <dbl> <dbl>
1 a 0 1 50 2
2 a 1 2 50 2
3 a 2 3 150 1
4 a 3 4 25 4
5 a 4 5 25 4
6 a 5 6 25 4
7 a 6 7 25 4
8 a 7 8 83.3 3
9 a 8 9 83.3 3
10 a 9 10 83.3 3
11 b 0 1 300 1
12 b 1 2 200 1
13 b 2 3 25 4
14 b 3 4 25 4
15 b 4 5 25 4
16 b 5 6 25 4
17 b 6 7 37.5 4
18 b 7 8 37.5 4
19 b 8 9 37.5 4
20 b 9 10 37.5 4

SMA for multiple items in the same column

I'm trying to create SMA formula for multiple items in the same column. Here's an example of the data I'm working with.
Person Time Value
<chr> <dbl> <dbl>
1 A 1 14
2 A 2 13
3 A 3 17
4 A 4 9
5 A 5 20
6 A 6 5
7 B 1 17
8 B 2 11
9 B 3 18
10 B 4 10
11 B 5 10
12 B 6 20
13 C 1 5
14 C 2 5
15 C 3 11
16 C 4 12
17 C 5 12
18 C 6 9
What I'd like to be able to do is to create another column with the SMA formula for each person (A,B,C, etc.). In this case let's say SMA2. While it works for Person A, I can't get the formula to restart at Person B. Rather Person B's first SMA2 value has Person A's values with it.
Right now I've used this which does give me the SMA I want, just not restarted at each new person:
DataSet$SMA2<-SMA(DataSet$Value, 2)
Any help would be appreciated.
DataSet <- DataSet %>%
group_by(Person) %>%
mutate(sma2 = TTR::SMA(Value,2))
Still came up with this:
# A tibble: 18 x 4
# Groups: Person [3]
Person Time Value sma2
<chr> <dbl> <dbl> <dbl>
1 A 1 14 NA
2 A 2 13 13.5
3 A 3 17 15
4 A 4 9 13
5 A 5 20 14.5
6 A 6 5 12.5
7 B 1 17 11
8 B 2 11 14
9 B 3 18 14.5
10 B 4 10 14
11 B 5 10 10
12 B 6 20 15
13 C 1 5 12.5
14 C 2 5 5
15 C 3 11 8
16 C 4 12 11.5
17 C 5 12 12
18 C 6 9 10.5
Using dplyr, group_by person then use mutate. This will reset the calculations per person.
DataSet <- DataSet %>%
group_by(Person) %>%
mutate(sma2 = TTR::SMA(Value, 2))
# A tibble: 18 x 4
# Groups: Person [3]
Person Time Value sma2
<chr> <int> <int> <dbl>
1 A 1 14 NA
2 A 2 13 13.5
3 A 3 17 15
4 A 4 9 13
5 A 5 20 14.5
6 A 6 5 12.5
7 B 1 17 NA
8 B 2 11 14
9 B 3 18 14.5
10 B 4 10 14
11 B 5 10 10
12 B 6 20 15
13 C 1 5 NA
14 C 2 5 5
15 C 3 11 8
16 C 4 12 11.5
17 C 5 12 12
18 C 6 9 10.5

Rolling sum in dplyr

set.seed(123)
df <- data.frame(x = sample(1:10, 20, replace = T), id = rep(1:2, each = 10))
For each id, I want to create a column which has the sum of previous 5 x values.
df %>% group_by(id) %>% mutate(roll.sum = c(x[1:4], zoo::rollapply(x, 5, sum)))
# Groups: id [2]
x id roll.sum
<int> <int> <int>
3 1 3
8 1 8
5 1 5
9 1 9
10 1 10
1 1 36
6 1 39
9 1 40
6 1 41
5 1 37
10 2 10
5 2 5
7 2 7
6 2 6
2 2 2
9 2 39
3 2 32
1 2 28
4 2 25
10 2 29
The 6th row should be 35 (3 + 8 + 5 + 9 + 10), the 7th row should be 33 (8 + 5 + 9 + 10 + 1) and so on.
However, the above function is also including the row itself for calculation. How can I fix it?
library(zoo)
df %>% group_by(id) %>%
mutate(Sum_prev = rollapply(x, list(-(1:5)), sum, fill=NA, align = "right", partial=F))
#you can use rollapply(x, list((1:5)), sum, fill=NA, align = "left", partial=F)
#to sum the next 5 elements scaping the current one
x id Sum_prev
1 3 1 NA
2 8 1 NA
3 5 1 NA
4 9 1 NA
5 10 1 NA
6 1 1 35
7 6 1 33
8 9 1 31
9 6 1 35
10 5 1 32
11 10 2 NA
12 5 2 NA
13 7 2 NA
14 6 2 NA
15 2 2 NA
16 9 2 30
17 3 2 29
18 1 2 27
19 4 2 21
20 10 2 19
There is the rollify function in the tibbletime package that you could use. You can read about it in this vignette: Rolling calculations in tibbletime.
library(tibbletime)
library(dplyr)
rollig_sum <- rollify(.f = sum, window = 5)
df %>%
group_by(id) %>%
mutate(roll.sum = lag(rollig_sum(x))) #added lag() here
# A tibble: 20 x 3
# Groups: id [2]
# x id roll.sum
# <int> <int> <int>
# 1 3 1 NA
# 2 8 1 NA
# 3 5 1 NA
# 4 9 1 NA
# 5 10 1 NA
# 6 1 1 35
# 7 6 1 33
# 8 9 1 31
# 9 6 1 35
#10 5 1 32
#11 10 2 NA
#12 5 2 NA
#13 7 2 NA
#14 6 2 NA
#15 2 2 NA
#16 9 2 30
#17 3 2 29
#18 1 2 27
#19 4 2 21
#20 10 2 19
If you want the NAs to be some other value, you can use, for example, if_else
df %>%
group_by(id) %>%
mutate(roll.sum = lag(rollig_sum(x))) %>%
mutate(roll.sum = if_else(is.na(roll.sum), x, roll.sum))

R data frame rank by groups (group by rank) with package dplyr

I have a data frame 'test' that look like this:
session_id seller_feedback_score
1 1 282470
2 1 275258
3 1 275258
4 1 275258
5 1 37831
6 1 282470
7 1 26
8 1 138351
9 1 321350
10 1 841
11 1 138351
12 1 17263
13 1 282470
14 1 396900
15 1 282470
16 1 282470
17 1 321350
18 1 321350
19 1 321350
20 1 0
21 1 1596
22 7 282505
23 7 275283
24 7 275283
25 7 275283
26 7 37834
27 7 282505
28 7 26
29 7 138359
30 7 321360
and a code (using package dplyr) that apparently should rank the 'seller_feedback_score' within each group of session_id:
test <- test %>% group_by(session_id) %>%
mutate(seller_feedback_score_rank = dense_rank(-seller_feedback_score))
however, what is really happening is that R rank the entire data frame together without relating to the groups (session_id's):
session_id seller_feedback_score seller_feedback_score_rank_2
1 1 282470 5
2 1 275258 7
3 1 275258 7
4 1 275258 7
5 1 37831 11
6 1 282470 5
7 1 26 15
8 1 138351 9
9 1 321350 3
10 1 841 14
11 1 138351 9
12 1 17263 12
13 1 282470 5
14 1 396900 1
15 1 282470 5
16 1 282470 5
17 1 321350 3
18 1 321350 3
19 1 321350 3
20 1 0 16
21 1 1596 13
22 7 282505 4
23 7 275283 6
24 7 275283 6
25 7 275283 6
26 7 37834 10
27 7 282505 4
28 7 26 15
29 7 138359 8
30 7 321360 2
I checked this by counting the unique 'seller_feedback_score_rank' values and not surprisingly it equals to the highest rank value. I'd appreciate if someone could reproduce and help. thanks
link to my original question: R group by and aggregate - return relative rank within groups using plyr
Had a similar issue, my answer was sorting on groups and the relevant ranked variable(s) in order to then use row_number() when using group_by.
# Sample dataset
df <- data.frame(group=rep(c("GROUP 1", "GROUP 2"),10),
value=as.integer(rnorm(20, mean=1000, sd=500)))
require(dplyr)
print.data.frame(df[0:10,])
group value
1 GROUP 1 1273
2 GROUP 2 1261
3 GROUP 1 1189
4 GROUP 2 1390
5 GROUP 1 1942
6 GROUP 2 1111
7 GROUP 1 530
8 GROUP 2 893
9 GROUP 1 997
10 GROUP 2 237
sorted <- df %>%
arrange(group, -value) %>%
group_by(group) %>%
mutate(rank=row_number())
print.data.frame(sorted)
group value rank
1 GROUP 1 1942 1
2 GROUP 1 1368 2
3 GROUP 1 1273 3
4 GROUP 1 1249 4
5 GROUP 1 1189 5
6 GROUP 1 997 6
7 GROUP 1 562 7
8 GROUP 1 535 8
9 GROUP 1 530 9
10 GROUP 1 1 10
11 GROUP 2 1472 1
12 GROUP 2 1390 2
13 GROUP 2 1281 3
14 GROUP 2 1261 4
15 GROUP 2 1111 5
16 GROUP 2 893 6
17 GROUP 2 774 7
18 GROUP 2 669 8
19 GROUP 2 631 9
20 GROUP 2 237 10
Found an answer in :
Add a "rank" column to a data frame
data.selected <- transform(data.selected,
seller_feedback_score_rank = ave(seller_feedback_score, session_id,
FUN = function(x) rank(-x, ties.method = "first")))
One way you can do this is :
dataset<-dataset%>%arrange(ID, DateTime,Index)
dataset$Rank<-c(0,ID)[-(nrow(dataset)+1)] == ID
dataset<- dataset%>%group_by(ID)%>%mutate(Rank = cumsum(Rank))
Had the same issue!

Resources