How to get the lowest and highest values for interval class - r

My data looks like the following data:
df<-read.table(text = "temp
12
15
12
6
9
11
15
14
14
16
14
14
11
12
13
14
10
12
12
14
9
13
12
15
11
11
12
12
10
11",header=TRUE)
I want to get the lowest and highest levels for temp to calculate cumulative.
I have done the following codes:
library(purrr)
library(dplyr)
map(names(df),~df %>%
count(!!rlang::sym(.x)%>%
mutate(cum=cumsum(temp)/sum(temp)))
AS you can see, this gives us the temps of 6,9,10,11,12,13,14,15,16, but 7 and 8 is lacking.
I want to have the following output:
temp n cum
6 x x
7 0 x
8 0 x
9 x x
10 x x
11 x x
12 x x
13 x x
14 x x
15 x x
16 x x

We can use complete to fill the missing sequence in temp and fill the cum value.
library(dplyr)
library(tidyr)
df %>%
count(temp) %>%
mutate(cum=cumsum(n)/sum(n)) %>%
complete(temp = seq(min(temp), max(temp)), fill = list(n = 0)) %>%
fill(cum)
# A tibble: 11 x 3
# temp n cum
# <int> <dbl> <dbl>
# 1 6 1 0.0333
# 2 7 0 0.0333
# 3 8 0 0.0333
# 4 9 2 0.1
# 5 10 2 0.167
# 6 11 5 0.333
# 7 12 8 0.6
# 8 13 2 0.667
# 9 14 6 0.867
#10 15 3 0.967
#11 16 1 1

In base R you could use table to get df2, match the frequencies within a new data.frame out of the temperature range, where you set NA to zero, ans calculate the cumsum.
df2 <- data.frame(table(df$temp))
rg <- range(df$temp)
res <- within(data.frame(temp=rg[1]:rg[2]), {
n <- df2[match(temp, df2$Var1), "Freq"]
n[is.na(n)] <- 0
cum=cumsum(n/sum(n))
})[c(1, 3, 2)]
res
# temp n cum
# 1 6 1 0.03333333
# 2 7 0 0.03333333
# 3 8 0 0.03333333
# 4 9 2 0.10000000
# 5 10 2 0.16666667
# 6 11 5 0.33333333
# 7 12 8 0.60000000
# 8 13 2 0.66666667
# 9 14 6 0.86666667
# 10 15 3 0.96666667
# 11 16 1 1.00000000

Related

Add rows using name columns

Is it possible to create two rows using the name of the columns?
I need to separate DX from SX and create new rows, after the separation I like to maintain the information DX or SX by adding a column. Some columns it is in common, in this case X. Instead, Num is the key
a = read.table(text="
Num X STDX ABDX XBDX STSX ABSX XBSX
12 3 9 5 3 11 3 7
13 35 24 1 7 18 2 8
14 35 24 1 7 18 2 8
15 10 1 5 16 -10 5 3 ",h=T)
b= read.table(text="Num X ST AB XB DX/SX
12 3 9 5 3 DX
12 3 11 3 7 SX
13 35 24 1 7 DX
13 35 18 2 8 SX
14 35 24 1 7 DX
14 35 18 2 8 SX
15 10 1 5 16 DX
15 10 -10 5 3 SX",h=T)
My idea was separate the data and after join, but it is heavy.
I have tried this code:
c <- sapply(c("DX", "SX",""),
function(x) a[endsWith(names(a),x)],
simplify = FALSE)
But the problem is x and Num, because I would like to have in the same DB with DX and SX.
There are more elegant and compact approaches for sure, but here's an example how you could achieve this by simple renaming and row binding:
a = read.table(text="
Num X STDX ABDX XBDX STSX ABSX XBSX
12 3 9 5 3 11 3 7
13 35 24 1 7 18 2 8
14 35 24 1 7 18 2 8
15 10 1 5 16 -10 5 3 ",h=T)
library(dplyr)
library(stringr)
dx <- a %>%
select(1,2,ends_with("DX")) %>%
rename_with(~ str_remove(.x, "DX$"), .cols = -c(1:2)) %>%
mutate(`DX/SX` = "DX" )
dx
#> Num X ST AB XB DX/SX
#> 1 12 3 9 5 3 DX
#> 2 13 35 24 1 7 DX
#> 3 14 35 24 1 7 DX
#> 4 15 10 1 5 16 DX
sx <- a %>%
select(1,2,ends_with("SX")) %>%
rename_with(~ str_remove(.x, "SX$"), .cols = -c(1:2)) %>%
mutate(`DX/SX` = "SX" )
sx
#> Num X ST AB XB DX/SX
#> 1 12 3 11 3 7 SX
#> 2 13 35 18 2 8 SX
#> 3 14 35 18 2 8 SX
#> 4 15 10 -10 5 3 SX
bind_rows(dx,sx) %>%
arrange(Num)
#> Num X ST AB XB DX/SX
#> 1 12 3 9 5 3 DX
#> 2 12 3 11 3 7 SX
#> 3 13 35 24 1 7 DX
#> 4 13 35 18 2 8 SX
#> 5 14 35 24 1 7 DX
#> 6 14 35 18 2 8 SX
#> 7 15 10 1 5 16 DX
#> 8 15 10 -10 5 3 SX
Created on 2022-10-12 with reprex v2.0.2

Mean of a column only for observations meeting a condition

How can I add a column with the mean of z for each group "y" for values where if x < 10 for any other case the mean column can take the value of z?
df <- data.frame(y = c(LETTERS[1:5], LETTERS[1:5],LETTERS[3:7]), x = 1:15, z = c(4:9,1:4,2:6))
y x z
1 A 1 4
2 B 2 5
3 C 3 6
4 D 4 7
5 E 5 8
6 A 6 9
7 B 7 1
8 C 8 2
9 D 9 3
10 E 10 4
11 C 11 2
12 D 12 3
13 E 13 4
14 F 14 5
I am trying something like
df %>% group_by(y) %>%
mutate(gr.mean = mean(z))
But this provides the mean for any case of x.
We can subset the 'z' with a logical condition on 'x':
library(dplyr)
df %>%
group_by(y) %>%
mutate(gr.mean = if(all(x >=10)) z else mean(z[x < 10])) %>%
ungroup
Output
# A tibble: 15 × 4
y x z gr.mean
<chr> <int> <int> <dbl>
1 A 1 4 6.5
2 B 2 5 3
3 C 3 6 4
4 D 4 7 5
5 E 5 8 8
6 A 6 9 6.5
7 B 7 1 3
8 C 8 2 4
9 D 9 3 5
10 E 10 4 8
11 C 11 2 4
12 D 12 3 5
13 E 13 4 8
14 F 14 5 5
15 G 15 6 6
Or without if/else
df %>%
group_by(y) %>%
mutate(gr.mean = coalesce(mean(z[x < 10]), z))

Rolling sum in dplyr

set.seed(123)
df <- data.frame(x = sample(1:10, 20, replace = T), id = rep(1:2, each = 10))
For each id, I want to create a column which has the sum of previous 5 x values.
df %>% group_by(id) %>% mutate(roll.sum = c(x[1:4], zoo::rollapply(x, 5, sum)))
# Groups: id [2]
x id roll.sum
<int> <int> <int>
3 1 3
8 1 8
5 1 5
9 1 9
10 1 10
1 1 36
6 1 39
9 1 40
6 1 41
5 1 37
10 2 10
5 2 5
7 2 7
6 2 6
2 2 2
9 2 39
3 2 32
1 2 28
4 2 25
10 2 29
The 6th row should be 35 (3 + 8 + 5 + 9 + 10), the 7th row should be 33 (8 + 5 + 9 + 10 + 1) and so on.
However, the above function is also including the row itself for calculation. How can I fix it?
library(zoo)
df %>% group_by(id) %>%
mutate(Sum_prev = rollapply(x, list(-(1:5)), sum, fill=NA, align = "right", partial=F))
#you can use rollapply(x, list((1:5)), sum, fill=NA, align = "left", partial=F)
#to sum the next 5 elements scaping the current one
x id Sum_prev
1 3 1 NA
2 8 1 NA
3 5 1 NA
4 9 1 NA
5 10 1 NA
6 1 1 35
7 6 1 33
8 9 1 31
9 6 1 35
10 5 1 32
11 10 2 NA
12 5 2 NA
13 7 2 NA
14 6 2 NA
15 2 2 NA
16 9 2 30
17 3 2 29
18 1 2 27
19 4 2 21
20 10 2 19
There is the rollify function in the tibbletime package that you could use. You can read about it in this vignette: Rolling calculations in tibbletime.
library(tibbletime)
library(dplyr)
rollig_sum <- rollify(.f = sum, window = 5)
df %>%
group_by(id) %>%
mutate(roll.sum = lag(rollig_sum(x))) #added lag() here
# A tibble: 20 x 3
# Groups: id [2]
# x id roll.sum
# <int> <int> <int>
# 1 3 1 NA
# 2 8 1 NA
# 3 5 1 NA
# 4 9 1 NA
# 5 10 1 NA
# 6 1 1 35
# 7 6 1 33
# 8 9 1 31
# 9 6 1 35
#10 5 1 32
#11 10 2 NA
#12 5 2 NA
#13 7 2 NA
#14 6 2 NA
#15 2 2 NA
#16 9 2 30
#17 3 2 29
#18 1 2 27
#19 4 2 21
#20 10 2 19
If you want the NAs to be some other value, you can use, for example, if_else
df %>%
group_by(id) %>%
mutate(roll.sum = lag(rollig_sum(x))) %>%
mutate(roll.sum = if_else(is.na(roll.sum), x, roll.sum))

Using purrr::map2 when one variable is not part of the function

If I had a function like this:
foo <- function(var) {
if(length(var) > 5) stop("can't be greater than 5")
data.frame(var = var)
}
Where this worked:
df <- 1:20
foo(var = df[1:5])
But this didn't:
foo(var = df)
The desired output is:
var
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
If I know that I can only run this function in chunk of 5 rows, what would be the best approach if I wanted to evaluate all 20 rows? Can I use purrr::map() for this? Assume that the 5 row constraint is rigid.
Thanks in advance.
We split df in chunks of 5 each then use purrr::map_dfr to apply foo function on them then bind everything together by rows
library(tidyverse)
foo <- function(var) {
if(length(var) > 5) stop("can't be greater than 5")
data.frame(var = var)
}
df <- 1:20
df_split <- split(df, (seq(length(df))-1) %/% 5)
df_split
map_dfr(df_split, ~ foo(.x))
var
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
You can use dplyr::group_by or tapply :
data.frame(df) %>%
mutate(grp = (row_number()-1) %/% 5) %>%
group_by(grp) %>%
mutate(var = foo(df)$var) %>%
ungroup %>%
select(var)
# # A tibble: 20 x 1
# var
# <int>
# 1 1
# 2 2
# 3 3
# 4 4
# 5 5
# 6 6
# 7 7
# 8 8
# 9 9
# 10 10
# 11 11
# 12 12
# 13 13
# 14 14
# 15 15
# 16 16
# 17 17
# 18 18
# 19 19
# 20 20
data.frame(var=unlist(tapply(df,(df-1) %/% 5,foo)))
# var
# 01 1
# 02 2
# 03 3
# 04 4
# 05 5
# 11 6
# 12 7
# 13 8
# 14 9
# 15 10
# 21 11
# 22 12
# 23 13
# 24 14
# 25 15
# 31 16
# 32 17
# 33 18
# 34 19
# 35 20

Sum of group but keep the same value for each row in r

I have data frame, I want to create a new variable by sum of each ID and group, if I sum normal,dimension of data reduce, my case I need to keep and repeat each row.
ID <- c(rep(1,3), rep(3, 5), rep(4,4))
Group <-c(1,1,2,1,1,1,2,2,1,1,1,2)
x <- c(1:12)
y<- c(12:23)
df <- data.frame(ID,Group,x,y)
ID Group x y
1 1 1 1 12
2 1 1 2 13
3 1 2 3 14
4 3 1 4 15
5 3 1 5 16
6 3 1 6 17
7 3 2 7 18
8 3 2 8 19
9 4 1 9 20
10 4 1 10 21
11 4 1 11 22
12 4 2 12 23
The output with 2 more variables "sumx" and "sumy". Group by (ID, Group)
ID Group x y sumx sumy
1 1 1 1 12 3 25
2 1 1 2 13 3 25
3 1 2 3 14 3 14
4 3 1 4 15 15 48
5 3 1 5 16 15 48
6 3 1 6 17 15 48
7 3 2 7 18 15 37
8 3 2 8 19 15 37
9 4 1 9 20 30 63
10 4 1 10 21 30 63
11 4 1 11 22 30 63
12 4 2 12 23 12 23
Any Idea?
As short as:
df$sumx <- with(df,ave(x,ID,Group,FUN = sum))
df$sumy <- with(df,ave(y,ID,Group,FUN = sum))
We can use dplyr
library(dplyr)
df %>%
group_by(ID, Group) %>%
mutate_each(funs(sum)) %>%
rename(sumx=x, sumy=y) %>%
bind_cols(., df[c("x", "y")])
If there are only two columns to sum, then
df %>%
group_by(ID, Group) %>%
mutate(sumx = sum(x), sumy = sum(y))
You can use below code to get what you want if it is a single column and in case you have more than 1 column then add accordingly:
library(dplyr)
data13 <- data12 %>%
group_by(Category) %>%
mutate(cum_Cat_GMR = cumsum(GrossMarginRs))

Resources