How can I make a moving sum from a cell in R? - r

I have a dataframe looking like this:
date
P
>60?
03-31-2020
6.8
0
03-30-2020
5.0
0
03-29-2020
0.0
0
03-28-2020
0.0
0
03-27-2020
2.0
0
03-26-2020
0.0
0
03-25-2020
71.0
1
03-24-2020
2.0
0
03-23-2020
0.0
0
03-22-2020
23.8
0
03-21-2020
0.0
0
03-20-2020
23.8
0
Code to reproduce the dataframe:
df1 <- data.frame(date = c("03-31-2020", "03-30-2020", "03-29-2020", "03-28-2020", "03-27-2020", "03-26-2020",
"03-25-2020", "03-24-2020", "03-23-2020", "03-22-2020", "03-21-2020", "03-20-2020"),
P = c(6.8, 5.0, 0.0, 0.0, 2.0, 0.0, 71.0, 2.0, 0.0, 23.8, 0.0, 23.8),
Sup60 = c(0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0))
I want to sum the P values N days befores the P > 60.
For example, the first barrier (number bigger than 60) is the P = 71 on the day 25-03-2020, from that i want to sum the 5 P values before that day, like:
2.0 + 0.0 + 23.8 + 0.0 + 23.8 = 49,6
It is a kind of moving sum because the concept is similar to a moving average.
Instead of the average of the last 5 values, for example, I want the sum of the last 5 values from a value greater than 60.
How can I do this?

Hi firstly we can solve how to calculate a running sum then we do an if_else on this column, as a general rule you always split complex problems into minor solvable problems
library(tidyverse)
df_example <- tibble::tribble(
~date, ~P, ~`>60?`,
"03-31-2020", 6.8, 0L,
"03-30-2020", 5, 0L,
"03-29-2020", 0, 0L,
"03-28-2020", 0, 0L,
"03-27-2020", 2, 0L,
"03-26-2020", 0, 0L,
"03-25-2020", 71, 1L,
"03-24-2020", 2, 0L,
"03-23-2020", 0, 0L,
"03-22-2020", 23.8, 0L,
"03-21-2020", 0, 0L,
"03-20-2020", 23.8, 0L
)
# lets start by doing a simple running sum
jjj <- df_example |>
arrange(date)
jjj |>
mutate(running_sum = slider::slide_dbl(.x = P,.f = ~ sum(.x),.before = 5,.after = -1)) |>
mutate(chosen_sum = if_else(P > 60,running_sum,NA_real_))
#> # A tibble: 12 x 5
#> date P `>60?` running_sum chosen_sum
#> <chr> <dbl> <int> <dbl> <dbl>
#> 1 03-20-2020 23.8 0 0 NA
#> 2 03-21-2020 0 0 23.8 NA
#> 3 03-22-2020 23.8 0 23.8 NA
#> 4 03-23-2020 0 0 47.6 NA
#> 5 03-24-2020 2 0 47.6 NA
#> 6 03-25-2020 71 1 49.6 49.6
#> 7 03-26-2020 0 0 96.8 NA
#> 8 03-27-2020 2 0 96.8 NA
#> 9 03-28-2020 0 0 75 NA
#> 10 03-29-2020 0 0 75 NA
#> 11 03-30-2020 5 0 73 NA
#> 12 03-31-2020 6.8 0 7 NA
Created on 2021-10-20 by the reprex package (v2.0.1)

Related

Issues with accent when using the "separate" function from tidyverse

I am using the separate function from tidyverse to split the first column of this tibble :
# A tibble: 6,951 x 9
Row.names Number_of_analysis~ DL_Minimum DL_Mean DL_Maximum Number_of_measur~ Measure_Minimum Measure_Mean Measure_Maximum
<I<chr>> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2011.FACILITY.PONT-À-CELLES 52 0.6 1.81 16 0 0 0 0
2 2011.FACILITY.PONT-À-CELLES 52 0.07 0.177 1.3 0 0 0 0
3 2011.FACILITY.CHARLEROI 52 0.07 0.212 1.9 0 0 0 0
4 2011.FACILITY.CHARLEROI 52 0.08 0.209 2 0 0 0 0
Merge_splitnames <- Merge %>%
separate(Row.names,sep = "\\.",into = c("Year", "Catchment", "Locality"), extra = "drop")
While everything seems correct, the output is a tibble without the first 2 columns (the ones which have a name comprising an accent in French) :
# A tibble: 6,951 x 9
Year Catchment Locality Number_of_analysis~ DL_Minimum DL_Mean DL_Maximum Number_of_measur~ Measure_Minimum Measure_Mean Measure_Maximum
<I<chr>> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
3 2011 FACILITY CHARLEROI 52 0.07 0.212 1.9 0 0 0 0
4 2011 FACILITY CHARLEROI 52 0.08 0.209 2 0 0 0 0
Any idea how to deal with this issue ? I wish to keep the real name in French (with the accent). This is quite surprising for me, I've never got any issue with all the other functions from tidyverse.
NB : this is a simple and reproducible example, my real tibble is about 100 times bigger
separate is retaining the accent for me:
library(tidyverse)
tribble(
~names,
"2011.FACILITY.PONT-À-CELLES",
"2011.FACILITY.PONT-À-CELLES",
"2011.FACILITY.CHARLEROI",
"2011.FACILITY.CHARLEROI"
) %>%
separate(names, sep = "\\.", into = c("Year", "Catchment", "Locality"))
#> # A tibble: 4 × 3
#> Year Catchment Locality
#> <chr> <chr> <chr>
#> 1 2011 FACILITY PONT-À-CELLES
#> 2 2011 FACILITY PONT-À-CELLES
#> 3 2011 FACILITY CHARLEROI
#> 4 2011 FACILITY CHARLEROI
Created on 2022-05-06 by the reprex package (v2.0.1)
Assuming DF shown reproducibly in the Note at the end, use extra = "merge" in separate . (It is possible that you may need to change your locale but I did not need to do that. Some things to try are shown in How to change the locale of R? or Using weekdays with any locale under Windows )
library(tidyr)
DF %>%
separate(Row.names, c("Year", "Catchment", "Locality"), extra = "merge")
giving:
Year Catchment Locality Number_of_analysis~ DL_Minimum DL_Mean
1 2011 FACILITY PONT-À-CELLES 52 0.60 1.810
2 2011 FACILITY PONT-À-CELLES 52 0.07 0.177
3 2011 FACILITY CHARLEROI 52 0.07 0.212
4 2011 FACILITY CHARLEROI 52 0.08 0.209
DL_Maximum Number_of_measur~ Measure_Minimum Measure_Mean Measure_Maximum
1 16.0 0 0 0 0
2 1.3 0 0 0 0
3 1.9 0 0 0 0
4 2.0 0 0 0 0
Note
DF <-
structure(list(Row.names = c("2011.FACILITY.PONT-À-CELLES", "2011.FACILITY.PONT-À-CELLES",
"2011.FACILITY.CHARLEROI", "2011.FACILITY.CHARLEROI"), `Number_of_analysis~` = c(52L,
52L, 52L, 52L), DL_Minimum = c(0.6, 0.07, 0.07, 0.08), DL_Mean = c(1.81,
0.177, 0.212, 0.209), DL_Maximum = c(16, 1.3, 1.9, 2), `Number_of_measur~` = c(0L,
0L, 0L, 0L), Measure_Minimum = c(0L, 0L, 0L, 0L), Measure_Mean = c(0L,
0L, 0L, 0L), Measure_Maximum = c(0L, 0L, 0L, 0L)), class = "data.frame", row.names = c("1",
"2", "3", "4"))

Determining the percentage of values in each column for each cluster

I need to determine the percentage of values in each column for each cluster with condition. Reproducible example is below. I have a table like this:
> tab
GI RT TR VR Cluster_number
1 1000086986 0.5814 0.5814 0.628 1
10 1000728257 0.5814 0.5814 0.628 1
13 1000074769 0.7879 0.7879 0.443 2
14 1000498642 0.7879 0.7879 0.443 2
22 1000074765 0.7941 0.3600 0.533 3
26 1000597385 0.7941 0.3600 0.533 3
31 1000502373 0.5000 0.5000 0.607 4
32 1000532631 0.6875 0.7059 0.607 4
33 1000597694 0.5000 0.5000 0.607 4
34 1000598724 0.5000 0.5000 0.607 4
And i need table like this:
> tab1
Cluster_number RT_cond TR_cond VR_cond
1 1 0 0 100
2 2 100 100 0
3 3 100 0 0
4 4 25 25 100
Where the values in the corresponding column indicate the percentage of GI in the corresponding cluster, where RT >= 0.6, TR >= 0.6 and VR >= 0.6, respectively. I.e., in the first cluster, all RT <= 0.6, therefore, in the final table, the value 0 is written in the first row, and, for example, in the fourth cluster, one of the four values TR >= 0.6, so the corresponding value in the final table is 25. How can i do this?
You can group_by Cluster_number and use across to calculate percentage :
library(dplyr)
df %>%
group_by(Cluster_number) %>%
summarise(across(RT:VR, ~mean(. >= 0.6) * 100, .names = '{col}_cond'))
#In older version of dplyr use summarise_at
#summarise_at(vars(RT:VR), ~mean(. >= 0.6) * 100)
# Cluster_number RT_cond TR_cond VR_cond
# <int> <dbl> <dbl> <dbl>
#1 1 0 0 100
#2 2 100 100 0
#3 3 100 0 0
#4 4 25 25 100
In base R, we can use aggregate :
aggregate(cbind(RT, TR, VR)~Cluster_number, df, function(x) mean(x >= 0.6) * 100)
data
df <- structure(list(GI = c(1000086986L, 1000728257L, 1000074769L,
1000498642L, 1000074765L, 1000597385L, 1000502373L, 1000532631L,
1000597694L, 1000598724L), RT = c(0.5814, 0.5814, 0.7879, 0.7879,
0.7941, 0.7941, 0.5, 0.6875, 0.5, 0.5), TR = c(0.5814, 0.5814,
0.7879, 0.7879, 0.36, 0.36, 0.5, 0.7059, 0.5, 0.5), VR = c(0.628,
0.628, 0.443, 0.443, 0.533, 0.533, 0.607, 0.607, 0.607, 0.607
), Cluster_number = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 4L)),
class = "data.frame", row.names = c("1", "10", "13", "14", "22",
"26", "31", "32", "33", "34"))
With the dplyr package you can use a group_by statement followed by summarise, and then rename the columns of interest with the new rename_with function
library(dplyr)
tab %>%
group_by(Cluster_number) %>%
summarise(across(c(RT, TR, VR), ~mean(. >= 0.6)*100)) %>%
rename_with(~paste0(., "_cond"), c(RT, TR, VR))
# A tibble: 4 x 4
# Cluster_number RT_cond TR_cond VR_cond
# <int> <dbl> <dbl> <dbl>
# 1 1 0 0 100
# 2 2 100 100 0
# 3 3 100 0 0
# 4 4 25 25 100

Merging rows with same value with conditions for keeping multiple dummies

Creating a subset example of the DF (the code for a part of the actual one is at the end)
ANO_CENSO PK_COD_TURMA PK_COD_ENTIDADE MAIS_ENSINO_FUND MAIS_ENSINO_MED ENSINO_INTEG_FUND ENSINO_INTEG_MED
2011 27 12 1 0 0 1
2011 41 12 1 1 0 0
2011 18 13 0 0 0 1
2011 16 14 1 1 0 1
I want to merge the rows with the same value for PK_COD_ENTIDADE into a single one, and keep the values "1" for the dummies with the same PK_COD_ENTIDADE. I don't care for the different values in PK_COD_TURMA, doesn't matter which one stays at the final DF (27 or 41).
MY DF have multiple variables like PK_COD_TURMA that I don't care for the final value, the important one are the PK_COD_ENTIDADE and the dummies with value "1"
It would look like this at the end:
ANO_CENSO PK_COD_TURMA PK_COD_ENTIDADE MAIS_ENSINO_FUND MAIS_ENSINO_MED ENSINO_INTEG_FUND ENSINO_INTEG_MED
2011 27 12 1 1 0 1
2011 18 13 0 0 0 1
2011 16 14 1 1 0 1
Look at how I have the values "1" for 2 dummies in the first observation of PK_COD_ENTIDADE = 12 and another value "1" in another dummy with the PK_COD_ENTIDADE = 12, and at the end they merged in a single observation for the same PK_COD_ENTIDADE keeping the different dummies "1" (and the same dummies with 1 for different observations don't sum to 2, because they are dummies)
I have no idea how to do this, I searched for some solutions with dplyr but couldn't apply anything close to working...
Here is the structure of the df with all variables:
dftest2 <- structure(list(ANO_CENSO = c(2011, 2011, 2011, 2011), PK_COD_TURMA = c(27,
41, 18, 16), NU_DURACAO_TURMA = c(250, 255, 255,
255), FK_COD_ETAPA_ENSINO = c(41, 19, 19, 19), PK_COD_ENTIDADE = c(12,
12, 13, 14), FK_COD_ESTADO = c(11, 11, 11,
11), SIGLA = c("RO", "RO", "RO", "RO"), FK_COD_MUNICIPIO = c(1100023,
1100023, 1100023, 1100023), ID_LOCALIZACAO = c(1, 1, 1, 1), ID_DEPENDENCIA_ADM = c(2,
2, 2, 2), MAIS_ENSINO_FUND = c(1, 1, 0, 1), MAIS_ENSINO_MED = c(0,
1, 0, 1), ENSINO_INTEG_FUND = c(0L, 0L, 0L, 0L), ENSINO_INTEG_MED = c(1L,
0L, 1L, 1L)), row.names = c(NA, -4L), class = c("tbl_df", "tbl",
"data.frame"))
The sample data you give for dftest2 does not match the data you present at the beginning of your post.
In response to your question, an option is to use aggregate
aggregate(
. ~ PK_COD_ENTIDADE,
data = transform(dftest2, SIGLA = as.factor(SIGLA)),
FUN = max)
#P K_COD_ENTIDADE ANO_CENSO PK_COD_TURMA NU_DURACAO_TURMA FK_COD_ETAPA_ENSINO
#1 12 2011 41 255 41
#2 13 2011 18 255 19
#3 14 2011 16 255 19
# FK_COD_ESTADO SIGLA FK_COD_MUNICIPIO ID_LOCALIZACAO ID_DEPENDENCIA_ADM
#1 11 1 1100023 1 2
#2 11 1 1100023 1 2
#3 11 1 1100023 1 2
# MAIS_ENSINO_FUND MAIS_ENSINO_MED ENSINO_INTEG_FUND ENSINO_INTEG_MED
#1 1 1 0 1
#2 0 0 0 1
#3 1 1 0 1
Explanation: We first convert the character column SIGLA to a factor; then we aggregate data in all columns (except PK_COD_ENTIDADE) by PK_COD_ENTIDADE, and return the max value (which should be consistent with your problem statement).
You can do something similar using dplyrs group_by and summarise_all
library(dplyr)
dftest2 %>%
group_by(PK_COD_ENTIDADE) %>%
summarise_all(~ifelse(is.character(.x), last(.x), max(.x))) %>%
ungroup()
# A tibble: 3 x 14
PK_COD_ENTIDADE ANO_CENSO PK_COD_TURMA NU_DURACAO_TURMA FK_COD_ETAPA_EN…
<dbl> <dbl> <dbl> <dbl> <dbl>
1 12 2011 41 255 41
2 13 2011 18 255 19
3 14 2011 16 255 19
# … with 9 more variables: FK_COD_ESTADO <dbl>, SIGLA <chr>,
# FK_COD_MUNICIPIO <dbl>, ID_LOCALIZACAO <dbl>, ID_DEPENDENCIA_ADM <dbl>,
# MAIS_ENSINO_FUND <dbl>, MAIS_ENSINO_MED <dbl>, ENSINO_INTEG_FUND <int>,
# ENSINO_INTEG_MED <int>

How to assign a value in a data frame based on multiple conditions of another data frame

I have two data frames, one consisting of numerical values called 'esame':
media id_poll fin
1 5.330000e+00 360 1
2 6.833333e-02 361 0
3 0.000000e+00 362 0
4 NA 363 0
5 8.200000e-01 364 0
6 3.416667e-01 365 0
7 0.000000e+00 366 0
8 0.000000e+00 367 0
9 0.000000e+00 368 0
10 NA 369 0
11 6.150000e-01 370 0
12 0.000000e+00 371 0
13 0.000000e+00 372 0
14 NA 373 0
15 0.000000e+00 374 0
16 0.000000e+00 375 0
17 0.000000e+00 376 0
18 1.298333e+00 377 0
And the second one consisting of numerical ranges which I would like to use to check in which range the 'media' field of the first data.frame is.
If it's in the first range I would like to assign "1" to the field "fin" of the first data.frame, if it's in the second I would like to assign "2" and so on.
So here it is the second data.frame with some of the conditions I'll need:
Range1 Range2 Range3 Range4 ID
0.5 9.9 29.9 >30 360
0.5 15.9 49.9 >50 361
0 4.9 24.9 >25 362
First of all I suppose I won't need to declare Range4 as it's already an information included in Range3. I removed the initial value of all numerical ranges as I need just a single number to check against (or so I think). The same row for ID 360 could be written as:
Range1 Range2 Range3 Range4 ID
0.5 0.6-9.9 10-29.9 >30 360
So my guess is to do something like this:
esame$fin<-ifelse (esame$media<0.6 & datofinale$id_poll=="360", "1", "0")
I could substitute the "0" value with another 'ifelse' statement and go on manually.
Is there any faster way to do that? (the list containing all the condititions is actually pretty larger than the example).
Thank you for any advice.
Not too nice, but this should work:
require(dplyr)
inner_join(Data,Data1,by=c("id_poll"="ID")) %>% rowwise() %>%
mutate(fin = findInterval(media,c(-Inf,Range1,Range2,Range3),left.open=TRUE))
Reproducible data
esame <- structure(list(media = c(5.33, 0.06833333, 0, NA, 0.82, 0.3416667,
0, 0, 0, NA, 0.615, 0, 0, NA, 0, 0, 0, 1.298333), id_poll = 360:377,
fin = c(1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L)), .Names = c("media", "id_poll", "fin"
), row.names = c(NA, -18L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x0000000014320788>)
df1 <- structure(list(Range1 = c(0.5, 0.5, 0), Range2 = c(9.9, 15.9,
4.9), Range3 = c(29.9, 49.9, 24.9), Range4 = c(">30", ">50",
">25"), ID = 360:362), .Names = c("Range1", "Range2", "Range3",
"Range4", "ID"), row.names = c(NA, -3L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x0000000014320788>)
dplyr solution
Using case_when
library(dplyr)
df2 <- left_join(esame1, df1, by=c("id_poll" = "ID")) %>%
mutate(fin = case_when( media > Range3 ~ 4,
media > Range2 ~ 3,
media > Range1 ~ 2,
media <= Range1 ~ 1,
is.na(Range1) == T ~ 0)) # else case
Output
media ID fin Range1 Range2 Range3 Range4
1 5.33000000 360 2 0.5 9.9 29.9 >30
2 0.06833333 361 1 0.5 15.9 49.9 >50
3 0.00000000 362 1 0.0 4.9 24.9 >25
4 NA 363 0 NA NA NA <NA>
5 0.82000000 364 0 NA NA NA <NA>
We can consider each row in the range data.frame as a vector and ask whether the current media value is greater than the value in this vector.
For simplicity, I'm assuming that all values in the first data.frame has a correspondent in the second, and that they are all ordered the same way.
for(i in 1:nrow(esame)) {
greater.than <- esame[i,1]>range[i,1:3] #this returns a vector of TRUE (greater than this range) and FALSE (within) you want the first FALSE
esame$fin <- max(which(greater.than))+1 #returns the position of the last TRUE +1, which is the position of the first FALSE
}
dat - first df, tad - second. It will put 0 if NA, nested ifelse() and assume that first range is from 0 to present value. However show some example result to check if it works properly.
dat$fin <- sapply(1:nrow(dat), function(x) ifelse(dat[x,1] >= tad[x,1] & !is.na(dat[x,1]), 1, ifelse(dat[x,1] >= tad[x,2] & !is.na(dat[x,1]), 2, ifelse(dat[x,1] >= tad[x,3] & !is.na(dat[x,1]), 3, 0))))
>dat
media id_poll fin
1 5.33000000 360 1
2 0.06833333 361 0
3 0.00000000 362 1

Dividing grouped data by group means r

I have data split up into two categories:
z= Tracer time treatment
15 0 S
20 0 S
25 0 X
04 0 X
55 15 S
16 15 S
15 15 X
20 15 X
I'd like to divide each value of Tracer by the group mean depending on which group it belongs to (e.g. All values of Tracer belonging to time=0 and treatment=S are divided by their mean).
The procedure would be something like this:
Find category means as follows:
1:
aggmeanz <-aggregate(z$Tracer, list(time=z$time,treatment=z$treatment), FUN=mean)
2: Divide z$Tracer by the correct aggmeanz value
structure(list(Tracer = c(15L, 20L, 25L, 4L, 55L, 16L, 15L, 20L
), time = c(0L, 0L, 0L, 0L, 15L, 15L, 15L, 15L), treatment = structure(c(1L,
1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("S", "X"), class = "factor")), .Names = c("Tracer",
"time", "treatment"), class = "data.frame", row.names = c(NA,
-8L))
Alternatively, here is a dplyr solution:
library(dplyr)
group_by(z,time,treatment) %>%
mutate(pmean=Tracer/mean(Tracer))
Output:
Tracer time treatment pmean
(int) (int) (fctr) (dbl)
1 15 0 S 0.8571429
2 20 0 S 1.1428571
3 25 0 X 1.7241379
4 4 0 X 0.2758621
5 55 15 S 1.5492958
6 16 15 S 0.4507042
7 15 15 X 0.8571429
8 20 15 X 1.1428571
Data:
z <- read.table(text="Tracer time treatment
15 0 S
20 0 S
25 0 X
04 0 X
55 15 S
16 15 S
15 15 X
20 15 X",head=TRUE)
Is it ok to use non-base tools? With data.table installed and loaded:
z <- data.table(z)
z[, scaledTracer := Tracer/mean(Tracer), by = c("time","treatment")]
Would compute means by each unique combination of time and treatment (which appear to be groups of 2 rows in your data), and scale the Tracer values in each group by the appropriate mean.
It's not the prettiest but:
groupmeans = aggregate(z$Tracer, by = list(z$time, z$treatment), FUN = mean)
Group.1 Group.2 x
0 S 17.5
15 S 35.5
0 X 14.5
15 X 17.5
names(groupmeans) = c("time", "treatment", "groupmean")
z = merge(z, groupmeans, id.vars = c("time","treatment" ))
time treatment groupmean Tracer tracer_div
0 S 17.5 15 0.8571429
0 S 17.5 20 1.1428571
0 X 14.5 25 1.7241379
0 X 14.5 4 0.2758621
15 S 35.5 55 1.5492958
15 S 35.5 16 0.4507042
15 X 17.5 15 0.8571429
15 X 17.5 20 1.1428571
z$tracer_div = z$Tracer/z$groupmean
time treatment groupmean Tracer tracer_div
0 S 17.5 15 0.8571429
0 S 17.5 20 1.1428571
0 X 14.5 25 1.7241379
0 X 14.5 4 0.2758621
15 S 35.5 55 1.5492958
15 S 35.5 16 0.4507042
15 X 17.5 15 0.8571429
15 X 17.5 20 1.1428571
You could reassign z$Tracer to the final step if you didn't want to create a whole new column. It can be nice to keep every step though in case you want to use it in another calculation or plot later.
a base R solution:
do.call(c, lapply(split(z[1], z[, -1]), FUN = function(x) x[[1]]/mean(x[[1]])))
# 0.S1 0.S2 15.S1 15.S2 0.X1 0.X2 15.X1 15.X2
#0.8571429 1.1428571 1.5492958 0.4507042 1.7142857 0.2857143 0.8571429 1.1428571
split into timextreatment groups first, then divide each group by mean. finally glue back together with c.

Resources