I am pretty new to R, but ok with spotfire. I found it was difficult to do in spotfire and someone told me about the TERR tools in spotfire that can run R codes.
I am trying to find the p10, p50 and p90 of a multi category time series data. Example of the data looks like this:
Category Time Rate
1 0
1 0.0104
1 0.1354 0.002
1 0.2604 0.139
1 0.3854 0.280
1 0.5104 0.299
1 0.6354 0.313
1 0.7604 0.403
1 0.8854 0.429
1 1.0104 0.408
1 1.1354 0.415
1 1.2604 0.482
1 1.3854 0.484
2 0
2 0.0104
2 0.1354
2 0.2604
2 0.3854 0.064
2 0.5104 0.166
2 0.6354 0.148
2 0.7604 0.141
2 0.8854 0.254
2 1.0104 0.286
2 1.1354 0.292
2 1.2604 0.296
2 1.3854 0.310
2 1.5104 0.304
2 1.6354 0.303
2 1.7604 0.301
2 1.8854 0.300
2 2.0104 0.319
2 2.1354 0.330
2 2.2604 0.330
2 2.3854 0.331
2 2.5104 0.332
2 2.6354 0.334
2 2.7604 0.330
2 2.8854 0.326
2 3.0104 0.325
3 0
3 0.0104
3 0.1354
3 0.2604 0.010
3 0.3854 0.021
3 0.5104 0.021
3 0.6354 0.021
3 0.7604 0.023
3 0.8854 0.026
3 1.0104 0.028
3 1.1354 0.029
3 1.2604 0.027
3 1.3854 0.033
3 1.5104 0.035
3 1.6354 0.034
In the end, I want to calculate other columns with p10, p50 and p90 values as in the attached picture.
p10 and 90 are the dash lines and p50 is the solid red line. enter image description here
Thanks
Related
I am quite new to R so I needed some help working out this problem. I have a data frame for daily rainfall values for different regions (AEZ).
The output needs to be another table that takes the (individual rainfall - column average)/column standard deviation.
For example in the table below for 01.Jan and AEZ 3 what it should do is take (0.0402 - Average (01.Jan)) / SD(01.Jan). This calculation needs to be run for each AEZ and the output then will be another table with results of these calculations.
AEZ `01-Jan` `02-Jan` `03-Jan` `04-Jan` `05-Jan` `06-Jan` `07-Jan`
1 3 0.0402 0.0044 0.0998 0.142 0.0061 0.0267 0.0351
2 12 0.0143 0.0027 0.0027 0.0029 0.0317 0.0012 0.0012
3 48 0 0 0.0026 0.0015 0.0019 0 0
4 77 0 0 0.0059 0.0124 0.0048 0.0009 0
5 160 0.0261 0.0173 0.057 0.0221 0.0892 0 0.0003
6 162 0.167 0.0037 0.0041 0.0683 0.102 0.199 0.0308
7 178 0.0062 0.0033 0.0808 0.101 0.0033 0.0023 0.0315
This will standardise (center and scale) the original dataframe.
df[,-1] <- scale(df[,-1], center = TRUE, scale = TRUE)
To scale a copy do:
foo <- df
foo[,-1] <- scale(foo[,-1], center = TRUE, scale = TRUE)
We could use dplyr:
library(dplyr)
data %>%
mutate(across(-AEZ, ~ (.x - mean(.x)) / sd(.x)))
which returns
# A tibble: 7 x 8
AEZ `\`01-Jan\`` `\`02-Jan\`` `\`03-Jan\`` `\`04-Jan\`` `\`05-Jan\`` `\`06-Jan\`` `\`07-Jan\``
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 0.0663 -0.0145 1.51 1.67 -0.647 -0.0835 1.22
2 12 -0.369 -0.302 -0.793 -0.857 -0.0563 -0.429 -0.751
3 48 -0.610 -0.759 -0.795 -0.882 -0.744 -0.445 -0.821
4 77 -0.610 -0.759 -0.717 -0.684 -0.677 -0.433 -0.821
5 160 -0.171 2.17 0.495 -0.508 1.27 -0.445 -0.804
6 162 2.20 -0.133 -0.760 0.332 1.56 2.25 0.969
7 178 -0.505 -0.201 1.06 0.926 -0.711 -0.414 1.01
I am working with some baseball data and am trying to figure out how to write functions that operate on all columns in a list of like data frames. So for instance, I have on base percentage for a team for each game played, what I want to do is run shift(cumsum(teamname$OBP)) for each data frame in the list in order to find the cumulative sum of their OBP for the season, and then shift the data so each row will contain the previous days cumulative sum. Is there a way to do this specifically with an apply function so that it runs the same function over all the data frames?
Below are some of the variables that I'm looking to apply this over:
> head(teams$SEA[27:34])
# A tibble: 6 x 8
RA OBP SLG wOBA wOPS RC PE BBSOr
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 7 0.375 0.452 0.326 0.777 6.20 0.623 0.6
2 4 0.314 0.442 0.298 0.740 7.29 0.610 0.556
3 4 0.419 0.861 0.478 1.34 13.8 0.9 0.444
4 7 0.3 0.559 0.350 0.908 6.48 0.424 0.4
5 5 0.368 0.441 0.354 0.795 6.74 0.590 0.4
6 8 0.422 0.568 0.418 0.985 10.1 0.610 0.636
My list looks like this.
> summary(teams)
Length Class Mode
SEA 34 spec_tbl_df list
PIT 34 spec_tbl_df list
ARI 34 spec_tbl_df list
COL 34 spec_tbl_df list
SLN 34 spec_tbl_df list
Sample Data
I gave the Pirates a lower OBP to show the function runs over the data frames separately, and also because they seemed the most likely candidate of the five non-SEA teams to have such a poor OBP :)
On future questions, please add sample data that others can easily paste into R. dput() is a useful function for doing so.
library(data.table)
SEA <- read.table(text = " RA OBP SLG wOBA wOPS RC PE BBSOr
7 0.375 0.452 0.326 0.777 6.20 0.623 0.6
4 0.314 0.442 0.298 0.740 7.29 0.610 0.556
4 0.419 0.861 0.478 1.34 13.8 0.9 0.444
7 0.3 0.559 0.350 0.908 6.48 0.424 0.4
5 0.368 0.441 0.354 0.795 6.74 0.590 0.4
8 0.422 0.568 0.418 0.985 10.1 0.610 0.636",
head = TRUE,)
PIT <- SEA
PIT$OBP <- PIT$OBP/2
ARI <- SEA
COL <- SEA
SLN <- SEA
teams <- list(SEA = SEA,
PIT = PIT,
ARI = ARI,
COL = COL,
SLN = SLN)
Solution
lapply(teams,
function(x){
x$OBP_cumsum <- shift(cumsum(x$OBP))
x
})
$SEA
RA OBP SLG wOBA wOPS RC PE BBSOr OBP_cumsum
1 7 0.375 0.452 0.326 0.777 6.20 0.623 0.600 NA
2 4 0.314 0.442 0.298 0.740 7.29 0.610 0.556 0.375
3 4 0.419 0.861 0.478 1.340 13.80 0.900 0.444 0.689
4 7 0.300 0.559 0.350 0.908 6.48 0.424 0.400 1.108
5 5 0.368 0.441 0.354 0.795 6.74 0.590 0.400 1.408
6 8 0.422 0.568 0.418 0.985 10.10 0.610 0.636 1.776
$PIT
RA OBP SLG wOBA wOPS RC PE BBSOr OBP_cumsum
1 7 0.1875 0.452 0.326 0.777 6.20 0.623 0.600 NA
2 4 0.1570 0.442 0.298 0.740 7.29 0.610 0.556 0.1875
3 4 0.2095 0.861 0.478 1.340 13.80 0.900 0.444 0.3445
4 7 0.1500 0.559 0.350 0.908 6.48 0.424 0.400 0.5540
5 5 0.1840 0.441 0.354 0.795 6.74 0.590 0.400 0.7040
6 8 0.2110 0.568 0.418 0.985 10.10 0.610 0.636 0.8880
$ARI
RA OBP SLG wOBA wOPS RC PE BBSOr OBP_cumsum
1 7 0.375 0.452 0.326 0.777 6.20 0.623 0.600 NA
2 4 0.314 0.442 0.298 0.740 7.29 0.610 0.556 0.375
3 4 0.419 0.861 0.478 1.340 13.80 0.900 0.444 0.689
4 7 0.300 0.559 0.350 0.908 6.48 0.424 0.400 1.108
5 5 0.368 0.441 0.354 0.795 6.74 0.590 0.400 1.408
6 8 0.422 0.568 0.418 0.985 10.10 0.610 0.636 1.776
$COL
RA OBP SLG wOBA wOPS RC PE BBSOr OBP_cumsum
1 7 0.375 0.452 0.326 0.777 6.20 0.623 0.600 NA
2 4 0.314 0.442 0.298 0.740 7.29 0.610 0.556 0.375
3 4 0.419 0.861 0.478 1.340 13.80 0.900 0.444 0.689
4 7 0.300 0.559 0.350 0.908 6.48 0.424 0.400 1.108
5 5 0.368 0.441 0.354 0.795 6.74 0.590 0.400 1.408
6 8 0.422 0.568 0.418 0.985 10.10 0.610 0.636 1.776
$SLN
RA OBP SLG wOBA wOPS RC PE BBSOr OBP_cumsum
1 7 0.375 0.452 0.326 0.777 6.20 0.623 0.600 NA
2 4 0.314 0.442 0.298 0.740 7.29 0.610 0.556 0.375
3 4 0.419 0.861 0.478 1.340 13.80 0.900 0.444 0.689
4 7 0.300 0.559 0.350 0.908 6.48 0.424 0.400 1.108
5 5 0.368 0.441 0.354 0.795 6.74 0.590 0.400 1.408
6 8 0.422 0.568 0.418 0.985 10.10 0.610 0.636 1.776
I tried to solve the problem with questions here on SO but I could not find a satisfying answer. My data frame has the structure
X = data_frame(
treat = c(rep(1,4), rep(2,4), rep(3,4), rep(4,4)),
id = seq(1:16),
x = rnorm(16),
y = rnorm(16),
z = rnorm(16)
)
Looks like
# A tibble: 16 x 5
treat id x y z
<int> <int> <dbl> <dbl> <dbl>
1 1 1 -0.0724 1.26 0.317
2 1 2 -0.486 -0.628 0.392
3 1 3 -0.406 -0.706 1.18
4 1 4 -1.35 -1.27 2.36
5 2 5 -0.0751 -0.0394 0.568
6 2 6 0.243 0.873 0.132
7 2 7 0.138 0.611 -0.700
8 2 8 -0.732 1.02 -0.811
9 3 9 -0.0278 1.78 0.568
10 3 10 0.526 1.18 1.03
11 3 11 1.43 0.0937 -0.0825
12 3 12 -0.299 -0.117 0.367
13 4 13 1.05 2.04 0.678
14 4 14 -1.93 0.201 0.250
15 4 15 0.624 1.09 0.852
16 4 16 0.502 0.119 -0.843
Every fourth value in treat is a control and now I want to calculate the difference in x, y and z between the treatments and the controls. For example I would like to calculate for the first treatment
-0.724 - (-1.35) #x
1.26 - (-1.27) #y
0.317 - 2.36 #z
for the first treatment. For the second treatment accordingly,
-0.486 - (-1.35) #x
-0.628 - (-1.27) #y
0.392 - 2.36 #z
... and so on.
I would like to use a dplyr / tidyverse solution but I have no idea how to do that in a "smooth" way. I found a solution already by using joins but this seems rather tedious compared to the "smooth" solution dplyr usually offers.
With dplyr, we can group_by treat and use mutate_at to select specific columns (x:z) and subtract each value with 4th value using the nth function.
library(dplyr)
X %>%
group_by(treat) %>%
mutate_at(vars(x:z), funs(. - nth(., 4)))
#treat id x y z
# <dbl> <int> <dbl> <dbl> <dbl>
# 1 1 1 -0.631 0.971 0.206
# 2 1 2 -0.301 -1.49 0.189
# 3 1 3 1.49 1.17 0.133
# 4 1 4 0 0 0
# 5 2 5 1.39 -0.339 0.934
# 6 2 6 2.98 0.511 0.319
# 7 2 7 1.73 -0.297 0.0745
# 8 2 8 0 0 0
# 9 3 9 -1.05 -0.778 -2.86
#10 3 10 -0.805 -1.84 -2.38
#11 3 11 0.864 0.684 -3.43
#12 3 12 0 0 0
#13 4 13 -1.39 -0.843 1.67
#14 4 14 -1.68 1.55 -0.656
#15 4 15 -2.34 0.722 0.0638
#16 4 16 0 0 0
This can be also written as
X %>%
group_by(treat) %>%
mutate_at(vars(x:z), funs(. - .[4]))
data
set.seed(123)
X = data_frame(
treat = c(rep(1,4), rep(2,4), rep(3,4), rep(4,4)),
id = seq(1:16),
x = rnorm(16),
y = rnorm(16),
z = rnorm(16)
)
My Question:
How do I calculate the average (mean) per sample A, B, C per day (3 separate to 5) and then add a line of best fit through the mean from one day to the next?
I wanted to add this to a dot plot (ggplot2 geom_point) example of data is below... R script used below data.
Data below:
Day Sample Measurement
3 A 0.648
3 A 0.661
3 A 0.65
3 A 0.594
3 A 0.548
3 A 0.653
3 A 0.648
3 A 0.672
3 A 0.661
3 A 0.66
3 A 0.647
3 A 0.629
3 A 0.691
3 A 0.534
3 A 0.567
3 A 0.634
3 A 0.579
3 B 0.689
3 B 0.598
3 B 0.658
3 B 0.662
3 B 0.599
3 B 0.678
3 B 0.65
3 B 0.617
3 B 0.673
3 B 0.67
3 B 0.666
3 B 0.595
3 B 0.604
3 B 0.59
3 B 0.569
3 B 0.614
3 C 0.624
3 C 0.623
3 C 0.606
3 C 0.66
3 C 0.623
3 C 0.669
3 C 0.642
3 C 0.658
3 C 0.645
3 C 0.653
3 C 0.501
3 C 0.552
3 C 0.663
3 C 0.589
3 C 0.602
5 A 0.811
5 A 0.822
5 A 0.811
5 A 0.824
5 A 0.773
5 A 0.823
5 A 0.815
5 A 0.819
5 A 0.754
5 A 0.81
5 A 0.796
5 A 0.818
5 A 0.797
5 A 0.811
5 A 0.812
5 A 0.817
5 A 0.821
5 B 0.827
5 B 0.798
5 B 0.819
5 B 0.81
5 B 0.826
5 B 0.821
5 B 0.805
5 B 0.821
5 B 0.825
5 B 0.821
5 B 0.816
5 B 0.814
5 B 0.823
5 B 0.81
5 B 0.823
5 B 0.762
5 B 0.825
5 B 0.821
5 B 0.825
5 B 0.812
R Code for ggplot:
p2 <- ggplot(data=data1, aes(x=Day, y=Fv.Fm..XE..Mean)) +
geom_point(aes(colour= Sample),
position = position_jitterdodge(dodge.width=0.75 , jitter.width=0.250)) +
# geom_line(aes(colour=Sample),
# position = position_jitterdodge(dodge.width=0.75)) +
scale_x_discrete(labels=c(3, 5, 7, 10, 14)) +
scale_y_continuous(limits=c(0.3 , 1.0))
p2
ggsave("p2.jpg")
First calculate the mean for each Sample and Day
library(tidyverse)
library(ggpmisc)
data1 <- read.table(text = txt, header = TRUE)
mean_data1 <- data1 %>%
group_by(Day, Sample) %>%
summarise(Mean = mean(Measurement, na.rm = TRUE))
mean_data1
#> # A tibble: 5 x 3
#> # Groups: Day [?]
#> Day Sample Mean
#> <int> <fct> <dbl>
#> 1 3 A 0.628
#> 2 3 B 0.633
#> 3 3 C 0.621
#> 4 5 A 0.808
#> 5 5 B 0.815
Then plot all Measurement, facet_grid by Sample. Linear lines are added using geom_smooth. stat_poly_eq function from ggpmisc package is used for displaying equation & R2. Finally, we plot the Mean values.
p2 <- ggplot(data = data1, aes(x = Day, y = Measurement)) +
geom_point(aes(colour= Sample),
alpha = 0.7,
position = position_jitterdodge(dodge.width=0.75,
jitter.width=0.250)) +
scale_x_continuous(breaks=c(3, 5)) +
scale_y_continuous(limits=c(0.3 , 1.0))
formula <- y ~ x
p2 +
facet_grid(~ Sample) +
geom_smooth(method = "lm", formula = formula, se = FALSE) +
stat_poly_eq(aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~~")),
label.x.npc = "left", label.y.npc = "top",
formula = formula, parse = TRUE, size = 4) +
geom_point(data = mean_data1, aes(Day, Mean, color = "Mean"),
size = 3) +
theme_bw() +
theme(aspect.ratio = 1)
Data used:
txt <- "Day Sample Measurement
3 A 0.648
3 A 0.661
3 A 0.65
3 A 0.594
3 A 0.548
3 A 0.653
3 A 0.648
3 A 0.672
3 A 0.661
3 A 0.66
3 A 0.647
3 A 0.629
3 A 0.691
3 A 0.534
3 A 0.567
3 A 0.634
3 A 0.579
3 B 0.689
3 B 0.598
3 B 0.658
3 B 0.662
3 B 0.599
3 B 0.678
3 B 0.65
3 B 0.617
3 B 0.673
3 B 0.67
3 B 0.666
3 B 0.595
3 B 0.604
3 B 0.59
3 B 0.569
3 B 0.614
3 C 0.624
3 C 0.623
3 C 0.606
3 C 0.66
3 C 0.623
3 C 0.669
3 C 0.642
3 C 0.658
3 C 0.645
3 C 0.653
3 C 0.501
3 C 0.552
3 C 0.663
3 C 0.589
3 C 0.602
5 A 0.811
5 A 0.822
5 A 0.811
5 A 0.824
5 A 0.773
5 A 0.823
5 A 0.815
5 A 0.819
5 A 0.754
5 A 0.81
5 A 0.796
5 A 0.818
5 A 0.797
5 A 0.811
5 A 0.812
5 A 0.817
5 A 0.821
5 B 0.827
5 B 0.798
5 B 0.819
5 B 0.81
5 B 0.826
5 B 0.821
5 B 0.805
5 B 0.821
5 B 0.825
5 B 0.821
5 B 0.816
5 B 0.814
5 B 0.823
5 B 0.81
5 B 0.823
5 B 0.762
5 B 0.825
5 B 0.821
5 B 0.825
5 B 0.812"
Created on 2018-03-17 by the reprex package (v0.2.0).
Lets create a matrix with fake data:
data_ex <- data.frame(y = runif(5,0,1), a1 = runif(5,0,1), b2 = runif(5,0,1),
c3 = runif(5,0,1), d4 = runif(5,0,1))
> data_ex
y a1 b2 c3 d4
1 0.162 0.221 0.483 0.989 0.558
2 0.445 0.854 0.732 0.723 0.259
3 0.884 0.041 0.893 0.985 0.947
4 0.944 0.718 0.338 0.238 0.592
5 0.094 0.867 0.026 0.334 0.314
The model's formula is as follows:
forml <- as.formula("y ~ a1 + b2 + a1:c3:d4 + a1:c3 + a1:b2 + a1:b2:c3")
> forml
y ~ a1 + b2 + a1:c3:d4 + a1:c3 + a1:b2 + a1:b2:c3
The resulting model.matrix is:
> as.matrix(model.matrix(forml, data_ex))
(Intercept) a1 b2 a1:c3 a1:b2 a1:c3:d4 a1:b2:c3
1 1 0.221 0.483 0.218 0.107 0.122 0.105
2 1 0.854 0.732 0.617 0.625 0.160 0.452
3 1 0.041 0.893 0.040 0.036 0.038 0.036
4 1 0.718 0.338 0.171 0.243 0.101 0.058
5 1 0.867 0.026 0.290 0.022 0.091 0.008
As you can see the columns are reordered from the lowest interaction grade to the highest.
I'm looking for a method that force the model.matrix function to follow the EXACT order of the terms in the formula.
The resulting matrix should be like the following:
> Correct_matrix
(Intercept) a1 b2 a1:c3:d4 a1:c3 a1:b2 a1:b2:c3
1 1 0.221 0.107 0.483 0.218 0.122 0.105
2 1 0.854 0.625 0.732 0.617 0.160 0.452
3 1 0.041 0.036 0.893 0.040 0.038 0.036
4 1 0.718 0.243 0.338 0.171 0.101 0.058
5 1 0.867 0.022 0.026 0.290 0.091 0.008
You can create the terms and keep the order of the terms with keep.order = TRUE. The resulting object can be used with model.matrix.
model.matrix(terms(forml, keep.order = TRUE), data_ex)
The result:
(Intercept) a1 b2 a1:c3:d4 a1:c3 a1:b2 a1:b2:c3
1 1 0.4604044 0.10968326 0.198301034 0.3015807 0.05049866 0.03307836
2 1 0.4795555 0.61339588 0.018934135 0.2205621 0.29415737 0.13529189
3 1 0.7560366 0.67036486 0.001418541 0.4465991 0.50682035 0.29938436
4 1 0.4490247 0.69179890 0.135388984 0.1376586 0.31063480 0.09523209
5 1 0.7198557 0.08595737 0.131564438 0.2918157 0.06187690 0.02508371
attr(,"assign")
[1] 0 1 2 3 4 5 6