Pass argument from user provided function to aggregate (stats) - r

I am looking to create a function that aggregates sale data by many different variables. I am running into a snag with aggregate(by =). Here is my function thus far:
func <- function(x, x2, statfunc) {
PT <- c(1,5,3,5,4,8,3,1,5,6,1,5,5,6,1,2,3,1,5,1)
SH <- c(7,7,3,1,1,1,1,4,4,6,6,7,7,1,1,1,3,2,1,3)
SaleRatio <- c(0.85, 0.92, 0.89, 0.88, 0.86, 1.08, 1.15, 1.03, 0.95, 1.01, 1.36, 0.96, 1.03, 0.95, 0.90, 1.01, 0.96, 0.95, 0.81, 1.29)
study <- data.frame(PT, SH, SaleRatio)
study <- select(study, x2, SaleRatio)
study <- aggregate(study,
by = list(x),
FUN = statfunc)
print(study)
}
When I attempt to run my formula with:
func(x = "study$PT", x2 = "PT", statfunc = median)
I get the error:
Error in aggregate.data.frame(study, by = list(x), FUN = statfunc) :
arguments must have same length
I am expecting this:
Group.1 PT SaleRatio
1 1 1 0.990
2 2 2 1.010
3 3 3 0.960
4 4 4 0.860
5 5 5 0.935
6 6 6 0.980
7 8 8 1.080
The results above are from the exact same formula, only by manually entering the arguments instead of letting the function pass them.
This user provided function will eventually be applied with many different variables and aggregate functions, and on a much larger data set.
Can someone assist?

We can try with tidyverse
library(dplyr)
func <- function(x, x2, statfunc) {
PT <- c(1,5,3,5,4,8,3,1,5,6,1,5,5,6,1,2,3,1,5,1)
SH <- c(7,7,3,1,1,1,1,4,4,6,6,7,7,1,1,1,3,2,1,3)
SaleRatio <- c(0.85, 0.92, 0.89, 0.88, 0.86, 1.08, 1.15, 1.03, 0.95,
1.01, 1.36, 0.96, 1.03, 0.95, 0.90, 1.01, 0.96, 0.95, 0.81, 1.29)
study <- data.frame(PT, SH, SaleRatio)
study %>%
select(x2, SaleRatio) %>%
group_by_at(x) %>%
summarise_all(statfunc)
}
func("PT", "PT", median)
# A tibble: 7 x 2
# PT SaleRatio
# <dbl> <dbl>
#1 1 0.99
#2 2 1.01
#3 3 0.96
#4 4 0.86
#5 5 0.935
#6 6 0.98
#7 8 1.08

Related

R - transpose dataframe with multiple id columns and multiple variables [duplicate]

I am trying to use pivot_longer. However, I am not sure how to use names_sep or names_pattern to solve this.
dat <- tribble(
~group, ~BP, ~HS, ~BB, ~lowerBP, ~upperBP, ~lowerHS, ~upperHS, ~lowerBB, ~upperBB,
"1", 0.51, 0.15, 0.05, 0.16, 0.18, 0.5, 0.52, 0.14, 0.16,
"2.1", 0.67, 0.09, 0.06, 0.09, 0.11, 0.66, 0.68, 0.08, 0.1,
"2.2", 0.36, 0.13, 0.07, 0.12, 0.15, 0.34, 0.38, 0.12, 0.14,
"2.3", 0.09, 0.17, 0.09, 0.13, 0.16, 0.08, 0.11, 0.15, 0.18,
"2.4", 0.68, 0.12, 0.07, 0.12, 0.14, 0.66, 0.69, 0.11, 0.13,
"3", 0.53, 0.15, 0.06, 0.14, 0.16, 0.52, 0.53, 0.15, 0.16)
Desired output (First row from wide data)
group names values lower upper
1 BP 0.51 0.16 0.18
1 HS 0.15 0.5 0.52
1 BB 0.05 0.14 0.16
Here is solution following a similar method that #Fnguyen used but using the newer pivot_longer and pivot_wider construct:
library(dplyr)
library(tidyr)
longer<-pivot_longer(dat, cols=-1, names_pattern = "(.*)(..)$", names_to = c("limit", "name")) %>%
mutate(limit=ifelse(limit=="", "value", limit))
answer <-pivot_wider(longer, id_cols = c(group, name), names_from = limit, values_from = value, names_repair = "check_unique")
Most of the selecting, separating, mutating and renaming is taking place within the pivot function calls.
Update:
This regular expressions "(.*)(..)$" means:
( ) ( ) Look for two parts,
(.*) the first part should have zero or more characters
(..) the second part should have just 2 characters at the “$” end of the string
A data.table version (not sure yet how to retain the original names so that you dont need to post substitute them https://github.com/Rdatatable/data.table/issues/2551):
library(data.table)
df <- data.table(dat)
v <- c("BP","HS","BB")
setnames(df, v, paste0("x",v) )
g <- melt(df, id.vars = "group",
measure.vars = patterns(values = "x" ,
lower = "lower",
upper = "upper"),
variable.name = "names")
g[names==1, names := "BP" ]
g[names==2, names := "HS" ]
g[names==3, names := "BB" ]
group names values lower upper
1: 1 BP 0.51 0.16 0.18
2: 2.1 BP 0.67 0.09 0.11
3: 2.2 BP 0.36 0.12 0.15
4: 2.3 BP 0.09 0.13 0.16
5: 2.4 BP 0.68 0.12 0.14
6: 3 BP 0.53 0.14 0.16
7: 1 HS 0.15 0.50 0.52
8: 2.1 HS 0.09 0.66 0.68
9: 2.2 HS 0.13 0.34 0.38
10: 2.3 HS 0.17 0.08 0.11
11: 2.4 HS 0.12 0.66 0.69
12: 3 HS 0.15 0.52 0.53
13: 1 BB 0.05 0.14 0.16
14: 2.1 BB 0.06 0.08 0.10
15: 2.2 BB 0.07 0.12 0.14
16: 2.3 BB 0.09 0.15 0.18
17: 2.4 BB 0.07 0.11 0.13
18: 3 BB 0.06 0.15 0.16
Based on your example data this solution using dplyr works for me:
library(dplyr)
dat %>%
gather(key, values,-group) %>%
mutate(names = gsub("lower","",gsub("upper","",key))) %>%
separate(key, into = c("key1","key2") ,"[[:upper:]]", perl=T) %>%
mutate(key1 = case_when(key1 == "" ~ "values", TRUE ~ key1)) %>%
select(group,names,key1,values) %>%
rowid_to_column() %>%
spread(key1,values) %>%
select(-rowid) %>%
group_by(group,names) %>%
summarise_all(mean,na.rm = TRUE)
I'd like to add an alternative tidyverse solution drawing from the answer provided by #Dave2e.
Like Dave2e's solution it's a two-step procedure (first rename, then reshape). Instead of reshaping the data twice, I add the prefix "values" to the columns named "BP", "HS", and "BB" using rename_with. This was necessary for getting the column names right when using the .value sentinel in the names_to argument of pivot_longer.
library(dplyr)
library(tidyr)
dat %>%
rename_with(~sub("^(BP|HS|BB)$", "values\\1", .)) %>% # add prefix values
pivot_longer(dat , cols= -1,
names_pattern = "(.*)(BP|HS|BB)$",
names_to = c(".value", "names"))

Merging monthly level data with quarterly data?

I have 2 data sets - one is quarterly which I need to match to monthly data. So the values from the quarterly data will be repeated thrice in the final data set. I have created a one quarter sample below but this would need to be repeated for many quarters.
month <- c(1/20, 2/20, 3/20)
rating <- c(0.5,0.6,0.65)
df1 <- cbind(month,rating)
quarter <- c(“q1/20”)
amount <- c(100)
df2 <- cbind(quarter,amount)
My final data set should have the following structure
month <- c(1/20, 2/20, 3/20)
rating <- c(0.5,0.6,0.65)
quarter <- c(“q1/20”, “q1/20”, “q1/20”)
amount <- c(100,100,100)
df3 <- cbind(month, rating, quarter, amount)
In the full quarterly data set (df1), some observations are also monthly so it would maybe be a case of matching the monthly observations by month and quarterly observations by quarter?
Thanks in anticipation.
Assuming you have this data.
head(m.dat)
# month rating
# 1 1/18 0.91
# 2 2/18 0.94
# 3 3/18 0.29
# 4 4/18 0.83
# 5 5/18 0.64
# 6 6/18 0.52
head(q.dat)
# quarter amount
# 1 q1/18 1
# 2 q2/18 21
# 3 q3/18 91
# 4 q4/18 61
# 5 q1/19 38
# 6 q2/19 44
You could match month information to quarters using an assignment matrix qm.
qm <- matrix(c(1:12, paste0("q", rep(1:4, each=3))), 12, 2)
m.dat$quarter <- paste0(qm[match(qm[, 1], gsub("(^\\d*).*", "\\1", m.dat$month)), 2],
"/",
sapply(strsplit(m.dat$month, "/"), `[`, 2))
This enables you to use merge.
res <- merge(m.dat, q.dat, all=TRUE)
head(res)
# quarter month rating amount
# 1 q1/18 1/18 0.91 1
# 2 q1/18 2/18 0.94 1
# 3 q1/18 3/18 0.29 1
# 4 q1/19 1/19 0.93 38
# 5 q1/19 2/19 0.26 38
# 6 q1/19 3/19 0.46 38
Toy data
m.dat <- structure(list(month = c("1/18", "2/18", "3/18", "4/18", "5/18",
"6/18", "7/18", "8/18", "9/18", "10/18", "11/18", "12/18", "1/19",
"2/19", "3/19", "4/19", "5/19", "6/19", "7/19", "8/19", "9/19",
"10/19", "11/19", "12/19", "1/20", "2/20", "3/20", "4/20", "5/20",
"6/20", "7/20", "8/20", "9/20", "10/20", "11/20", "12/20"), rating = c(0.91,
0.94, 0.29, 0.83, 0.64, 0.52, 0.74, 0.13, 0.66, 0.71, 0.46, 0.72,
0.93, 0.26, 0.46, 0.94, 0.98, 0.12, 0.47, 0.56, 0.9, 0.14, 0.99,
0.95, 0.08, 0.51, 0.39, 0.91, 0.45, 0.84, 0.74, 0.81, 0.39, 0.69,
0, 0.83)), class = "data.frame", row.names = c(NA, -36L))
q.dat <- structure(list(quarter = c("q1/18", "q2/18", "q3/18", "q4/18",
"q1/19", "q2/19", "q3/19", "q4/19", "q1/20", "q2/20", "q3/20",
"q4/20"), amount = c(1, 21, 91, 61, 38, 44, 4, 97, 43, 96, 89,
64)), class = "data.frame", row.names = c(NA, -12L))
Assuming that df1 and df2 are the data frames shown in the Note at the end create a yq column of class yearqtr in each and merge on that:
library(zoo)
df1 <- transform(df1, yq = as.yearqtr(month, "%m/%y"))
df2 <- transform(df2, yq = as.yearqtr(quarter, "q%q/%y"))
merge(df1, df2, by = "yq", all = TRUE)
giving:
yq month rating quarter amount
1 2020 Q1 1/20 0.50 q1/20 100
2 2020 Q1 2/20 0.60 q1/20 100
3 2020 Q1 3/20 0.65 q1/20 100
We could also consider converting the month column into a yearmon class column using
as.yearmon .
Note
df1 <- data.frame(month = c("1/20", "2/20", "3/20"), rating = c(0.5,0.6,0.65))
df2 <- data.frame(quarter = "q1/20", amount = 100)

pivot_longer into multiple columns

I am trying to use pivot_longer. However, I am not sure how to use names_sep or names_pattern to solve this.
dat <- tribble(
~group, ~BP, ~HS, ~BB, ~lowerBP, ~upperBP, ~lowerHS, ~upperHS, ~lowerBB, ~upperBB,
"1", 0.51, 0.15, 0.05, 0.16, 0.18, 0.5, 0.52, 0.14, 0.16,
"2.1", 0.67, 0.09, 0.06, 0.09, 0.11, 0.66, 0.68, 0.08, 0.1,
"2.2", 0.36, 0.13, 0.07, 0.12, 0.15, 0.34, 0.38, 0.12, 0.14,
"2.3", 0.09, 0.17, 0.09, 0.13, 0.16, 0.08, 0.11, 0.15, 0.18,
"2.4", 0.68, 0.12, 0.07, 0.12, 0.14, 0.66, 0.69, 0.11, 0.13,
"3", 0.53, 0.15, 0.06, 0.14, 0.16, 0.52, 0.53, 0.15, 0.16)
Desired output (First row from wide data)
group names values lower upper
1 BP 0.51 0.16 0.18
1 HS 0.15 0.5 0.52
1 BB 0.05 0.14 0.16
Here is solution following a similar method that #Fnguyen used but using the newer pivot_longer and pivot_wider construct:
library(dplyr)
library(tidyr)
longer<-pivot_longer(dat, cols=-1, names_pattern = "(.*)(..)$", names_to = c("limit", "name")) %>%
mutate(limit=ifelse(limit=="", "value", limit))
answer <-pivot_wider(longer, id_cols = c(group, name), names_from = limit, values_from = value, names_repair = "check_unique")
Most of the selecting, separating, mutating and renaming is taking place within the pivot function calls.
Update:
This regular expressions "(.*)(..)$" means:
( ) ( ) Look for two parts,
(.*) the first part should have zero or more characters
(..) the second part should have just 2 characters at the “$” end of the string
A data.table version (not sure yet how to retain the original names so that you dont need to post substitute them https://github.com/Rdatatable/data.table/issues/2551):
library(data.table)
df <- data.table(dat)
v <- c("BP","HS","BB")
setnames(df, v, paste0("x",v) )
g <- melt(df, id.vars = "group",
measure.vars = patterns(values = "x" ,
lower = "lower",
upper = "upper"),
variable.name = "names")
g[names==1, names := "BP" ]
g[names==2, names := "HS" ]
g[names==3, names := "BB" ]
group names values lower upper
1: 1 BP 0.51 0.16 0.18
2: 2.1 BP 0.67 0.09 0.11
3: 2.2 BP 0.36 0.12 0.15
4: 2.3 BP 0.09 0.13 0.16
5: 2.4 BP 0.68 0.12 0.14
6: 3 BP 0.53 0.14 0.16
7: 1 HS 0.15 0.50 0.52
8: 2.1 HS 0.09 0.66 0.68
9: 2.2 HS 0.13 0.34 0.38
10: 2.3 HS 0.17 0.08 0.11
11: 2.4 HS 0.12 0.66 0.69
12: 3 HS 0.15 0.52 0.53
13: 1 BB 0.05 0.14 0.16
14: 2.1 BB 0.06 0.08 0.10
15: 2.2 BB 0.07 0.12 0.14
16: 2.3 BB 0.09 0.15 0.18
17: 2.4 BB 0.07 0.11 0.13
18: 3 BB 0.06 0.15 0.16
Based on your example data this solution using dplyr works for me:
library(dplyr)
dat %>%
gather(key, values,-group) %>%
mutate(names = gsub("lower","",gsub("upper","",key))) %>%
separate(key, into = c("key1","key2") ,"[[:upper:]]", perl=T) %>%
mutate(key1 = case_when(key1 == "" ~ "values", TRUE ~ key1)) %>%
select(group,names,key1,values) %>%
rowid_to_column() %>%
spread(key1,values) %>%
select(-rowid) %>%
group_by(group,names) %>%
summarise_all(mean,na.rm = TRUE)
I'd like to add an alternative tidyverse solution drawing from the answer provided by #Dave2e.
Like Dave2e's solution it's a two-step procedure (first rename, then reshape). Instead of reshaping the data twice, I add the prefix "values" to the columns named "BP", "HS", and "BB" using rename_with. This was necessary for getting the column names right when using the .value sentinel in the names_to argument of pivot_longer.
library(dplyr)
library(tidyr)
dat %>%
rename_with(~sub("^(BP|HS|BB)$", "values\\1", .)) %>% # add prefix values
pivot_longer(dat , cols= -1,
names_pattern = "(.*)(BP|HS|BB)$",
names_to = c(".value", "names"))

If rowSums greater than one, divide by sum

I'd like to divide by the sum of rows if the rowSums() is greater than one. I haven't thought of a way to do this without a for() loop, so I'm looking for a solution without a loop.
Sample Data
dat <- structure(list(x1 = c(0.18, 0, 0.11, 0.24, 0.33), x2 = c(0.34,
0.14, 0.36, 0.35, 0.21), x3 = c(0.1, 0.36, 0.12, 0.07, 0.18),
x4 = c(0.08, 0.35, 0.06, 0.1, 0.09), x5 = c(0.26, 0.13, 0.22,
0.31, 0.22)), .Names = c("x1", "x2", "x3", "x4", "x5"), row.names = c(NA,
5L), class = "data.frame")
> rowSums(dat)
1 2 3 4 5
0.96 0.98 0.87 1.07 1.03
What I've tried
This works, but I wonder if there is a better way to do it:
a <- which(rowSums(dat) > 1)
dat[a, ] <- dat[a,] / rowSums(dat[a,]
> rowSums(dat)
1 2 3 4 5
0.96 0.98 0.87 1.00 1.00
This gives the same value as the expression near the end of the question:
dat / pmax(rowSums(dat), 1)
This is inferior to G. Grothendieck's answer, but you can also use ifelse.
rs <- rowSums(dat)
dat / ifelse(rs < 1, rs, 1L)
x1 x2 x3 x4 x5
1 0.1875000 0.3541667 0.1041667 0.08333333 0.2708333
2 0.0000000 0.1428571 0.3673469 0.35714286 0.1326531
3 0.1264368 0.4137931 0.1379310 0.06896552 0.2528736
4 0.2400000 0.3500000 0.0700000 0.10000000 0.3100000
5 0.3300000 0.2100000 0.1800000 0.09000000 0.2200000

Scatterplot with categorical x-axis (and uncertainties boxes) in R

I have made some calculations on data measured on several systems of photovoltaic panels. I have 11 different photovoltaic systems, and for each of them I have 3 different numerical values.
My results are in a matrix that has 11 rows (each of them corresponding to one of the photovoltaic systems), and 3 columns (containing the 3 numerical quantities computed for each system).
Here is a minimal reproducible matrix :
monthly_LR monthly_CSD monthly_HW
solon 0.398 0.417 0.48
sanyo 0.489 0.479 0.59
atersa NA NA NA
sunpower 0.129 NA 0.19
schott_efg 0.387 0.486 0.47
BP 0.235 0.161 0.22
solarworld 1.153 1.245 1.25
schott_main 0.531 0.628 0.62
wurth 2.889 2.886 2.85
first 1.631 1.651 1.64
mhi 0.974 0.888 1.02
and the corresponding dput output so you can reproduce it :
structure(c(0.398, 0.489, NA, 0.129, 0.387, 0.235, 1.153, 0.531,
2.889, 1.631, 0.974, 0.417, 0.479, NA, NA, 0.486, 0.161, 1.245,
0.628, 2.886, 1.651, 0.888, 0.48, 0.59, NA, 0.19, 0.47, 0.22,
1.25, 0.62, 2.85, 1.64, 1.02), .Dim = c(11L, 3L), .Dimnames = list(
c("solon", "sanyo", "atersa", "sunpower", "schott_efg", "BP",
"solarworld", "schott_main", "wurth", "first", "mhi"), c("monthly_LR",
"monthly_CSD", "monthly_HW"))) `
I also have another matrix which contains the uncertainties associated with each value of the first matrix :
monthly_LR_uncertainty monthly_CSD_uncertainty monthly_HW_uncertainty
solon 0.14 0.09 0.07
sanyo 0.13 0.06 0.07
atersa NA 0.13 NA
sunpower 0.18 0.18 0.20
schott_efg 0.14 0.07 0.06
BP 0.14 0.14 0.15
solarworld 0.16 0.04 0.03
schott_main 0.15 0.08 0.07
wurth 0.12 0.10 0.11
first 0.08 0.09 0.10
mhi 0.08 0.07 0.08
and the corresponding dput output so you can reproduce it :
structure(c(0.14, 0.13, NA, 0.18, 0.14, 0.14, 0.16, 0.15, 0.12,
0.08, 0.08, 0.09, 0.06, 0.13, 0.18, 0.07, 0.14, 0.04, 0.08, 0.1,
0.09, 0.07, 0.07, 0.07, NA, 0.2, 0.06, 0.15, 0.03, 0.07, 0.11,
0.1, 0.08), .Dim = c(11L, 3L), .Dimnames = list(c("solon", "sanyo",
"atersa", "sunpower", "schott_efg", "BP", "solarworld", "schott_main",
"wurth", "first", "mhi"), c("monthly_LR_uncertainty", "monthly_CSD_uncertainty",
"monthly_HW_uncertainty"))) `
Now, here is the type of scatterplot I would like to obtain (I almost got what I wanted with boxplots, but now I'd prefer a scatterplot) :
I would like the x-axis to be categorical, as it is when I make a boxplot (i.e. one category for each of the 11 rows).
And above each category on the x-axis, I would like to have 3 points corresponding to the 3 values in the corresponding row of the first matrix, with boxes indicating the uncertainty on the results.
The image below (a graph in an article written by a researcher of the same lab than me, but that is gone from the lab now) shows exactly what I want to obtain. The 11 categories on the x-axis correspond to my 11 rows. The three different points for each category (blue, red, green) correspond to the 3 values for each category in the first matrix. And the box associated to each point corresponds to the uncertainty (given in the second matrix).
Let's say a is the table with means and b is the table with uncertainties:
# x axis width
x = 1:nrow(a)
# horizontal offset for data of same group
offset = 0.2
# draw empty plot
plot(NULL, xlim=c(0, nrow(a)), ylim=c(0, max(a,na.rm=T)), xaxt='n', ylab='performance', xlab='')
# add error bars (arrows with angle=90)
arrows(x0=x, x1=x, y0 = a[,1]-0.5*b[,1], y1 = a[,1]+0.5*b[,1], angle=90, code=3, len=0.01)
arrows(x0=x-offset, x1=x-offset, y0 = a[,2]-0.5*b[,2], y1 = a[,2]+0.5*b[,2], angle=90, code=3, col=2, len=0.02)
arrows(x0=x+offset, x1=x+offset, y0 = a[,3]-0.5*b[,3], y1 = a[,3]+0.5*b[,3], angle=90, code=3, col=4, len=0.02)
# add points
points(x, a[,1], pch=1, col=1)
points(x-offset, a[,2], pch=2, col=2)
points(x+offset, a[,3], pch=3, col=4)
# axis labels
axis(1, at = 1:nrow(a), labels = rownames(a), las=3)
# add legend
legend(x='topleft', legend=colnames(a), col=c(1,2,4), pch=c(1,2,3), inset=0.02)
Also have a look at this answer for grouped boxplots.

Resources