Group by on XDF file? - r

Say I have a huge source XDF file generated with RevoScaleR. I want to create a new target XDF by grouping the source entries on columns A, B, C and compute the sum, min, max, avg, std deviation on column D.
Let's assume the target data is too big to fit into memory too. How should I proceed? I could not find much information about group by operations in the documentation.

If you want to create a new xdf file I suggest using "RevoPemaR" library, which is include in the ML Server. It would be nice if you add a reproducible example, but the answer could be something like this:
library(RevoPemaR)
byGroupPemaObj <- PemaByGroup()
groupVals <- pemaCompute(
pemaObj = byGroupPemaObj,
data = "input.xdf",
outData = "output.xdf",
groupByVar = c("A", "B", "C"),
computeVars = c("D"),
fnList = list(
sum= list(FUN = sum, x = NULL, na.rm = TRUE),
min= list(FUN = min, x = NULL, na.rm = TRUE)
max= list(FUN = max, x = NULL, na.rm = TRUE),
mean= list(FUN = mean, x = NULL, na.rm = TRUE),
sd = list(FUN = sd, x = NULL, na.rm = TRUE)
)
)
But you also have another option which is rxSummary. For each variable:
rxSummary(D~F(A),
data = "input.xdf" ,
byGroupOutFile = "out.xdf",
summaryStats = c( "Mean", "StdDev", "Min", "Max", "Sum")
)

The dplyrXdf package lets you carry out dplyr operations like this on Xdf files.
library(dplyrXdf)
src <- RxXdfData("src.xdf")
dest <- src %>%
group_by(A, B, C) %>%
summarise(sum=sum(D), min=min(D), max=max(D), mean=mean(D), sd=sd(D))

Related

Store multiple outputs in one table

I am trying to store different outputs in one table so I can perform further analysis on them. below is my code where I need to run 4 times (for each company stocks). How can I store all value from the 4 companies in one table.
tapply(Ford_R_ER, as.integer(gl(length(Ford_R_ER), 12, length(Ford_R_ER))), FUN = mean, na.rm = TRUE)
tapply(GE_R_ER, as.integer(gl(length(GE_R_ER), 12, length(GE_R_ER))), FUN = mean, na.rm = TRUE)
tapply(MICROSOFT_R_ER, as.integer(gl(length(MICROSOFT_R_ER), 12, length(MICROSOFT_R_ER))), FUN = mean, na.rm = TRUE)
tapply(ORACLE_R_ER, as.integer(gl(length(ORACLE_R_ER), 12, length(ORACLE_R_ER))), FUN = mean, na.rm = TRUE)
If there are multiple columns, use summarise with across - create a data.frame/tibble with the vectors (assuming they are of the same length), create the grouping column with gl and summarise across the numeric columns to get the mean by group
library(dplyr)
dat %>%
group_by(grp = as.integer(gl(n(), 12, n()))) %>%
summarise(across(where(is.numeric), mean, na.rm = TRUE))
Or using aggregate from base R
aggregate(.~ grp, data = transform(df,
grp = as.integer(gl(nrow(df), 12, nrow(df)))),
mean, na.rm = TRUE, na.action = NULL)
In case we have different lengths for the vectors, create a function and reuse it
f1 <- function(vec, n = 12) {
tapply(vec, as.integer(gl(length(vec), n, length(vec))), FUN =
mean, na.rm = TRUE)
}
and then run the function either on a single vector or a list of vectors
f1(Ford_R_ER)
lapply(list(Ford_R_ER = Ford_R_ER, GE_R_ER = GE_R_ER,
MICROSOFT_R_ER = MICROSOFT_R_ER, ORACLE_R_ER = ORACLE_R_ER), f1)
data
dat <- data.frame(Ford_R_ER, GE_R_ER, MICROSOFT_R_ER, ORACLE_R_ER)

chi square over multiple groups and variables

I have a huge dataset with several groups (factors with between 2 to 6 levels), and dichotomous variables (0, 1).
example data
DF <- data.frame(
group1 = sample(x = c("A","B","C","D"), size = 100, replace = T),
group2 = sample(x = c("red","blue","green"), size = 100, replace = T),
group3 = sample(x = c("tiny","small","big","huge"), size = 100, replace = T),
var1 = sample(x = 0:1, size = 100, replace = T),
var2 = sample(x = 0:1, size = 100, replace = T),
var3 = sample(x = 0:1, size = 100, replace = T),
var4 = sample(x = 0:1, size = 100, replace = T),
var5 = sample(x = 0:1, size = 100, replace = T))
I want to do a chi square for every group, across all the variables.
library(tidyverse)
library(rstatix)
chisq_test(DF$group1, DF$var1)
chisq_test(DF$group1, DF$var2)
chisq_test(DF$group1, DF$var3)
...
etc
I managed to make it work by using two nested for loops, but I'm sure there is a better solution
groups <- c("group1","group2","group3")
vars <- c("var1","var2","var3","var4","var5")
results <- data.frame()
for(i in groups){
for(j in vars){
test <- chisq_test(DF[,i], DF[,j])
test <- mutate(test, group=i, var=j)
results <- rbind(results, test)
}
}
results
I think I need some kind of apply function, but I can't figure it out
Here is one way to do it with apply. I am sure there is an even more elegant way to do it with dplyr. (Note that here I extract the p.value of the test, but you can extract something else or the whole test result if you prefer).
res <- apply(DF[,1:3], 2, function(x) {
apply(DF[,4:7], 2,
function(y) {chisq.test(x,y)$p.value})
})
Here's a quick and easy dplyr solution, that involves transforming the data into long format keyed by group and var, then running the chi-sq test on each combination of group and var.
DF %>%
pivot_longer(starts_with("group"), names_to = "group", values_to = "group_val") %>%
pivot_longer(starts_with("var"), names_to = "var", values_to = "var_val") %>%
group_by(group, var) %>%
summarise(chisq_test(group_val, var_val)) %>%
ungroup()

summarise data for multiple variables of a data.frame in r?

I am trying to compute the upper and lower quartile of the two variables in my data.frame across the time period of my interest. The code below gave me single digit for upper and lower value.
set.seed(50)
FakeData <- data.frame(seq(as.Date("2001-01-01"), to= as.Date("2003-12-31"), by="day"),
A = runif(1095, 0,10),
D = runif(1095,5,15))
colnames(FakeData) <- c("Date", "A","D")
statistics <- FakeData %>%
gather(-Date, key = "Variable", value = "Value") %>%
mutate(Year = year(Date), Month = month(Date)) %>%
filter(between(Month,3,5)) %>%
mutate(NewDate = ymd(paste("2020", Month,day(Date), sep = "-"))) %>%
group_by(Variable, NewDate) %>%
summarise(Upper = quantile(Value,0.75, na.rm = T),
Lower = quantile(Value, 0.25, na.rm = T))
I would want an output like below (the Final_output is what i am interested)
Output1 <- data.frame(seq(as.Date("2000-03-01"), to= as.Date("2000-05-31"), by="day"),
Upper = runif(92, 0,10), lower = runif(92,5,15), Variable = rep("A",92))
colnames(Output1)[1] <- "Date"
Output2 <- data.frame(seq(as.Date("2000-03-01"), to= as.Date("2000-05-31"), by="day"),
Upper = runif(92, 2,10), lower = runif(92,5,15), Variable = rep("D",92))
colnames(Output2)[1] <- "Date"
Final_Output<- bind_rows(Output1,Output2)
I can propose you a data.table solution. In fact there are several ways to do that.
The final steps (apply quartile by group on the Value variable) could be translated into (if you want, as in your example, two columns):
statistics[,.('p25' = quantile(get('Value'), probs = 0.25), 'p75' = quantile(get('Value'), probs = 0.75)),
by = c("Variable", "NewDate")]
If you prefer long-formatted output:
library(data.table)
setDT(statistics)
statistics[,.(lapply(get('Value'), quantile, probs = .25,.75)) ,
by = c("Variable", "NewDate")]
All steps together
It's probably better if you chose to use data.table to do all steps using data.table verbs. I will assume your data have the structure similar to the dataframe you generated and arranged, i.e.
statistics <- FakeData %>%
gather(-Date, key = "Variable", value = "Value")
In that case, mutate and filter steps would become
statistics[,`:=`(Year = year(Date), Month = month(Date))]
statistics <- statistics[Month %between% c(3,5)]
statistics[, NewDate = :ymd(paste("2020", Month,day(Date), sep = "-"))]
And choose the final step you prefer, e.g.
statistics[,.('p25' = quantile(get('Value'), probs = 0.25), 'p75' = quantile(get('Value'), probs = 0.75)),
by = c("Variable", "NewDate")]

Normalising data with dplyr mutate() brings inconsistencies

I'm trying to reproduce the framework from this blogpost http://www.luishusier.com/2017/09/28/balance/ with the following code but it looks like I get inconsistent results
library(tidyverse)
library(magrittr)
ids <- c("1617", "1516", "1415", "1314", "1213", "1112", "1011", "0910", "0809", "0708", "0607", "0506")
data <- ids %>%
map(function(i) {read_csv(paste0("http://www.football-data.co.uk/mmz4281/", i ,"/F1.csv")) %>%
select(Date:AST) %>%
mutate(season = i)})
data <- bind_rows(data)
data <- data[complete.cases(data[ , 1:3]), ]
tmp1 <- data %>%
select(season, HomeTeam, FTHG:FTR,HS:AST) %>%
rename(BP = FTHG,
BC = FTAG,
TP = HS,
TC = AS,
TCP = HST,
TCC = AST,
team = HomeTeam)%>%
mutate(Pts = ifelse(FTR == "H", 3, ifelse(FTR == "A", 0, 1)),
Terrain = "Domicile")
tmp2 <- data %>%
select(season, AwayTeam, FTHG:FTR, HS:AST) %>%
rename(BP = FTAG,
BC = FTHG,
TP = AS,
TC = HS,
TCP = AST,
TCC = HST,
team = AwayTeam)%>%
mutate(Pts = ifelse(FTR == "A", 3 ,ifelse(FTR == "H", 0 , 1)),
Terrain = "Extérieur")
tmp3 <- bind_rows(tmp1, tmp2)
l1_0517 <- tmp3 %>%
group_by(season, team)%>%
summarise(j = n(),
pts = sum(Pts),
diff_but = (sum(BP) - sum(BC)),
diff_t_ca = (sum(TCP, na.rm = T) - sum(TCC, na.rm = T)),
diff_t = (sum(TP, na.rm = T) - sum(TC, na.rm = T)),
but_p = sum(BP),
but_c = sum(BC),
tir_ca_p = sum(TCP, na.rm = T),
tir_ca_c = sum(TCC, na.rm = T),
tir_p = sum(TP, na.rm = T),
tir_c = sum(TC, na.rm = T)) %>%
arrange((season), desc(pts), desc(diff_but))
Then I apply the framework mentioned above:
l1_0517 <- l1_0517 %>%
mutate(
# First, see how many goals the team scores relative to the average
norm_attack = but_p %>% divide_by(mean(but_p)) %>%
# Then, transform it into an unconstrained scale
log(),
# First, see how many goals the team concedes relative to the average
norm_defense = but_c %>% divide_by(mean(but_c)) %>%
# Invert it, so a higher defense is better
raise_to_power(-1) %>%
# Then, transform it into an unconstrained scale
log(),
# Now that we have normalized attack and defense ratings, we can compute
# measures of quality and attacking balance
quality = norm_attack + norm_defense,
balance = norm_attack - norm_defense
) %>%
arrange(desc(norm_attack))
When I look at the column norm_attack, I expect to find the same value for equivalent but_p values, which is not the case here:
head(l1_0517, 10)
for instance when but_p has value 83, row 5 and row 7, I get norm_attack at 0.5612738 and 0.5128357 respectively.
Is it normal? I would expect mean(l1_0517$but_p) to be fixed and therefore obtaining the same result when a value of l1_0517$but_p is log normalised?
UPDATE
I have tried to work on a simpler example but I can't reproduce this issue:
df <- tibble(a = as.integer(runif(200, 15, 100)))
df <- df %>%
mutate(norm_a = a %>% divide_by(mean(a)) %>%
log())
I found the solution after looking at the type of l1_0517
It is a grouped_df hence the different results.
The correct code is:
l1_0517 <- tmp3 %>%
group_by(season, team)%>%
summarise(j = n(),
pts = sum(Pts),
diff_but = (sum(BP) - sum(BC)),
diff_t_ca = (sum(TCP, na.rm = T) - sum(TCC, na.rm = T)),
diff_t = (sum(TP, na.rm = T) - sum(TC, na.rm = T)),
but_p = sum(BP),
but_c = sum(BC),
tir_ca_p = sum(TCP, na.rm = T),
tir_ca_c = sum(TCC, na.rm = T),
tir_p = sum(TP, na.rm = T),
tir_c = sum(TC, na.rm = T)) %>%
ungroup() %>%
arrange((season), desc(pts), desc(diff_but))

dplyr: Maximum across arbitrary number of variables

I want to take the maximum of a number of variables within a pipe:
library(dplyr)
library(purrr)
df_foo = data_frame(
a = rnorm(100),
b = rnorm(100),
c = rnorm(100)
) %>%
mutate(
`Max 1` = max(a, b, c, na.rm = TRUE),
`Max 2` = pmap_dbl(list(a, b, c), max, na.rm = TRUE),
`Max 3` = pmax(a, b, c, na.rm = TRUE)
)
The purrr::pmap_dbl solution appears to be clunky -- in that it requires specifying the names of the variables as a list. Is there a way to do away with having to use the list keyword so that it is potentially usable programmatically?
We can use . to specify the dataset
df_foo %>%
mutate(Max2 = pmap_dbl(.l = ., max, na.rm = TRUE))
and suppose, if we are doing on a subset of columns, then
nm1 <- c("a", "b")
df_foo %>%
mutate(Max2 = pmap_dbl(.l = .[nm1], max, na.rm = TRUE))

Resources