I am trying to store different outputs in one table so I can perform further analysis on them. below is my code where I need to run 4 times (for each company stocks). How can I store all value from the 4 companies in one table.
tapply(Ford_R_ER, as.integer(gl(length(Ford_R_ER), 12, length(Ford_R_ER))), FUN = mean, na.rm = TRUE)
tapply(GE_R_ER, as.integer(gl(length(GE_R_ER), 12, length(GE_R_ER))), FUN = mean, na.rm = TRUE)
tapply(MICROSOFT_R_ER, as.integer(gl(length(MICROSOFT_R_ER), 12, length(MICROSOFT_R_ER))), FUN = mean, na.rm = TRUE)
tapply(ORACLE_R_ER, as.integer(gl(length(ORACLE_R_ER), 12, length(ORACLE_R_ER))), FUN = mean, na.rm = TRUE)
If there are multiple columns, use summarise with across - create a data.frame/tibble with the vectors (assuming they are of the same length), create the grouping column with gl and summarise across the numeric columns to get the mean by group
library(dplyr)
dat %>%
group_by(grp = as.integer(gl(n(), 12, n()))) %>%
summarise(across(where(is.numeric), mean, na.rm = TRUE))
Or using aggregate from base R
aggregate(.~ grp, data = transform(df,
grp = as.integer(gl(nrow(df), 12, nrow(df)))),
mean, na.rm = TRUE, na.action = NULL)
In case we have different lengths for the vectors, create a function and reuse it
f1 <- function(vec, n = 12) {
tapply(vec, as.integer(gl(length(vec), n, length(vec))), FUN =
mean, na.rm = TRUE)
}
and then run the function either on a single vector or a list of vectors
f1(Ford_R_ER)
lapply(list(Ford_R_ER = Ford_R_ER, GE_R_ER = GE_R_ER,
MICROSOFT_R_ER = MICROSOFT_R_ER, ORACLE_R_ER = ORACLE_R_ER), f1)
data
dat <- data.frame(Ford_R_ER, GE_R_ER, MICROSOFT_R_ER, ORACLE_R_ER)
Related
If I want to get the mean and sum of all the numeric columns using the mtcars data set, I would use following codes:
group_by(gear) %>%
summarise(across(where(is.numeric), list(mean = mean, sum = sum)))
But if I have missing values in some of the columns, how do I take that into account? Here is a reproducible example:
test.df1 <- data.frame("Year" = sample(2018:2020, 20, replace = TRUE),
"Firm" = head(LETTERS, 5),
"Exporter"= sample(c("Yes", "No"), 20, replace = TRUE),
"Revenue" = sample(100:200, 20, replace = TRUE),
stringsAsFactors = FALSE)
test.df1 <- rbind(test.df1,
data.frame("Year" = c(2018, 2018),
"Firm" = c("Y", "Z"),
"Exporter" = c("Yes", "No"),
"Revenue" = c(NA, NA)))
test.df1 <- test.df1 %>% mutate(Profit = Revenue - sample(20:30, 22, replace = TRUE ))
test.df_summarized <- test.df1 %>% group_by(Firm) %>% summarize(across(where(is.numeric)), list(mean = mean, sum = sum)))
If I would just summarize each variable separately, I could use the following:
test.df1 %>% group_by(Firm) %>% summarize(Revenue_mean = mean(Revenue, na.rm = TRUE,
Profit_mean = mean(Profit, na.rm = TRUE)
But I am trying to figure out how can I tweak the code I wrote above for mtcars to the example data set I have provided here.
Because your functions all have a na.rm argument, you can pass it along with the ...
test.df1 %>% summarize(across(where(is.numeric), list(mean = mean, sum = sum), na.rm = TRUE))
# Year_mean Year_sum Revenue_mean Revenue_sum Profit_mean Profit_sum
# 1 2019.045 44419 162.35 3247 138.25 2765
(I left out the group_by because it's not specified properly in your code and the example is still well-illustrated without it. Also make sure that your functions are inside across().)
Just for the record, you could also do it like this (and this works when the different functions have different arguments)
test.df1 %>%
summarise(across(where(is.numeric),
list(
mean = ~ mean(.x, na.rm = T),
sum = ~ sum(.x, na.rm = T))
)
)
# Year_mean Year_sum Revenue_mean Revenue_sum Profit_mean Profit_sum
# 1 2019.045 44419 144.05 2881 119.3 2386
I have extracted some summary statistics from R:
group_by(starters, starters$Programme, starters$Gender ) %>% summarise(
count = n(),
# mean = mean(Total_testscore, na.rm = TRUE),
# sd = sd(Total_testscore, na.rm = TRUE),
percentage = (n()/238)*100)
group_by(starters, starters$Programme ) %>% summarise(
count = n(),
mean = mean(Total_testscore, na.rm = TRUE),
sd = sd(Total_testscore, na.rm = TRUE), percentage = (n()/238)*100)
and would like to get a table that looks like this :
I am using xtables to export my output to latex for all my other results. For xtables all my results have to be in one table. How can i combine the two outputs in order to get a table like pictured?
I am working with the iris dataset, and manipulating it as follows to get a species, feature1, feature2, value data frame:
gatherpairs <- function(data, ...,
xkey = '.xkey', xvalue = '.xvalue',
ykey = '.ykey', yvalue = '.yvalue',
na.rm = FALSE, convert = FALSE, factor_key = FALSE) {
vars <- quos(...)
xkey <- enquo(xkey)
xvalue <- enquo(xvalue)
ykey <- enquo(ykey)
yvalue <- enquo(yvalue)
data %>% {
cbind(gather(., key = !!xkey, value = !!xvalue, !!!vars,
na.rm = na.rm, convert = convert, factor_key = factor_key),
select(., !!!vars))
} %>% gather(., key = !!ykey, value = !!yvalue, !!!vars,
na.rm = na.rm, convert = convert, factor_key = factor_key)%>%
filter(!(.xkey == .ykey)) %>%
mutate(var = apply(.[, c(".xkey", ".ykey")], 1, function(x) paste(sort(x), collapse = ""))) %>%
arrange(var)
}
test = iris %>%
gatherpairs(sapply(colnames(iris[, -ncol(iris)]), eval))
This was taken from https://stackoverflow.com/a/47731111/8315659
What this does is give me that data frame with all combinations of feature1 and feature2, but I want to remove duplicates where it is just the reverse being shown. For example, Petal.Length vs Petal.Width is the same as Petal.Width vs Petal.Length. But if there are two rows with identical values for Petal.Length vs Petal.Width, I do not want to drop that row. Therefore, just dropping rows where all values are identical except that .xkey and .ykey are reversed is what I would want to do. Essentially, this is just to recreate the bottom triangle of the ggplot matrix shown in the above linked answer.
How can this be done?
Jack
I think this could be accomplished using the first part of the source code, which performs a single gathering operation. Using the iris example, this will produce 600 rows of output, one for each of the 150 rows x 4 columns in iris.
gatherpairs <- function(data, ...,
xkey = '.xkey', xvalue = '.xvalue',
ykey = '.ykey', yvalue = '.yvalue',
na.rm = FALSE, convert = FALSE, factor_key = FALSE) {
vars <- quos(...)
xkey <- enquo(xkey)
xvalue <- enquo(xvalue)
ykey <- enquo(ykey)
yvalue <- enquo(yvalue)
data %>% {
cbind(gather(., key = !!xkey, value = !!xvalue, !!!vars,
na.rm = na.rm, convert = convert, factor_key = factor_key),
select(., !!!vars))
} # %>% gather(., key = !!ykey, value = !!yvalue, !!!vars,
# na.rm = na.rm, convert = convert, factor_key = factor_key)%>%
# filter(!(.xkey == .ykey)) %>%
# mutate(var = apply(.[, c(".xkey", ".ykey")], 1, function(x) paste(sort(x), collapse = ""))) %>%
# arrange(var)
}
I want to take the maximum of a number of variables within a pipe:
library(dplyr)
library(purrr)
df_foo = data_frame(
a = rnorm(100),
b = rnorm(100),
c = rnorm(100)
) %>%
mutate(
`Max 1` = max(a, b, c, na.rm = TRUE),
`Max 2` = pmap_dbl(list(a, b, c), max, na.rm = TRUE),
`Max 3` = pmax(a, b, c, na.rm = TRUE)
)
The purrr::pmap_dbl solution appears to be clunky -- in that it requires specifying the names of the variables as a list. Is there a way to do away with having to use the list keyword so that it is potentially usable programmatically?
We can use . to specify the dataset
df_foo %>%
mutate(Max2 = pmap_dbl(.l = ., max, na.rm = TRUE))
and suppose, if we are doing on a subset of columns, then
nm1 <- c("a", "b")
df_foo %>%
mutate(Max2 = pmap_dbl(.l = .[nm1], max, na.rm = TRUE))
My dataset has about 20 columns and I would like to create 7 new columns with lagged data for each of the 20 current columns.
For example I have column x, y, and z. I would like to create a columns for xlag1, xlag2, xlag3, xlag4, xlag5, xlag6, xlag7, ylag1, ylag2, etc..
My current attempt is with dplyr in R -
aq %>% mutate(.,
xlag1 = lag(x, 1),
xlag2 = lag(x, 2),
xlag3 = lag(x, 3),
xlag4 = lag(x, 4),
xlag5 = lag(x, 5),
xlag6 = lag(x, 6),
xlag7 = lag(x, 7),
)
As you can see it'll take alot of lines of codes to cover all 20 columns. Is there a more efficient way of doing this ? If possible in dplyr and R as I'm most familiar with the package.
We can use data.table. The shift from data.table can take a sequence of 'n'.
library(data.table)
setDT(aq)[, paste0('xlag', 1:7) := shift(x, 1:7)]
If there are multiple columns,
setDT(aq)[, paste0(rep(c("xlag", "ylag"), each = 7), 1:7) :=
c(shift(x, 1:7), shift(y, 1:7))]
If we have many columns, then specify the columns in .SDcols and loop through the dataset, get the shift, unlist and assign to new columns
setDT(aq)[, paste0(rep(c("xlag", "ylag"), each = 7), 1:7) :=
unlist(lapply(.SD, shift, n = 1:7), recursive = FALSE) , .SDcols = x:y]
We can also use the shift in dplyr
library(dplyr)
aq %>%
do(setNames(data.frame(., shift(.$x, 1:7)), c(names(aq), paste0('xlag', 1:7))))
and for multiple columns
aq %>%
do(setNames(data.frame(., shift(.$x, 1:7), shift(.$y, 1:7)),
c(names(aq), paste0(rep(c("xlag", "ylag"), each = 7), 1:7) )))
data
aq <- data.frame(x = 1:20, y = 21:40)