Simplify R code for computing group statistics across multiple factors - r

I am trying to compute the variance for each group in a data set with multiple factors. For example, the data set below is the first 6 lines of a data frame with 5 columns: 4 factors of two levels each (No and Yes) and 1 continuous variable:
Factor A
Factor B
Factor C
Factor D
VarX
Yes
Yes
Yes
No
66.8
No
Yes
Yes
No
66.0
Yes
No
No
No
58.4
No
Yes
Yes
Yes
68.3
Yes
Yes
Yes
No
61.8
Yes
No
No
No
67.3
What I want to do is produce a summary table such as the one below:
Factor
SD (NO)
SD (YES)
SD Ratio
Factor A
3.79
3.51
1.08
Factor B
3.44
3.83
1.11
Factor C
3.77
3.53
1.07
Factor D
3.92
3.32
1.18
For each factor, I have calculated the standard deviation at each level ("No" and "Yes") as well as the ratio of the two standard deviations.
Here is the code I am using to do this:
#
# Define modify function for SD ratio column
#
sd_ratio<-function(x,y){
return(max(x,y)/min(x,y))
}
#
# Set up storage
#
nc<-4 # number of factors in data
testDataSum<-tibble(SD_No=rep(NA,nc),
SD_Yes=rep(NA,nc),
SD_Ratio=rep(NA,nc))
#
Factor<-vector("list",4)
SDList<-vector("list",4)
#
# For Loop. Group data by factors 1,2,3,4
#
for (i in 1:4){
Factor[[i]]<-names(testData[,i])
SDList[[i]]<-testData %>%
group_by(testData[,i])%>%
summarize(SD=sd(VarX))
}
# Load summary DF with data by unlisting SDList
#
testDataSum$SD_No<-as.vector(matrix(unlist(SDList),ncol=4,byrow=T)[,3])
testDataSum$SD_Yes<-as.vector(matrix(unlist(SDList),ncol=4,byrow=T)[,4])
testDataSum$SD_Ratio=modify2(testDataSum$SD_No,testDataSum$SD_Yes,sd_ratio)
#
# Load formatted factor names and put it at the front
#
testDataSum<-testDataSum %>%
mutate(Factor=unlist(Factor)) %>%
relocate(Factor)
# Show results
testDataSum
My request is for help in simplifying this code. This works but it seems horribly ugly and complex, not to mention difficult to come back to at a later date and modify. I believe there is a much simpler way to do it without a for-loop, and without the ungainly process of unlisting SDList using the "as.vector (matrix (..." lines. I have reviewed the documentation for DPLYR and PLYR, especially the grouping section, but I am baffled. Any suggestions are much appreciated.
Here is a link to a github repository with the code and a csv file with 192 rows that you can use to produce the result table.
Git Hub Link for code and Data

You may try using reshape2, dplyr, and tidyr
When I read your data, column names get broken, so I rename them beforehand.
library(dplyr)
library(tidyr)
library(reshape2)
names(df) <- c("A","B","C","D","VarX")
df %>%
melt(id.vars = "VarX", variable.name = "Factor") %>%
group_by(Factor, value) %>%
summarize(sd = sd(VarX)) %>%
pivot_wider(id_cols = Factor, values_from = sd, names_from = value, names_glue = "sd_{value}") %>%
mutate(SD_ratio = pmax(sd_No,sd_Yes)/pmin(sd_No,sd_Yes))
Factor sd_No sd_Yes SD_ratio
<fct> <dbl> <dbl> <dbl>
1 A 3.51 3.79 1.08
2 B 3.83 3.44 1.11
3 C 3.53 3.77 1.07
4 D 3.92 3.32 1.18

Related

combining 2 series and dropping NA in R

I am trying to combing the dividend history of 2 different stocks and I have the below code:
library(quantmod)
library(tidyr)
AAPL<-round(getDividends("AAPL"),4)
MSFT<-round(getDividends("MSFT"),4)
dividend<-(cbind(AAPL,MSFT))
As the 2 stocks pay out dividend on different dates so there will be NAs after combining and so I try to use drop_na function from tidyr like below:
drop_na(dividend)
#Error in UseMethod("drop_na") :
no applicable method for 'drop_na' applied to an object of class "c('xts', 'zoo')"
May I know what did I do wrong here? Many thanks for your help.
Update 1:
Tried na.omit, which returns with the following:
> dividend<-(cbind(AAPL,MSFT))%>%
> na.omit(dividend)%>%
> print(dividend)
[,1] [,2]
drop_na only works on data frames but even if you converted it to data frame it would not give what you want. Every row contains an NA so drop_na and na.omit would both remove every row leaving a zero row result.
Instead try aggregating over year/month which gives zoo object both. If you need an xts object use as.xts(both) .
both <- aggregate(dividend, as.yearmon, sum, na.rm = TRUE)
Optionally we could remove the rows with zero dividends in both columns:
both[rowSums(both) > 0]
We could use na.omit
na.omit(dividend)
drop_na expects a data.frame as input whereas the the object dividend is not a data.frame
We could fortify into a data.frame and then use drop_na after reordering the NAs
library(dplyr)
library(lubridate)
fortify.zoo(dividend) %>%
group_by(Index = format(ymd(Index), '%Y-%b')) %>%
mutate(across(everything(), ~ .x[order(is.na(.x))])) %>%
ungroup %>%
drop_na()
-output
# A tibble: 41 × 3
Index AAPL.div MSFT.div
<chr> <dbl> <dbl>
1 2012-Aug 0.0034 0.2
2 2012-Nov 0.0034 0.23
3 2013-Feb 0.0034 0.23
4 2013-May 0.0039 0.23
5 2013-Aug 0.0039 0.23
6 2013-Nov 0.0039 0.28
7 2014-Feb 0.0039 0.28
8 2014-May 0.0042 0.28
9 2014-Aug 0.0294 0.28
10 2014-Nov 0.0294 0.31
# … with 31 more rows

How do I only report selected summary statistics in a table that lists variables as rows using R?

I have a dataset and I need to create a simple table with the number of observations, means, and standard deviations of all the variables (columns). I can't find a way to get only the required 3 summary statistics. Everything I tried keeps giving me min, max, median, 1st and 3rd quartiles, etc. The table should look something like this (with a title):
Table 1: Table Title
_______________________________________
Variables Observations Mean Std.Dev
_______________________________________
Age 30 24 2
... . . .
... . . .
_______________________________________
The summary () does not work because it gives too many other summary statistics. I have done this:
sapply(dataset, function(x) list(means=mean(x,na.rm=TRUE), sds=sd(x,na.rm=TRUE)))
But how do I form the table from this? And is there a better way to do this than using "sapply"?
sapply does return the values that you want but it is not properly structured.
Using mtcars data as an example :
#Get the required statistics and convert the data into dataframe
summ_data <- data.frame(t(sapply(mtcars, function(x)
list(means = mean(x,na.rm=TRUE), sds = sd(x,na.rm=TRUE)))))
#Change rownames to new column
summ_data$variables <- rownames(summ_data)
#Remove rownames
rownames(summ_data) <- NULL
#Make variable column as 1st column
cbind(summ_data[ncol(summ_data)], summ_data[-ncol(summ_data)])
Another way would be using dplyr functions :
library(dplyr)
mtcars %>%
summarise(across(.fns = list(means = mean, sds = sd),
.names = '{col}_{fn}')) %>%
tidyr::pivot_longer(cols = everything(),
names_to = c('variable', '.value'),
names_sep = '_')
# A tibble: 11 x 3
# variable means sds
# <chr> <dbl> <dbl>
# 1 mpg 20.1 6.03
# 2 cyl 6.19 1.79
# 3 disp 231. 124.
# 4 hp 147. 68.6
# 5 drat 3.60 0.535
# 6 wt 3.22 0.978
# 7 qsec 17.8 1.79
# 8 vs 0.438 0.504
# 9 am 0.406 0.499
#10 gear 3.69 0.738
#11 carb 2.81 1.62

Reshaping database using reshape package-part 2 [duplicate]

This question already has answers here:
Reshape multiple values at once
(2 answers)
Closed 3 years ago.
Following the interaction in the previous post Reshaping database using reshape package I create this one to ask other question.
Briefly: I have a database with some rows that it replicate for Id column, I would like to transpose it. The following cose show a little example of my database.
test<-data.frame(Id=c(1,1,2,3),
St=c(20,80,80,20),
gap=seq(0.02,0.08,by=0.02),
gip=c(0.23,0.60,0.86,2.09),
gat=c(0.0107,0.989,0.337,0.663))
I would like a final database like this figure that I attached:
With one row for each Id value and the different columns attached.
Can you give me any suggestions?
You can use dcast from data.table. This function allows to spread multiple value variables.
library(data.table)
setDT(test) # convert test to a data.table
test1 <- dcast(test, Id ~ rowid(Id),
value.var = c('St', 'gap', 'gip', 'gat'), fill = 0)
test1
# Id St_1 St_2 gap_1 gap_2 gip_1 gip_2 gat_1 gat_2
#1: 1 20 80 0.02 0.04 0.23 0.6 0.0107 0.989
#2: 2 80 0 0.06 0.00 0.86 0.0 0.3370 0.000
#3: 3 20 0 0.08 0.00 2.09 0.0 0.6630 0.000
If you want to continue with a data.frame call setDF(test1) at the end.
A dplyr/tidyr alternative is to first gather to long format, group_by Id and key and create a sequential row identifier for each group (new_key) and finally spread it back to wide form.
library(dplyr)
library(tidyr)
test %>%
gather(key, value, -Id) %>%
group_by(Id, key) %>%
mutate(new_key = paste0(key, row_number())) %>%
ungroup() %>%
select(-key) %>%
spread(new_key, value, fill = 0)
# Id gap1 gap2 gat1 gat2 gip1 gip2 St1 St2
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 0.02 0.04 0.0107 0.989 0.23 0.6 20 80
#2 2 0.06 0 0.337 0 0.86 0 80 0
#3 3 0.08 0 0.663 0 2.09 0 20 0

Can someone find out how this dplyr table doesn't work?

I have to make a table in which the mean and median from the logsales per month are set out from the "txhousing" dataset. The exercise i got is the following: “The table below show the means and medians of the log of sales per month, sorted by mean”
Insert a new r chunk and type the code in it to display that table
Use na.omit to get rid of cases with missing values
Use the dplyr command mutate to make the variable logsales
Use the dplyr command group_by to group by month
Use the dplyr command summarise to display the table
Use the dplyr command arrange to sorth by mean
Connect the commands with the pipe operator %>%
I've tried to mix up the code multiple times but I can't find out why it keeps on giving me NA's in my table.
library(tidyverse)
summary(txhousing)
na.omit(txhousing)
txhousing<- as.data.frame(txhousing)
logsales <- log(txhousing$sales)
group_by(txhousing, txhousing$month)
txhousing<- txhousing %>% mutate(logsales= log(txhousing$sales))
txhousing %>% group_by(txhousing$month) %>% summarise(mean(logsales), median(logsales)) %>% arrange(mean)
I expect to get a table with the mean and median of the logsales per month, but what i get is only NA in the column from the mean en the median and the arrange gives the following error:
Error: cannot arrange column of class 'function' at position 1`
There are NA values in the columns, so you need to tell mean and median to ignore them. And also name the columns in summarise to use arrange on column named as mean.
txhousing %>%
group_by(txhousing$month) %>%
summarise(mean = mean(logsales, na.rm = T),
med= median(logsales, na.rm = T)) %>%
arrange(mean) %>%
rename(month = `txhousing$month`)
This creates following tibble
# A tibble: 12 x 3
month mean med
<int> <dbl> <dbl>
1 1 4.95 4.74
2 2 5.13 4.93
3 11 5.19 4.96
4 12 5.24 5.02
5 10 5.29 5.08
6 9 5.32 5.09
7 3 5.38 5.15
8 4 5.42 5.21
9 5 5.52 5.29
10 7 5.53 5.30
11 8 5.53 5.33
12 6 5.56 5.34

R: How to aggregate with NA values

To give a small working example, suppose I have the following data frame:
library(dplyr)
country <- rep(c("A", "B", "C"), each = 6)
year <- rep(c(1,2,3), each = 2, times = 3)
categ <- rep(c(0,1), times = 9)
pop <- rep(c(NA, runif(n=8)), each=2)
money <- runif(18)+100
df <- data.frame(Country = country,
Year = year,
Category = categ,
Population = pop,
Money = money)
Now the data I'm actually working with has many more repetitions, namely for every country, year, and category, there are many repeated rows corresponding to various sources of money, and I want to sum these all together. However, for now it's enough just to have one row for each country, year, and category, and just trivially apply the sum() function on each row. This will still exhibit the behavior I'm trying to get rid of.
Notice that for country A in year 1, the population listed is NA. Therefore when I run
aggregate(Money ~ Country+Year+Category+Population, df, sum)
the resulting data frame has dropped the rows corresponding to country A and year 1. I'm only using the ...+Population... bit of code because I want the output data frame to retain this column.
I'm wondering how to make the aggregate() function not drop things that have NAs in the columns by which the grouping occurs--it'd be nice if, for instance, the NAs themselves could be treated as values to group by.
My attempts: I tried turning the Population column into factors but that didn't change the behavior. I read something on the na.action argument but neither na.action=NULL nor na.action=na.skip changed the behavior. I thought about trying to turn all the NAs to 0s, and I can't think of what that would hurt but it feels like a hack that might bite me later on--not sure. But if I try to do it, I'm not sure how I would. When I wrote a function with the is.na() function in it, it didn't apply the if (is.na(x)) test in a vectorized way and gave the error that it would just use the first element of the vector. I thought about perhaps using lapply() on the column and coercing it back to a vector and sticking that in the column, but that also sounds kind of hacky and needlessly round-about.
The solution here seemed to be about keeping the NA values out of the data frame in the first place, which I can't do: Aggregate raster in R with NA values
As you have already mentioned dplyr before your data, you can use dplyr::summarise function. The summarise function supports grouping on NA values.
library(dplyr)
df %>% group_by(Country,Year,Category,Population) %>%
summarise(Money = sum(Money))
# # A tibble: 18 x 5
# # Groups: Country, Year, Category [?]
# Country Year Category Population Money
# <fctr> <dbl> <dbl> <dbl> <dbl>
# 1 A 1.00 0 NA 101
# 2 A 1.00 1.00 NA 100
# 3 A 2.00 0 0.482 101
# 4 A 2.00 1.00 0.482 101
# 5 A 3.00 0 0.600 101
# 6 A 3.00 1.00 0.600 101
# 7 B 1.00 0 0.494 101
# 8 B 1.00 1.00 0.494 101
# 9 B 2.00 0 0.186 100
# 10 B 2.00 1.00 0.186 100
# 11 B 3.00 0 0.827 101
# 12 B 3.00 1.00 0.827 101
# 13 C 1.00 0 0.668 100
# 14 C 1.00 1.00 0.668 101
# 15 C 2.00 0 0.794 100
# 16 C 2.00 1.00 0.794 100
# 17 C 3.00 0 0.108 100
# 18 C 3.00 1.00 0.108 100
Note: The OP's sample data doesn't have multiple rows for same groups. Hence, number of summarized rows will be same as actual rows.

Resources