Data cleaning in R, aggregation of multiple columns [duplicate]

Data cleaning in R, aggregation of multiple columns [duplicate] - r

This is best illustrated with an example
str(mtcars)
mtcars$gear <- factor(mtcars$gear, labels=c("three","four","five"))
mtcars$cyl <- factor(mtcars$cyl, labels=c("four","six","eight"))
mtcars$am <- factor(mtcars$am, labels=c("manual","auto")
str(mtcars)
tapply(mtcars$mpg, mtcars$gear, sum)
That gives me the summed mpg per gear. But say I wanted a 3x3 table with gear across the top and cyl down the side, and 9 cells with the bivariate sums in, how would I get that 'smartly'.
I could go.
tapply(mtcars$mpg[mtcars$cyl=="four"], mtcars$gear[mtcars$cyl=="four"], sum)
tapply(mtcars$mpg[mtcars$cyl=="six"], mtcars$gear[mtcars$cyl=="six"], sum)
tapply(mtcars$mpg[mtcars$cyl=="eight"], mtcars$gear[mtcars$cyl=="eight"], sum)
This seems cumbersome.
Then how would I bring a 3rd variable in the mix?
This is somewhat in the space I'm thinking about.
Summary statistics using ddply
update This gets me there, but it's not pretty.
aggregate(mpg ~ am+cyl+gear, mtcars,sum)
Cheers

How about this, still using tapply()? It's more versatile than you knew!
with(mtcars, tapply(mpg, list(cyl, gear), sum))
# three four five
# four 21.5 215.4 56.4
# six 39.5 79.0 19.7
# eight 180.6 NA 30.8
Or, if you'd like the printed output to be a bit more interpretable:
with(mtcars, tapply(mpg, list("Cylinder#"=cyl, "Gear#"=gear), sum))
If you want to use more than two cross-classifying variables, the idea's exactly the same. The results will then be returned in a 3-or-more-dimensional array:
A <- with(mtcars, tapply(mpg, list(cyl, gear, carb), sum))
dim(A)
# [1] 3 3 6
lapply(1:6, function(i) A[,,i]) # To convert results to a list of matrices
# But eventually, the curse of dimensionality will begin to kick in...
table(is.na(A))
# FALSE TRUE
# 12 42

I think the answers already on this question are fantastic options, but I wanted to share an additional option based on the dplyr package (this came up for me because I'm teaching a class right now where we use dplyr for data manipulation, so I wanted to avoid introducing students to specialized base R functions like tapply or aggregate).
You can group on as many variables as you want using the group_by function and then summarize information from these groups with summarize. I think this code is more readable to an R newcomer than the formula-based interface of aggregate, yielding identical results:
library(dplyr)
mtcars %>%
group_by(am, cyl, gear) %>%
summarize(mpg=sum(mpg))
# am cyl gear mpg
# (dbl) (dbl) (dbl) (dbl)
# 1 0 4 3 21.5
# 2 0 4 4 47.2
# 3 0 6 3 39.5
# 4 0 6 4 37.0
# 5 0 8 3 180.6
# 6 1 4 4 168.2
# 7 1 4 5 56.4
# 8 1 6 4 42.0
# 9 1 6 5 19.7
# 10 1 8 5 30.8
With two variables, you can summarize with one variable on the rows and the other on the columns by adding a call to the spread function from the tidyr package:
library(dplyr)
library(tidyr)
mtcars %>%
group_by(cyl, gear) %>%
summarize(mpg=sum(mpg)) %>%
spread(gear, mpg)
# cyl 3 4 5
# (dbl) (dbl) (dbl) (dbl)
# 1 4 21.5 215.4 56.4
# 2 6 39.5 79.0 19.7
# 3 8 180.6 NA 30.8

I like Josh's answer for this, but reshape2 can also provide a nice framework for these type of problems:
library(reshape2)
#use subset to only grab the variables of interest...
mtcars.m <- melt(subset(mtcars, select = c("mpg", "gear", "cyl")), measure.vars="mpg")
#cast into appropriate format
dcast(mtcars.m, cyl ~ gear, fun.aggregate=sum, value.var="value")
cyl three four five
1 four 21.5 215.4 56.4
2 six 39.5 79.0 19.7
3 eight 180.6 0.0 30.8

The answer contains same output using tapply and aggregate function.
I would like to add some information to Josh O'Brien's answer. User can either use aggregate function or tapply depending on output. In order to use more than one factor variable in tapply one can use the method Josh has shown.
Loading dataset
data("mtcars")
Using tapply
with(mtcars, tapply(mpg, list("Cylinder#"=cyl, "Gear#"=gear), sum))
The output of above code is
Gear#
Cylinder# 3 4 5
4 21.5 215.4 56.4
6 39.5 79.0 19.7
8 180.6 NA 30.8
Using aggregate function
with(mtcars, aggregate(mpg, list(Cylinder = cyl, Gear = gear), sum))
Output of aggregate function
Cylinder Gear x
1 4 3 21.5
2 6 3 39.5
3 8 3 180.6
4 4 4 215.4
5 6 4 79.0
6 4 5 56.4
7 6 5 19.7
8 8 5 30.8
Now if the user wants same output as aggregate function but using tapply.
as.data.frame(as.table(with(mtcars, tapply(mpg, list("Cylinder#"=cyl, "Gear#"=gear),
sum))))
Output of tapply function
Cylinder. Gear. Freq
1 4 3 21.5
2 6 3 39.5
3 8 3 180.6
4 4 4 215.4
5 6 4 79.0
6 8 4 NA
7 4 5 56.4
8 6 5 19.7
9 8 5 30.8
NA's can be kept or removed as per business requirements.

Related

How to move R code into functions to generalise behaviour

I have a huge messy piece of R code with loads of ugly repetition. There is an opportunity to massively reduce it. Starting with this piece of code:
table <-
risk_assigned %>%
group_by(rental_type, room_type) %>%
summarise_all(funs( sum(!is.na(.)) / length(.) ) ) %>%
select(-c(device_id, ts, room, hhi, temp)) %>%
adorn_pct_formatting()
I would like to generalise it into a function so it can be reused.
LayKable = function(kableDetails) {
table <-
risk_assigned %>%
group_by(kableDetails$group1 , kableDetails$group2) %>%
summarise_all(funs( sum(!is.na(.)) / length(.) ) ) #%>%
select(-c(device_id, ts, room, hhi, temp)) %>%
adorn_pct_formatting()
...
kable <- table
return(kable)
}
kableDetails <- list(
group1 = "rental_type",
group2 = "room_type"
)
newKable <- LayKable(kableDetails)
This rather half-hearted attempt serves to explain what I want to do. How can I pass stuff into this function inside a list (I'm a C programmer, pretending it's a struct).

When passing function arguments to a dplyr verb inside a function you have to use rlang terms. But should be simple to define a function you can pass a number of grouping terms to:
library(dplyr)
test_func <- function(..., data = mtcars) {
# Passing `data` as a default argument as it's nice to be flexible!
data %>%
group_by(!!!enquos(...)) %>%
summarise(across(.fns = sum), .groups = "drop")
}
test_func(cyl, gear)
#> # A tibble: 8 x 11
#> cyl gear mpg disp hp drat wt qsec vs am carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4 3 21.5 120. 97 3.7 2.46 20.0 1 0 1
#> 2 4 4 215. 821 608 32.9 19.0 157. 8 6 12
#> 3 4 5 56.4 215. 204 8.2 3.65 33.6 1 2 4
#> 4 6 3 39.5 483 215 5.84 6.68 39.7 2 0 2
#> 5 6 4 79 655. 466 15.6 12.4 70.7 2 2 16
#> 6 6 5 19.7 145 175 3.62 2.77 15.5 0 1 6
#> 7 8 3 181. 4291. 2330 37.4 49.2 206. 0 0 37
#> 8 8 5 30.8 652 599 7.76 6.74 29.1 0 2 12
Update - adding a list
I see your ideal would be to write a list of arguments for each function call and pass these rather than write out the arguments in each call. You can do this using do.call to pass a list of named arguments to a function. Again, when using dplyr verbs you can quote variable names in constructing your list (so that R doesn't try to find them in the global environment when compiling the list) and !!enquo each one in the calls to then use them there:
library(dplyr)
test_func2 <- function(.summary_var, .group_var, data = mtcars) {
data %>%
group_by(!!enquo(.group_var)) %>%
summarise(mean = mean(!!enquo(.summary_var)))
}
# Test with bare arguments
test_func2(hp, cyl)
#> # A tibble: 3 x 2
#> cyl mean
#> <dbl> <dbl>
#> 1 4 82.6
#> 2 6 122.
#> 3 8 209.
# Construct and pass list
args <- list(.summary_var = quote(hp), .group_var = quote(cyl))
do.call(test_func2, args = args)
#> # A tibble: 3 x 2
#> cyl mean
#> <dbl> <dbl>
#> 1 4 82.6
#> 2 6 122.
#> 3 8 209.
A handy guide to tidy evaluation where most of these ideas are explained more clearly.
Created on 2021-12-21 by the reprex package (v2.0.1)

row bind list columns using dplyr

I would like to find a better way to bind together the results of any number of regressions after adding an identifier for each model. The code below is my current solution but is too manual for a large number of regressions. This is part of a larger tidy workflow so a solution inside of the tidyverse is preferred but whatever works is fine. Thanks
library(tidyverse)
library(broom)
model_dat=mtcars %>%
do(lm_1 = tidy(lm(disp~ wt*vs, data = .),conf.int=T),
lm_2=tidy(lm(cyl ~ wt*vs, data = .),conf.int=T ),
lm_3=tidy(lm(mpg ~ wt*vs, data = .),conf.int=T ))
df=model_dat %>%
select(lm_1) %>%
unnest(c(lm_1)) %>%
mutate(model="one") %>%
select(model,term,estimate,p.value:conf.high) %>%
bind_rows(
model_dat %>%
select(lm_2) %>%
unnest(c(lm_2)) %>%
mutate(model="two") %>%
select(model,term,estimate,p.value:conf.high)) %>%
bind_rows(
model_dat %>%
select(lm_3) %>%
unnest(c(lm_3)) %>%
mutate(model="three") %>%
select(model,term,estimate,p.value:conf.high))

It may be easier with map2 i.e. loop across the columns and the corresponding english word for the sequence of columns, pluck the list element, create the 'model' column with second argument i.e. engish words (.y), select the columns of interest, and create a single dataset by specifying _dfr in map
library(purrr)
library(english)
library(dplyr)
library(broom)
map2_dfr(model_dat, as.character(english(seq_along(model_dat))),
~ .x %>%
pluck(1) %>%
mutate(model = .y) %>%
select(model, term, estimate, p.value:conf.high) )
-output
# A tibble: 12 x 6
# model term estimate p.value conf.low conf.high
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 one (Intercept) -70.0 1.55e- 1 -168. 28.2
# 2 one wt 102. 8.20e- 9 76.4 128.
# 3 one vs 31.2 6.54e- 1 -110. 172.
# 4 one wt:vs -36.7 1.10e- 1 -82.2 8.82
# 5 two (Intercept) 4.31 1.28e- 5 2.64 5.99
# 6 two wt 0.849 4.90e- 4 0.408 1.29
# 7 two vs -2.19 7.28e- 2 -4.59 0.216
# 8 two wt:vs 0.0869 8.20e- 1 -0.689 0.862
# 9 three (Intercept) 29.5 6.55e-12 24.2 34.9
#10 three wt -3.50 2.33e- 5 -4.92 -2.08
#11 three vs 11.8 4.10e- 3 4.06 19.5
#12 three wt:vs -2.91 2.36e- 2 -5.40 -0.419
Or use summarise with across, unclass and then bind with bind_rows
model_dat %>%
summarise(across(everything(), ~ {
# // get the column name
nm1 <- cur_column()
# // extract the list element (.[[1]])
list(.[[1]] %>%
# // create new column by extracting the numeric part
mutate(model = english(readr::parse_number(nm1))) %>%
# // select the subset of columns, wrap in a list
select(model, term, estimate, p.value:conf.high))
}
)) %>%
# // unclass to list
unclass %>%
# // bind the list elements
bind_rows
-output
# A tibble: 12 x 6
# model term estimate p.value conf.low conf.high
# <english> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 one (Intercept) -70.0 1.55e- 1 -168. 28.2
# 2 one wt 102. 8.20e- 9 76.4 128.
# 3 one vs 31.2 6.54e- 1 -110. 172.
# 4 one wt:vs -36.7 1.10e- 1 -82.2 8.82
# 5 two (Intercept) 4.31 1.28e- 5 2.64 5.99
# 6 two wt 0.849 4.90e- 4 0.408 1.29
# 7 two vs -2.19 7.28e- 2 -4.59 0.216
# 8 two wt:vs 0.0869 8.20e- 1 -0.689 0.862
# 9 three (Intercept) 29.5 6.55e-12 24.2 34.9
#10 three wt -3.50 2.33e- 5 -4.92 -2.08
#11 three vs 11.8 4.10e- 3 4.06 19.5
#12 three wt:vs -2.91 2.36e- 2 -5.40 -0.419

R describeby function subscript out of bounds error

I'm fairly new to R and I'm trying to get descriptive statistics grouped by multiple variables using the describeby function from the psych package.
Here's what I'm trying to run:
JL <- describeBy(df$JL, group=list(df$Time, df$Cohort, df$Gender), digits=3, skew=FALSE, mat=TRUE)
And I get the error message Error in `[<-`(`*tmp*`, var, group + 1, value = dim.names[[group]][[groupi]]) :
subscript out of bounds
I only get this error message with my Gender variable (which is dichotomous in this datset). I'm able to run the code when I take out the mat=TRUE argument, and I see that it's generating groupings with NULL for Gender. I saw in other answers that this has something to do with the array being out of bounds but I'm not sure how to troubleshoot. Any advice is appreciated.
Thanks so much.

You could use dplyr, with some custom functions added.
library(dplyr)
se <- function(x) sd(x, na.rm=TRUE)/sqrt(length(na.omit(x)))
rnge <- function(x) diff(range(x, na.rm=TRUE))
group_by(df, Time, Cohort, Gender) %>%
summarise_at(vars(JL), .funs=list(n=length, mean=mean, sd=sd, min=min, max=max, range=rnge, se=se)) %>%
as.data.frame()
Using the mtcars dataset:
group_by(mtcars, vs, am, cyl) %>%
summarise_at(vars(mpg), .funs=list(n=length, mean=mean, sd=sd, min=min, max=max, range=rnge, se=se)) %>% as.data.frame()
vs am cyl n mean sd min max range se
1 0 0 8 12 15.1 2.774 10.4 19.2 8.8 0.801
2 0 1 4 1 26.0 NA 26.0 26.0 0.0 NA
3 0 1 6 3 20.6 0.751 19.7 21.0 1.3 0.433
4 0 1 8 2 15.4 0.566 15.0 15.8 0.8 0.400
5 1 0 4 3 22.9 1.453 21.5 24.4 2.9 0.839
6 1 0 6 4 19.1 1.632 17.8 21.4 3.6 0.816
7 1 1 4 7 28.4 4.758 21.4 33.9 12.5 1.798
Using the describBy function from the psych package returns your error:
library(psych)
describeBy(mtcars$mpg, group=list(mtcars$vs, mtcars$am, mtcars$cyl), digits=3, skew=FALSE, mat=TRUE)
Error in [<-(*tmp*, var, group + 1, value =
dim.names[[group]][[groupi]]) : subscript out of bounds
Because not all combinations of the three groups exist in the data.
with(mtcars,
ftable(table(vs,am,cyl)))
# cyl 4 6 8
#vs am
#0 0 0 0 12
# 1 1 3 2
#1 0 3 4 0
# 1 7 0 0

hierarchical clustering with gower distance - hclust() and philentropy::distance()

I've got a mixed data set (categorical and continuous variables) and I'd like to do hierarchical clustering using Gower distance.
I base my code on an example from https://www.r-bloggers.com/hierarchical-clustering-in-r-2/, which uses base R dist() for Euclidean distance. Since dist() doesn't compute Gower distance, I've tried using philentropy::distance() to compute it but it doesn't work.
Thanks for any help!
# Data
data("mtcars")
mtcars$cyl <- as.factor(mtcars$cyl)
# Hierarchical clustering with Euclidean distance - works
clusters <- hclust(dist(mtcars[, 1:2]))
plot(clusters)
# Hierarchical clustering with Gower distance - doesn't work
library(philentropy)
clusters <- hclust(distance(mtcars[, 1:2], method = "gower"))
plot(clusters)

The error is in the distance function itself.
I don't know if it's intentional or not, but the current implementation of philentropy::distance with the "gower" method cannot handle any mixed data types, since the first operation is to transpose the data.frame, producing a character matrix which then throws the typing error when passed to the DistMatrixWithoutUnit function.
You might try using the daisy function from cluster instead.
library(cluster)
x <- mtcars[,1:2]
x$cyl <- as.factor(x$cyl)
dist <- daisy(x, metric = "gower")
cls <- hclust(dist)
plot(cls)
EDIT: For future reference it seems like philentropy will be updated to included better type handling in the next version. From the vignette
In future versions of philentropy I will optimize the distance()
function so that internal checks for data type correctness and correct
input data will take less termination time than the base dist()
function.

LLL;
Sorry, I don't know English and I can't explain. Now this is a try.
But the code is good ;-)
library(philentropy)
clusters <- hclust(
as.dist(
distance(mtcars[, 1:2], method = "gower")))
plot(clusters)
Good look

You can do it pretty efficiently with the gower package
library(gower)
d <- sapply(1:nrow(mtcars), function(i) gower_dist(mtcars[i,],mtcars))
d <- as.dist(d)
h <- hclust(d)
plot(h)

Many thanks for this great question and thanks to all of you who provided excellent answers.
Just to resolve the issue for future readers:
# import example data
data("mtcars")
# store example subset with correct data type
mtcars_subset <- tibble::tibble(mpg = as.numeric(as.vector(mtcars$mpg)),
cyl = as.numeric(as.vector(mtcars$cyl)),
disp = as.numeric(as.vector(mtcars$disp)))
# transpose data.frame to be conform with philentropy input format
mtcars_subset <- t(mtcars_subset)
# cluster
clusters <- hclust(as.dist(philentropy::distance(mtcars_subset, method = "gower")))
plot(clusters)
# When using the developer version on GitHub you can also specify 'use.row.names = TRUE'
clusters <- hclust(as.dist(philentropy::distance(mtcars_subset, method = "gower",
use.row.names = TRUE)))
plot(clusters)
As you can see, clustering works perfectly fine now.
The problem is that in the example dataset the column cyl stores factor values and not double values as is required for the philentropy::distance() function. Since the underlying code is written in Rcpp, non-conform data types will cause problems. As noted correctly by Esther, I will implement a better way to check type safety in future versions of the package.
head(tibble::as.tibble(mtcars))
# A tibble: 6 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
To overcome this limitation, I stored the columns of interest from the mtcars dataset in a separate data.frame/tibble and converted all columns to double values via as.numeric(as.vector(mtcars$mpg)).
The resulting subset data.frame now stores only double values as required.
mtcars_subset
# A tibble: 32 x 3
mpg cyl disp
<dbl> <dbl> <dbl>
1 21 6 160
2 21 6 160
3 22.8 4 108
4 21.4 6 258
5 18.7 8 360
6 18.1 6 225
7 14.3 8 360
8 24.4 4 147.
9 22.8 4 141.
10 19.2 6 168.
# … with 22 more rows
Please also note that if you provide the philentropy::distance() function only 2 input vectors, then only one distance value will be returned and the hclust() function won't be able to compute any clusters with one value. Hence, I added a third column disp to enable visualization of the clusters.
I hope this helps.

How to fit a function for different groups in a data set using R

Please, how can I fit a function for different groups in a data set (Soil) using R. the first column is the group i.e. Plot and the second column is the observed variable i.e. Depth
Plot Depth
1 12.5
1 14.5
1 15.8
1 16.1
1 18.9
1 21.2
1 23.4
1 25.7
2 13.1
2 15.0
2 15.8
2 16.3
2 17.4
2 18.6
2 22.6
2 24.1
2 25.6
3 11.5
3 12.2
3 13.9
3 14.7
3 18.9
3 20.5
3 21.6
3 22.6
3 24.1
3 25.8
4 10.2
4 21.5
4 15.1
4 12.3
4 10.0
4 13.5
4 16.5
4 19.2
4 17.6
4 14.1
4 19.7
I used the 'for' statement but only saw output for Plot 1.
This was how I applied the 'for' statement:
After importing my data in R, I saved it as: SNq,
for (i in 1:SNq$Plot[i]) {
dp <- SNq$Depth[SNq$Plot==SNq$Plot[i]]
fit1 = fitdist(dp, "gamma") ## this is the function I'm fitting. The function is not the issue. My challenge is the 'for' statement.
fit1
}

I think this should work. Just make one change in your code:
Why would it work ?
Because: unique function will return unique values (1,2,3) which are nothing but the groups in Plot column. With unique value, we can subset the data using SNq$Depth[SNq$Plot==i] and get depth value for that group.
for (i in unique(SNq$Plot)) { # <- here
dp <- SNq$Depth[SNq$Plot==i]
fit1 = fitdist(dp, "gamma") ## this is the function I'm fitting. The function is not the issue. My challenge is the 'for' statement.
plot(fit1)
}

A tidyverse suggestion:
library("tidyverse")
library("fitdistrplus")
fits <- SNq %>%
group_by(Plot) %>%
nest() %>%
mutate(fits = map(data, ~ fitdist(data = .$Depth, distr = "gamma")),
summaries = map(fit, summary))
You could continue with print(fits$fits) and print(fits$summaries) to access the different fits and their summary. Alternatively you can use a syntax like fits$fits[[1]] and fits$summaries[[1]] to access them.

Try:
for (i in 1:nrow(SNq)) {
dp <- SNq$Depth[SNq$Plot==SNq$Plot[i]]
fit1 = fitdist(dp, "gamma")
fit1
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Data cleaning in R, aggregation of multiple columns [duplicate] - r

Related

How to move R code into functions to generalise behaviour

row bind list columns using dplyr

R describeby function subscript out of bounds error

hierarchical clustering with gower distance - hclust() and philentropy::distance()

How to fit a function for different groups in a data set using R

Categories

Resources