How does ddply split the data? - r

I have this data frame.
mydf<- data.frame(c("a","a","b","b","c","c"),c("e","e","e","e","e","e")
,c(1,2,3,10,20,30),
c(5,10,20,20,15,10))
colnames(mydf)<-c("Model", "Class","Length", "Speed")
I'm trying to get a better understanding on how ddply works.
I'd like to get the average length and speed for each pairing of model and class.
I know this is one way to do it: ddply(mydf, .(Model, Class), .fun = summarize, mSpeed = mean(Speed), mLength = mean(Length)).
I wonder if I can get the mean using ddply and without specifying it one at a time.
I tried ddply(mydf, .(Model, Class), .fun = mean) but I get the error
Warning messages: 1: In mean.default(piece, ...) : argument is not
numeric or logical: returning NA
What does ddply pass on to the function argument? Is there a way to apply one function to every column using ddply?
My goal is to learn more about ddply. I will only accept answers will ddply

Here's a solution using dplyr and the summarize function.
library(dplyr)
mydf<- data.frame(c("a","a","b","b","c","c"),c("e","e","e","e","e","e")
,c(1,2,3,10,20,30),
c(5,10,20,20,15,10))
colnames(mydf)<-c("Model", "Class","Length", "Speed")
#summarize data by Model & Class
mydf %>% group_by(Model, Class) %>% summarize_if(is.numeric, mean)
#> # A tibble: 3 x 4
#> # Groups: Model [3]
#> Model Class Length Speed
#> <fct> <fct> <dbl> <dbl>
#> 1 a e 1.5 7.5
#> 2 b e 6.5 20
#> 3 c e 25 12.5
Created on 2019-04-16 by the reprex package (v0.2.1)

Related

How to replace vector tidier

I am looking to find another function that can replace the broom::tidy() function after it get removed. Here is what the broom package warning says:
Tidy Atomic Vectors
Vector tidiers are deprecated and will be removed from an upcoming release of broom.
Here is a description of function:
tidy() produces a tibble() where each row contains information about an important component of the model. For regression models, this often corresponds to regression coefficients. This is can be useful if you want to inspect a model or create custom visualizations.
Thanks you,
John
As I understand the warning, there is no general deprecation of the function broom::tidy, this warning only occurs when it is called with an atomic vector. In this case tibble() seems to be a slot-in replacement:
No deprecation warning for tidy() when called for a linear model:
library(broom)
fit <- lm(Volume ~ Girth + Height, trees)
tidy(fit)
## A tibble: 3 x 5
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 (Intercept) -58.0 8.64 -6.71 2.75e- 7
#2 Girth 4.71 0.264 17.8 8.22e-17
#3 Height 0.339 0.130 2.61 1.45e- 2
#Deprecation warning:
tidy(1:5)
## A tibble: 5 x 1
# x
# <int>
#1 1
#2 2
#3 3
#4 4
#5 5
#Warning messages:
#1: 'tidy.numeric' is deprecated.
#See help("Deprecated")
#2: `data_frame()` is deprecated as of tibble 1.1.0.
#Please use `tibble()` instead.
No warning for tibble, same output :
tibble(1:5)
## A tibble: 5 x 1
# `1:5`
# <int>
#1 1
#2 2
#3 3
#4 4
#5 5
The deprecation warning is letting you know that the method tidy.numeric is being removed.
broom:::tidy.numeric
function (x, ...)
{
.Deprecated()
if (!is.null(names(x))) {
dplyr::data_frame(names = names(x), x = unname(x))
}
else {
dplyr::data_frame(x = x)
}
}
You can see the call to .Deprecated there, and the rest of the function just calls data_frame. As this function is also being deprecated, tibble is the new solution. As tibble does not honour row names, if you want to save the names, you could create something similar to the above.
tidy.numeric <- function (x, ...)
{
if (!is.null(names(x))) {
tibble::tibble(names = names(x), x = unname(x))
}
else {
tibble::tibble(x = x)
}
}
If you try to convert a named vector as mentioned by #Miff, you can also use the function enframe(). It creates a tibble with two columns, one with the names in the vector and one column with the values.

How to fix dplyr filter() Error in UseMethod("filter_")

When I try to select a data from the data matrix that I have created I receive an error, I would like that someone can help me out and fix it.
Error in UseMethod("filter_") : no applicable method for 'filter_'
applied to an object of class "c('matrix', 'double', 'numeric')"
I have tried to call out the function by doing dplyr::
or by using some pipe operations mydata %>% filter(2010) or even installed and loaded package "conflicted" and gave the dplyr an priority but nothing works. I am new with r.
Matrix_5c_AVG_Year <- cbind(AVG_SWE_YEAR,AVG_NO[,2],AVG_FI[,2],AVG_EE[,2],AVG_LV[,2],AVG_LT[,2])
colnames(Matrix_5c_AVG_Year) <- c("Year","AVG_SWE1", "AVG_NO1", "AVG_FI1", "AVG_EE1", "AVG_LV1", "AVG_LT1")
mydata<-Matrix_5c_AVG_Year
mydata %>% filter(2010)
I would like to get an output of only the row of 2010 data and perferably be able to select only one header.
As commented by #brettljausn, you need to convert your matrix to a data.frame. You will also get an error in the call to filter if you do not add the column name on which you want to compare your conditional value.
This should work illustrate your problem and a solution (continuing in tidyverse, since you are using filter):
library(tidyverse)
(a <- matrix(c(5,1), 2, 2))
#> [,1] [,2]
#> [1,] 5 5
#> [2,] 1 1
colnames(a) <- c("Year", "AVG_SWE1")
a %>%
filter(Year == 5)
#> Error in UseMethod("filter_"): no applicable method for 'filter_' applied to an object of class "c('matrix', 'double', 'numeric')"
(a2 <- as_tibble(a))
#> # A tibble: 2 x 2
#> Year AVG_SWE1
#> <dbl> <dbl>
#> 1 5 5
#> 2 1 1
a2 %>%
filter(Year == 5)
#> # A tibble: 1 x 2
#> Year AVG_SWE1
#> <dbl> <dbl>
#> 1 5 5
Created on 2019-07-31 by the reprex package (v0.3.0)
Since you are new, I would recommend you to read chapter 1-16 of https://r4ds.had.co.nz/.
Thanks to larsoevlisen I understood that my data are as an matrix and can not be manipulated in that way so I had to transform them into an data.frame() to filter the data I need out.
Final solution:
Matrix_5c_AVG_Year < cbind(AVG_SWE_YEAR,AVG_NO_YEAR[,2],AVG_FI_YEAR[,2],AVG_EE_YEAR[,2],AVG_LV_YEAR[,2],AVG_LT_YEAR[,2])
Matrix_5c_AVG_Year <-data.frame(Matrix_5c_AVG_Year)
colnames(Matrix_5c_AVG_Year) <- c("Year","AVG_SWE1", "AVG_NO1",
"AVG_FI1", "AVG_EE1", "AVG_LV1", "AVG_LT1")

Using aggregate functions on multiple columns at once in r

Lets say I have a data set that has multiple rows and columns and I want to record the min, max and mean for each column and store this data in its own table. How do I loop through the data frame in such a way that I can find this data for each column?
Edit: My initial data is stored in a tbl that looks like this Initial Data and I want the output to look like this Output Data
Take a look at package dplyr, which will make this task more straightforward!
Here's an approach that just uses dplyr. The format isn't exactly what's in Output Data...
> df <- data.frame(A=c(7,2,4), B=c(5,4,6), C=c(7,9,1)) # Your Initial Data
> library(dplyr)
> df %>% summarise_all(.funs=funs(mean, min, max)) ## Approach 1: just dplyr
A_mean B_mean C_mean A_min B_min C_min A_max B_max C_max
1 4.333333 5 5.666667 2 4 1 7 6 9
Alternatively, if you also use package tidyr, you can get exactly the format you wanted for your output data:
> library(tidyr)
> df %>%
+ gather(Column, Value) %>% ## Converts dataframe from wide to long format
+ group_by(Column) %>% ## Groups by the new column containing old column names
+ summarise(Max=max(Value), Min=min(Value), Mean=mean(Value)) ## The summary functions
# A tibble: 3 x 4
Column Max Min Mean
<chr> <dbl> <dbl> <dbl>
1 A 7.00 2.00 4.33
2 B 6.00 4.00 5.00
3 C 9.00 1.00 5.67
One advantage of using these packages is that it may be more efficient, especially if df is large, than using an explicit loop.
I suggest you work with long tables instead of wide ones. While the last will make it simpler to the human eye, the former are easier to manipulate for data analysis. That said, I think you could use the data.table package to achieve this:
# create a data frame
df <- data.frame(A=c(7,2,4), B=c(5,4,6), C=c(7,9,1))
# load data.table package
require(data.table)
# convert df to a data.table
setDT(df)
#Explanation of the following code:
# melt: turns your wide table into a long one
# .(val_mean ...) calculate and give names to calculated variables
# by = ... : group by variable. See data.table vignette
melt(df)[, .(val_mean = mean(value),
val_min = min(value),
val_max = max(value)),
by = variable]
which produces:
variable val_mean val_min val_max
1: A 4.333333 2 7
2: B 5.000000 4 6
3: C 5.666667 1 9

Looping through variables in a dataframe to find summary stats

I don't know much about R, and I have a variables in a dataframe that I am trying to calculate some stats for, with the hope of writing them into a csv. I have been using a basic for loop, like this:
for(i in x) {
mean(my_dataframe[,c(i)], na.rm = TRUE))
}
where x is colnames(my_dataframe)
Not every variable is numeric - but when I add a print to the loop, this works fine - it just prints means when applicable, and NA when not. However, when I try to assign this loop to a value (means <- for....), it produces an empty list. Similarly, when I try to directly write the results to a csv, I get an empty csv. Does anyone know why this is happening/how to fix it?
this should work for you. you don't need a loop. just use the summary() function.
summary(cars)
The for loop executes the code inside, but it doesn't put any results together. To do that, you need to create an object to hold the results and explicitly assign each one:
my_means = rep(NA, ncol(my_dataframe)
for(i in seq_along(x)) {
my_means[i] = mean(my_dataframe[, x[i], na.rm = TRUE))
}
Note that I have also changed your loop to use i = 1, 2, 3, ... instead of each name.
sapply, as shown in another answer, is a nice shortcut that does the loop and combines the results for you, so you don't need to worry about pre-allocating the result object. It's also smart enough to iterate over columns of a data frame by default.
my_means_2 = sapply(my_dataframe, mean, na.rm = T)
Please give a reproducible example the next time you post a question.
Input is how I imagine your data would look like.
Input:
library(nycflights13)
library(tidyverse)
input <- flights %>% select(origin, air_time, carrier, arr_delay)
input
# A tibble: 336,776 x 4
origin air_time carrier arr_delay
<chr> <dbl> <chr> <dbl>
1 EWR 227. UA 11.
2 LGA 227. UA 20.
3 JFK 160. AA 33.
4 JFK 183. B6 -18.
5 LGA 116. DL -25.
6 EWR 150. UA 12.
7 EWR 158. B6 19.
8 LGA 53. EV -14.
9 JFK 140. B6 -8.
10 LGA 138. AA 8.
# ... with 336,766 more rows
The way I see it, there are 2 ways to do it:
Use summarise_all()
summarise_all() will summarise all your columns, including those that are not numeric.
Method:
input %>% summarise_all(funs(mean(., na.rm = TRUE)))
# A tibble: 1 x 4
origin air_time carrier arr_delay
<dbl> <dbl> <dbl> <dbl>
1 NA 151. NA 6.90
Warning messages:
1: In mean.default(origin, na.rm = TRUE) :
argument is not numeric or logical: returning NA
2: In mean.default(carrier, na.rm = TRUE) :
argument is not numeric or logical: returning NA
You will get a result and a warning if you were to use this method.
Use summarise_if
summarise only numeric columns. You can avoid from getting any error this way.
Method:
input %>% summarise_if(is.numeric, funs(mean(., na.rm = TRUE)))
# A tibble: 1 x 2
air_time arr_delay
<dbl> <dbl>
1 151. 6.90
You can then create a NA column for others
You can use lapply or sapply for this sort of thing. e.g.
sapply(my_dataframe, mean)
will get you all the means. You can also give it your own function e.g.
sapply(my_dataframe, function(x) sum(x^2 + 2)/4 - 9)
If all variables are not numeric you can use summarise_if from dplyr to get the results just for the numeric columns.
require(dplyr)
my_dataframe %>%
summarise_if(is.numeric, mean)
Without dplyr, you could do
sapply(my_dataframe[sapply(my_dataframe, is.numeric)], mean)

ACF by group in R

I would like to calculate the acf of a time series grouped by a grouping variable. Specifically, I have a data frame contaning a single time series (variable a) and a grouping variable (e. g. weekday, variable b). Here is an example:
data <- data.frame(a=rnorm(1:150), b=rep(rep(1:3, each=5), 10))
Now, I would like to calculate the acf for the different values of the grouping variable. For example, for lag 2 and group 1 I would like to get the correlation between t and t-2 calculated only over time points t with b=1 (the value of b for t-2 does not matter). I know that the function acf can easily calculate the acf but I don't find a way to include the grouping variable.
I could manually calculate the desired correlation but as I have a large data set and a lot of lags and values for the grouping variables, I would hope that there is a more elegant and faster way. Here is the manual calculation for the example above (lag 2, b=1):
sel <- which(data$b==1)
cor(data$a[sel[sel > 2]], data$a[sel[sel>2] - 2])
If the time series object is a tsibble, the following works for me. Assuming the data frame is called df and the variable you are interested in is called var. You can specify max lag additionally
df %>% group_by(Region) %>% ACF(var, lag_max = 18) %>% autoplot()
I'm not sure I understand exactly what information you are looking for but if you just want the acf values for multiple groups this should accomplish that. Some people have mentioned creating a tidy solution and this uses dplyr, tidyr, and purrr to do grouped calculations.
library(dplyr)
library(tidyr)
library(purrr)
sample_data <- dplyr::data_frame(group = sample(c("a", "b", "c"), size = 100, replace = T), value = sample.int(30, size = 100, replace = T))
head(sample_data)
#> # A tibble: 6 × 2
#> group value
#> <chr> <int>
#> 1 c 28
#> 2 c 9
#> 3 c 13
#> 4 c 11
#> 5 a 9
#> 6 c 9
grouped_acf_values <- sample_data %>%
tidyr::nest(-group) %>%
dplyr::mutate(acf_results = purrr::map(data, ~ acf(.x$value, plot = F)),
acf_values = purrr::map(acf_results, ~ drop(.x$acf))) %>%
tidyr::unnest(acf_values) %>%
dplyr::group_by(group) %>%
dplyr::mutate(lag = seq(0, n() - 1))
head(grouped_acf_values)
#> Source: local data frame [6 x 3]
#> Groups: group [1]
#>
#> group acf_values lag
#> <chr> <dbl> <int>
#> 1 c 1.00000000 0
#> 2 c -0.20192774 1
#> 3 c 0.07191805 2
#> 4 c -0.18440489 3
#> 5 c -0.31817935 4
#> 6 c 0.06368096 5
You can have a look at split to seperate your data.frame in buckets and then lapply to apply your function to each group. Something like:
groups_data <- split(data, data$b)
groups_acf <- lapply(groups_data, acf,...)
Then you have to extract the required information from the output list for instance with `sapply(groups,acf, FUN=function(acfobject){acfobject$value})
For groups computations, I would also definitiely go with new ways "à la" Hadley Wickham with %>% operator and group_by ; studing that is on my todo list...

Resources