Looping through variables in a dataframe to find summary stats - r

I don't know much about R, and I have a variables in a dataframe that I am trying to calculate some stats for, with the hope of writing them into a csv. I have been using a basic for loop, like this:
for(i in x) {
mean(my_dataframe[,c(i)], na.rm = TRUE))
}
where x is colnames(my_dataframe)
Not every variable is numeric - but when I add a print to the loop, this works fine - it just prints means when applicable, and NA when not. However, when I try to assign this loop to a value (means <- for....), it produces an empty list. Similarly, when I try to directly write the results to a csv, I get an empty csv. Does anyone know why this is happening/how to fix it?

this should work for you. you don't need a loop. just use the summary() function.
summary(cars)

The for loop executes the code inside, but it doesn't put any results together. To do that, you need to create an object to hold the results and explicitly assign each one:
my_means = rep(NA, ncol(my_dataframe)
for(i in seq_along(x)) {
my_means[i] = mean(my_dataframe[, x[i], na.rm = TRUE))
}
Note that I have also changed your loop to use i = 1, 2, 3, ... instead of each name.
sapply, as shown in another answer, is a nice shortcut that does the loop and combines the results for you, so you don't need to worry about pre-allocating the result object. It's also smart enough to iterate over columns of a data frame by default.
my_means_2 = sapply(my_dataframe, mean, na.rm = T)

Please give a reproducible example the next time you post a question.
Input is how I imagine your data would look like.
Input:
library(nycflights13)
library(tidyverse)
input <- flights %>% select(origin, air_time, carrier, arr_delay)
input
# A tibble: 336,776 x 4
origin air_time carrier arr_delay
<chr> <dbl> <chr> <dbl>
1 EWR 227. UA 11.
2 LGA 227. UA 20.
3 JFK 160. AA 33.
4 JFK 183. B6 -18.
5 LGA 116. DL -25.
6 EWR 150. UA 12.
7 EWR 158. B6 19.
8 LGA 53. EV -14.
9 JFK 140. B6 -8.
10 LGA 138. AA 8.
# ... with 336,766 more rows
The way I see it, there are 2 ways to do it:
Use summarise_all()
summarise_all() will summarise all your columns, including those that are not numeric.
Method:
input %>% summarise_all(funs(mean(., na.rm = TRUE)))
# A tibble: 1 x 4
origin air_time carrier arr_delay
<dbl> <dbl> <dbl> <dbl>
1 NA 151. NA 6.90
Warning messages:
1: In mean.default(origin, na.rm = TRUE) :
argument is not numeric or logical: returning NA
2: In mean.default(carrier, na.rm = TRUE) :
argument is not numeric or logical: returning NA
You will get a result and a warning if you were to use this method.
Use summarise_if
summarise only numeric columns. You can avoid from getting any error this way.
Method:
input %>% summarise_if(is.numeric, funs(mean(., na.rm = TRUE)))
# A tibble: 1 x 2
air_time arr_delay
<dbl> <dbl>
1 151. 6.90
You can then create a NA column for others

You can use lapply or sapply for this sort of thing. e.g.
sapply(my_dataframe, mean)
will get you all the means. You can also give it your own function e.g.
sapply(my_dataframe, function(x) sum(x^2 + 2)/4 - 9)
If all variables are not numeric you can use summarise_if from dplyr to get the results just for the numeric columns.
require(dplyr)
my_dataframe %>%
summarise_if(is.numeric, mean)
Without dplyr, you could do
sapply(my_dataframe[sapply(my_dataframe, is.numeric)], mean)

Related

is_quosure(x) error when forwarding ... inside map

I would like to define a wrapper to an inner function.
The idea is to repeat random sampling that uses one of r* base function (eg runif, rnorm, etc.) and let the user easily change this inner function and define custom ones.
The example below show a reproducible example that I cannot make work with tidyeval patterns and, more precisely, within a purrr::map. The evaluation of ... seems to not happen properly. I missed something on quosures evaluation but I cannot figure what. I also show below a workaround that works fine with a goold old replicate.
I would like to implement such behaviour in more complex cases and, more generally, be delighted for any pointer and to understand why the following does not work.
# use the tidyverse and define a dummy tibble
library(tidyverse)
df <- tibble(col1=seq(10, 50, 10), col2=col1+5)
# first random function, on top of stats::runif
random_1 <- function(x, min, max){
x %>%
rowwise() %>%
mutate(new=runif(1, min={{min}}, max={{max}})) %>%
ungroup()
}
# second random function, on top of stats::rnorm
random_2 <- function(x, mean, sd){
x %>%
rowwise() %>%
mutate(new=rnorm(1, mean={{mean}}, sd={{sd}})) %>%
ungroup()
}
# at top level, everything works fine
> df %>% random_1(min=col1, max=col2)
> df %>% random_2(mean=col1, sd=col2)
# So far so good
# we we wrap it for a single shot
random_fun <- function(x, random_fun, ...){
random_fun(x, ...)
}
random_fun(df, random_1, min=col1, max=col2)
# Still fine.
# Here comes the trouble:
random_fun_k <- function(df, k, random_fun, ...){
map(1:k, ~random_fun(df, ...))
}
random_fun_k(df, k=2, random_1, min=col1, max=col2)
Error in is_quosure(x) : argument "x" is missing, with no default
The following workaround around replicate works fine yet I would like to stick to tidyeval spirit:
random_fun_k_oldie <- function(df, k, random_fun, ...){
f <- random_fun(df, ...)
replicate(k, f, simplify=FALSE)
}
random_fun_k_oldie(df, k=2, random_1, min=col1, max=col2)
random_fun_k_oldie(df, k=2, random_2, mean=col1, sd=col2)
It may be better to use original lambda function i.e. function(x)
library(purrr)
random_fun_k <- function(df, k, random_fun, ...){
map(seq_len(k), function(x) random_fun(df, ...))
}
-testing
> random_fun_k(df, k=2, random_1, min=col1, max=col2)
[[1]]
# A tibble: 5 × 3
col1 col2 new
<dbl> <dbl> <dbl>
1 10 15 12.6
2 20 25 21.4
3 30 35 34.1
4 40 45 40.7
5 50 55 53.8
[[2]]
# A tibble: 5 × 3
col1 col2 new
<dbl> <dbl> <dbl>
1 10 15 13.1
2 20 25 24.2
3 30 35 33.8
4 40 45 41.6
5 50 55 50.9
NOTE: The function name and argument name seems to be the same rand_fun and this could cause some confusion as well (though it is not the source of the error). It may be better to rename the function argument differently
random_fun <- function(x, rn_fun, ...){
rn_fun(x, ...)
}
purrr's lambdas support positional arguments with the ..1, ..2, etc syntax. This is implemented via the ... mechanism. Because of this, you're not passing the correct arguments to random_fun.
The solution is to use a normal lambda function as akrun suggested. Maybe you could use the \(x) x syntax of R 4.0.

How to avoid number rounding when using as.numeric() in R?

I am reading well structured, textual data in R and in the process of converting from character to numeric, numbers lose their decimal places.
I have tried using round(digits = 2) but it didn't work since I first had to apply as.numeric. At one point, I did set up options(digits = 2) before the conversion but it didn't work either.
Ultimately, I desired to get a data.frame with its numbers being exactly the same as the ones seen as characters.
I looked up for help here and did find answers like this, this, and this; however, none really helped me solve this issue.
How will I prevent number rounding when converting from character to
numeric?
Here's a reproducible piece of code I wrote.
library(purrr)
my_char = c(" 246.00 222.22 197.98 135.10 101.50 86.45
72.17 62.11 64.94 76.62 109.33 177.80")
# Break characters between spaces
my_char = strsplit(my_char, "\\s+")
head(my_char, n = 2)
#> [[1]]
#> [1] "" "246.00" "222.22" "197.98" "135.10" "101.50" "86.45"
#> [8] "72.17" "62.11" "64.94" "76.62" "109.33" "177.80"
# Convert from characters to numeric.
my_char = map_dfc(my_char, as.numeric)
head(my_char, n = 2)
#> # A tibble: 2 x 1
#> V1
#> <dbl>
#> 1 NA
#> 2 246
# Delete first value because it's empty
my_char = my_char[-1,1]
head(my_char, n = 2)
#> # A tibble: 2 x 1
#> V1
#> <dbl>
#> 1 246
#> 2 222.
It's how R visualize data in a tibble.
The function map_dfc is not rounding your data, it's just a way R use to display data in a tibble.
If you want to print the data with the usual format, use as.data.frame, like this:
head(as.data.frame(my_char), n = 4)
V1
#>1 246.00
#>2 222.22
#>3 197.98
#>4 135.10
Showing that your data has not been rounded.
Hope this helps.

R: `Error in f(x): could not find function "f"` when trying to use column of functions as argument in a tibble

I'm experimenting with using functions in dataframes (tidyverse tibbles) in R and I ran into some difficulties. The following is a minimal (trivial) example of my problem.
Suppose I have a function that takes in three arguments: x and y are numbers, and f is a function. It performs f(x) + y and returns the output:
func_then_add = function(x, y, f) {
result = f(x) + y
return(result)
}
And I have some simple functions it might use as f:
squarer = function(x) {
result = x^2
return(result)
}
cuber = function(x) {
result = x^3
return(result)
}
Done on its own, func_then_add works as advertised:
> func_then_add(5, 2, squarer)
[1] 27
> func_then_add(6, 11, cuber)
[1] 227
But lets say I have a dataframe (tidyverse tibble) with two columns for the numeric arguments, and one column for which function I want:
library(tidyverse)
library(magrittr)
test_frame = tribble(
~arg_1, ~arg_2, ~func,
5, 2, squarer,
6, 11, cuber
)
> test_frame
# A tibble: 2 x 3
arg_1 arg_2 func
<dbl> <dbl> <list>
1 5 2 <fn>
2 6 11 <fn>
I then want to make another column result that is equal to func_then_add applied to those three columns. It should be 27 and 227 like before. But when I try this, I get an error:
> test_frame %>% mutate(result=func_then_add(.$arg_1, .$arg_2, .$func))
Error in f(x) : could not find function "f"
Why does this happen, and how do I get what I want properly? I confess that I'm new to "functional programming", so maybe I'm just making an obvious syntax error ...
Not the most elegant but we can do:
test_frame %>%
mutate(Res= map(seq_along(.$func), function(x)
func_then_add(.$arg_1, .$arg_2, .$func[[x]])))
EDIT: The above maps both over the entire data which isn't really what OP desires. As suggested by #January this can be better applied as:
Result <- test_frame %>%
mutate(Res= map(seq_along(.$func), function(x)
func_then_add(.$arg_1[x], .$arg_2[x], .$func[[x]])))
Result$Res
The above again is not very efficient since it returns a list. A better alternative(again as suggested by #January is to use map_dbl which returns the same data type as its objects:
test_frame %>%
mutate(Res= map_dbl(seq_along(.$func), function(x)
func_then_add(.$arg_1[x], .$arg_2[x], .$func[[x]])))
# A tibble: 2 x 4
arg_1 arg_2 func Res
<dbl> <dbl> <list> <dbl>
1 5 2 <fn> 27
2 6 11 <fn> 227
This is because you should map instead of mutating. Mutate calls the function once, and supplies the whole columns as arguments.
The second problem is that test_frame$func[1] is not a function, but a list with one element. You can't have "function" columns, only list columns.
Try this:
test_frame$result <- with(test_frame,
map_dbl(1:2, ~ func_then_add(arg_1[.], arg_2[.], func[[.]])))
Result:
# A tibble: 2 x 4
arg_1 arg_2 func result
<dbl> <dbl> <list> <dbl>
1 5 2 <fn> 27
2 6 11 <fn> 227
EDIT: a simpler solution using dplyr, mutate and rowwise:
test_frame %>% rowwise %>% mutate(res=func_then_add(arg_1, arg_2, func))
Quite frankly, I am slightly puzzled by this last one. Why func and not func[[1]]? func should be a list, and not function. mutate and rowwise are doing here something sinister, like automatically converting a list to a vector.
Edit 2: actually, this is written explicitly in the rowwise manual:
Its main impact is to allow you to work with list-variables in
‘summarise()’ and ‘mutate()’ without having to use ‘[[1]]’.
Final edit: I became so fixated on tidyverse recently that I did not think of the simplest option – using base R:
apply(test_frame, 1, function(x) func_then_add(x$arg_1, x$arg_2, x$func))

How does ddply split the data?

I have this data frame.
mydf<- data.frame(c("a","a","b","b","c","c"),c("e","e","e","e","e","e")
,c(1,2,3,10,20,30),
c(5,10,20,20,15,10))
colnames(mydf)<-c("Model", "Class","Length", "Speed")
I'm trying to get a better understanding on how ddply works.
I'd like to get the average length and speed for each pairing of model and class.
I know this is one way to do it: ddply(mydf, .(Model, Class), .fun = summarize, mSpeed = mean(Speed), mLength = mean(Length)).
I wonder if I can get the mean using ddply and without specifying it one at a time.
I tried ddply(mydf, .(Model, Class), .fun = mean) but I get the error
Warning messages: 1: In mean.default(piece, ...) : argument is not
numeric or logical: returning NA
What does ddply pass on to the function argument? Is there a way to apply one function to every column using ddply?
My goal is to learn more about ddply. I will only accept answers will ddply
Here's a solution using dplyr and the summarize function.
library(dplyr)
mydf<- data.frame(c("a","a","b","b","c","c"),c("e","e","e","e","e","e")
,c(1,2,3,10,20,30),
c(5,10,20,20,15,10))
colnames(mydf)<-c("Model", "Class","Length", "Speed")
#summarize data by Model & Class
mydf %>% group_by(Model, Class) %>% summarize_if(is.numeric, mean)
#> # A tibble: 3 x 4
#> # Groups: Model [3]
#> Model Class Length Speed
#> <fct> <fct> <dbl> <dbl>
#> 1 a e 1.5 7.5
#> 2 b e 6.5 20
#> 3 c e 25 12.5
Created on 2019-04-16 by the reprex package (v0.2.1)

Renaming doesn't work for column names starting with two dots

I updated my tidyverse and my read_excel() function (from readxl) has also changed. Columns without titles are are now called ..1, ..2 and so on, when they used to be called X__1, X__2.
I'm trying to rename() these columns starting with two dots, but I'm getting an error message.
Here's an example:
library(tidyverse)
df <- tibble(a = 1:3,
..1 = 4:6)
df <- df %>%
rename(b = ..1)
Throws the error:
Error in .f(.x[[i]], ...) :
..1 used in an incorrect context, no ... to look in
I get the same error if I use backticks around the name: rename(b = `..1`).
..1 is a reserved word in R. See help("reserved") and help("..1"). Try quoting it:
df %>% rename(b = "..1")
giving:
# A tibble: 3 x 2
a b
<int> <int>
1 1 4
2 2 5
3 3 6
The janitor package has a very handy function clean_names for tasks like this. In this case, it replaces any .. that come from readxl with x. I added another .. column to show how the replacement works.
library(tidyverse)
df <- tibble(a = 1:3,
..1 = 4:6,
..5 = 10:12)
df %>%
janitor::clean_names()
#> # A tibble: 3 x 3
#> a x1 x5
#> <int> <int> <int>
#> 1 1 4 10
#> 2 2 5 11
#> 3 3 6 12
It seems like the naming setup in readxl is a topic of debate: see this issue, among others on the best way to convert unusable names from Excel sheets. There's also a vignette on it. To be honest, the last couple times I've needed to mess with readxl names, I just passed the data frame to janitor.

Resources