summarise function in R - r

I am trying to create a R database including some numerical variable.
While doing this, I made a typing mistake whose result looks weird to me and I would like to understand why (for sure I am missing something, here).
I have tried to look around for possible explanation but haven' t found what I am looking for.
library("dplyr")
library("tidyr")
data <-
data.frame(FS = c(1), FS_name = c("Armenia"), Year = c(2015), class =
c("class190"), area_1000ha = c(66.447)) %>%
mutate(FS_name = as.character(FS_name)) %>%
mutate(Year = as.integer(Year)) %>%
mutate(class = as.character(class)) %>%
tbl_df()
data
x <- data %>%
group_by(FS, FS_name, Year, class) %>%
dplyr::summarise(area_1000ha = sum(area_1000ha, rm.na = TRUE)) %>%
ungroup()
As you can see, the mistake is
rm.na=
rather than
na.rm=
When I type correctly, I have the right result on area_1000ha variable (10.5).
If I don't - i.e. keeping rm.na= I get 11.5, instead (+1, in fact).
What am I missing?

I think rm.na=TRUE is added to the sum, and as TRUE is considered as 1, it sums your initial sum and 1.
If you change TRUE to 2 for example
x <- data %>%
group_by(FS_name, Year, class) %>%
dplyr::summarise(area_1000ha = sum(area_1000ha, rm.na = 2)) %>%
ungroup()
The result is
# A tibble: 1 x 4
FS_name Year class area_1000ha
<chr> <int> <chr> <dbl>
1 Rome 2018 class190 12.5

There is no function in R as rm.na hence R is considering it as a variable which has value TRUE i.e. 1.
Try keeping it na.rm = T and you will get the right result.
Even if you change the name of the variable
x <- data %>%
group_by(FS, FS_name, Year, class) %>%
dplyr::summarise(area_1000ha = sum(area_1000ha, tester = TRUE)) %>%
ungroup()
I have replaced rm.na with tester variable.
# A tibble: 1 x 4
FS_name Year class area_1000ha
<chr> <int> <chr> <dbl>
1 Rome 2018 class190 11.5

Related

Standard deviation of average events per ID in R

Background
I've got this dataset d:
d <- data.frame(ID = c("a","a","a","a","a","a","b","b"),
event = c("G12","R2","O99","B4","B4","A24","L5","J15"),
stringsAsFactors=FALSE)
It's got 2 people (IDs) in it, and they each have some events.
The problem
I'm trying to get an average number (count) of events per person, along with a standard deviation for that average, all in one result (it can be a dataframe or not, doesn't matter).
In other words I'm looking for something like this:
| Mean | SD |
|------|------|
| 4.00 | 2.83 |
What I've tried
I'm not far off, I don't think -- it's just that I've got 2 separate pieces of code doing these calculations. Here's the mean:
d %>%
group_by(ID) %>%
summarise(event = length(event)) %>%
summarise(ratio = mean(event))
# A tibble: 1 x 1
ratio
<dbl>
1 4
And here's the SD:
d %>%
group_by(ID) %>%
summarise(event = length(event)) %>%
summarise(sd = sd(event))
# A tibble: 1 x 1
sd
<dbl>
1 2.83
But I when I try to pipe them together like so...
d %>%
group_by(ID) %>%
summarise(event = length(event)) %>%
summarise(ratio = mean(event)) %>%
summarise(sd = sd(event))
... I get an error:
Error in `h()`:
! Problem with `summarise()` column `sd`.
i `sd = sd(event)`.
x object 'event' not found
Any insight?
You have to put the last two calls to summarise() in the same call. The only remaining columns after summarise() will be those you named and the grouping columns, so after your second summarise, the event column no longer exists.
library(dplyr)
d <- data.frame(ID = c("a","a","a","a","a","a","b","b"),
event = c("G12","R2","O99","B4","B4","A24","L5","J15"),
stringsAsFactors=FALSE)
d %>%
group_by(ID) %>%
# the next summarise will be within ID
summarise(event = length(event)) %>%
# this summarise is overall
summarise(sd = sd(event),
ratio = mean(event))
#> # A tibble: 1 × 2
#> sd ratio
#> <dbl> <dbl>
#> 1 2.83 4
The code is a bit confusing because you are renaming the event variable, and doing the first summarise() within groups and the second without grouping. This code would be a little easier to read and get the same result:
d %>%
count(ID) %>%
summarise(sd = sd(n),
ratio = mean(n))
Created on 2022-05-25 by the reprex package (v2.0.1)

Calculating average rle$lengths over grouped data

I would like to calculate duration of state using rle() on grouped data. Here is test data frame:
DF <- read.table(text="Time,x,y,sugar,state,ID
0,31,21,0.2,0,L0
1,31,21,0.65,0,L0
2,31,21,1.0,0,L0
3,31,21,1.5,1,L0
4,31,21,1.91,1,L0
5,31,21,2.3,1,L0
6,31,21,2.75,0,L0
7,31,21,3.14,0,L0
8,31,22,3.0,2,L0
9,31,22,3.47,1,L0
10,31,22,3.930,0,L0
0,37,1,0.2,0,L1
1,37,1,0.65,0,L1
2,37,1,1.089,0,L1
3,37,1,1.5198,0,L1
4,36,1,1.4197,2,L1
5,36,1,1.869,0,L1
6,36,1,2.3096,0,L1
7,36,1,2.738,0,L1
8,36,1,3.16,0,L1
9,36,1,3.5703,0,L1
10,36,1,3.970,0,L1
", header = TRUE, sep =",")
I want to know the average length for state == 1, grouped by ID. I have created a function inspired by: https://www.reddit.com/r/rstats/comments/brpzo9/tidyverse_groupby_and_rle/
to calculate the rle average portion:
rle_mean_lengths = function(x, value) {
r = rle(x)
cond = r$values == value
data.frame(count = sum(cond), avg_length = mean(r$lengths[cond]))
}
And then I add in the grouping aspect:
DF %>% group_by(ID) %>% do(rle_mean_lengths(DF$state,1))
However, the values that are generated are incorrect:
ID
count
avg_length
1 L0
2
2
2 L1
2
2
L0 is correct, L1 has no instances of state == 1 so the average should be zero or NA.
I isolated the problem in terms of breaking it down into just summarize:
DF %>% group_by(ID) %>% summarize_at(vars(state),list(name=mean)) # This works but if I use summarize it gives me weird values again.
How do I do the equivalent summarize_at() for do()? Or is there another fix? Thanks
As it is a data.frame column, we may need to unnest afterwards
library(dplyr)
library(tidyr)
DF %>%
group_by(ID) %>%
summarise(new = list(rle_mean_lengths(state, 1)), .groups = "drop") %>%
unnest(new)
Or remove the list and unpack
DF %>%
group_by(ID) %>%
summarise(new = rle_mean_lengths(state, 1), .groups = "drop") %>%
unpack(new)
# A tibble: 2 × 3
ID count avg_length
<chr> <int> <dbl>
1 L0 2 2
2 L1 0 NaN
In the OP's do code, the column that should be extracted should be not from the whole data, but from the data coming fromt the lhs i.e. . (Note that do is kind of deprecated. So it may be better to make use of the summarise with unnest/unpack
DF %>%
group_by(ID) %>%
do(rle_mean_lengths(.$state,1))
# A tibble: 2 × 3
# Groups: ID [2]
ID count avg_length
<chr> <int> <dbl>
1 L0 2 2
2 L1 0 NaN

Pivot_longer on integer and factor

I have a dataset that looks like the following.
# A tibble: 1 x 4
hhm1q001 hhm2q001 hhm1q002 hhm2q002
<chr> <chr> <int> <int>
1 blue red 30 50
I have been trying to transform it to long using tidyr::pivot_longer
my expected output looks like this:
hhm q001 q002
<int> <chr> <int>
1 1 blue 30
2 2 red 50
I have tried the following code
HHS_long <- pivot_longer(HHS_all,
cols= starts_with("hhm"), #identifies the column from which to go from wide to long
names_to = ("hhm"), #name(s) of new column(s) created from cols=
values_drop_na = FALSE
)
head(HHS_long)
Unfortunately i get the following error
.Error: No common type for hhm1q101 <factor<b5064>> and hhm1q102 <integer>.
Not sure how to go around this, i get they are not the same class, but i have quite a lot of variable in the dataset and they are definitely of a different class.
Hope this is the correct format to posting.
Thanks for any help
I struggle a little bit when I use the new pivot_longer. Sometimes, I feel that renaming variable names before pivot_longer could make it much easier:
library(tidyverse)
HHS_all <- data.frame(hhm1q001 = "blue", hhm2q001 = "red", hhm1q002 = 30, hhm2q002 = 50)
df <- HHS_all %>%
rename(q001hhm1 = hhm1q001, q001hhm2 = hhm2q001,
q002hhm1 = hhm1q002, q002hhm2 = hhm2q002)
df %>%
pivot_longer(everything(), names_to = c(".value", "hhm"), names_sep = "hhm")
# A tibble: 2 x 3
hhm q001 q002
<chr> <fct> <dbl>
1 1 blue 30
2 2 red 50
I'm not familiar with the pivot_longer of tidyr and whether it would be able to do so. Using a combination of tidyr::separate and the melt and cast verbs from reshape2 I was able to produce your expected output:
df <- data.frame(hhm1q001 = "blue", hhm2q001 = "red", hhm1q002 = 30, hhm2q002= 50 )
df %>%
mutate(id = row_number()) %>%
reshape2::melt(id.vars = "id") %>%
tidyr::separate(variable, into = c("hhm", "q00"), sep = 4) %>%
tidyr::separate(hhm, into = c("prefix", "hhm"), sep = 3) %>%
select(-prefix, -id) %>%
reshape2::dcast(hhm ~ q00)

Parse and Evaluate Column of String Expressions in R?

How can I parse and evaluate a column of string expressions in R as part of a pipeline?
In the example below, I produce my desired column, evaluated. But I know this isn't the right approach. I tried taking a tidyverse approach. But I'm just very confused.
library(tidyverse)
df <- tibble(name = LETTERS[1:3],
to_evaluate = c("1-1+1", "iter+iter", "4*iter-1"),
evaluated = NA)
iter = 1
for (i in 1:nrow(df)) {
df[i,"evaluated"] <- eval(parse(text=df$to_evaluate[[i]]))
}
print(df)
# # A tibble: 3 x 3
# name to_evaluate evaluated
# <chr> <chr> <dbl>
# 1 A 1-1+1 1
# 2 B iter+iter 2
# 3 C 4*iter-1 3
As part of a pipeline, I tried:
df %>% mutate(evaluated = eval(parse(text=to_evaluate)))
df %>% mutate(evaluated = !!parse_exprs(to_evaluate))
df %>% mutate(evaluated = parse_exprs(to_evaluate))
df %>% mutate(evaluated = eval(parse_expr(to_evaluate)))
df %>% mutate(evaluated = parse_exprs(to_evaluate))
df %>% mutate(evaluated = eval(parse_exprs(to_evaluate)))
df %>% mutate(evaluated = eval_tidy(parse_exprs(to_evaluate)))
None of these work.
You can try:
df %>%
rowwise() %>%
mutate(iter = 1,
evaluated = eval(parse(text = to_evaluate))) %>%
select(-iter)
name to_evaluate evaluated
<chr> <chr> <dbl>
1 A 1-1+1 1
2 B iter+iter 2
3 C 4*iter-1 3
Following this logic, also other possibilities could work. Using rlang::parse_expr():
df %>%
rowwise() %>%
mutate(iter = 1,
evaluated = eval(rlang::parse_expr(to_evaluate))) %>%
select(-iter)
On the other hand, I think it is important to quote #Martin Mächler:
The (possibly) only connection is via parse(text = ....) and all good
R programmers should know that this is rarely an efficient or safe
means to construct expressions (or calls). Rather learn more about
substitute(), quote(), and possibly the power of using
do.call(substitute, ......).
Here's a slightly different way that does everything within mutate.
df %>% mutate(
evaluated = pmap_dbl(., function(name, to_evaluate, evaluated)
eval(parse(text=to_evaluate)))
)
# A tibble: 3 x 3
name to_evaluate evaluated
<chr> <chr> <dbl>
1 A 1-1+1 1
2 B iter+iter 2
3 C 4*iter-1 3
Note that values of additional variables (such as iter=1 in your case) can be passed directly to eval():
df %>%
mutate( evaluated = map_dbl(to_evaluate, ~eval(parse(text=.x), list(iter=1))) )
One advantage is that it automatically restricts the scope of the variable, keeping its value right next to where it is used.

filtering within the summarise function of dplyr

I am struggling a little with dplyr because I want to do two things at one and wonder if it is possible.
I want to calculate the mean of values and at the same time the mean for the values which have a specific value in an other column.
library(dplyr)
set.seed(1234)
df <- data.frame(id=rep(1:10, each=14),
tp=letters[1:14],
value_type=sample(LETTERS[1:3], 140, replace=TRUE),
values=runif(140))
df %>%
group_by(id, tp) %>%
summarise(
all_mean=mean(values),
A_mean=mean(values), # Only the values with value_type A
value_count=sum(value_type == 'A')
)
So the A_mean column should calculate the mean of values where value_count == 'A'.
I would normally do two separate commands and merge the results later, but I guess there is a more handy way and I just don't get it.
Thanks in advance.
We can try
df %>%
group_by(id, tp) %>%
summarise(all_mean = mean(values),
A_mean = mean(values[value_type=="A"]),
value_count=sum(value_type == 'A'))
You can do this with two summary steps:
df %>%
group_by(id, tp, value_type) %>%
summarise(A_mean = mean(values)) %>%
summarise(all_mean = mean(A_mean),
A_mean = sum(A_mean * (value_type == "A")),
value_count = sum(value_type == "A"))
The first summary calculates the means per value_type and the second "sums" only the mean of value_type == "A"
You can also give the following function a try:
?summarise_if
(the function family is summarise_all)
Example
The dplyr documentation serves a quite good example of this, i think:
# The _if() variants apply a predicate function (a function that
# returns TRUE or FALSE) to determine the relevant subset of
# columns. Here we apply mean() to the numeric columns:
starwars %>%
summarise_if(is.numeric, mean, na.rm = TRUE)
#> # A tibble: 1 x 3
#> height mass birth_year
#> <dbl> <dbl> <dbl>
#> 1 174. 97.3 87.6
The interesting thing here is the predicate function. This represents the rule by which the columns, that will have to be summarized, are selected.

Resources