R - automatize exclusion from the quantile_split function - r

I have a dataframe that looks like this:
Var1 Var2 Var3
100 B 15
200 A 16
700 A 13
500 C 10
This is just preview data, in fact it has 10000+ rows.
I am doing the following:
data %>%
group_by(Var2) %>%
mutate(Tercile = fabricatr::split_quantile(Var3, 3)) %>%
group_by(Var2, Tercile) %>%
summarise(Var1 = mean(Var1))
This results in a following error message:
The `x` argument provided to quantile split must be non-null and length at least 2.
As far as I understand, this means that for some values of Var2 there is only 1 unique value of Var3 and the tercile split cannot be accomplished. My first question is: Is this interpretation correct? I am confused by the part that says "length at least 2" because I expect that length should be at least 3 to perform a tercile split, right?
If the interpretation is correct, my second question is: How to automate the exclusion of such cases? I don't have nearly enough time to go through some 300 values of Var2 and examine the values of Var3. I need a coding solution that excludes such levels of Var2, so that the error mentioned previously doesn't appear.

As the error message says split_quantile needs a vector of at least length 2 we can remove the groups which has rows less than 2 and then apply the function?
library(dplyr)
data %>%
group_by(Var2) %>%
filter(n() >= 2) %>%
mutate(Tercile = fabricatr::split_quantile(Var3, 3)) %>%
group_by(Var2, Tercile) %>%
summarise(Var1 = mean(Var1))

Related

Is there an equivalent of COUNTIF in R?

I have some forestry data I want to work with. There are two variables in question for this portion of the data frame:
species
status (0 = alive, 2 = dead, 3 = ingrowth, 5 = grew with another tree)
MY GOAL is to count the number of trees that are 0 or 3 (the live trees) and create a tibble with species and number present as columns.
I have tried:
spp_pres_n <- plot9 %>% count(spp, status_2021, sort = TRUE)
Which gives a tibble of every species with each status. But I need a condition that selects only status 0 and 3 to be counted. Would if_else or a simple if statement then count suffice?
A simple way with dplyr
plot9 %>%
filter(status_2021 %in% c(0,3)) %>%
count(spp, status_2021, sort = TRUE)

Tidyr: pivot_wider error: Can't convert <double> to <list>

I have a dataframe which lists species observations across multiple survey plots (the data is here). I'm trying to use tidyr's pivot_wider to spread that abundance data across several columns, with the new columns being each of the observed species. Here's the line of code I'm trying to use to do that:
data %>% pivot_wider(names_from = Species, values_from = Total.Abundance, values_fill = 0)
However, this gives me two error messages:
Error: Can't convert <double> to <list>.
Values are not uniquely identified; output will contain list-cols.
I'm not sure what the issue is, because this has worked fine for several other dataframes that are (seemingly) identical to this one. I've tried googling the first error message and have not been able to find what conditions cause it—I don't know what double R is trying to convert to a list, nor why it's trying to convert to a list. The Total.Abundance column should be integers, but I wonder if somehow it's a double data type?
From what I've been able to find, the second error message appears when there are identical rows in the dataframe. However, the error persists when I modify my statement to
unique(data) %>% pivot_wider(names_from = Species, values_from = Total.Abundance, values_fill = 0)
Which I would have thought would remove duplicate rows.
Any help would be much appreciated!
Expanding on my comment, there are duplicates in your data that cannot be removed by unique() or in dplyr, distinct():
dat %>%
distinct() %>%
group_by(Plot.ID, Species) %>%
count()
# Plot.ID Species n
# <dbl> <chr> <int>
# 1 1 Calliopius 1
# 2 1 Idotea 2
# 3 1 Lacuna vincta 2
# 4 1 Mitrella lunata 2
# 5 1 Podoceropsis nitida 1
# 6 1 Unk. Amphipod 1
# 7 1 Unk. Bivalve 1
# 8 2 Calliopius 1
# 9 2 Caprella penantis 1
#10 2 Corophium insidiosum 1
Need to find out why you have duplicates like this and reconcile it, say by summing them up. The problem might be coming out of data wrangling coding bugs in which case summing is not necessarily suitable. Or perhaps say you sample same plot twice, you want mean instead of sum to normalize vs sampling effort, or perhaps you need extra column indicating sampling effort). Nevertheless, this works perfectly:
dat %>%
group_by(Plot.ID, Species) %>%
summarise(abundance = sum(Total.Abundance)) %>%
tidyr::pivot_wider(names_from = Species, values_from = abundance,
values_fill = 0)

R code not working to merge data (unduplicate) in columns based on identical values

I am trying to do some data wrangling on a tibble (dataframe) using dplyr, to unduplicate records, where if an id shows up twice, the resulting record will contain the same values if they are all identical, or an NA if there is a discrepancy in one of the records. For example, if I have df:
id date amount tag
--- ---- ------ ---
1 2018-01-03 10 big
2 2019-01-16 20 small
3 2020-01-05 30 big
3 2001-03-04 30 big
1 2018-01-03 5 big
The result should look like:
id date amount tag
--- ---- ------ ---
1 2018-01-03 NA big
2 2019-01-16 20 small
3 NA 30 big
Based on other answers I've found on stack overflow, I have tried various methods of using summarise_all including:
new_df <- df %>% group_by(id) %>% summarise_all(function(x) ifelse(all(x[1] == x),x[1],NA))
new_df <- df %>% group_by(id) %>% summarise_all(list(~ if(all(.[1] == .)) .[1] else NA))
new_df <- df %>% group_by(id) %>% summarise_all(funs(if(all(.[1] == .)) .[1] else NA))
Since I was able to use ifelse(all(x[1] == x),x[1],NA) on its own with a vector and it worked fine, I thought that would work with summarise_all. But when I use it with summarise_all or the other variants I show above, I get the error:
Error in summarise_impl(.data, dots): Column `date` can't promote group 2 to character
I suspect I just need to make a small tweak to my code to get it to work, but I've been working on this all day, and I don't know why it isn't working... So any help that the community can provide would be appreciated. This is the first time I've actually asked a question on stack overflow, because I almost always can find the answer from other people's questions :-) Thank you so much for any help!
First, the solution:
d %>%
group_by(id) %>%
summarise_all(~if(n_distinct(.) == 1) first(.) else c(NA, .)[1])
This is actually a little tricky. You'd think one could write simply:
d %>%
group_by(id) %>%
summarise_all(~if(n_distinct(.) == 1) first(.) else NA)
Which is just an alternative to your if (all ...) ... else ..., using some more dplyr functions.
However, dplyr doesn't like simply giving NA, but rather you need to be type specific. E.g. you need to provide NA_character_ or NA_integer_ etc to match the correct data type. This is why your code is failing, the error says that group 2 (i.e. id == 2 in this case) is failing to be "promoted" to character. This means that the NA provided there in column Date isn't being coerced to character and a new column is failing to be created.
Since you don't want to code all the correct NA types, I use a little trick here. Using c(NA, .)[1] to combine an NA value with the original variable will coerce that NA to the correct type, which I then use. You can probably use other tricks to get the correct NA too.

Dplyr/Lubridate: How to summarise overlapping intervals after grouping

I would like to group agreements and then compare how much their periods overlap (or are apart).
My dataframe may look like:
library(tidyverse)
library(lubridate)
tribble(
~ShipTo, ~Code, ~Start, ~End,
"xxxx", "AAA11", 2018-01-01, 2018-03-01,
"yyyy", "BBB23", 2018-02-01, 2018-05-11,
"yyyy", "BBB23", 2018-03-01, 2018-06-11,
"cccc", "AAA11", 2018-01-06, 2018-03-12,
"yyyy", "CCC04", 2018-01-16, 2018-03-31,
"xxxx", "DDD", 2018-01-21, 2018-03-25
)
I would like to mutate a column to create lubridate periods and evaluate them after grouping by ShipTo and Code. What I tried was:
dft3<-dft %>% filter(concat1 %in% to_filter2) %>%
arrange(ShipTo,Code)%>%
group_by(ShipTo,Code)%>%
mutate(period=interval(Start,End),
nextperiod=interval(lead(Start),lead(End)),
interv=day(as.period(intersect(period, nextperiod), "days"))) %>%
group_by(ShipTo,Code)%>%
summarise(count=n(),
intervmax=max(interv),
intervmin=min(interv))
If I remove the line group_by(ShipTo,Code)%>% the intervals are created correctly and also the lead intervals are correctly calculated from the next line. But when I naively use group_by, the intervals are not calculated correctly.
I suspect that perhaps my database should be split into many tables by groups and then, after the operation of creating and comparing intervals it should be glued back together.
Is there a succinct way to do it? Or perhaps there is a simpler way I have not yet learned? Thank you in advance for the hint in the right direction.
EDIT: The desired output should be a column with value of overlaps of intervals in days (or distances between intervals if no overlap). Grouping destroys the calculation. I would like to have these values calculated within groups (not accross them).
EDIT2: I trying to solve the problem by splitting dataframe into a list of dataframes and then combining it, but I am not sure of a syntax. It does not quite work, produces tables with one column, a help I was given on other portal (perhaps it can ilustrate the issue). The idea is to split a database, create new columns and combine the tables to a single table.
fnOverlaps <- function(x) {
mutate(x,okres=interval(Start,End),
nastokres=interval(lead(Start),lead(End)),
interv=day(as.period(intersect(okres, nastokres), "days")))
}
dft3<-dft3 %>%
split(list(.$ShipTo, .$Code), drop = TRUE) %>%
map_df(fnOverlaps) %>%
flatten_dfr()
The result (for one group) that I expect would look like this.
tribble(
~ShipTo, ~Code, ~interv,
"yyyy", "BBB23", 70 #say there is a 70 days overlap
"yyyy", "BBB23", NA #there is no next row to compare
)
It looks like the issue is being caused by trying to combine vectors with the class "Interval." Specifically, they appear to be getting converted to numeric and losing their inherent information.
I think the only viable solution is to split the data.frame, run the analysis on each component separately with lapply, then bring them back together with bind_rows. The number of groups with only one entry present an issue as max and min return -Inf and Inf when the argument is empty after removing NAs. But, that is easy enough to correct for.
This code should work. Note that I am using group_by to ensure the ShipTo/Code columns are kept, though you could do that in other ways.
dft %>%
split(paste(.$ShipTo, "XXX", .$Code)) %>%
lapply(function(x){
x %>%
arrange(ShipTo,Code) %>%
mutate(period=interval(Start,End)
, nextperiod=interval(lead(Start),lead(End))
, interv=day(as.period(intersect(period, nextperiod), "days"))
) %>%
group_by(ShipTo,Code)%>%
summarise(count=n(),
intervmax=max(interv, na.rm = TRUE),
intervmin=min(interv, na.rm = TRUE)) %>%
ungroup()
}) %>%
bind_rows() %>%
mutate(intervmax = ifelse(is.infinite(intervmax)
, NA, intervmax)
, intervmin = ifelse(is.infinite(intervmin)
, NA, intervmin))
Returns
# A tibble: 5 x 5
ShipTo Code count intervmax intervmin
<chr> <chr> <int> <dbl> <dbl>
1 cccc AAA11 1 NA NA
2 xxxx AAA11 1 NA NA
3 xxxx DDD 1 NA NA
4 yyyy BBB23 2 71.0 71.0
5 yyyy CCC04 1 NA NA
I am putting it just for the record. I received an answer from Jake Knaupp on slack r4ds group with the modern map_df() syntax, it calculates overlap of periods but it converts periods to numeric. And there is a bunch of warnings it will do that.
myFun <- function(x) {
mutate(x,period=interval(Start,End),
nextperiod=interval(lead(Start),lead(End)),
interv=day(as.period(intersect(period, nextperiod), "days")))
}
df %>%
split(list(.$ShipTo, .$Code), drop = TRUE) %>%
map_df(myFun)

R: Multiple levels of grouping in dplyr::summarise using lapply

I'm relatively new to R and I seem to be having trouble applying a list criteria to a data.frame I'm trying to summarize. I've been reading a bunch of different posts, but they only seem to be concerned with one level of grouping, and not a second one.
Assuming my df looks like this (my actual data frame is much larger. There are 35 different "Codes" and about 20 different "Colors")
Code Color Value
[1] A Red 10
[2] A Blue 15
[3] A Red 5
[4] B Green 20
[5] B Red 15
[6] C Green 10
Ideally, I'd like to create a summary table which enables me to group the data by Code (I've been successful doing this with Group by and Split) but then i'd also like to create a sum of values by criteria "Color". Currently, I've only been able to accomplish this by running the criteria one by one.
So far I've been able to do this:
#this gives me the total value by each code, like a pivot or a sumif
dfsummary <-df %>% group_by(Code) %>% summarise (total = sum(Value))
#then I was able to come up with this to give me, by Code, value by Color.
dfsummary2 <- df %>% filter(Color == "Red") %>% group_by(Code)
%>% summarise(sumRed = sum(Value))
The results in dfsummary2 are:
Code sumRed
[1] A 15
[2] B 15
[3] C 0
What I'd like to accomplish is creating a data frame for all "Color" without having to specify each one individually.
My desired output, let's call it dfsummaryall, looks like:
Code sumRed sumBlue sumGreen
[1] A 15 15 0
[2] B 15 0 20
[3] C 0 0 10
This is where I get stumped. I can run each one individually and then merge them into one table, but I'd like to find a way to work in an apply function (lapply, I would think). This is where I'm definitely a novice.
My attempt so far, and this is where I'm sure I'm egregiously wrong, goes like this:
colors <- c("Red","Blue","Green")
dfsummaryall <- lapply(colors, function(x){dftmp <- df %>%
dplyr::filter(Color == x) %>% group_by(Code) %>%
summarise(x == sum(MktValue)
I know there's definitely a problem here in the "summarise(x == sum(MktValue)" part, but I'm really stumped as to how to pull this off.
Any help would be truly appreciated!
From user duckmayr in the comments:
df %>% group_by(Code, Color) %>% summarise(Sum = sum(Value)) %>% tidyr::spread(Color, Sum, fill = 0)
This worked perfectly for my purposes.

Resources