Count rows by group in Dplyr: Evaluation error - r

I'm trying to count the number of rows using dplyr after using group_by. I have the following data:
scenario pertubation population
A 1 20
B 1 30
C 1 40
D 1 50
A 2 15
B 2 25
And I'm using the following code to group_by and mutate:
test <- all_scenarios %>%
group_by(scenario) %>%
mutate(rank = dense_rank(desc(population)),
exceedance_probability = rank / count(pertubation)) %>%
select(scenario, pertubation, All.ages, rank, exceedance_probability)
But I keep encoutering this error message and I am unsure of what it means, or why I keep getting it?
Error in mutate_impl(.data, dots) :
Evaluation error: no applicable method for 'groups' applied to an object of class "c('integer', 'numeric')".
I would like my output data to look something like this:
scenario pertubation population rank exceedance_probability
A 1 20 12 0.06
B 1 30 7 0.035
C 1 40 2 0.01
D 1 50 1 0.005
A 2 15 34 0.17
B 2 25 28 0.14
To calculate the exceedance probability I just need to divide the rank by the number of observations, but I've found it hard to do this in dplyr after a group_by statement. Am I ordering the dplyr statements incorrectly?

We can get the count separately and join with the original dataset
all_scenarios %>%
count(pertubation) %>%
left_join(all_scenarios, ., by = 'pertubation') %>%
group_by(scenario) %>%
mutate(rank = dense_rank(desc(population)), exceedance_probability = rank /n)
Or instead of using count, we can do a second group_by and get the n()
all_scenarios %>%
group_by(scenario) %>%
mutate(rank = dense_rank(desc(population))) %>%
group_by(pertubation) %>%
mutate( exceedance_probability = rank /n())

Your issue comes from the
count(pertubation)
part of the code. You cannot use count in a group_by scenario. I can't find a good explanation why, but it won't work. Just use
n()
in place of it in the code. Since youre grouping by scenario, and each scenario-pertubation is unique in your dataset, by counting the number of rows in each scenario you are effectively counting the number of values or pertubation for each scenario.

Related

How to group data to specific ranges and mutate vectors in R?

I am trying to organize and mutate my data in R.
Essentially I am trying to graph the average of B, for data ranges in A
Original Data Set
A B
<dbl> <dbl>
1 200 28
2 1053 67.3
3 17000. 30
4 7565. 12
5 14525 56
6 3411 30
What I am trying to transform my data into
Ranges Average
0 - 999.99 23%
1000 - 1999.99 45%
2000 - 2999.99 32%
3000 - 3999.99 50%
This is what I have so far for this function
A1 <- read_excel("file")
DataRange <- data.frame( A= A1$C,
B= A1$R)
# Function 1
ranges1 <- DataRange %>% mutate(new_range=cut(A, breaks = seq(min(A),max(A)), by = 999))
The Output of range1 is
232 699.00 23.00000 (699,700]
233 445.00 33.00000 (445,446]
234 3112.00 28.00000 (3112,3113]
235 1235.00 98.00000 (1235,1236]
This is a breakdown from the function I am working with
# Function 2
ranges1 <- DataRange %>% mutate(new_range=cut(A, breaks = seq(min(A),max(A)), by = 999)
%>% group_by(new_range)
%>% dplyr::summarize(mean_1 = mean(B))
%>% as.data.frame())
The output of range1 is:
Error in `mutate()`:
! Problem while computing `new_range = ... %>% as.data.frame()`.
Caused by error in `UseMethod()`:
! no applicable method for 'group_by' applied to an object of class "factor"
Run `rlang::last_error()` to see where the error occurred.
As you can tell I am jumping the gun on the first problem, but the later function is where I am trying to take this expression.
I am really confused about how to fix the first function, any suggestions?
This is a syntax error. You need to have the %>% pipes at the ends of lines, not the start of lines. When your line ends after the mutate() R thinks that command is complete. Then the next line starts with %>% and the data didn't actually get piped through.
Change it to this:
ranges1 <- DataRange %>%
mutate(new_range=cut(A, breaks = seq(min(A),max(A)), by = 999)) %>%
group_by(new_range) %>%
dplyr::summarize(mean_1 = mean(B)) %>%
as.data.frame())

Melt Data and fill new column with desired data

Hello coding community
I have a two part question that is 1/2 answered
transpose, aka melt data frame, to my liking - done
add rows of data based on results found in "removed" column, a column created in the transposing step - stuck here
df<- read.table("https://pastebin.com/raw/NEPcUG01",header=T, sep="\t")
df_transformed<-tidyr::gather(df, day, removed, -(1:2), na.rm = TRUE) # melted data
In my example here (df), I have an experiment ran over 8 days. On certain days, I remove data points, and I am only interested in these days (hence why I added na.rm = TRUE in the transposing process). I sometimes remove 1 data point, or 4 (but this could be any number really)
I would like the removed data points to be called "individuals", and for them to be counted in chronological order. Therefore, I first need to add a column called "individuals"
df_transformed$individual <- ""
I would like to fill in the "individual" column based on the results in the "removed" column.
example: cage 2 had only 1 data point removed, and it was on day_8. I would therefore like to add, in the "individual" column, a 1. Cage 4, on the other hand, had data points removed on day_5 (1 data point) and day_7 (3 data points), for a total of 4 data points , aka , 4 "individuals". Therefore, Cage 4, when starting with day_5, I would like to add a 1 in the "individuals" column, and for day 7, create 3 total rows of data, and continue my "individual count" with 2,3,4. IF day_8 had 3 more data points removed, the individual count would continue with 5,6,7.
My desired result for my example data set today would be this:
desired_results <- read.table("https://pastebin.com/raw/r7QrC0y3", header=T, sep="\t") # 68 total rows of data
Interesting piece of information: The total number of rows in my final data set should equal the sum of all removed data points:
sum(df_transformed$removed) # 68
Thank you StackOverflow community. Looking forward to seeing the results.
We can use complete to create a sequence from 1 to each individual grouped by cage and day. We then fill the NA values in columns experiment and removed.
library(dplyr)
library(tidyr)
df_transformed %>%
mutate(individual = removed) %>%
group_by(cage, day) %>%
complete(individual = seq_len(individual)) %>%
fill(experiment, removed, .direction = "up")
# cage day individual experiment removed
#1 2 day_8 1 sugar 1
#2 3 day_5 1 sugar 1
#3 4 day_5 1 sugar 3
#4 4 day_5 2 sugar 3
#5 4 day_5 3 sugar 3
#6 4 day_7 1 sugar 1
#7 7 day_7 1 sugar 1
#8 7 day_8 1 sugar 1
#9 8 day_5 1 sugar 2
#10 8 day_5 2 sugar 2
# … with 58 more rows
To update individual only based on cage we can do
df_transformed %>%
mutate(individual = removed) %>%
group_by(cage, day) %>%
complete(individual = seq_len(individual)) %>%
group_by(cage) %>%
mutate(individual = row_number()) %>%
fill(experiment, removed, .direction = "up")
I think the following bit of code does what you need:
library(tidyverse)
read.table("https://pastebin.com/raw/NEPcUG01",header=T, sep="\t") %>%
pivot_longer(starts_with("day_"), names_to = "day", values_to = "removed") %>%
# drop_na() %>%
group_by(cage) %>%
summarize(individual = sum(removed, na.rm = TRUE))
I have used the pipe operator (%>%), which enables cleaner syntax. I have also used the newer pivot_longer function instead of gather. Then, grouping by cage and later summing over the individual column with summarize you get how many individuals were removed per cage.
I checked the sum of all the individuals and it seems to work:
read.table("https://pastebin.com/raw/NEPcUG01",header=T, sep="\t") %>%
pivot_longer(starts_with("day_"), names_to = "day", values_to = "removed") %>%
# drop_na() %>%
group_by(cage) %>%
summarize(individual = sum(removed, na.rm = TRUE)) %>%
pull(individual) %>%
sum()
#> [1] 68
The result is slightly different to your desired result. I am not 100% your desired result is actually correct... From your question, I understand that cage 4 should have 4 individuals, but in your desired_result it appears 4 times with values 1, 2, 3 and 4. The code I sent you generates a data frame where each appears in a single row.

Difference between two higher numbers in a column in R

I have a data frame like these:
NUM_TURNO CODIGO_MUNICIPIO SIGLA_PARTIDO SHARE
1 1 81825 PPB 38.713318
2 1 81825 PMDB 61.286682
3 1 09717 PMDB 48.025900
4 1 09717 PL 1.279217
5 1 09717 PFL 50.694883
6 1 61921 PMDB 51.793868
This is a data.frame of elections in Brazil. Grouping by NUM_TURNO and CODGIDO_MUNICIPIO I want to compare the SHARE of the FIRST and SECOND most votted politics in each city and round (1 or 2) and create a new column.
What am I having problem to do? I don't know how to calculate the difference only for the two biggest SHARES of votes.
For the first case, for example, I want to create something that gives me the difference between 61.286682 and 38.713318 = 22.573364 and so on.
Something like this:
df %>%
group_by(NUM_TURNO, CODIGO_MUNICIPIO) %>%
mutate(Diff = HIGHER SHARE - 2º HIGHER SHARE))
You can also use top_n from dplyr with grouping and summarizing. Keep in mind that in the data you provided, you will get an error in summarize if you use diff with a single value, hence the use of ifelse.
df %>%
group_by(NUM_TURNO, CODIGO_MUNICIPIO) %>%
top_n(2, SHARE) %>%
summarize(Diff = ifelse(n() == 1, NA, diff(SHARE)))
# A tibble: 3 x 3
# Groups: NUM_TURNO [?]
NUM_TURNO CODIGO_MUNICIPIO Diff
<dbl> <dbl> <dbl>
1 1 9717 2.67
2 1 61921 NA
3 1 81825 22.6
You could arrange your dataframe by Share and then slice the first two values. Then you could use summarise to get the diff between the values for every group:
library(dplyr)
df %>%
group_by(NUM_TURNO, CODIGO_MUNICIPIO) %>%
arrange(desc(Share)) %>%
slice(1:2) %>%
summarise(Diff = -diff(Share))

(R): Calculate quantile by unique row value unification

I have a df like this:
> df<-data.frame(Client.code =
c(100451,100451,100523,100523,100523,100525),dayref = c(24,30,15,13,17,5))
> df
Client.code dayref
1 100451 24
2 100451 30
3 100523 15
4 100523 13
5 100523 17
6 100525 5
It is a one-year distribution of payments period from issue.
Usign this data above and given a df2 like this:
Client.Code Days
1 100451 16
1 100523 16
1 100460 35
As i have enough data for a reasonable quantile prob. calculations.I will like to know how to build a loop for assing to every row in this df2 of days a quantile according with the first df.
We can use data.table
library(data.table)
setDT(df)[, .(Quantile = quantile(dayref)), Client.code]
Or with tidyverse
library(dplyr)
library(tidyr)
df %>%
group_by(Client.code) %>%
summarise(Quantile = list(quantile(dayref))) %>%
unnest
tapply(df$dayref, df$Client.code, quantile)
You can specify specific percentiles by adding a vector of them
tapply(df$dayref, df$Client.code, quantile, 1:19/20)
You may need to formulate like this
tapply(df$dayref, df$Client.code, quantile, probs = 1:19/20)
And you can add na.rm = TRUE as another argument if you might have NAs

Find monthly plane/aircraft usage from the nycflights13 data set

I would like to find the monthly usage of all the aircrafts(based on tailnum)
lets say this is required for some kind of maintenance activity that needs to be done after x number of trips.
As of now i am doing it like below;
library(nycflights13)
N14228 <- filter(flights,tailnum=="N14228")
by_month <- group_by(N14228 ,month)
usage <- summarise(by_month,freq = n())
freq_by_months<- arrange(usage, desc(freq))
This has to be done for all aircrafts and for that the above approach wont work as there are 4044 distinct tailnums
I went through the dplyr vignette and found an example that comes very close to this but it is aimed at finding overall delays as shown below
flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
Apart from this i tried using aggregate and apply but couldnt get the desired results.
Check out the data.table package.
library(data.table)
flt <- data.table(flights)
flt[, .N, by = c("tailnum", "month")]
tailnum month N
1: N14228 1 15
2: N24211 1 14
3: N619AA 1 1
4: N804JB 1 29
5: N668DN 1 4
---
37984: N225WN 9 1
37985: N528AS 9 1
37986: N3KRAA 9 1
37987: N841MH 9 1
37988: N924FJ 9 1
Here, the .N means "count occurrence of".
Not sure if this is exactly what you're looking for, but regardless, for these kinds of counts, it's hard to beat data.table for execution speed and syntactical simplicity.

Resources