I would like to find the monthly usage of all the aircrafts(based on tailnum)
lets say this is required for some kind of maintenance activity that needs to be done after x number of trips.
As of now i am doing it like below;
library(nycflights13)
N14228 <- filter(flights,tailnum=="N14228")
by_month <- group_by(N14228 ,month)
usage <- summarise(by_month,freq = n())
freq_by_months<- arrange(usage, desc(freq))
This has to be done for all aircrafts and for that the above approach wont work as there are 4044 distinct tailnums
I went through the dplyr vignette and found an example that comes very close to this but it is aimed at finding overall delays as shown below
flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
Apart from this i tried using aggregate and apply but couldnt get the desired results.
Check out the data.table package.
library(data.table)
flt <- data.table(flights)
flt[, .N, by = c("tailnum", "month")]
tailnum month N
1: N14228 1 15
2: N24211 1 14
3: N619AA 1 1
4: N804JB 1 29
5: N668DN 1 4
---
37984: N225WN 9 1
37985: N528AS 9 1
37986: N3KRAA 9 1
37987: N841MH 9 1
37988: N924FJ 9 1
Here, the .N means "count occurrence of".
Not sure if this is exactly what you're looking for, but regardless, for these kinds of counts, it's hard to beat data.table for execution speed and syntactical simplicity.
Related
I am trying to organize and mutate my data in R.
Essentially I am trying to graph the average of B, for data ranges in A
Original Data Set
A B
<dbl> <dbl>
1 200 28
2 1053 67.3
3 17000. 30
4 7565. 12
5 14525 56
6 3411 30
What I am trying to transform my data into
Ranges Average
0 - 999.99 23%
1000 - 1999.99 45%
2000 - 2999.99 32%
3000 - 3999.99 50%
This is what I have so far for this function
A1 <- read_excel("file")
DataRange <- data.frame( A= A1$C,
B= A1$R)
# Function 1
ranges1 <- DataRange %>% mutate(new_range=cut(A, breaks = seq(min(A),max(A)), by = 999))
The Output of range1 is
232 699.00 23.00000 (699,700]
233 445.00 33.00000 (445,446]
234 3112.00 28.00000 (3112,3113]
235 1235.00 98.00000 (1235,1236]
This is a breakdown from the function I am working with
# Function 2
ranges1 <- DataRange %>% mutate(new_range=cut(A, breaks = seq(min(A),max(A)), by = 999)
%>% group_by(new_range)
%>% dplyr::summarize(mean_1 = mean(B))
%>% as.data.frame())
The output of range1 is:
Error in `mutate()`:
! Problem while computing `new_range = ... %>% as.data.frame()`.
Caused by error in `UseMethod()`:
! no applicable method for 'group_by' applied to an object of class "factor"
Run `rlang::last_error()` to see where the error occurred.
As you can tell I am jumping the gun on the first problem, but the later function is where I am trying to take this expression.
I am really confused about how to fix the first function, any suggestions?
This is a syntax error. You need to have the %>% pipes at the ends of lines, not the start of lines. When your line ends after the mutate() R thinks that command is complete. Then the next line starts with %>% and the data didn't actually get piped through.
Change it to this:
ranges1 <- DataRange %>%
mutate(new_range=cut(A, breaks = seq(min(A),max(A)), by = 999)) %>%
group_by(new_range) %>%
dplyr::summarize(mean_1 = mean(B)) %>%
as.data.frame())
So I am trying to write an automated report in R with Functions. One of the questions I am trying to answer is this " During the first week of the month, what were the 10 most viewed products? Show the results in a table with the product's identifier, category, and count of the number of views.". To to this I wrote the following function
most_viewed_products_per_week <- function (month,first_seven_days, views){
month <- views....February.2020.2
first_seven_days <- function( month, date_1, date_2){
date_1 <-2020-02-01
date_2 <- 2020-02-07
return (first_seven_days)}
views <-function(views, desc){
return (views.head(10))}
}
print(most_viewed_products_per_week)
However the output I get is this:
function (month,first_seven_days, views){
month <- views....February.2020.2
first_seven_days <- function( month, date_1, date_2){
date_1 <-2020-02-01
date_2 <- 2020-02-07
return (first_seven_days)}
views <-function(views, desc){
return (views.head(10))}
How do I fix that?
This report has more questions like this, so I am trying to get my function writing as correct as possible from the start.
Thanks in advance,
Edo
It is a good practice to code in functions. Still I recommend you get your code doing what you want and then think about what parts you want to wrap in a function (for future re-use). This is to get you going.
In general: to support your analysis, make sure that your data is in the right class. I.e. dates are formatted as dates, numbers as double or integers, etc. This will give you access to many helper functions and packages.
For the case at hand, read up on {tidyverse}, in particular {dplyr} which can help you with coding pipes.
simulate data
As mentioned - you will find many friends on Stackoverflow, if you provide a reproducible example.
Your questions suggests your data look a bit like the following simulated data.
Adapt as appropriate (or provide example)
library(tibble) # tibble are modern data frames
library(dplyr) # for crunching tibbles/data frames
library(lubridate) # tidyverse package for date (and time) handling
df <- tribble( # create row-tibble
~date, ~identifier, ~category, ~views
,"2020-02-01", 1, "TV", 27
,"2020-02-02", 2, "PC", 40
,"2020-02-03", 1, "TV", 12
,"2020-02-03", 2, "PC", 2
,"2020-02-08", 3, "UV", 200
) %>%
mutate(date = ymd(date)) # date is read in a character - lubridate::ymd() for date
This yields
> df
# A tibble: 5 x 4
date identifier category views
<date> <dbl> <chr> <dbl>
1 2020-02-01 1 TV 27
2 2020-02-02 2 PC 40
3 2020-02-03 1 TV 12
4 2020-02-03 2 PC 2
5 2020-02-08 3 UV 200
Notice: date-column is in date-format.
work your algorithm
From your attempt it follows you want to extract the first 7 days.
Since we have a "date"-column, we can use a date-function to help us here.
{lubridate}'s day() extracts the "day-number".
> df %>% filter(day(date) <= 7)
# A tibble: 4 x 4
date identifier category views
<date> <dbl> <chr> <dbl>
1 2020-02-01 1 TV 27
2 2020-02-02 2 PC 40
3 2020-02-03 1 TV 12
4 2020-02-03 2 PC 2
Anything outside the first 7 days is gone.
Next you want to summarise to get your product views total.
df %>%
## ---------- c.f. above ------------
filter(day(date) <= 7) %>%
## ---------- summarise in bins that you need := groups -------
group_by(identifier, category) %>%
summarise(total_views = sum(views)
, .groups = "drop" ) # if grouping is not needed "drop" it
This gives you:
# A tibble: 2 x 3
identifier category total_views
<dbl> <chr> <dbl>
1 1 TV 39
2 2 PC 42
Now pick the top-10 and sort the order:
df %>%
## ---------- c.f. above ------------
filter(day(date) <= 7) %>%
group_by(identifier, category) %>%
summarise(total_views = sum(views), .groups = "drop" ) %>%
## ---------- make use of another helper function of dplyr
top_n(n = 10, total_views) %>% # note top-10 makes here no "real" sense :), try top_n(1, total_views)
arrange(desc(total_views)) # arrange in descending order on total_views
wrap in function
Now that the workflow is in place, think about breaking your code into the blocks you think are useful.
I leave this to you. You can assign interim results to new data frames and wrap the preparation of the data into a function and then the top_n() %>% arrange() in another function, ...
This yields:
# A tibble: 2 x 3
identifier category total_views
<dbl> <chr> <dbl>
1 2 PC 42
2 1 TV 39
I'm trying to count the number of rows using dplyr after using group_by. I have the following data:
scenario pertubation population
A 1 20
B 1 30
C 1 40
D 1 50
A 2 15
B 2 25
And I'm using the following code to group_by and mutate:
test <- all_scenarios %>%
group_by(scenario) %>%
mutate(rank = dense_rank(desc(population)),
exceedance_probability = rank / count(pertubation)) %>%
select(scenario, pertubation, All.ages, rank, exceedance_probability)
But I keep encoutering this error message and I am unsure of what it means, or why I keep getting it?
Error in mutate_impl(.data, dots) :
Evaluation error: no applicable method for 'groups' applied to an object of class "c('integer', 'numeric')".
I would like my output data to look something like this:
scenario pertubation population rank exceedance_probability
A 1 20 12 0.06
B 1 30 7 0.035
C 1 40 2 0.01
D 1 50 1 0.005
A 2 15 34 0.17
B 2 25 28 0.14
To calculate the exceedance probability I just need to divide the rank by the number of observations, but I've found it hard to do this in dplyr after a group_by statement. Am I ordering the dplyr statements incorrectly?
We can get the count separately and join with the original dataset
all_scenarios %>%
count(pertubation) %>%
left_join(all_scenarios, ., by = 'pertubation') %>%
group_by(scenario) %>%
mutate(rank = dense_rank(desc(population)), exceedance_probability = rank /n)
Or instead of using count, we can do a second group_by and get the n()
all_scenarios %>%
group_by(scenario) %>%
mutate(rank = dense_rank(desc(population))) %>%
group_by(pertubation) %>%
mutate( exceedance_probability = rank /n())
Your issue comes from the
count(pertubation)
part of the code. You cannot use count in a group_by scenario. I can't find a good explanation why, but it won't work. Just use
n()
in place of it in the code. Since youre grouping by scenario, and each scenario-pertubation is unique in your dataset, by counting the number of rows in each scenario you are effectively counting the number of values or pertubation for each scenario.
I am trying to calculate the families sizes from a data frame, which also contains two types of events : family members who died, and those who left the family. I would like to take into account these two parameters in order to compute the actual family size.
Here is a reproductive example of my problem, with 3 families only :
family <- factor(rep(c("001","002","003"), c(10,8,15)), levels=c("001","002","003"), labels=c("001","002","003"), ordered=TRUE)
dead <- c(0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0)
left <- c(0,0,0,0,0,1,0,0,0,1,1,0,0,0,1,1,0,0,0,1,1,1,0,0,0,0,0,0,1,1,1,0,0)
DF <- data.frame(family, dead, left) ; DF
I could count N = total family members (in each family) in a second dataframe DF2, by simply using table()
DF2 <- with(DF, data.frame(table(family)))
colnames(DF2)[2] <- "N" ; DF2
family N
1 001 10
2 002 8
3 003 15
But i can not find a proper way to get the actual number of people (for example, creating a new variable N2 into DF2) , calculated by substracting to N the number of members who died or left the family. I suppose i have to relate the two dataframes DF and DF2 in a way. i have looked for other related questions in this site but could not find the right answer...
If anyone has a good idea, it would be great !
Thank you in advance..
Deni
Logic : First we want to group_by(family) and then calculate 2 numbers : i) total #obs in each group ii) subtract the sum(dead) + sum(left) from this total .
In dplyr package : n() helps us get the total #observations in each group
In data.table : .N does the same above job
library(dplyr)
DF %>% group_by(family) %>% summarise( total = n(), current = n()-sum(dead,left, na.rm = TRUE))
# family total current
# (fctr) (int) (dbl)
#1 001 10 6
#2 002 8 4
#3 003 15 7
library(data.table)
# setDT() is preferred if incase your data was a data.frame. else just DF.
setDT(DF)[, .(total = .N, current = .N - sum(dead, left, na.rm = TRUE)), by = family]
# family total current
#1: 001 10 6
#2: 002 8 4
#3: 003 15 7
Here is a base R option
do.call(data.frame, aggregate(dl~family, transform(DF, dl = dead + left),
FUN = function(x) c(total=length(x), current=length(x) - sum(x))))
Or a modified version is
transform(aggregate(. ~ family, transform(DF, total = 1,
current = dead + left)[c(1,4:5)], FUN = sum), current = total - current)
# family total current
#1 001 10 6
#2 002 8 4
#3 003 15 7
I finally found another which works fine (from another post), allowing to compute everything from the original DF table. This uses the ddply function :
DF <- ddply(DF,.(family),transform,total=length(family))
DF <- ddply(DF,.(family),transform,actual=length(family)-sum(dead=="1")-sum(left=="1"))
DF
Thanks a lot to everyone who helped ! Deni
After performing a survey on perceived problems per neighborhood I get this dataframe. Since the survey had different options to choose from + an open one, the results on the open question are frequently irrelevant (see below):
library(dplyr)
library(splitstackshape)
df = read.csv("http://pastebin.com/raw.php?i=tQKHWMvL")
# Splitting multiple answers into different rows.
df = cSplit(df, "Problems", ",", direction = "long")
df = df %>%
group_by(Problems) %>%
summarise(Total = n()) %>%
mutate(freq = Total/sum(Total)*100) %>%
arrange(rank = desc(rank(freq)))
Resulting in this data frame:
> df
Source: local data table [34 x 3]
Problems Total freq
1 Hurtos o robos sin violencia 245 25.6008359
2 Drogas 232 24.2424242
3 Peleas callejeras 162 16.9278997
4 Ningún problema 149 15.5694880
5 Agresiones 66 6.8965517
6 Robos con violencia 62 6.4785789
7 Quema contenedores 6 0.6269592
8 Ruidos 5 0.5224660
9 NS/NC 4 0.4179728
10 Desempleo 2 0.2089864
.. ... ... ...
>
As you can see results after row 9 are mostly irrelevant (only one or two respondants per option), so I'd like them to be grouped into a single option (such as "others") without losing their relation to the neighborhood (that's why I cant rename the values now). Any suggestions?
The splitstackshape imports the data.table package (so you don't even need to library it) and assigns a data.table class to your data set, so I would simply proceed with data.table syntax from there, especially because nothing beats data.table when it comes to assignments in a subset.
In other words, intead of this long dplyr piping, you can simply do
df[, freq := .N / nrow(df) * 100 , by = Problems]
df[freq < 6, Problems := "OTHER"]
And you good to go.
You can check the new summary table using
df[, .(freq = .N/nrow(df) * 100), by = Problems][order(-freq)]
# 1: Hurtos o robos sin violencia 25.600836
# 2: Drogas 24.242424
# 3: Peleas callejeras 16.927900
# 4: Ningֳ÷n problema 15.569488
# 5: Agresiones 6.896552
# 6: Robos con violencia 6.478579
# 7: OTHER 4.284222