creating conditioned sub-table in R - r

I'm having the following table (this is just a sample):
custNbr channel custBranchNbr totalTransactions
1 Web 901 7
2 store 903 5
3 Cel 901 10
etc...
and I'd like to create "sub_table" which summarize the number of transactions in each custBranchNbr conditioned on the specific channels (Web+Cel only); something like this:
custBranchNbr sum(totalTransaction)
901 17
I know how to use conditional sum (like this: sum(DF[which(DF[,1]>30 & DF[,4]>90),2])), but I don't know how can I implement this to get the "sub-table" I described above.
your help will be appreciated.

use the aggregate function
sub_table <- aggregate(custBranchNbr, df[df$channel %in% c('Web', 'Cel'), ], sum)

we can also do this with library(dplyr):
df %>% filter(channel %in% c("Web", "Cel") %>%
group_by(custBranchNbr) %>%
summarise(sum_totalTransactions = sum(totalTransactions))
# A tibble: 1 × 2
custBranchNbr sum_totalTransactions
<int> <int>
1 901 17

An option using data.table
library(data.table)
setDT(df)[channel %chin% c('Web', 'Cel'), .(Sum = sum(totalTransaction)).( , custBranchNbr]

Related

Function for writing an automated report in R

So I am trying to write an automated report in R with Functions. One of the questions I am trying to answer is this " During the first week of the month, what were the 10 most viewed products? Show the results in a table with the product's identifier, category, and count of the number of views.". To to this I wrote the following function
most_viewed_products_per_week <- function (month,first_seven_days, views){
month <- views....February.2020.2
first_seven_days <- function( month, date_1, date_2){
date_1 <-2020-02-01
date_2 <- 2020-02-07
return (first_seven_days)}
views <-function(views, desc){
return (views.head(10))}
}
print(most_viewed_products_per_week)
However the output I get is this:
function (month,first_seven_days, views){
month <- views....February.2020.2
first_seven_days <- function( month, date_1, date_2){
date_1 <-2020-02-01
date_2 <- 2020-02-07
return (first_seven_days)}
views <-function(views, desc){
return (views.head(10))}
How do I fix that?
This report has more questions like this, so I am trying to get my function writing as correct as possible from the start.
Thanks in advance,
Edo
It is a good practice to code in functions. Still I recommend you get your code doing what you want and then think about what parts you want to wrap in a function (for future re-use). This is to get you going.
In general: to support your analysis, make sure that your data is in the right class. I.e. dates are formatted as dates, numbers as double or integers, etc. This will give you access to many helper functions and packages.
For the case at hand, read up on {tidyverse}, in particular {dplyr} which can help you with coding pipes.
simulate data
As mentioned - you will find many friends on Stackoverflow, if you provide a reproducible example.
Your questions suggests your data look a bit like the following simulated data.
Adapt as appropriate (or provide example)
library(tibble) # tibble are modern data frames
library(dplyr) # for crunching tibbles/data frames
library(lubridate) # tidyverse package for date (and time) handling
df <- tribble( # create row-tibble
~date, ~identifier, ~category, ~views
,"2020-02-01", 1, "TV", 27
,"2020-02-02", 2, "PC", 40
,"2020-02-03", 1, "TV", 12
,"2020-02-03", 2, "PC", 2
,"2020-02-08", 3, "UV", 200
) %>%
mutate(date = ymd(date)) # date is read in a character - lubridate::ymd() for date
This yields
> df
# A tibble: 5 x 4
date identifier category views
<date> <dbl> <chr> <dbl>
1 2020-02-01 1 TV 27
2 2020-02-02 2 PC 40
3 2020-02-03 1 TV 12
4 2020-02-03 2 PC 2
5 2020-02-08 3 UV 200
Notice: date-column is in date-format.
work your algorithm
From your attempt it follows you want to extract the first 7 days.
Since we have a "date"-column, we can use a date-function to help us here.
{lubridate}'s day() extracts the "day-number".
> df %>% filter(day(date) <= 7)
# A tibble: 4 x 4
date identifier category views
<date> <dbl> <chr> <dbl>
1 2020-02-01 1 TV 27
2 2020-02-02 2 PC 40
3 2020-02-03 1 TV 12
4 2020-02-03 2 PC 2
Anything outside the first 7 days is gone.
Next you want to summarise to get your product views total.
df %>%
## ---------- c.f. above ------------
filter(day(date) <= 7) %>%
## ---------- summarise in bins that you need := groups -------
group_by(identifier, category) %>%
summarise(total_views = sum(views)
, .groups = "drop" ) # if grouping is not needed "drop" it
This gives you:
# A tibble: 2 x 3
identifier category total_views
<dbl> <chr> <dbl>
1 1 TV 39
2 2 PC 42
Now pick the top-10 and sort the order:
df %>%
## ---------- c.f. above ------------
filter(day(date) <= 7) %>%
group_by(identifier, category) %>%
summarise(total_views = sum(views), .groups = "drop" ) %>%
## ---------- make use of another helper function of dplyr
top_n(n = 10, total_views) %>% # note top-10 makes here no "real" sense :), try top_n(1, total_views)
arrange(desc(total_views)) # arrange in descending order on total_views
wrap in function
Now that the workflow is in place, think about breaking your code into the blocks you think are useful.
I leave this to you. You can assign interim results to new data frames and wrap the preparation of the data into a function and then the top_n() %>% arrange() in another function, ...
This yields:
# A tibble: 2 x 3
identifier category total_views
<dbl> <chr> <dbl>
1 2 PC 42
2 1 TV 39

Obtaining Percentage for Date Observations

I am very new to R and am struggling with this concept. I have a data frame that looks like this:
enter image description here
I have used summary(FoodFacilityInspections$DateRecent) to get the observations for each "date" listed. I have 3932 observations, though, and wanted to get a summary of:
Dates with the most observations and the percentage for that
Percentage of observations for the Date Recent category
I have tried:
*
> count(FoodFacilityInspections$DateRecent) Error in UseMethod("count")
> : no applicable method for 'count' applied to an object of class
> "factor"
Using built in data as you did not provide example data
library(data.table)
dtcars <- data.table(mtcars, keep.rownames = TRUE)
Solution
dtcars[, .("count"=.N, "percent"=.N/dtcars[, .N]*100),
by=cyl]
You can use the table function to find out which date occurs the most. Then you can loop through each item in the table (date in your case) and divide it by the total number of rows like this (also using the mtcars dataset):
table(mtcars$cyl)
percent <- c()
for (i in 1:length(table(mtcars$cyl))){
percent[i] <- table(mtcars$cyl)[i]/nrow(mtcars) * 100
}
output <- cbind(table(mtcars$cyl), percent)
output
percent
4 11 34.375
6 7 21.875
8 14 43.750
A one-liner using table and proportions in within.
within(as.data.frame.table(with(mtcars, table(cyl))), Pc <- proportions(Freq)*100)
# cyl Freq Pc
# 1 4 11 34.375
# 2 6 7 21.875
# 3 8 14 43.750
An updated solution with total, percent and cumulative percent table based on your data.
library(data.table)
data<-data.frame("ScoreRecent"=c(100,100,100,100,100,100,100,100,100),
"DateRecent"=c("7/23/2021", "7/8/2021","5/25/2021","5/19/2021","5/20/2021","5/13/2021","5/17/2021","5/18/2021","5/18/2021"),
"Facility_Type_Description"=c("Retail Food Stores", "Retail Food Stores","Food Service Establishment","Food Service Establishment","Food Service Establishment","Food Service Establishment","Food Service Establishment","Food Service Establishment","Food Service Establishment"),
"Premise_zip"=c(40207,40207,40207,40206,40207,40206,40207,40206,40206),
"Opening_Date"=c("6/27/1988","6/29/1988","10/20/2009","2/28/1989","10/20/2009","10/20/2009","10/20/2009","10/20/2009", "10/20/2009"))
tab <- function(dataset, var){
dataset %>%
group_by({{var}}) %>%
summarise(n=n()) %>%
mutate(total = cumsum(n),
percent = n / sum(n) * 100,
cumulativepercent = cumsum(n / sum(n) * 100))
}
tab(data, Facility_Type_Description)
Facility_Type_Description n total percent cumulativepercent
<chr> <int> <int> <dbl> <dbl>
1 Food Service Establishment 7 7 77.8 77.8
2 Retail Food Stores 2 9 22.2 100

How do I sort an as.data.frame table

To try and get the frequency of variable within a column, I used the following code:
s = table(students$Sport)
t = as.data.frame(s)
names(t)[1] = 'Sport'
t
Although this works, it gives me a massive list that is not sorted, such as this:
1 Football 20310
2 Rugby 80302
3 Tennis 5123
4 Swimming 73132
… … …
68 Basketball 90391
How would I go about sorting this table, so that the most frequent sport is at the top. Also, is there a way to only display the top 5 options? Rather than all 68 different sports?
Or, alternatively, if there's a better way to approach this.
Any help would be appreciated!
you can use dplyr and do it all in a single line, below an example
library(dplyr)
students = data.frame(sport = c(rep("Football", 200),
rep("Rugby", 130),
rep("Tennis", 100),
rep("Swimming", 40),
rep("Basketball", 10),
rep("Baseball", 300),
rep("Gimnastics", 70)
)
)
students %>% group_by(sport) %>% summarise( n = length(sport)) %>% arrange(desc(n)) %>% top_n(5, n)
# A tibble: 5 x 2
sport n
<fct> <int>
1 Baseball 300
2 Football 200
3 Rugby 130
4 Tennis 100
5 Gimnastics 70
You can use the plyr packages count function to count the words and frequency. A more elegant way of doing it compared to converting it to a dataframe.
library(plyr)
d<-count(students,"Sport") #convert it to a dataframe first before using count.
Order function helps you to order the output. using the - makes in sort in descending order. [1:5] gives you the top 5 rows. You can remove it if you want all entries.
d[order(-d$freq)[1:5],]

R For loop fails applying max function

I premise I'm new with R and actually I'm trying to get the fundamentals.
Currently I'm workin on a large dataframe (called "ppl") which I have to edit in order to filter some rows. Each row is included in a group and it is characterized by an intensity (into) value and a sample value.
mz rt into sample tracker sn grp
100.0153 126 2.762664 3 11908 7.522655 0
100.0171 127 2.972048 2 5308 7.718521 0
100.0788 272 30.217969 2 5309 19.024807 1
100.0796 272 17.277916 3 11910 7.297716 1
101.0042 128 37.557324 3 11916 27.991320 2
101.0043 128 39.676014 2 5316 28.234918 2
Well, the first question is: "How can I select from each group the sample with the highest intensity?"
I tried a for loop:
for (i in ppl$grp) {
temp<-ppl[ppl$grp == i,]
sel<-rbind(sel,temp[max(temp$into),])
}
The fact is that it works for ppl$grp == 0, but the next cycles return NAs rows.
Then the filtered dataframe(called "sel") also should store the sample values of the removed rows. It should be as follows:
mz rt into sample tracker sn grp
100.0171 127 2.972048 c(2,3) 5308 7.718521 0
100.0788 272 30.217969 c(2,3) 5309 19.024807 1
101.0043 128 39.676014 c(2,3) 5316 28.234918 2
In order to get this I would use this approach:
lev<-factor(ppl$grp)
samp<-ppl$sample
samp2<-split(samp,lev)
sel$sample<-samp2
Any hint? Because I cannot test it since I still don't have solved the previous problem.
Thanks a lot.
Not sure if I follow your question. But maybe this will get you started.
library(dplyr)
ppl %>% group_by(grp) %>% filter(into == max(into))
A base R option using ave is
ppl[with(ppl, ave(into, grp, FUN = max)==into),]
If the 'sample' column in the expected output have the unique elements in each 'grp', then after grouping by 'grp', update the 'sample' as the pasted unique elements of 'sample', then arrange the 'into' descendingly and slice the 1st row.
library(dplyr)
ppl %>%
group_by(grp) %>%
mutate(sample = toString(sort(unique(sample)))) %>%
arrange(desc(into)) %>%
slice(1L)
# mz rt into sample tracker sn grp
# <dbl> <int> <dbl> <chr> <int> <dbl> <int>
#1 100.0171 127 2.972048 2, 3 5308 7.718521 0
#2 100.0788 272 30.217969 2, 3 5309 19.024807 1
#3 101.0043 128 39.676014 2, 3 5316 28.234918 2
A data.table alternative:
library(data.table)
setkey(setDT(ppl),grp)
ppl <- ppl[ppl[,into==max(into),by=grp]$V1,]
## mz rt into sample tracker sn grp
##1: 100.0171 127 2.972048 2 5308 7.718521 0
##2: 100.0788 272 30.217969 2 5309 19.024807 1
##3: 101.0043 128 39.676014 2 5316 28.234918 2
I have no idea why this code would work
for (i in ppl$grp) {
temp<-ppl[ppl$grp == i,]
sel<-rbind(sel,temp[max(temp$into),])
}
max(temp$into) should return the maximum value--which appears to not be an integer in most cases.
Also, building a data.frame with rbind in every for loop instance is not good practice (in any language). It requires quit a bit of type checking and array growing that can get very expensive.
Also, max will return NA when there are any NAs for that group.
There is also a question about what you want to do about ties? Do you just want one result or all of them? The code Akrun gives will give you all of them.
This code will write a new column that has the group max
ppl$grpmax <- ave(ppl$into, ppl$grp, FUN=function(x) { max(x, na.rm=TRUE ) } )
You can then select all values in a group that are equal to the max with
pplmax <- subset(ppl, into == grpmax)
If you want just one per group then you can remove duplicates
pplmax[!duplicated(pplmax$grp),]

Find monthly plane/aircraft usage from the nycflights13 data set

I would like to find the monthly usage of all the aircrafts(based on tailnum)
lets say this is required for some kind of maintenance activity that needs to be done after x number of trips.
As of now i am doing it like below;
library(nycflights13)
N14228 <- filter(flights,tailnum=="N14228")
by_month <- group_by(N14228 ,month)
usage <- summarise(by_month,freq = n())
freq_by_months<- arrange(usage, desc(freq))
This has to be done for all aircrafts and for that the above approach wont work as there are 4044 distinct tailnums
I went through the dplyr vignette and found an example that comes very close to this but it is aimed at finding overall delays as shown below
flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
Apart from this i tried using aggregate and apply but couldnt get the desired results.
Check out the data.table package.
library(data.table)
flt <- data.table(flights)
flt[, .N, by = c("tailnum", "month")]
tailnum month N
1: N14228 1 15
2: N24211 1 14
3: N619AA 1 1
4: N804JB 1 29
5: N668DN 1 4
---
37984: N225WN 9 1
37985: N528AS 9 1
37986: N3KRAA 9 1
37987: N841MH 9 1
37988: N924FJ 9 1
Here, the .N means "count occurrence of".
Not sure if this is exactly what you're looking for, but regardless, for these kinds of counts, it's hard to beat data.table for execution speed and syntactical simplicity.

Resources