Obtaining Percentage for Date Observations - r

I am very new to R and am struggling with this concept. I have a data frame that looks like this:
enter image description here
I have used summary(FoodFacilityInspections$DateRecent) to get the observations for each "date" listed. I have 3932 observations, though, and wanted to get a summary of:
Dates with the most observations and the percentage for that
Percentage of observations for the Date Recent category
I have tried:
*
> count(FoodFacilityInspections$DateRecent) Error in UseMethod("count")
> : no applicable method for 'count' applied to an object of class
> "factor"

Using built in data as you did not provide example data
library(data.table)
dtcars <- data.table(mtcars, keep.rownames = TRUE)
Solution
dtcars[, .("count"=.N, "percent"=.N/dtcars[, .N]*100),
by=cyl]

You can use the table function to find out which date occurs the most. Then you can loop through each item in the table (date in your case) and divide it by the total number of rows like this (also using the mtcars dataset):
table(mtcars$cyl)
percent <- c()
for (i in 1:length(table(mtcars$cyl))){
percent[i] <- table(mtcars$cyl)[i]/nrow(mtcars) * 100
}
output <- cbind(table(mtcars$cyl), percent)
output
percent
4 11 34.375
6 7 21.875
8 14 43.750

A one-liner using table and proportions in within.
within(as.data.frame.table(with(mtcars, table(cyl))), Pc <- proportions(Freq)*100)
# cyl Freq Pc
# 1 4 11 34.375
# 2 6 7 21.875
# 3 8 14 43.750

An updated solution with total, percent and cumulative percent table based on your data.
library(data.table)
data<-data.frame("ScoreRecent"=c(100,100,100,100,100,100,100,100,100),
"DateRecent"=c("7/23/2021", "7/8/2021","5/25/2021","5/19/2021","5/20/2021","5/13/2021","5/17/2021","5/18/2021","5/18/2021"),
"Facility_Type_Description"=c("Retail Food Stores", "Retail Food Stores","Food Service Establishment","Food Service Establishment","Food Service Establishment","Food Service Establishment","Food Service Establishment","Food Service Establishment","Food Service Establishment"),
"Premise_zip"=c(40207,40207,40207,40206,40207,40206,40207,40206,40206),
"Opening_Date"=c("6/27/1988","6/29/1988","10/20/2009","2/28/1989","10/20/2009","10/20/2009","10/20/2009","10/20/2009", "10/20/2009"))
tab <- function(dataset, var){
dataset %>%
group_by({{var}}) %>%
summarise(n=n()) %>%
mutate(total = cumsum(n),
percent = n / sum(n) * 100,
cumulativepercent = cumsum(n / sum(n) * 100))
}
tab(data, Facility_Type_Description)
Facility_Type_Description n total percent cumulativepercent
<chr> <int> <int> <dbl> <dbl>
1 Food Service Establishment 7 7 77.8 77.8
2 Retail Food Stores 2 9 22.2 100

Related

How to group and average a dataset by order then plot a broken line plot?

I have a data frame that contains 5000 examinee's ability estimation with their test score, and they are both continuous variables. Since there are too many examinees, it would be messy to plot out all their scores, so I wish to draw a 'broken line plot' or 'conditional mean plot', that average the test scores of several examines that have similar ability levels at a time, and plot their average score against their average ability. Like the plot below.
I already managed to do this with the codes below.
df<-cbind(rnorm(100,set.seed(123)),sample(100,set.seed(123)),) %>%
as.data.frame() %>%
setNames(c("ability","score")) #simulate the dataset
df<-df[order(df$ability),] #sort the data from low to high according to the ability varaible
seq<-round(seq(from=1, to=nrow(df), length.out=10),0) #divide the data equally to nine groups (which is also gonna be the 9 points that appear in my plot)
b<-data.frame()
for (i in 1:9) {
b[i,1]<-mean(df[seq[i]:seq[i+1],1]) #calculate the mean of the ability by group
b[i,2]<-mean(df[seq[i]:seq[i+1],2]) # calculate the mean of test score by group
}.
I got the mean of the ability and test score using this for loop, and it looks like this
and finally, do the plot
plot(b$V1,b$V2, type='b',
xlab="ability",
ylab="score",
main="Conditional score")
These codes meet my goal, but I can't help thinking if there's a simpler way to do this. Drawing a broken line plot by averaging the data that is sorted from low to high seems to be a normal task.
I wonder if there is any function or trick for this. All ideas are welcome! :)
Here is a solution to create the data to be plotted using dplyr:
set.seed(123)
df<-cbind(rnorm(100,1),sample(100,50)) %>%
as.data.frame() %>%
setNames(c("ability","score")) #simulate the dataset
df<-df[order(df$ability),] #sort the data from low to high according to the ability varaible
df$id <- seq(1, nrow(df))
df %>% mutate(bin = ntile(id, 10)) %>%
group_by(bin) %>%
dplyr::summarize(meanAbility = mean(ability, na.rm=T),
meanScore = mean(score, na.rm=T)) %>%
as.data.frame()
bin meanAbility meanScore
1 1 -0.81312770 41.6
2 2 -0.09354171 52.3
3 3 0.29089892 54.4
4 4 0.68490709 45.8
5 5 0.93078744 59.8
6 6 1.17380069 34.0
7 7 1.42942368 41.3
8 8 1.64965315 40.1
9 9 1.95290596 35.6
10 10 2.50277510 52.9
I would approach the whole thing a bit differently (note also that your code has several errors and won't run the way you were showing.
The exmaple below will lead to different numbers than yours (due to the random generation of numbers and your non-working code).
library(tidyverse)
df <- data.frame(ability = rnorm(100),
score = sample(100)) %>%
arrange(ability) %>%
mutate(seq = ntile(n = 9)) %>%
group_by(seq) %>%
summarize(mean_ability = mean(ability),
mean_score = mean(score))
which gives:
# A tibble: 9 x 3
seq mean_ability mean_score
<int> <dbl> <dbl>
1 1 -1.390807 45.25
2 2 -0.7241746 56.18182
3 3 -0.4315872 49
4 4 -0.2223723 48.81818
5 5 0.06313174 56.36364
6 6 0.3391321 42
7 7 0.6118022 53.27273
8 8 1.021438 50.54545
9 9 1.681746 53.54545

Hos can I add a column porcentaje based on other column in my data frame?

I would like to create a column in my data frame that gives the percentage of each category. The total (100%) would be the summary of the column Score.
My data looks like
Client Score
<chr> <int>
1 RP 125
2 DM 30
Expected
Client Score %
<chr> <int>
1 RP 125 80.6
2 DM 30 19.3
Thanks!
Note special character in column names is not good.
library(dplyr)
df %>%
mutate(`%` = round(Score/sum(Score, na.rm = TRUE)*100, 1))
Client Score %
1 RP 125 80.6
2 DM 30 19.4
Probably the best way is to use dplyr. I recreated your data below and used the mutate function to create a new column on the dataframe.
#Creation of data
Client <- c("RP","DM")
Score <- c(125,30)
DF <- data.frame(Client,Score)
DF
#install.packages("dplyr") #Remove first # and install if library doesn't load
library(dplyr) #If this doesn't run, install library using code above.
#Shows new column
DF %>%
mutate("%" = round((Score/sum(Score))*100,1))
#Overwrites dataframe with new column added
DF %>%
mutate("%" = round((Score/sum(Score))*100,1)) -> DF
Using base R functions the same goal can be achieved.
X <- round((DF$Score/sum(DF$Score))*100,1) #Creation of percentage
DF$"%" <- X #Storage of X as % to dataframe
DF #Check to see it exists
In base R, may use proportions
df[["%"]] <- round(proportions(df$Score) * 100, 1)
-output
> df
Client Score %
1 RP 125 80.6
2 DM 30 19.4

R dplyr - running total with row-wise calculations

I have a dataframe that keeps track of the activities associated with a bank account (example below).
The initial balance is $5,000 (type "initial). If type is "in", that means a cash deposit. In this example each deposit is $1,000. If type is "out", that means a withdrawal from the account. In this example each withdrawal is 10% of the account balance.
data <- tibble(
activity=1:6,
type=c("initial","in","out","out","in","in"),
input=c(5000,1000,10,10,1000,1000))
Is there a dplyr solution to keep track of the balance after each activity?? I have tried several ways but I can't seem to find a way to efficiently calculate running totals and the withdrawal amount (which depends on the running total).
For this example the output should be:
result <- tibble(
activity=1:6,
type=c("initial","in","out","out","in","in"),
input=c(5000,1000,10,10,1000,1000),
balance=c(5000,6000,5400,4860,5860,6860))
Thanks in advance for any suggestions or recommendations!
You can use purrr::accumulate2() to condition the calculation on the value of type:
library(dplyr)
library(purrr)
library(tidyr)
data %>%
mutate(balance = accumulate2(input, type[-1], .f = function(x, y, type) if(type == "out") x - x * y/100 else x + y)) %>%
unnest(balance)
# A tibble: 6 x 4
activity type input balance
<int> <chr> <dbl> <dbl>
1 1 initial 5000 5000
2 2 in 1000 6000
3 3 out 10 5400
4 4 out 10 4860
5 5 in 1000 5860
6 6 in 1000 6860

R - Normalise column groups based on column names and add column with conditional value

I'm new to R and dplyr (coming from a pandas/python background) and am currently trying to do more data manipulation in R. The dplyr syntax has really grown on me, but working on my current data normalisation I can't help but think "there must be a 'cleaner' way of doing it".
I have two data.frames, the first has values which I'd like to use to dynamically normalise subsets of the second. I then want to average all columns which have names that end identically, then group and classify rows by the higher/lower means. If this was too unclear, I hope the code clears some things up.
Normalisation (works, but messy?)
> main
airport location x1.takeoffs x1.landings x2.takeoffs x2.landings x3.takeoffs x3.landings x4.takeoffs x4.landings
1 YYZ N.A. 301029 300976 291615 291614 259649 259613 40326 40297
2 LHR U.K. 211013 210983 360456 360389 241972 241964 309509 309495
3 JFK N.A. 432521 432491 205626 205592 1877087 1877060 865802 865771
4 MUC E.U. 101023 101011 43562 43509 234134 234071 30110 30087
5 VIE E.U. 250102 250079 128620 128561 152017 152015 1418485 1418471
> norm
name counts
1 x1 10
2 x2 20
3 x3 30
4 x4 40
What I'd like to do is take all columns that start with x1, and divide them by norm[which(norm$name == "x1"),]$counts, and so on for x2, x3, and x4.
Here's my code:
mainNorm <- main
for (n in norm$name) {
mainNorm[grep(n, colnames(mainNorm))] <- main %>%
select(starts_with(n)) %>%
mutate_each(funs(. / norm[which(norm$name == n),]$counts))
}
Now I average all .takeoffs and .landings:
mainNorm <- mainNorm %>%
mutate(avg.takeoff=select(., ends_with(".takeoffs")) %>%
rowMeans(na.rm=T))
mainNorm <- mainNorm %>%
mutate(avg.landings=select(., ends_with(".landings")) %>%
rowMeans(na.rm=T))
Dynamic column assignment based on min/max of other column
Last, I would like to add a new column, which looks at location groups and assigns either "high", or "low" based on the value in avg.takeoff
I've been trying the rowSums approach suggested in a different question ( R - Assign a value/factor in a data.frame to column conditioned on value(s) of other columns ) but am hitting a bit of a wall.
> mainNorm %>%
group_by(location) %>%
mutate(volume=c("high", "low")[rowSums(select(., avg.takeoff) <1)+1])
Error: Position must be between 0 and n
TL;DR
So, in summary my questions are:
Is there a more dplyrish way around the for loop? I wouldn't mind meltin the data from norm in to main if that helps?
How do I assign "low" and "high" in the group_by call? I'm guessing I'll have to pass it to a custom function?
Regarding my second question I'm guessing this always would be an option:
mainNorm %>%
group_by(location) %>%
filter(avg.takeoff == min(avg.takeoff)) %>%
mutate(volume="low")
But if I now want to handle the other half of the data I'd have to repeat, and then join the two tables. Is there a way of doing this in a single filter call? (Back to functions, I guess?)
Edit: Exptected result
Incorporating #alistair's suggestion helped, but I'm still unsure about the last part: assigning "high", "low". What I'd like to end up with (in some shape or form) is the following table:
# A tibble: 40 × 9
airport location name variable value_norm counts avg.takeoff avg.landings volume
<fctr> <fctr> <chr> <chr> <dbl> <int> <dbl> <dbl> <fctr>
1 YYZ N.A. x1 takeoffs 30102.9 10 13586.692 13584.873 low
2 LHR U.K. x1 takeoffs 21101.3 10 13731.890 13730.148 high
3 JFK N.A. x1 takeoffs 43252.1 10 34437.000 34435.410 high
4 MUC E.U. x1 takeoffs 10102.3 10 5209.404 5207.773 low
5 VIE E.U. x1 takeoffs 25010.2 10 17992.640 17991.220 high
6 YYZ N.A. x1 landings 30097.6 10 13586.692 13584.873 low
7 LHR U.K. x1 landings 21098.3 10 13731.890 13730.148 high
8 JFK N.A. x1 landings 43249.1 10 34437.000 34435.410 high
9 MUC E.U. x1 landings 10101.1 10 5209.404 5207.773 low
10 VIE E.U. x1 landings 25007.9 10 17992.640 17991.220 high
# ... with 30 more rows

How to subset data for a specific column with ddply?

I would like to know if there is a simple way to achieve what I describe below using ddply. My data frame describes an experiment with two conditions. Participants had to select between options A and B, and we recorded how long they took to decide, and whether their responses were accurate or not.
I use ddply to create averages by condition. The column nAccurate summarizes the number of accurate responses in each condition. I also want to know how much time they took to decide and express it in the column RT. However, I want to calculate average response times only when participants got the response right (i.e. Accuracy==1). Currently, the code below can only calculate average reaction times for all responses (accurate and inaccurate ones). Is there a simple way to modify it to get average response times computed only in accurate trials?
See sample code below and thanks!
library(plyr)
# Create sample data frame.
Condition = c(rep(1,6), rep(2,6)) #two conditions
Response = c("A","A","A","A","B","A","B","B","B","B","A","A") #whether option "A" or "B" was selected
Accuracy = rep(c(1,1,0),4) #whether the response was accurate or not
RT = c(110,133,121,122,145,166,178,433,300,340,250,674) #response times
df = data.frame(Condition,Response, Accuracy,RT)
head(df)
Condition Response Accuracy RT
1 1 A 1 110
2 1 A 1 133
3 1 A 0 121
4 1 A 1 122
5 1 B 1 145
6 1 A 0 166
# Calculate averages.
avg <- ddply(df, .(Condition), summarise,
N = length(Response),
nAccurate = sum(Accuracy),
RT = mean(RT))
# The problem: response times are calculated over all trials. I would like
# to calculate mean response times *for accurate responses only*.
avg
Condition N nAccurate RT
1 6 4 132.8333
2 6 4 362.5000
With plyr, you can do it as follows:
ddply(df,
.(Condition), summarise,
N = length(Response),
nAccurate = sum(Accuracy),
RT = mean(RT[Accuracy==1]))
this gives:
Condition N nAccurate RT
1: 1 6 4 127.50
2: 2 6 4 300.25
If you use data.table, then this is an alternative way:
library(data.table)
setDT(df)[, .(N = .N,
nAccurate = sum(Accuracy),
RT = mean(RT[Accuracy==1])),
by = Condition]
Using dplyr package:
library(dplyr)
df %>%
group_by(Condition) %>%
summarise(N = n(),
nAccurate = sum(Accuracy),
RT = mean(RT[Accuracy == 1]))

Resources