Finding Avg/Sum of a Column Value

Finding Avg/Sum of a Column Value - r

I have a nice Jitter plot of my data, but I'm looking to look further into the data by finding Mean/Sum/Median etc...
I don't know the syntax to separate the data by column value.
My date frame consists of 2 variables: Year (2010-2017) and Followers (Numeric)
Code I used:
ggplot(MyData, aes(factor(Date), Followers)) +
geom_jitter(aes(color = factor(Date)))
This separated each Numeric data point into categorized groups of each year.
I was able to use sum(MyData$Followers) to get total Followers for all years.
As well as count(MyData, 'Date') To get frequency for each year.
But I'm not sure how to combine them to get total followers/avg followers for each individual year.

You can use dplyr:
df <- MyData %>%
group_by(Year) %>%
summarize(Mean = mean(Followers), Count = n(Followers))

Related

How do I use a loop to iterate through multiple datasets to get one output per dataset in R?

I have 60 data sets I created from one massive original one. They are split by Year, and I named them all using their year number - like Year1, Year2, Year3, Year4, etc to Year60. Each data set has a column "Car" and "Weeks". I am trying to loop through every dataset to sort by the largest Number of Cars value, take the row that value is in, and get the value for "Weeks" for that row (basically the week in which the most cars were sold per year, for each of the 60 years).
My code is:
Year1$Car <- as.integer(Year1$Car)
df.1 <- aggregate(Car ~ Week, Year1, max)
df.a <- merge(df.1, Year1)
print(paste("Year 1 Most Cars Sold in Week", print(df.a$Week))
I am trying to find a way to run through this quicker than just manually typing for each dataset Year1, Year2, etc all the way to Year60.
I tried:
for (i in 1:60){
Year"i"$Car <- as.integer(Year"i"$Car)
df.1 <- aggregate(Car ~ Week, Year"i", max)
df.a <- merge(df.1, Year"i")
print(paste("Year "i" Most Cars Sold in Week", print(df.a$Week))
}
that didn't work :/ Would really appreciate any suggestions!

If you want to keep the list intact, you can use sapply to go through each dataframe and extract the Week number of row with maximum Car value.
sapply(mget(paste0('Year', 1:60)), function(x) x$Week[which.max(x$Car)])
Or with dplyr you can combine all the datasets into one group_by each Year and select the row with maximum value of Car.
library(dplyr)
bind_rows(mget(paste0('Year', 1:60)), .id = "id") %>%
group_by(id) %>%
slice(which.max(Car))

grouping & plotting by textual column value

I've got a (very) basic level of competency with R when working with numbers, but when it comes to manipulating data based on text values in columns I'm stuck. For example, if I want to plot meal frequency vs. day of week (is Tuesday really for tacos?) using the following data frame, how would I do that? I've seen suggestions of tapply, aggregate, colSums, and others, but those have all been for slightly different scenarios and nothing gives me what I'm looking for. Should I be looking at something other than R for this problem? My end goal is a graph with day of week on the X-axis, count on the Y-axis, and a line plot for each meal.
df <- data.frame(meal= c("tacos","spaghetti","burgers","tacos","spaghetti",
"spaghetti"), day = c("monday","tuesday","wednesday","monday","tuesday","wednesday"))
This is as close as I've gotten, and, to be honest, I don't fully understand what it's doing:
tapply(df$day, df$meal, FUN = function(x) length(x))
It will summarize the meal counts, but a) it doesn't have column names (my understanding is that's due to tapply returning a vector), and b) it doesn't keep an association with the day of the week.
Edit: The melt() suggestion below works for this dataset, but it won't scale to the size I need. I was, however, able to get a working graph from the dataframe produced by the melt. If anybody runs across this in the future, try:
ggplot(new, aes(day, value, group=meal, col=meal)) +
geom_line() + geom_point() + scale_y_continuous(breaks = function(x)
unique(floor(pretty(seq(0, (max(x) + 1) * 1.1)))))
(The part after geom_point() is to force the Y-axis to only be integers, which is what makes sense in this case.)

I tried to cut this into smaller pieces so you can understand whats going on
library(tidyverse)
# defining the dataframes
df <- data.frame(meal = c("tacos","spaghetti","burgers","tacos","spaghetti","spaghetti"),
day = c("monday","tuesday","wednesday","monday","tuesday","wednesday"))
# define a vector of days of week ( will be useful to display x axis in the correct order)
ordered_days =c("sunday","monday","tuesday","wednesday",
"thursday","friday",'saturday')
# count the number of meals per day of week
df_count <- df %>% group_by(meal,day) %>% count() %>% ungroup()
# a lot of combinations are missing, for example no burgers on monday
# so i am creating all combinations with count 0
fill_0 <- expand.grid(
meal=factor(unique(df$meal)),
day=factor(ordered_days),
n=0)
# append this fill_0 to df_count
# as some combinations already exist, group by again and sum n
# so only one row per (meal,day) combination
df_count <- rbind(df_count,fill_0) %>%
group_by(meal,day) %>%
summarise(n=sum(n)) %>%
mutate(day=factor(day,ordered=TRUE,
ordered_days))
# plot this by grouping by meal
ggplot(df_count,aes(x=day,y=n,group=meal,col=meal)) + geom_line()

The magic is here, courtesy of #fmarm:
df_count <- df %>% group_by(meal,day) %>% count() %>% ungroup()
The fill_0 and rbind bits also in the sample provided by #fmarm are necessary to keep from bombing out on unspecified combinations, but it's the line above that handles summing meals by day.

Dividing values in each cell by the group average in R

I am trying to generate a new column with values derived from the original chart. I would like to calculate the group average of same hotel and same date first, then use this group averages to divide the original sales.
Here is my code: I tried to calculate the group average by using group_by and summarise embedding in dplyr package, however, it did not generate my expected results.
hotel = c(rep("Hilton",3), rep("Caesar",3))
date1 = c(rep('2018-01-01',2), '2018-01-02', rep('2018-01-01',3))
dba = c(2,0,1,3,2,1)
sales = c(3,5,7,5,2,3)
df = data.frame(cbind(hotel, date1, dba, sales))
df1 = df %>%
group_by(date1, hotel) %>%
dplyr::summarise(avg = mean(sales)) %>%
acast(., date1~hotel)
Any suggestion would be highly appreciated!

Instead of summarise, we can use mutate. After grouping by 'date1', 'hotel', divide the 'sales' by the mean of 'sales' to create a new column
library(tidyverse)
df %>%
group_by(date1, hotel) %>%
mutate(SalesDividedByMean = sales/mean(sales))
NOTE: When there are columns having different types, cbinding results in a matrix and matrix can have only a single type. So, a character class vector can change the whole data into character. Wrapping with data.frame, propagate that change into either factor (by default stringsAsFactors = TRUE or `character)
data
df <- data.frame(hotel, date1, dba, sales)

summing my data by two specific columns

I have my data frame below, I want to sum the data like I have in the first row in the image below (labelled row 38). The total flowering summed for Sections A-D for each date, i also have multiple plots not just Dry1, but Dry2, Dry3 etc.
It's so simple to do in my head but I can't workout how to do it in R?
Essentially I want to do this:
with(dat1, sum(dat1$TotalFlowering[dat1$Date=="1997-07-01" & dat1$Plot=="Dry1"]))
Which tells me that the sum of total flowers for sections "A,B,C,D" in plot "Dry1" for the date "1997-07-01" = 166
I want a way to write a code so this does so for every date and plot combo, and then puts it in the data frame?
In the same format as the first row in the image I included :)

Based on your comment it seems like you just want to add a variable to keep track of TotalFlowering for every Date and Plot combination. If that's the case then can you just add a field like TotalCount below?
library(dplyr)
df %>%
group_by(Date, Plot) %>%
mutate(TotalCount = sum(TotalFlowering)) %>%
ungroup()
Or, alternatively, if all you want is the sum you could make use of dplyr's summarise like below
library(dplyr)
df %>%
group_by(Date, Plot) %>%
summarise(TotalCount = sum(TotalFlowering))

ggvis: plotting data in multiple series

Here is what I have:
A data frame which contains a date field, and a number of summary statistics.
Here's what I want:
I want a chart that allows me to compare the time series week over week, to see how the performance of the process this week compares to the previous one, for example.
What I have done so far:
##Get the week day name to display
summaryData$WeekDay <- format(summaryData$Date, format = '%A')
##Get the week number to differentiate the weeks
summaryData$Week <- format(summaryData$Date, format = '%V')
summaryData %>%
ggvis(x = ~WeekDay, y = ~Referrers) %>%
layer_lines(stroke = ~Week)`
I expected it to create a chart with multiple coloured lines, each one representing a week in my data set. It does not do what I expect

Try looking at reshaper to convert your data with a factor variable for each week, or split up the data with a dplyr::lag() command.
A general way of doing graphs of multiple columns in ggivs is to use the following format
summaryData %>%
ggvis() %>%
layer_lines(x = ~WeekDay, y = ~Referrers)%>%
layer_lines(x=~WeekDay, y= ~Other)
I hope this helps