Create df aggregating from multiple rows into single row in R - r

I'm working with an events dataset and need help in creating a new df by summing a specific variable based on certain conditions.
For example, lets say I had a dataset of all cars sold in a county with the name of the dealership, the month the car was sold, the year the car was sold, and the number of cars sold for the past n years. I want to create a new df where each row would present the number of cars sold by a particular dealership at the year level.
In other words, I want to go from something like this:
Dealership Month Year # of Cars
Bobs April 2016 12
Toms March 2016 8
Bobs July 2016 20
Toms June 2016 4
...
To
Dealership Month Year # of Cars
Bobs ? 2016 32
Toms ? 2016 12
...
I'm not sure if that will give me an error because the month data (or other columns in a bigger dataset) will be different. I just don't need that information.
Can anyone help? Many thanks.

We can only do so much without a reproducible example, but this is probably covered by dplyr
library(dplyr)
yourdata %>% group_by(Dealership, Year) %>% summarise(Ncars = sum(`# of Cars`))

Related

Year on year growth rates for multiple columns basis unique ID & years in rows

I have a data set where I have a unique proposal ID, application year & financial statement year. One proposal ID shall have one application year(t) & could have t-1 &(or) t-2 financial year statements. I have multiple columns for debt, equity, networth etc & want to have two columns for YOY growth -F1 & YOY growth-2.
dataset :
Proposal ID Application Year Financial statement year Net sales
P1 2019 2019 100
P1 2019 2018 120
P1 2019 2017 130
Now basis each proposal ID I need additional columns on growth rates between financial statement years against my application year
desired output :
Proposal ID Application Year Financial statement year Net sales YOY - netsales-g1
P1 2019 2019 100 (100-120)/120...
P1 2019 2018 120
P1 2019 2017 130
this same step I need to do for all columns I have.
What I want is a function -- for each proposal ID it estimates the YOY growth & take out the latest application date as the final row with columns as YOY growth for all numeric variables in dataset
Thank you in advance for the help! :)
I am not sure but is it what you need?
library(dplyr)
library(tidyverse)
data %>% arrange(Financial_Statement_Year) %>%
mutate(Growth_Difference = Net_Sales - lag(Net_Sales)) %>%
mutate(Growth_Rate = (Growth_Difference / Net_Sales) * 100)
Proposal_ID
Application_Year
Financial_Statement_Year
Net_Sales
Growth_Difference
Growth_Rate
P3
2019
2017
130
NA
NA
P2
2019
2018
120
-10
-8.333
P1
2019
2019
100
-20
-20.000
This can be done use the dplyr::lead() formula in mutate(). The jantior::clean_names() is optional to make the code writing easier.
df %>%
janitor::clean_names() %>%
mutate(YoY_net_sales=(net_sales-lead(net_sales,n=1L))/lead(net_sales,n=1L))

Using indexing to perform mathematical operations on data frame in r

I'm struggling to perform basic indexing on a data frame to perform mathematical operations. I have a data frame containing all 50 US states with an entry for each month of the year, so there are 600 observations. I wish to find the difference between a value for the month of December minus the January value for each of the states. My data looks like this:
> head(df)
state year month value
1 AL 2020 01 2.7
2 AK 2020 01 5
3 AZ 2020 01 4.8
4 AR 2020 01 3.7
5 CA 2020 01 4.2
7 CO 2020 01 2.7
For instance, AL has a value in Dec of 4.7 and Jan value of 2.7 so I'd like to return 2 for that state.
I have been trying to do this with the group_by and summarize functions, but can't figure out the indexing piece of it to grab values that correspond to a condition. I couldn't find a resource for performing these mathematical operations using indexing on a data frame, and would appreciate assistance as I have other transformations I'll be using.
With dplyr:
library(dplyr)
df %>%
group_by(state) %>%
summarize(year_change = value[month == "12"] - value[month == "01"])
This assumes that your data is as you describe--every state has a single value for every month. If you have missing rows, or multiple observations in for a state in a given month, I would not expect this code to work.
Another approach, based row order rather than month value, might look like this:
library(dplyr)
df %>%
## make sure things are in the right order
arrange(state, month) %>%
group_by(state) %>%
summarize(year_change = last(value) - first(value))

Sum up ride_length for weekdays vs weekends and compare for casual and annual members

Hi Im a beginner in R so dont know much functionality to go about to perform this operation even though in my head i know what to do just dont know how to do it.
So I have data for of ride length I want to sum up for weekdays vs weekends and compare it with annual and casual members.
I have used the wday() to convert the dates to '1' to '7'. Now i want to filter out '2' to '6' (weekdays) and sum the ride_lenth and filter out '1' & '7' (weekends) and sum that ride_length and then use the aggregate() to compare them with the casual and annual members usage.
That is what i have decided.
member_type ride_length date month day year day_of_week weekday_num
casual 5280 2020-07-01 Jul 01 2020 Wednesday 4
casual 9840 2020-07-01 Jul 01 2020 Wednesday 4
Any other path to this would be welcome too.
unfortunately I can not test the code due to missing input and desired output. But you should be able to make these lines work for you:
library(dplyr)
# your data.frame/tibble
df %>%
# create variable to indicate weekend or not (check the weekend day names)
dplyr::mutate(day_type = ifelse(day_of_week %in% c("Saturday", "Sunday"), "WEEKEND","WEEK")) %>%
# build gouping by member type and day type
dplyr::group_by(member_type, day_type) %>%
# summarise total ride length
dplyr::summarize(total_ride_length = sum(ride_length, na.rm = TRUE))
Just as an advice: possibly there are some holidays you should consider, as they can be on a working day but show the behaviour of a weekend day (due to most people having free time to rent and ride bikes or viceverse if people predominantly rent to get to and from work)

Updating Data Frames

I have the following dataset, which originates from two datasets taken from an API at different points in time. df1 simply shows the state after I appended them. My goal is to generate the newest version of my API data, without forgetting the old data. This means I am looking to create some kind of update mechanism. I thought about creating a unique number for each dataset to identify its state, append the new version to the old one and then filter out the duplicates while keeping the newer data.
The data frames look like this:
df (after simply appending the two)
"Year" "Month" "dataset"
2017 December 1
2018 January 1
2018 January 2
2018 February 1
2018 February 2
2018 March 2
2018 April 2
df2 (the update)
"Year" "Month" "dataset"
2017 December 1
2018 January 2
2018 February 2
2018 March 2
2018 April 2
As df2 shows, the update mechanism prefers the data from dataset 2. January and February data were in both data sets but only the data from February is kept.
On the other hand, if there is no overlap between the datasets, it keeps the old and the new data.
Is there a simple solution in order to create the described update mechanism in R?
This is the Code for df1:
df1 <- data.frame(Year = c(2017,2018,2018,2018,2018,2018,2018),
Month =
c("December","January","January","February","February","March","April"),
Dataset = c(1,1,2,1,2,2,2))
Let me see if I have this right: you have 2 datasets (named 1 and 2) which you want to combine. Currently, you're getting the format shown above as df but you want the output to be df2. Is this correct? The below code should solve your problem. It is important that your newer dataset appears first in the full_join call. Whichever appears first will be given priority by distinct when it decides which duplicated rows to remove.
library(dplyr)
df <- data.frame(Year = c(2017,2018,2018,2018,2018,2018,2018),
Month = c("December","January","January","February",
"February","March","April"),
Dataset = c(1,1,2,1,2,2,2))
df1 <- dfx[dfx$Dataset == 1,]
df2 <- dfx[dfx$Dataset == 2,]
df.updated <- dplyr::full_join(df2, df1) %>%
distinct(Year, Month, .keep_all = TRUE)
df.updated
Year Month Dataset
1 2018 January 2
2 2018 February 2
3 2018 March 2
4 2018 April 2
5 2017 December 1
full_join joins the two data frames on matching variables, keeping all rows from both. Then distinct tosses out the duplicated rows. By specifying variable names in distinct, we tell it to only consider the values in Year and Month when determining uniqueness, so when a specific Year/Month combination appears in more than one dataset, only one row will be kept.
Normally, distinct only keeps the variables it uses to determine uniqueness. By providing the argument .keep_all = TRUE, it will keep all variables. When there are conflicts (for example, 2 rows from February 2018 with different values of Dataset) it will keep whichever row appears first in the data frame. This is why it's important for your newer dataset to appear first in the full_join: this gives rows that appear in df2 priority over rows that also appear in df1.

new column created in dataframe based on conditions in other columns - but data being read incorrectly

I have a dataframe describing observations of bird species in various locations based on month and year. It looks like this:
COMMON.NAME OBSERVATION.COUNT LOCALITY Month Year
Bushtit 1 Vancouver Jan 2000
Lapland Longspur 1 Vancouver - general area Jan 2000
Mew Gull 1 Vancouver Jan 2000
American Coot 4 Maplewood 00 Jan 2000
American Coot 2 Maplewood 00 Jan 2000
American Coot 1 Iona Island (general) Jan 2000
I am trying to create another column in the data frame called "Season" which groups the months of Jan, Feb, Mar and calls them Winter and groups the months Oct, Nov, Dec into Fall. This is the code I wrote to do this:
metrobirds$Season<-ifelse(metrobirds$Month==c("Jan","Feb","Mar"),"Winter","Fall")
However, when I view the dataframe, R has not correctly grouped the data in the new column. For example, many of the rows with Jan, Feb, or Mar are indicated as Fall and some are correctly indicated as Winter. What's wrong with my code? Any suggestions to correct this error?
When I read the csv file into R, I converted columns that were identified as factors into characters (e.g. common name, month, and locality), so those columns should be being read as characters.
Thanks for your help with this!
Try using %in%,
metrobirds$Season<-ifelse(metrobirds$Month %in% c("Jan","Feb","Mar"),"Winter","Fall")

Resources