Updating Data Frames - r

I have the following dataset, which originates from two datasets taken from an API at different points in time. df1 simply shows the state after I appended them. My goal is to generate the newest version of my API data, without forgetting the old data. This means I am looking to create some kind of update mechanism. I thought about creating a unique number for each dataset to identify its state, append the new version to the old one and then filter out the duplicates while keeping the newer data.
The data frames look like this:
df (after simply appending the two)
"Year" "Month" "dataset"
2017 December 1
2018 January 1
2018 January 2
2018 February 1
2018 February 2
2018 March 2
2018 April 2
df2 (the update)
"Year" "Month" "dataset"
2017 December 1
2018 January 2
2018 February 2
2018 March 2
2018 April 2
As df2 shows, the update mechanism prefers the data from dataset 2. January and February data were in both data sets but only the data from February is kept.
On the other hand, if there is no overlap between the datasets, it keeps the old and the new data.
Is there a simple solution in order to create the described update mechanism in R?
This is the Code for df1:
df1 <- data.frame(Year = c(2017,2018,2018,2018,2018,2018,2018),
Month =
c("December","January","January","February","February","March","April"),
Dataset = c(1,1,2,1,2,2,2))

Let me see if I have this right: you have 2 datasets (named 1 and 2) which you want to combine. Currently, you're getting the format shown above as df but you want the output to be df2. Is this correct? The below code should solve your problem. It is important that your newer dataset appears first in the full_join call. Whichever appears first will be given priority by distinct when it decides which duplicated rows to remove.
library(dplyr)
df <- data.frame(Year = c(2017,2018,2018,2018,2018,2018,2018),
Month = c("December","January","January","February",
"February","March","April"),
Dataset = c(1,1,2,1,2,2,2))
df1 <- dfx[dfx$Dataset == 1,]
df2 <- dfx[dfx$Dataset == 2,]
df.updated <- dplyr::full_join(df2, df1) %>%
distinct(Year, Month, .keep_all = TRUE)
df.updated
Year Month Dataset
1 2018 January 2
2 2018 February 2
3 2018 March 2
4 2018 April 2
5 2017 December 1
full_join joins the two data frames on matching variables, keeping all rows from both. Then distinct tosses out the duplicated rows. By specifying variable names in distinct, we tell it to only consider the values in Year and Month when determining uniqueness, so when a specific Year/Month combination appears in more than one dataset, only one row will be kept.
Normally, distinct only keeps the variables it uses to determine uniqueness. By providing the argument .keep_all = TRUE, it will keep all variables. When there are conflicts (for example, 2 rows from February 2018 with different values of Dataset) it will keep whichever row appears first in the data frame. This is why it's important for your newer dataset to appear first in the full_join: this gives rows that appear in df2 priority over rows that also appear in df1.

Related

Using indexing to perform mathematical operations on data frame in r

I'm struggling to perform basic indexing on a data frame to perform mathematical operations. I have a data frame containing all 50 US states with an entry for each month of the year, so there are 600 observations. I wish to find the difference between a value for the month of December minus the January value for each of the states. My data looks like this:
> head(df)
state year month value
1 AL 2020 01 2.7
2 AK 2020 01 5
3 AZ 2020 01 4.8
4 AR 2020 01 3.7
5 CA 2020 01 4.2
7 CO 2020 01 2.7
For instance, AL has a value in Dec of 4.7 and Jan value of 2.7 so I'd like to return 2 for that state.
I have been trying to do this with the group_by and summarize functions, but can't figure out the indexing piece of it to grab values that correspond to a condition. I couldn't find a resource for performing these mathematical operations using indexing on a data frame, and would appreciate assistance as I have other transformations I'll be using.
With dplyr:
library(dplyr)
df %>%
group_by(state) %>%
summarize(year_change = value[month == "12"] - value[month == "01"])
This assumes that your data is as you describe--every state has a single value for every month. If you have missing rows, or multiple observations in for a state in a given month, I would not expect this code to work.
Another approach, based row order rather than month value, might look like this:
library(dplyr)
df %>%
## make sure things are in the right order
arrange(state, month) %>%
group_by(state) %>%
summarize(year_change = last(value) - first(value))

How to assign and dynamically change the name of a dataframe in R in a loop

Hi am trying to create dataframes in a loop with different names, what is assigned to them is a filter from another dataframe inside the loop
here is the code I have so far
for (i in 1:nrow(data_s_y_revenue)){
name_y <- data_s_y_revenue$year[i]
data_y_y <-filter(data_y, year==name_y)
}
the name_y is a variable that as it loops it gets a year value, 2018, 2019,2020, etc, as the code is right now the dataframe data_y_y gets rewritten every time, what I would like to end with is a way that the name of the variable has the VALUE of name_y variable on its name, and I end with as many dataframes as years there is, for example if I have only year 2019 and 2020, I would end with 2 dataframes with names 2020_data_y_y and 2019_data_y_y with the values of the filter for those years.
Thanks for the help.
some data example
data_s_y_revenue data:
year
2018
2019
data_y data:
year value value2
2018 1 4
2018 2 4
2019 3 2
2019 3 2
the expected result would be 2 dataframes called 2019_data_y_y and 2020_data_y_y with the filtered values
With the suggestion of Waldi I was able to solve it
for (i in 1:nrow(data_s_y_revenue)){
name_y <- data_s_y_revenue$year[i]
name_y1 <- paste("data_y_y",name_y, sep="_")
data_y_y <-filter(data_y, year==name_y)
assign(name_y1, data_y_y)
}

Filtering Data based on another dataframe based on two rows

I have two Datasets.
The first dataset includes Companies, the Quarter and the corresponding value from the whole timespan.
Quarter Date Company value
2012.1 2012-12-28 x 1
2013.1 2013-01-02 y 2
2013.1 2013-01-03 z 3
Companies again are in the dataset over the whole time and show up multiple times.
The other dataset is an index which includes a company identifier and the quarter in which it existed in the index (Companies can be in the index in multiple quarters).
Quarter Date Company value
2012.1 2012-12-28 x 1
2014.1 2013-01-02 y 2
2013.1 2013-01-03 x 3
Now I need to only select the companies which are in the index at the same time (quarter) as I have data from the first dataset.
In the example above I would need the data from company x in both quarters, but company y needs to get kicked out because the data is available in the wrong quarter.
I tried multiple functions including filter, subset and match but never got the desired result. It always filters either too much or too little.
data %>% filter(Company == index$Company & Quarter == index$Quarter)
or
data[Company == index$Company & Quarter = index$Quarter,]
Something with my conditions doesn't seem right. Any help is appreciated!
Have a look at dplyr's powerful join functions. Here inner_join might help you
dplyr::inner_join(df1, df2, by=c("Company", "Quarter"))

Calculations by Subgroup in a Column [duplicate]

This question already has answers here:
Calculate the mean by group
(9 answers)
Closed 5 years ago.
I have a dataset that looks approximately like this:
> dataSet
month detrend
1 Jan 315.71
2 Jan 317.45
3 Jan 317.5
4 Jan 317.1
5 Jan 315.71
6 Feb 317.45
7 Feb 313.5
8 Feb 317.1
9 Feb 314.37
10 Feb 315.41
11 March 316.44
12 March 315.73
13 March 318.73
14 March 315.55
15 March 312.64
.
.
.
How do I compute the average by month? E.g., I want something like
> by_month
month ave_detrend
1 Jan 315.71
2 Feb 317.45
3 March 317.5
What you need to focus on is a means to group your column of interest (the "detrend") by the month. There are ways to do this within "vanilla R", but the most effective way is to use tidyverse's dplyr.
I will use the example taken directly from that page:
mtcars %>%
group_by(cyl) %>%
summarise(disp = mean(disp), sd = sd(disp))
In your case, that would be:
by_month <- dataSet %>%
group_by(month) %>%
summarize(avg = mean(detrend))
This new "tidyverse" style looks quite different, and you seem quite new, so I'll explain what's happening (sorry if this is overly obvious):
First, we are grabbing the dataframe, which I'm calling dataSet.
Then we are piping that dataset to our next function, which is group_by. Piping means that we're putting the results of the last command (which in this case is just the dataframe dataSet) and using it as the first parameter of our next function. The function group_by has a dataframe provided as its first function.
Then the results of that group by are piped to the next function, which is summarize (or summarise if you're from down under, as the author is). summarize simply calculates using all the data in the column, however, the group_by function creates partitions in that column. So we now have the mean calculated for each partition that we've made, which is month.
This is the key: group_by creates "flags" so that summarize calculates the function (mean, in this case) separately on each group. So, for instance, all of the Jan values are grouped together and then the mean is calculated only on them. Then for all of the Feb values, the mean is calculated, etc.
HTH!!
R has an inbuilt mean function: mean(x, trim = 0, na.rm = FALSE, ...)
I would do something like this:
january <- dataset[dataset[, "month"] == "january",]
januaryVector <- january[, "detrend"]
januaryAVG <- mean(januaryVector)

Create df aggregating from multiple rows into single row in R

I'm working with an events dataset and need help in creating a new df by summing a specific variable based on certain conditions.
For example, lets say I had a dataset of all cars sold in a county with the name of the dealership, the month the car was sold, the year the car was sold, and the number of cars sold for the past n years. I want to create a new df where each row would present the number of cars sold by a particular dealership at the year level.
In other words, I want to go from something like this:
Dealership Month Year # of Cars
Bobs April 2016 12
Toms March 2016 8
Bobs July 2016 20
Toms June 2016 4
...
To
Dealership Month Year # of Cars
Bobs ? 2016 32
Toms ? 2016 12
...
I'm not sure if that will give me an error because the month data (or other columns in a bigger dataset) will be different. I just don't need that information.
Can anyone help? Many thanks.
We can only do so much without a reproducible example, but this is probably covered by dplyr
library(dplyr)
yourdata %>% group_by(Dealership, Year) %>% summarise(Ncars = sum(`# of Cars`))

Resources