How to assign and dynamically change the name of a dataframe in R in a loop - r

Hi am trying to create dataframes in a loop with different names, what is assigned to them is a filter from another dataframe inside the loop
here is the code I have so far
for (i in 1:nrow(data_s_y_revenue)){
name_y <- data_s_y_revenue$year[i]
data_y_y <-filter(data_y, year==name_y)
}
the name_y is a variable that as it loops it gets a year value, 2018, 2019,2020, etc, as the code is right now the dataframe data_y_y gets rewritten every time, what I would like to end with is a way that the name of the variable has the VALUE of name_y variable on its name, and I end with as many dataframes as years there is, for example if I have only year 2019 and 2020, I would end with 2 dataframes with names 2020_data_y_y and 2019_data_y_y with the values of the filter for those years.
Thanks for the help.
some data example
data_s_y_revenue data:
year
2018
2019
data_y data:
year value value2
2018 1 4
2018 2 4
2019 3 2
2019 3 2
the expected result would be 2 dataframes called 2019_data_y_y and 2020_data_y_y with the filtered values

With the suggestion of Waldi I was able to solve it
for (i in 1:nrow(data_s_y_revenue)){
name_y <- data_s_y_revenue$year[i]
name_y1 <- paste("data_y_y",name_y, sep="_")
data_y_y <-filter(data_y, year==name_y)
assign(name_y1, data_y_y)
}

Related

How to match data to two conditions in a loop?

I'm having trouble building a data table that matches numbers based on two conditions (ID and date). Below is an example of a table snippet containing batch data.
ID
Power
Fuel
Starting_date
Shutting_down_date
El_Bel
344
WB
1983
2030
El_Opo
256
WK
1987
2027
El_Tur
400
WB
2019
2049
The question is how do I effectively match this data so that the data in the "Power" column is matched until the last year of operation by column "Shutting_down_date" is reached.
Date
El_Bel
El_Opo
El_Tur
2017
2018
2019
2020
2021
Many thanks for any suggestions.
Let us call the first dataframe x and the second data frame y and let us further assume that each ID only occurs once in the first table. The problem is that you have a different number of years for each ID which means that they cannot be stored in a data.frame (requires all columns to have the same length). You can use a list, though:
result <- list()
for (i in 1:nrow(x)) {
id <- x[i,"ID"]
end_date <- x[i,"Shutting_down_date"]
result[[id]] <- subset(y[,c("Date",id)], Date <= end_date)
}
Then you can query the results as result[["El_Bel"]] or result$El_Bel etc.

Using lubridate with multiple date formats

I have a column of dates that was stored in the format 8/7/2001, 10/21/1990, etc. Two values are just four-digit years. I converted the entire column to class Date using the following code.
lubridate::parse_date_time(eventDate, orders = c('mdy', 'Y'))
It works great, except the values that were just years are converted to yyyy-01-01 and I want them to just be yyyy. Is there a way to keep lubridate from adding on any information that wasn't already there?
Edit: Code to create data frame
id = (1:5)
eventDate = c("10/7/2001", "1989", NA, "5/5/2016", "9/18/2011")
df <- data.frame(id, eventDate)
I do not think is possible to convert your values to Dates, and keep the "yyyy" values intact. And by transforming your "yyyy" values into "yyyy-01-01" the lubridate is doing the right thing. Because dates have order, and if you have other values in your column that have days and months defined, all the other values needs to have these components too.
For example. If I produce the data.frame below. If I ask R, to order the table, according to the date column, the date in the first line ("2020"), comes before the value in the second row ("2020-02-28")? Or comes after it? The value "2020" being the year of 2020, it can actually means every possible day in this year, so how R should treate it? By adding the first day of the year, lubridate is defining these components, and avoiding that R get confused by it.
dates <- c("2020", "2020-02-28", "2020-02-20", "2020-01-10", "2020-05-12")
id <- 1:5
df <- data.frame(
id,
dates
)
id dates
1 1 2020
2 2 2020-02-28
3 3 2020-02-20
4 4 2020-01-10
5 5 2020-05-12
So if you want to mantain the "yyyy" intact, is very likely that they should not rest in your eventDate column, with other values that are in a different structure ("dd/mm/yyyy"). Now if is really necessary to mantain these values intact, I think is best, to keep the values of eventDate column as characters, and store these values as Dates in another column, like this:
df$as_dates <- lubridate::parse_date_time(df$eventDate, orders = c('mdy', 'Y'))
id eventDate as_dates
1 1 10/7/2001 2001-10-07
2 2 1989 1989-01-01
3 3 <NA> <NA>
4 4 5/5/2016 2016-05-05
5 5 9/18/2011 2011-09-18

R: Subset/extract rows of a data frame in steps of 12

I have a data frame with data for each month of a 26 years period (1993 - 2019), which makes 312 rows in total.
Unfortunately, I had to lag the data, so each year goes now from July t to June t+1. So I can't just extract the year from the date.
Now, I want to exclude the 12-month data for each year in a separate data frame. My first Idea is to insert in the first column the year and use the lapply function to filter afterward.
For this, I created the following loop:
n <- 1
m <- 1993
for (a in 1:26) {
for (i in n:(n+11)) {
t.monthly.ret.lag[i,1] <- m
}
n <- n+1
m <- m+1
}
Unfortunately, R isn't naming the year in steps of 12. Instead, it is counting directly in steps of 1.
Does anyone know how to solve this or maybe know a better way of doing it?
y.first <- 1993
y.last <- 2019
month.col <- rep(c(7:12, 1:6), y.last-y.first+1)
year.col <- rep(c(y.first:y.last), each=length(month.name))
df <- data.frame(year=year.col, month=month.col)
This yields a dataframe with months and year correspondingly tagged, which further allows to use dplyr::group_by() and so on.
You could just create a 312 element long vector giving the year (and one giving the month) using rep() and seq(). Then you can attach them as additional columns to your data.frame or just use them as reference for month and year.
month = rep(seq(1:12),27)
year = c(matrix(rep(seq(1:27),12),ncol=27,byrow=T)+1992)
month = month[7:(length(month)-6)]
year = year[7:(length(year)-6)]
The month vector counts from 1 to 12, beginning at 6, the year vector repeats the year 12 times (the first and last only 6 times).

Updating Data Frames

I have the following dataset, which originates from two datasets taken from an API at different points in time. df1 simply shows the state after I appended them. My goal is to generate the newest version of my API data, without forgetting the old data. This means I am looking to create some kind of update mechanism. I thought about creating a unique number for each dataset to identify its state, append the new version to the old one and then filter out the duplicates while keeping the newer data.
The data frames look like this:
df (after simply appending the two)
"Year" "Month" "dataset"
2017 December 1
2018 January 1
2018 January 2
2018 February 1
2018 February 2
2018 March 2
2018 April 2
df2 (the update)
"Year" "Month" "dataset"
2017 December 1
2018 January 2
2018 February 2
2018 March 2
2018 April 2
As df2 shows, the update mechanism prefers the data from dataset 2. January and February data were in both data sets but only the data from February is kept.
On the other hand, if there is no overlap between the datasets, it keeps the old and the new data.
Is there a simple solution in order to create the described update mechanism in R?
This is the Code for df1:
df1 <- data.frame(Year = c(2017,2018,2018,2018,2018,2018,2018),
Month =
c("December","January","January","February","February","March","April"),
Dataset = c(1,1,2,1,2,2,2))
Let me see if I have this right: you have 2 datasets (named 1 and 2) which you want to combine. Currently, you're getting the format shown above as df but you want the output to be df2. Is this correct? The below code should solve your problem. It is important that your newer dataset appears first in the full_join call. Whichever appears first will be given priority by distinct when it decides which duplicated rows to remove.
library(dplyr)
df <- data.frame(Year = c(2017,2018,2018,2018,2018,2018,2018),
Month = c("December","January","January","February",
"February","March","April"),
Dataset = c(1,1,2,1,2,2,2))
df1 <- dfx[dfx$Dataset == 1,]
df2 <- dfx[dfx$Dataset == 2,]
df.updated <- dplyr::full_join(df2, df1) %>%
distinct(Year, Month, .keep_all = TRUE)
df.updated
Year Month Dataset
1 2018 January 2
2 2018 February 2
3 2018 March 2
4 2018 April 2
5 2017 December 1
full_join joins the two data frames on matching variables, keeping all rows from both. Then distinct tosses out the duplicated rows. By specifying variable names in distinct, we tell it to only consider the values in Year and Month when determining uniqueness, so when a specific Year/Month combination appears in more than one dataset, only one row will be kept.
Normally, distinct only keeps the variables it uses to determine uniqueness. By providing the argument .keep_all = TRUE, it will keep all variables. When there are conflicts (for example, 2 rows from February 2018 with different values of Dataset) it will keep whichever row appears first in the data frame. This is why it's important for your newer dataset to appear first in the full_join: this gives rows that appear in df2 priority over rows that also appear in df1.

Producing Ordered Columns of Integers in R for odd-numbered ranges

Total newb R question, but here it is: lets say I want to create a data frame with two columns, one with all years in a range, and the other with every month in each year. When I'm done, I should have this:
year month
1990 1
1990 2
1990 3
Et cetera. This seems like a pretty obvious job for cbind, to turn a range into a column, and repeat, to produce 12 instances of each year. This works great, but only for an even number of years in the range. So, for instance:
df <- data.frame(cbind(year=rep(c(1990:2000), 12)))
Works fine. And so does this:
df <- data.frame(cbind(year=rep(c(1990:2000), 12), month=c(1:12)))
But this produces overt nonsense:
df <- data.frame(cbind(year=rep(c(1990:2001), 12), month=c(1:12)))
The first line of code produces 12 instances of each year in the range, just as you'd expect; the second line produces the desired result. The third line produces 12 instances of each year, where each year only gets one month number. Thus:
year month
1990 1
1990 1
1990 1
Is there a way around this that doesn't require always adding a year and trimming it off later?
You are looking for expand.grid
df <- expand.grid(year = 1990:2001, month = 1:12)

Resources