How to match data to two conditions in a loop? - r

I'm having trouble building a data table that matches numbers based on two conditions (ID and date). Below is an example of a table snippet containing batch data.
ID
Power
Fuel
Starting_date
Shutting_down_date
El_Bel
344
WB
1983
2030
El_Opo
256
WK
1987
2027
El_Tur
400
WB
2019
2049
The question is how do I effectively match this data so that the data in the "Power" column is matched until the last year of operation by column "Shutting_down_date" is reached.
Date
El_Bel
El_Opo
El_Tur
2017
2018
2019
2020
2021
Many thanks for any suggestions.

Let us call the first dataframe x and the second data frame y and let us further assume that each ID only occurs once in the first table. The problem is that you have a different number of years for each ID which means that they cannot be stored in a data.frame (requires all columns to have the same length). You can use a list, though:
result <- list()
for (i in 1:nrow(x)) {
id <- x[i,"ID"]
end_date <- x[i,"Shutting_down_date"]
result[[id]] <- subset(y[,c("Date",id)], Date <= end_date)
}
Then you can query the results as result[["El_Bel"]] or result$El_Bel etc.

Related

How to assign and dynamically change the name of a dataframe in R in a loop

Hi am trying to create dataframes in a loop with different names, what is assigned to them is a filter from another dataframe inside the loop
here is the code I have so far
for (i in 1:nrow(data_s_y_revenue)){
name_y <- data_s_y_revenue$year[i]
data_y_y <-filter(data_y, year==name_y)
}
the name_y is a variable that as it loops it gets a year value, 2018, 2019,2020, etc, as the code is right now the dataframe data_y_y gets rewritten every time, what I would like to end with is a way that the name of the variable has the VALUE of name_y variable on its name, and I end with as many dataframes as years there is, for example if I have only year 2019 and 2020, I would end with 2 dataframes with names 2020_data_y_y and 2019_data_y_y with the values of the filter for those years.
Thanks for the help.
some data example
data_s_y_revenue data:
year
2018
2019
data_y data:
year value value2
2018 1 4
2018 2 4
2019 3 2
2019 3 2
the expected result would be 2 dataframes called 2019_data_y_y and 2020_data_y_y with the filtered values
With the suggestion of Waldi I was able to solve it
for (i in 1:nrow(data_s_y_revenue)){
name_y <- data_s_y_revenue$year[i]
name_y1 <- paste("data_y_y",name_y, sep="_")
data_y_y <-filter(data_y, year==name_y)
assign(name_y1, data_y_y)
}

R: Subset/extract rows of a data frame in steps of 12

I have a data frame with data for each month of a 26 years period (1993 - 2019), which makes 312 rows in total.
Unfortunately, I had to lag the data, so each year goes now from July t to June t+1. So I can't just extract the year from the date.
Now, I want to exclude the 12-month data for each year in a separate data frame. My first Idea is to insert in the first column the year and use the lapply function to filter afterward.
For this, I created the following loop:
n <- 1
m <- 1993
for (a in 1:26) {
for (i in n:(n+11)) {
t.monthly.ret.lag[i,1] <- m
}
n <- n+1
m <- m+1
}
Unfortunately, R isn't naming the year in steps of 12. Instead, it is counting directly in steps of 1.
Does anyone know how to solve this or maybe know a better way of doing it?
y.first <- 1993
y.last <- 2019
month.col <- rep(c(7:12, 1:6), y.last-y.first+1)
year.col <- rep(c(y.first:y.last), each=length(month.name))
df <- data.frame(year=year.col, month=month.col)
This yields a dataframe with months and year correspondingly tagged, which further allows to use dplyr::group_by() and so on.
You could just create a 312 element long vector giving the year (and one giving the month) using rep() and seq(). Then you can attach them as additional columns to your data.frame or just use them as reference for month and year.
month = rep(seq(1:12),27)
year = c(matrix(rep(seq(1:27),12),ncol=27,byrow=T)+1992)
month = month[7:(length(month)-6)]
year = year[7:(length(year)-6)]
The month vector counts from 1 to 12, beginning at 6, the year vector repeats the year 12 times (the first and last only 6 times).

Updating Data Frames

I have the following dataset, which originates from two datasets taken from an API at different points in time. df1 simply shows the state after I appended them. My goal is to generate the newest version of my API data, without forgetting the old data. This means I am looking to create some kind of update mechanism. I thought about creating a unique number for each dataset to identify its state, append the new version to the old one and then filter out the duplicates while keeping the newer data.
The data frames look like this:
df (after simply appending the two)
"Year" "Month" "dataset"
2017 December 1
2018 January 1
2018 January 2
2018 February 1
2018 February 2
2018 March 2
2018 April 2
df2 (the update)
"Year" "Month" "dataset"
2017 December 1
2018 January 2
2018 February 2
2018 March 2
2018 April 2
As df2 shows, the update mechanism prefers the data from dataset 2. January and February data were in both data sets but only the data from February is kept.
On the other hand, if there is no overlap between the datasets, it keeps the old and the new data.
Is there a simple solution in order to create the described update mechanism in R?
This is the Code for df1:
df1 <- data.frame(Year = c(2017,2018,2018,2018,2018,2018,2018),
Month =
c("December","January","January","February","February","March","April"),
Dataset = c(1,1,2,1,2,2,2))
Let me see if I have this right: you have 2 datasets (named 1 and 2) which you want to combine. Currently, you're getting the format shown above as df but you want the output to be df2. Is this correct? The below code should solve your problem. It is important that your newer dataset appears first in the full_join call. Whichever appears first will be given priority by distinct when it decides which duplicated rows to remove.
library(dplyr)
df <- data.frame(Year = c(2017,2018,2018,2018,2018,2018,2018),
Month = c("December","January","January","February",
"February","March","April"),
Dataset = c(1,1,2,1,2,2,2))
df1 <- dfx[dfx$Dataset == 1,]
df2 <- dfx[dfx$Dataset == 2,]
df.updated <- dplyr::full_join(df2, df1) %>%
distinct(Year, Month, .keep_all = TRUE)
df.updated
Year Month Dataset
1 2018 January 2
2 2018 February 2
3 2018 March 2
4 2018 April 2
5 2017 December 1
full_join joins the two data frames on matching variables, keeping all rows from both. Then distinct tosses out the duplicated rows. By specifying variable names in distinct, we tell it to only consider the values in Year and Month when determining uniqueness, so when a specific Year/Month combination appears in more than one dataset, only one row will be kept.
Normally, distinct only keeps the variables it uses to determine uniqueness. By providing the argument .keep_all = TRUE, it will keep all variables. When there are conflicts (for example, 2 rows from February 2018 with different values of Dataset) it will keep whichever row appears first in the data frame. This is why it's important for your newer dataset to appear first in the full_join: this gives rows that appear in df2 priority over rows that also appear in df1.

Combine different rows

Consider a dataframe of the form
id start end
2009.36220 65693384 2010-03-20 2010-07-04
2010.36221 65693592 2010-01-01 2010-12-31
2010.36222 65698250 2010-01-01 2010-12-31
2010.36223 65704349 2010-01-01 2010-12-31
where I have around 20k observations per year for 15 years.
I need to combine the rows by the following rule:
if for the same id, there exists a record that ends at the last day of the year
and a record that starts at the first day of the following year
then
- create a new row with start value of the earlier row and end value of the later year
- and delete the two original rows
Given that the same id can be visible several times (since I have more than 2 years) I will then just iterate over the script several time to combine different ids that have for example 4 rows in consecutive years that satisfy the condition.
The Question
I'd know how to program this in an iterative manner, where I would go over every single row and check if there's a row with a start date next year somewhere in the whole data frame that corresponds to the end date this year - but that's extremely slow and non satisfying from an aesthetic point of view. I'm a very beginner with R, so I have no clue of where to even look to do such a thing in a more efficient manner - I'm open for any suggestion.
Warning: this kind of code with rbind() is cancerous, but this is the easiest solution I could think of. Let df be your data.
df$start = as.POSIXct(df$start)
df$end = as.POSIXct(df$end)
df2 = data.frame()
for (i in unique(df$id)){
s = subset(df, id==i)
df2 = rbind(df2, c(id, min(s$start), max(s$end)))
}

Selecting Specific Dates in R

I am wondering how to create a subset of data in R based on a list of dates, rather than by a date range.
For example, I have the following data set data which contains 3 years of 6-minute data.
date zone month day year hour minute temp speed gust dir
1 09/06/2009 00:00 PDT 9 6 2009 0 0 62 2 15 156
2 09/06/2009 00:06 PDT 9 6 2009 0 6 62 13 16 157
I have used breeze<-subset(data, ws>=15 & wd>=247.5 & wd<=315, select=date:dir) to select the rows which meet my criteria for a sea breeze, which is fine, but what I want to do is create a subset of the days which contain those times that meet my criteria.
I have used...
as.character(breeze$date)
trimdate<-strtrim(breeze$date, 10)
breezedate<-as.Date(trimdate, "%m/%d/%Y")
breezedate<-format(breezedate, format="%m/%d/%Y")
...to extract the dates from each row that meets my criteria so I have a variable called breezedate that contains a list of the dates that I want (not the most eloquent coding to do this, I'm sure). There are about two-hundred dates in the list. What I am trying to do with the next command is in my original dataset data to create a subset which contains only those days which meet the seabreeze criteria, not just the specific times.
breezedays<-(data$date==breezedate)
I think one of my issues here is that I am comparing one value to a list of values, but I am not sure how to make it work.
Lets assume your breezedate list looks like this and data$date is simple string:
breezedate <- as.Date(c("2009-09-06", "2009-10-01"))
This is probably want you want:
breezedays <- data[as.Date(data$date, '%m/%d/%Y') %in% breezedate]
The intersect() function (docs) will allow you to compare one data frame to another and return those records that are the same.
To use, run the following:
breezedays <- intersect(data$date,breezedate) # returns into breezedays all records that are shared between data$date and breezedate

Resources