How to summarize two data frames by matching date columns? - r

I have two data frames: Original and Base......
Original<- data.frame(Bond = c("A","B","C","D"),Date = c("19-11-2021","19-11-2021","19-11-2021","17-11-2021"),
Rate =c("O_11","O_12","O_13","O_31"))
base<- data.frame(Date = c("19-11-2021","18-11-2021","17-11-2021"), Rate =c("B_1","B_2","B_3"))
Here I would like to calculate the rate differential between Original and Base for each bond of each date w.r.t. the base rate. The output should be in the following format -
Note: The original data frame contains numerical values of the Original and Base Rates
I was trying using group_by() but wasn't able to proceed much further. Please help me with this. Even suggestion will also work

Seems like you want to join on date, with dplyr you can with an inner_join, assuming that there exist a date in base for every record in Original:
Output <- Original %>%
inner_join(base, by="Date") %>%
mutate(Rate_Diff = paste0(Rate.x,"-",Rate.y), Rate=Rate.x) %>%
select(-Rate.x, -Rate.y)
> Output
Bond Date Rate_Diff Rate
1 A 19-11-2021 O_11-B_1 O_11
2 B 19-11-2021 O_12-B_1 O_12
3 C 19-11-2021 O_13-B_1 O_13
4 D 17-11-2021 O_31-B_3 O_31
Edit: Is see the note now, then you could just replace the paste0 function with the actual columns:
mutate(Rate_Diff = Rate.x - Rate.y, Rate=Rate.x)

Related

R data long-wide restructure

I am currently working with data in R. My dataset looks like this (except with about 3 million observations):
This is the current structure of the data:
And I have two objectives...
Objective 1 is to structure it so that it looks like this:
then my second objective is to go back to the original structure and make it look like this:
I have tried variations of aggregate with dcast (which is apparently being deprecated...?)
so, for example, I have tried this:
df2 <- dcast(df1, Store + Sales~ Year, value.var = "Sales")
or even
df %>%
group_by(Store, Year) %>%
summarise(across(starts_with('Sales'), sum))
And I get a diagonal of sales totals across years, but then I'm unable to summarize them so that it looks like
Store Year1 Year2
A $$ $$
B $$ $$
Since there is so much data, it looks like a bunch of stacked identity matrices ...except, instead of 1's there are sales values for the years (there are many many years, not just two).
I am looking for suggestions on how to proceed. One package I found was 'pivottabler' and I have not used it yet, I wanted to see if anyone had any better suggestions first.
:::Much appreciated:::

R: Create column showing days leading up to/ since the maximum value in another column was reached?

I have a dataset with repeated measures: measurements nested within participants (ID) nested in groups. A variable G (with range 0-100) was measured on the group-level. I want to create a new column that shows:
The first day on which the maximum value of G was reached in a group coded as zero.
How many days each measurement (in this same group) occurred before or after the day on which the maximum was reached. For example: a measurement taken 2 days before the maximum is then coded -2, and a measurement 5 days after the maximum is coded as 5.
Here is an example of what I'm aiming for: Example
I highlighted the days on which the maximum value of G was reached in the different groups. The column 'New' is what I'm trying to get.
I've been trying with dplyr and I managed to get for each group the maximum with group_by, arrange(desc), slice. I then recoded those maxima into zero and joined this dataframe with my original dataframe. However, I cannot manage to do the 'sequence' of days leading up to/ days from the maximum.
EDIT: sorry I didn't include a reprex. I used this code so far:
To find the maximum value: First order by date
data <- data[with(data, order(G, Date)),]
Find maximum and join with original data:
data2 <- data %>%
dplyr::group_by(Group) %>%
arrange(desc(c(G)), .by_group=TRUE) %>%
slice(1) %>%
ungroup()
data2$New <- data2$G
data2 <- data2 %>%
dplyr::select(c("ID", "New", "Date"))
data3 <- full_join(data, data2, by=c("ID", "Date"))
data3$New[!is.na(data3$New)] <- 0
This gives me the maxima coded as zero and all the other measurements in column New as NA but not yet the number of days leading up to this, and the number of days since. I have no idea how to get to this.
It would help if you would be able to provide the data using dput() in your question, as opposed to using an image.
It looked like you wanted to group_by(Group) in your example to compute number of days before and after the maximum date in a Group. However, you have ID of 3 and Group of A that suggests otherwise, and maybe could be clarified.
Here is one approach using tidyverse I hope will be helpful. After grouping and arranging by Date, you can look at the difference in dates comparing to the Date where G is maximum (the first maximum detected in date order).
Also note, as.numeric is included to provide a number, as the result for New is a difftime (e.g., "7 days").
library(tidyverse)
data %>%
group_by(Group) %>%
arrange(Date) %>%
mutate(New = as.numeric(Date - Date[which.max(G)]))

How do I reformat a data set to have this particular structure without for loops?

I am trying to reorganize some raw data into a more condense form. Currently the data looks like the below output from the R code. I would like the final output to have columns for time, ID, and all possible desired prices. Then, I want each ID to have only one row for each time with the quantity number put in at the different desired prices(so how many an ID wants at a particular price during this time). So for example, a particular ID might have a quantity of 1 at 100 and quantity of 2 at 101. If it is a buy, then the value should be negative and if it is a sell then positive. For example, -1 for buy at 100 and 2 for sell at 101.
I originally tried doing it through a double for loop with the first loop being time and then the second loop being the ID. Then I was able to look at the quantity column and desired price for an ID and put them into a vector. Afterwards, I combined all the vectors together for that time and then repeated this. When I tried to use this in practice, it was not feasible because the code was too slow as there are hundreds of IDs and thousands of times.
Can someone help me do this in a faster and cleaner way?
set.seed(1)
time <- rep(seq(1,5), , each = 15)
id <- sample(342:450,75,replace = TRUE)
price <- sample(99:103,75,replace = TRUE)
Desire.Price <- sample(97:105,75,replace = TRUE)
quantity <- sample(1:4,75,replace = TRUE)
data <- data.frame(time = time, id = id,price = price, Desire.Price = Desire.Price,quantity = quantity)
data$buysell <- 0
data$buysell <- ifelse( data$Desire.Price <= data$price, "BUY","SELL")
I expect the final data set would look something like this.
Final.df <- data.frame(time=NA,id=NA,"97" = NA,"98"=NA ,"99"=NA,"100"=NA,"101"=NA,"102"=NA,"103"=NA
,"104"=NA,"105"=NA)
It would basically condense the original raw data to have all the information for a particular ID in a row during each time period.
Edit: If an ID did not get sampled in that time (for example ID 342 is not in time 1) they should have a row of NA in that time period( So ID 342 would have a row of NA in time 1). I edited the code that generates the samples to have more ids to reflect this( So that they can't all possibility be sampled in every time period).
Here's a tidyverse approach. First, make quantity signed based on BUY/SELL, then sum quantity for each id / time / Desire.Price, then spread those into wide format with a column for each Desire.Price.
library(dplyr); library(tidyr)
data %>%
mutate(quantity_signed = if_else(buysell == "BUY", -quantity, quantity)) %>%
count(id, time, Desire.Price, wt = quantity_signed) %>%
complete(id, time) %>% # EDIT to bring in all times for all id's
spread(Desire.Price, n) %>% View("output")
I think this approach is simple comparatively.
# Code
library(reshape2)
#Turning BUY quantity values negative.
data[which(data$buysell=="BUY"),]$quantity <- -(data[which(data$buysell=="BUY"),]$quantity)
#Using dcast function to achieve desired columns.
final.df <- dcast(data,time + id~Desire.Price ,fun=sum,value.var='quantity')

Assigning a variable in one dataset to multiple fields in another dataset

I'm trying to assign a variable in one dataframe into multiple rows of another dataframe - namely the AWND variable here (average wind speed).
I'm trying to obtain the AWND from
here
And I am trying to match it with multiple dates based on the date
here
Here's what I've tried so far.
dfNew <- merge(dfWeather, dfFlight, by="DATE")
I'm not sure how to proceed with this.
Should I do a join?
(EDIT: Here's the data- https://shrib.com/#-7dXevTkb12Bt6Kdfxim (this is the dput output of the data I am getting AWND from)
I got the flights data (that I am trying to match dates with) from the nycflights13 package, and then I subset the flights data to include only the carriers that had at least 1000 flights depart from LaGuardia.
The flights data has the date-time class as shown in your tibble. First, make sure that the elements you want to join between are the same i.e. 2013-01-01 05:00:00 will not match with 2013-01-01 in your dfWeather data.frame
# Make sure dates match between data.frames
dfFlight$DATE <- stringr::str_extract(dfFlight$DATE, "\\S*")
# Join AWND wherever dates match to left-hand side
dfNew <- dplyr::left_join(dfFlight, dfWeather, by = "DATE")
I did assume some things about your data since I couldn't fully see what you're working with from screenshot. This is my first answer on Stack Overflow, so feel free to edit or leave me suggestions

R: replace identical rows with average

I have data which looks like this:
patient day response
Bob "08/08/2011" 5
However, sometimes, we have several responses for the same day (from the same patient). For all such rows, I want to replace them all with just one row, where the patient and the day is of course what it happens to be for all those rows, and the response is the average of them.
So if we also had
patient day response
Bob "08/08/2011" 6
then we'd remove both these rows and replace them with
patient day response
Bob "08/08/2011" 5.5
How do I write up a code in R to do this for a data frame that spans tens of thousands of rows?
EDIT: I might need the code to generalize to several covariables. So, for example, apart from day, we might have "location", so then we'd only want to average all the rows which correspond to the same patient on the same day on the same location.
Required output can be obtained by:
aggregate(a$response, by=list(Category=a$patient,a$date), FUN=mean)
You can do this with the dplyr package pretty easily:
library(dplyr)
df %>% group_by(patient, day) %>%
summarize(response_avg = mean(response))
This groups by whatever variables you choose in the group_by so you can add more. I named the new variable "response_avg" but you can change that to what you want also.
just to add a data.table solution if any reader is a data.table user.
library(data.table)
setDT(df)
df[, response := mean(response, na.rm = T), by = .(patient, day)]
df <- unique(df) # to remove duplicates

Resources