I have a df consisting of daily returns for various maturities. The first column consists of dates and the next 12 are maturities. I want to create a new df that calculates the difference in consecutive daily rates. Not sure where to start.
With multiple columns, diff can applied
rbind(0, diff(as.matrix(df[-1])))
Or we can use dplyr
library(dplyr)
df %>%
mutate_at(vars(-Date), ~ . - lag(.))
Reproducible example
diff(as.matrix(head(mtcars)))
In the future try to refrain just providing a picture and provide a reprex!
Here is one way to get what you're looking for:
df <-
data.frame(
dates= c("2019-01-01", "2019-01-02", "2019-01-03"),
original_numbers = c(1,2,3)
)
df2 <- df %>%
mutate(
difference = original_numbers - lag(original_numbers)
)
I'm trying to sum all of the similar Date/time rows into one row and a "count" row. Therefore I'll get two columns- one for the Date/Time and one for the count.
I used this argument to round my observations into a 15 minute time period:
dat$by15 <- cut(dat$Date_Time, breaks = "15 min", )
I tried to use this argument, but it's "jumping" to a previous dataset and giving me the wrong observations for some reason:
dat <- aggregate(dat, by = list(dat$by15), length )
Thank you guys !
I'm not sure if I understood the question, but if you are trying to group by date and count observations for each date it's really simple
library(dplyr)
grouped_dates <- dat %>%
group_by(Date_Time) %>%
summarise(Count = n())
I am using the code below to group by month to sum or count. However, the SLARespond column seems like it sums for the whole data set, not for each month.
Any way that I can fix the problem?
also, instead of sum function, can I do count function with SLAIncident$IsSlaRespondByViolated == 1
Appreciate for helps!
SLAIncident <- SLAIncident %>%
mutate(month = format(SLAIncident$CreatedDateLocal, "%m"), year = format(SLAIncident$CreatedDateLocal, "%Y")) %>%
group_by(year, month) %>%
summarise(SLARespond = sum(SLAIncident$IsSlaRespondByViolated))
If you could provide a small bit of the dataset to illustrate your example that would be great. I would first make sure that your months/years are characters or factors so that dplyr can grab them. An ifelse function wrapped in a sum should also fit your criteria for the second part of the question. I am using your code here to convert the dates into month and year but I recommend lubridate
SLAIncident <- SLAIncident %>%
mutate(month = as.character(format(SLAIncident$CreatedDateLocal, "%m")),
year = as.character(format(SLAIncident$CreatedDateLocal, "%Y"))) %>%
group_by(year, month) %>%
summarise(SLARespond = sum(IsSlaRespondByViolated),
sla_1 = sum(ifelse(isSlaRespondByViolated == 1, 1, 0)))
Also as hinted to in the comments, these column names are really long and could use some tidying
I have a data.frame with information on racing performance on horses. I have a variable Competition.year that has a "Total" row and then a row for each year the horse competed. I also have a variable Competition.age that describes the age the horses were in each specific year they competed.
I am trying to create a subsetted df based on their best racing times and the age they were when they achieved it. In the "Total" row, the racing time included is their best one. So, I need to figure out how to tell R that, when the race time in Total row is equal to whenever it is they actually achieved that time, include the age they were then in the new data frame. I am super new to R so I have no idea where to even begin doing this, I've tried some stuff I've seen on other questions but I can't get it right. Any help would be much appreciated!
My df looks like this:
travdata <- data.frame(
"Name"=c(rep("Muuttuva",3),rep("Pelson Poika",7),rep("Muusan Muisto",4)),
"Competition.year" = c("Total",2005,2004,"Total",2003,2004,2006,2005,2002,2001,2008,2010,"Total",2009),
"Time.record.auto.start"=c(93.5,NA,93.5,96.5,NA,NA,104.2,96.5,NA,96.6,NA,NA,NA,NA),
"Time.record.volt.start"=c(92.5,98.4,92.5,94.3,NA,105.3,98.3,94.3,102.1,99.1,107.5,NA,107.5,NA),
"Competition.age"=c(NA,6,7,NA,4,5,6,7,8,9,NA,5,6,7))
The desired df should have 223 rows (since that is the total amount of horses I have) with columns Name, Competition.year=="Total", Time.record.auto.start, Time.record.volt.start and Competition.age
Firstly, I had to change your sample data to make sure all 5 variables only had 14 observations each. I did this by removing the final NA in the Competition.age variable. I also had to swap around the 94.3 and 98.3 values in the Time.record.volt.start variable so that the values lined up with what was expected in the Total column for the horse with Name equal to Pelson Poika.
Here is the corrected data:
travdata <- data.frame(
"Name"=c(rep("Muuttuva",3),rep("Pelson Poika",7),rep("Muusan Muisto",4)),
"Competition.year" = c("Total",2005,2004,"Total",2003,2004,2006,2005,2002,2001,2008,2010,"Total",2009),
"Time.record.auto.start"=c(93.5,NA,93.5,96.5,NA,NA,104.2,96.5,NA,96.6,NA,NA,NA,NA),
"Time.record.volt.start"=c(92.5,98.4,92.5,94.3,NA,105.3,98.3,94.3,102.1,99.1,107.5,NA,107.5,NA),
"Competition.age"=c(NA,6,7,NA,4,5,6,7,8,9,NA,5,6,7))
And here is a simple dplyr solution, which I think does what you want.
library(dplyr)
df1 <-
travdata %>% group_by(Name) %>% filter(Competition.year == "Total") %>% select(Name, Time.record.auto.start, Time.record.volt.start)
df2 <- travdata %>% filter(Competition.year != "Total")
df3 <-
inner_join(
df1,
df2,
by = c(
"Name" = "Name",
"Time.record.auto.start" = "Time.record.auto.start",
"Time.record.volt.start" = "Time.record.volt.start"
)
)
The dataframe df3 should return what you were after.
I am trying to clean my stocks' df and I need to get rid of the ones that have less than 10 observations per month.
Already checked these 2 threads:
subsetting-based-on-observations-in-a-month
and ddply-for-sum-by-group-in-r
But I'm a noob and I cannot figure it out yet.
In short: Please, help me out eliminating IDs (Stocks) whose observations per month are <10 (for any month if possible). They are Id'd via the permanent number from CRSP (permno).
Here is the df: Lessthan10days.csv
Thank you so much,
Leo
We could create a column 'MonthYr' from the 'date' column after converting it to 'Date' class. Get the number of observations ('n') per group ('permno', 'MonthYr') and use that to remove the IDs ('permno') that have at least one 'n' less than 10.
library(dplyr)
res <- df1 %>%
mutate(MonthYr=format(as.Date(date, format='%m/%d/%Y'), '%Y-%m')) %>%
group_by(permno, MonthYr) %>%
mutate(n=n()) %>%
group_by(permno) %>%
filter(all(n>=10))
all(res$n>=10)
#[1] TRUE
tbl <-table(res$permno, res$MonthYr)
all(tbl[tbl!=0]>=10)
#[1] TRUE
Or using similar approach withdata.table
library(data.table)
setDT(df1)[,N:=.N , list(permno, MonthYr=format(as.Date(date,
format='%m/%d/%Y'), '%Y-%m'))][all(N>=10) , permno][]
data
df1 <- read.csv('Lessthan10days.csv', header=TRUE, stringsAsFactors=FALSE)
I'd just like to add that the next commands work partially:
library(dplyr)
res <- df1 %>%
mutate(MonthYr=format(as.Date(date, format='%m/%d/%Y'), '%Y-%m')) %>%
group_by(permno, MonthYr) %>%
mutate(n=n()) %>%
group_by(permno) %>%
filter(all(n>=10))
all(res$n>=10)
#[1] TRUE
tbl <-table(res$permno, res$MonthYr)
all(tbl[tbl!=0]>=10)
#[1] TRUE
They do not perfectly clean the sample, I believe that some NA values are counted as observations, so they might 'escape' the subsetting/cleaning.
Therefore I did it manually to be sure. A suggestion I can propose would be using just:
>tbl <-table(res$permno, res$MonthYr)
>write.csv(tbl,"tbl.csv")
And then you look into the spreadsheet yourself for cleaning obs<10 (for each year/stock).
On top of that, you can filter the NA values for Price, and erase the 5-10 stocks (ids) that present a couple of months with <10 observations.
Hope this helps a bit. Thanks again for your help!