I got a big data set that contains monthly returns of a given stock. I'd like to delete rows that do not have a full year data. A subset of data is shown below as an example:
Date Return Year
9/1/2009 0.71447 2009
10/1/2009 0.48417 2009
11/1/2009 0.90753 2009
12/1/2009 -0.7342 2009
1/1/2010 0.83293 2010
2/1/2010 0.18279 2010
3/1/2010 0.19416 2010
4/1/2010 0.38907 2010
5/1/2010 0.37834 2010
6/1/2010 0.6401 2010
7/1/2010 0.62079 2010
8/1/2010 0.42128 2010
9/1/2010 0.43117 2010
10/1/2010 0.42307 2010
11/1/2010 -0.1994 2010
12/1/2010 -0.2252 2010
Ideally, the code will remove the first four observations since they don't have a full year of observation.
The OP has requested to remove all rows from a large data set of monthly values which do not make up a full year. Although the solution suggested by Wen seems to be working for the OP I would like to suggest a more robust approach.
Wen's solution counts the number of rows per year assuming that there is exactly one row per month. It would be more robust to count the number of unique months per year in case there are duplicate entries in the production data set.
(From my experience, one cannot be careful enough when dealing with production data and check all assumptions).
library(data.table)
# count number of unique months per year,
# keep only complete years, omit counts
# result is a data.table with one column Year
full_years <- DT[, uniqueN(month(Date)), by = Year][V1 == 12L, -"V1"]
full_years
Year
1: 2010
# right join with original table, only rows belonging to a full year will be returned
DT[full_years, on = "Year"]
Date Return Year
1: 2010-01-01 0.83293 2010
2: 2010-02-01 0.18279 2010
3: 2010-03-01 0.19416 2010
4: 2010-04-01 0.38907 2010
5: 2010-05-01 0.37834 2010
6: 2010-06-01 0.64010 2010
7: 2010-07-01 0.62079 2010
8: 2010-08-01 0.42128 2010
9: 2010-09-01 0.43117 2010
10: 2010-10-01 0.42307 2010
11: 2010-11-01 -0.19940 2010
12: 2010-12-01 -0.22520 2010
Note that this approach avoids to add a count column to each row of a potentially large data set.
The code can be written more concisely as:
DT[DT[, uniqueN(month(Date)), by = Year][V1 == 12L, -"V1"], on = "Year"]
It is also possible to check the data for any duplicate months, e.g.,
stopifnot(all(DT[, .N, by = .(Year, month(Date))]$N == 1L))
This code counts the number of occurrences for each year and month and halts execution when there is more than one.
Related
I have the following panel data in R:
ID_column<- c("A","A","A","A","B","B","B","B")
Date_column<-c(20040131, 20041231,20051231,20061231, 20051231, 20061231, 20071231, 20081231)
Price_column<-c(12,13,17,19,35,38,39,41)
Data<- data.frame(ID_column, Date_column, Price_column)
#The data looks like this:
ID_column Date_column Price_column
1: A 20040131 12
2: A 20041231 13
3: A 20051231 17
4: A 20061231 19
5: B 20051231 35
6: B 20061231 38
7: B 20071231 39
8: B 20081231 41
My next aim would be to convert the Date column which is currently in a numeric YYYYMMDD format into YYYY by simply taking the first four digits of each entry in the data column as follows:
Data$Date_column<- substr(Data$Date_column,1,4)
#The data then looks like:
ID_column Date_column Price_column
1 A 2004 12
2 A 2004 13
3 A 2005 17
4 A 2006 19
5 B 2005 35
6 B 2006 38
7 B 2007 39
8 B 2008 41
My ultimate goal would be to employ the plm package for panel data regression, but when applying the package and using pdata.frame to set the ID and Time variables as indices, I get error messages of duplicate ID/Time pairs (In this case rows 1 and 2 which would both be given the tag: A,2004). To solve this issue, I would like to delete row 1 in the original data, and only keep the newer observation from the year 2004. This would the provide me with unique ID/Time pairs across the whole data.
Therefore I was hoping for someone to help me out with a loop or a package suggestion with which I can only keep the row with the newer/later observation within a year, if this occurs, also for application to larger data sets.. I believe this involves a couple commands of conditional formatting which I am having difficulties putting together currently. I believe a loop that evaluates whether the first four digits of consecutive date observations are identical and then deletes the one with the "smaller" date/takes the "larger" date would do it, but my experience with loops is very limited.
Kind regards and thank you!
I'd recommend to keep the Date_column as a reference to pick the later observation and mutate a new column for only the year,since you want the latest observation each year.
Data$year<- substr(Data$Date_column,1,4)
> Data$Date_column<- lubridate::ymd(Data$Date_column)
>
> Data %>% arrange(desc(Date_column)) %>%
+ distinct(ID_column,year,.keep_all = TRUE) %>%
+ arrange(Date_column)
ID_column Date_column Price_column year
1 A 2004-12-31 13 2004
2 A 2005-12-31 17 2005
3 B 2005-12-31 35 2005
4 A 2006-12-31 19 2006
5 B 2006-12-31 38 2006
6 B 2007-12-31 39 2007
since we arranged in the actual date in descending order, you guarantee that dropped rows for the unique combination of ID and year is the oldest. you can change the arrangement for the opposite; to get the oldest occuerence
I have some fantasy football data from my league. 12 teams x 8 years = 96 observations. I'm trying to create tibble(year, team, record). The team and record variables are organized correctly. But my year column is in the wrong order. It's current order is below, but I need to reverse it so that 2019 starts at the top and 2012 is the last observation. Each value in the year column repeats 12 times since there are 12 teams. There are no NA values. Thanks in advance.
year team record
2012
2012
2012
2012
2012
2012
2012
2012
2012
2012
2012
2012
2013
2013
2013
.
.
.
2019
I'm dumb, this was quite easy. I'll leave it for others and I'll accept any other answer that works. I just inverted year numerically. year <- year[96:1] then did tibble(year, team, record)
I have these data sets
month Year Rain
10 2010 376.8
11 2010 282.78
12 2010 324.58
1 2011 73.51
2 2011 225.89
3 2011 22.96
I used
df2prnext<-
aggregate(Rain~Year, data = subdataprnext, mean)
but I need the mean value of 217.53.
I am not getting the expected result. Thank you for your help.
I have a data frame that has hourly observational climate data over multiple years, I have included a dummy data frame below that will hopefully illustrate my QU.
dateTime <- seq(as.POSIXct("2012-01-01"),
as.POSIXct("2012-12-31"),
by=(60*60))
WS <- sample(0:20,8761,rep=TRUE)
WD <- sample(0:390,8761,rep=TRUE)
Temp <- sample(0:40,8761,rep=TRUE)
df <- data.frame(dateTime,WS,WD,Temp)
df$WS[WS>15] <- NA
I need to group by year (or in this example, by month) to find if df$WS has 75% or more of valid data for that month. My filtering criteria is NA as 0 is still a valid observation. I have real NAs as it is observational climate data.
I have tried dplyr piping using %>% function to filer by a new column "Month" as well as reviewing several questions on here
Calculate the percentages of a column in a data frame - "grouped" by column,
Making a data frame of count of NA by variable for multiple data frames in a list,
R group by date, and summarize the values
None of these have really answered my question.
My hope is to put something in a longer script that works in a looping function that will go through all my stations and all the years in each station to produce a wind rose if this criteria is met for that year / station. Please let me know if I need to clarify more.
Cheers
There are many way of doing this. This one appears quite instructive.
First create a new variable which will denote month (and account for year if you have more than one year). Split on this variable and count the number of NAs. Divide this by the number of values and multiply by 100 to get percentage points.
df$monthyear <- format(df$dateTime, format = "%m %Y")
out <- split(df, f = df$monthyear)
sapply(out, function(x) (sum(is.na(x$WS))/nrow(x)) * 100)
01 2012 02 2012 03 2012 04 2012 05 2012 06 2012 07 2012
23.92473 21.40805 24.09152 25.00000 20.56452 24.58333 27.15054
08 2012 09 2012 10 2012 11 2012 12 2012
22.31183 25.69444 23.22148 21.80556 24.96533
You could also use data.table.
library(data.table)
setDT(df)
df[, (sum(is.na(WS))/.N) * 100, by = monthyear]
monthyear V1
1: 01 2012 23.92473
2: 02 2012 21.40805
3: 03 2012 24.09152
4: 04 2012 25.00000
5: 05 2012 20.56452
6: 06 2012 24.58333
7: 07 2012 27.15054
8: 08 2012 22.31183
9: 09 2012 25.69444
10: 10 2012 23.22148
11: 11 2012 21.80556
12: 12 2012 24.96533
Here is a method using dplyr. It will work even if you have missing data.
library(lubridate) #for the days_in_month function
library(dplyr)
df2 <- df %>% mutate(Month=format(dateTime,"%Y-%m")) %>%
group_by(Month) %>%
summarise(No.Obs=sum(!is.na(WS)),
Max.Obs=24*days_in_month(as.Date(paste0(first(Month),"-01")))) %>%
mutate(Obs.Rate=No.Obs/Max.Obs)
df2
Month No.Obs Max.Obs Obs.Rate
<chr> <int> <dbl> <dbl>
1 2012-01 575 744 0.7728495
2 2012-02 545 696 0.7830460
3 2012-03 560 744 0.7526882
4 2012-04 537 720 0.7458333
5 2012-05 567 744 0.7620968
6 2012-06 557 720 0.7736111
7 2012-07 553 744 0.7432796
8 2012-08 568 744 0.7634409
9 2012-09 546 720 0.7583333
10 2012-10 544 744 0.7311828
11 2012-11 546 720 0.7583333
12 2012-12 554 744 0.7446237
I am new to R and I want a new data set from my dataframe that will include a new column which represents the median of the values in an existing column (called Total Extras) of the dataframe. The latter consists of around 5,000 individual observations.
I am a bit confused on how to proceed with this task as the Median need to be calculated based on the following criteria: Property, Month, Year and Market
Currently, my dataframe (let's call it mydata1) stands as follows (first 5 rows shown):
Property Date Month Year Market TotalExtras
ZIL 1-Jan-15 1 2015 UK 450.00
ZIL 1-Jan-15 1 2015 UK 125.00
ZIL 1-Feb-15 2 2015 UK 300.00
ZIL 1-Feb-16 2 2016 FR 225.00
EBA 1-Feb-15 2 2015 UK 150.00
...
I need my R codes to create a new dataframe (let's call it mydata2) to appear like below:
Property Date Month Year Market MedianTotalExtras
ZIL 1-Jan-15 1 2015 UK 175.00
ZIL 1-Feb-15 2 2015 UK 250.00
ZIL 1-Feb-16 2 2016 FR 400.00
EBA 1-Feb-15 2 2015 UK 328.00
...
The figures above are for illustration purposes only. Basically, mydata2 is re-grouping the data based on Property, Date and Market with the column 'Median Total Extras' replacing the 'TotalExtras' column of mydata1.
Can this be done with R?
In dplyr the general gist will be something like:
mydata1 %>%
group_by(Property, Date, Market) %>%
summarise(MedianTotalExtras = median(TotalExtras))
where group_by arranges the cutting up of the dataset into pieces with unique Property, Date, Market combos, and the summarise + median calculates the median.