subset my df provided that each ID has >10 obs a month - r

I am trying to clean my stocks' df and I need to get rid of the ones that have less than 10 observations per month.
Already checked these 2 threads:
subsetting-based-on-observations-in-a-month
and ddply-for-sum-by-group-in-r
But I'm a noob and I cannot figure it out yet.
In short: Please, help me out eliminating IDs (Stocks) whose observations per month are <10 (for any month if possible). They are Id'd via the permanent number from CRSP (permno).
Here is the df: Lessthan10days.csv
Thank you so much,
Leo

We could create a column 'MonthYr' from the 'date' column after converting it to 'Date' class. Get the number of observations ('n') per group ('permno', 'MonthYr') and use that to remove the IDs ('permno') that have at least one 'n' less than 10.
library(dplyr)
res <- df1 %>%
mutate(MonthYr=format(as.Date(date, format='%m/%d/%Y'), '%Y-%m')) %>%
group_by(permno, MonthYr) %>%
mutate(n=n()) %>%
group_by(permno) %>%
filter(all(n>=10))
all(res$n>=10)
#[1] TRUE
tbl <-table(res$permno, res$MonthYr)
all(tbl[tbl!=0]>=10)
#[1] TRUE
Or using similar approach withdata.table
library(data.table)
setDT(df1)[,N:=.N , list(permno, MonthYr=format(as.Date(date,
format='%m/%d/%Y'), '%Y-%m'))][all(N>=10) , permno][]
data
df1 <- read.csv('Lessthan10days.csv', header=TRUE, stringsAsFactors=FALSE)

I'd just like to add that the next commands work partially:
library(dplyr)
res <- df1 %>%
mutate(MonthYr=format(as.Date(date, format='%m/%d/%Y'), '%Y-%m')) %>%
group_by(permno, MonthYr) %>%
mutate(n=n()) %>%
group_by(permno) %>%
filter(all(n>=10))
all(res$n>=10)
#[1] TRUE
tbl <-table(res$permno, res$MonthYr)
all(tbl[tbl!=0]>=10)
#[1] TRUE
They do not perfectly clean the sample, I believe that some NA values are counted as observations, so they might 'escape' the subsetting/cleaning.
Therefore I did it manually to be sure. A suggestion I can propose would be using just:
>tbl <-table(res$permno, res$MonthYr)
>write.csv(tbl,"tbl.csv")
And then you look into the spreadsheet yourself for cleaning obs<10 (for each year/stock).
On top of that, you can filter the NA values for Price, and erase the 5-10 stocks (ids) that present a couple of months with <10 observations.
Hope this helps a bit. Thanks again for your help!

Related

Summarise multiple but not all column

I have a dataset with 51 columns and I want to add summary rows for most of these variables. Currently columns 5:48 are various metrics with each row being 1 area from 1 quarter, I am summing the metric for all quarters for each area and ultimately creating a rate. The below code works fine for doing this to one individual metric but I need to run this for 44 different columns.
example <- test %>%
group_by(Area) %>%
summarise(`Metric 1`= (sum(`Metric 1`))/(mean(Population))*10000) %>%
bind_rows(test) %>%
arrange(Area,-Quarter)%>% # sort so that total rows come after the quarters
mutate(Timeframe=if_else(is.na(Quarter),'12 month rolling', 'Quarterly'))
I have tried creating a for loop and using the column index values, however, that hasn't worked and just returns various errors. I've been unable to get the above script working with index values as well, the below gives an error ('Error: unexpected '=' in: " group_by_at(Local_Authority) %>% summarise(u17_12mo[5]=")
example <- test %>%
group_by_at(Area) %>%
summarise(test[5]= (sum(test[5]))/(mean(test[4]))*10000) %>%
bind_rows(test) %>%
arrange(Area,-Quarter)%>% # sort so that total rows come after the quarters
mutate(Timeframe=if_else(is.na(Quarter),'12 month rolling', 'Quarterly'))
Any help on setting up a for loop for this, or another way entirely would be great
Without data, its tough to help, but maybe this would work for you:
library(tidyverse)
example <- test %>%
group_by(Area) %>%
summarise(across(5:48, ~(sum(.))/(mean(Population))*10000))

Iterative partial sum on rows with the same dates in R

I would like to do some computation on several rows in a table.
I created an exemple below:
library(dplyr)
set.seed(123)
year_week <- c(200045:200053, 200145:200152, 200245:200252)
input <- as.vector(sample(1:10,25,TRUE))
partial_sum <- c( 20,12,13,18,12,13,4,15,9,13,10,20,11,9,9,5,13,13,,8,13,11,15,14,7,14)
df <- data.frame(year_week, input, partial_sum)
Given are the columns input and year_week. The later represents dates but the values are numerical in my case with the first 4 digits as years and the last two as the working weeks for that year.
What I need, is to iterate over each week in each year and to sum up the values from the same weeks in the other years and save the results into a column called here partial_sum. The current value is excluded from the sum.
The week 53 in the lap year 2000 will get the same treatment but in this case I have only one lap year therefore its value 3 doesn't change.
Any idea on how to make it?
Thank you
I would expect something like this would work, though as pointed out in comments your example isn't exactly reproducible.
library(dplyr)
df %>%
mutate(week = substr(year_week, 5, 6)) %>%
group_by(week) %>%
mutate(result = sum(input))
Perhaps this helps - grouped by 'week' by taking the substring, get the difference between the sum of 'input' and the 'input'
library(dplyr)
df %>%
group_by(week = substring(year_week, 5)) %>%
mutate(partial_sum2 = sum(input) - input)

r - Filter a rows by a date that alters each day

The dataset is 1 column with thousands of rows that contain a date as "2021-09-23T06:38:53.458Z".
With the following code I am able to subset the rows from yesterday:
rows_from_yesterday <- df[df$timestamp %like% "2021-09-24", ]
It works like a charm! I would now like to automate the process because I am not able to update the match criteria each day. How would one approach this? Any tips or suggestions?
Just to be clear. I would like that the "2021-09-24" is automatically updated to "2021-09-25" when it is tomorrow. I have tried the following:
rows_from_yesterday <- df[df$timestamp %like% as.character(Sys.Date()-1), ]
This is sadly without succes.
If I understood you want to filter the observations from yesterday, right? If so, here a solution:
library(dplyr)
library(lubridate)
x <- "2021-09-23T06:38:53.458Z"
df <- tibble(timestamp = x)
df %>%
mutate(timestamp = ymd_hms(timestamp)) %>%
#filter dates equals to yesterday
filter(as_date(timestamp) == (today()-days(1)))

How to identify number of duplicate rows in R (and remove)

I have a large dataframe in R (1.3 mil row, 51 columns). I am not sure if there are any duplicate rows but I want to find out. I tried using the duplicate() function but it took too long and ended up freezing my Rstudio. I dont need to know which entries are duplicate, I just want to delete the ones that are.
Does anyone know how to do this with out it taking 20+ minutes and eventually not loading?
Thanks
I don't know how you used the duplicated function. It seems like this way should be relatively quick even if the dataframe is large (I've tested it on a dataframe with 1.4m rows and 32 columns: it took less than 2min):
df[-which(duplicated(df)), ]
The first one is to extract complete duplicates or over 1(maybe triples)
The second is to removes duplicates or over one.
duplication <- df %>% group_by(col) %>% filter(any(row_number() > 1))
unique_df <- df %>% group_by(col) %>% filter(!any(row_number() > 1))
you can use these too.
dup <- df[duplicated(df$col)|duplicated(df$col, fromLast=TRUE),]
uni_df <- df[!duplicated(df$col)|duplicated(df$col, fromLast=TRUE),]
*** If you want to get the whole df then you can use this***
df %>%
group_by_all() %>%
count() %>%
filter(n > 1)

Choose top n variables in R when matching values

I have a large timeseries dataset, and would like to choose the top 10 observations from each date based one the values in one of my columns.
I am able to do this using group_by(Date) %>% top_n(10)
However, if the values for the 10th and 11th observation are equal, then they are both picked, so that I get 11 observations instead of 10.
Do anyone know what i can do to make sure that only 10 observations are chosen?
You can arrange the data and select first 10 rows in each group.
library(dplyr)
df %>% arrange(Date, desc(col_name)) %>% group_by(Date) %>% slice(1:10)
Similarly, with filter
df %>%
arrange(Date, desc(col_name)) %>%
group_by(Date) %>%
filter(row_number() <= 10)
With data.table you can do
library(data.table)
setDT(df)
df[order(Date, desc(value))][, .SD[1:10], by = Date]
Change value to match the variable name used to choose which observation should be kept in case of ties. You can also do:
df[order(Date, desc(value))][, head(.SD,10), by = Date]
We can use base R
df1 <- df[with(df, order(Date, -value)),]
df1[with(df1, ave(seq_along(Date), Date, FUN = function(x) x %in% 1:10)),]

Resources