Choose top n variables in R when matching values - r

I have a large timeseries dataset, and would like to choose the top 10 observations from each date based one the values in one of my columns.
I am able to do this using group_by(Date) %>% top_n(10)
However, if the values for the 10th and 11th observation are equal, then they are both picked, so that I get 11 observations instead of 10.
Do anyone know what i can do to make sure that only 10 observations are chosen?

You can arrange the data and select first 10 rows in each group.
library(dplyr)
df %>% arrange(Date, desc(col_name)) %>% group_by(Date) %>% slice(1:10)
Similarly, with filter
df %>%
arrange(Date, desc(col_name)) %>%
group_by(Date) %>%
filter(row_number() <= 10)

With data.table you can do
library(data.table)
setDT(df)
df[order(Date, desc(value))][, .SD[1:10], by = Date]
Change value to match the variable name used to choose which observation should be kept in case of ties. You can also do:
df[order(Date, desc(value))][, head(.SD,10), by = Date]

We can use base R
df1 <- df[with(df, order(Date, -value)),]
df1[with(df1, ave(seq_along(Date), Date, FUN = function(x) x %in% 1:10)),]

Related

R filter or subset for finding a specific repeat count for data.frame

I want to use filter or subset from dplyr that will give a new dataframe only with rows in which for the selected column the value is counted exactly 2 times in the original data.frame
I try this:
df2 <-
df %>%
group_by(x) %>% mutate(duplicate = n()) %>%
filter(duplicate == 2)
and this
df2 <- subset(df,duplicated(x))
but neither option works
In the group_by, just use the unquoted column name. Also, we don't need to create a column in mutate before filtering. It can be directly done on the fly in filter
library(dplyr)
df %>%
group_by(x) %>%
filter(n() ==2) %>%
ungroup

Loop for aggregated data frame in R

I have a data frame with 58 columns labeled SD1 through to SD58 along with columns for date info (Date, Year, Month, Day).
I'm trying to find the date of the maximum value of each of the SD columns each year using the following code:
maxs<-aggregate(SD1~Year, data=SDtime, max)
SDMax<-merge(maxs,SDtime)
I only need the dates so I made a new df and relabeled the column as below:
SD1Max = subset(SDMax, select = c(Year, Date))
SD1Max %>%
rename(
SD1=Date
)
I want to do the same thing for every SD column but I don't want to have to repeat these steps 58 times. Is there a way to loop the process?
Assuming there are no ties (multiple days with where the variable reached its maximum) this probably does what you want:
library('tidyverse')
SDtime %>%
pivot_longer(
cols = matches('^SD[0-9]{1,2}$')
) %>%
group_by(name) %>%
filter(value == max(value, na.rm = TRUE)) %>%
ungroup()
You might want to pivot_wider afterwards.

dplyr count, sum and calculate the percentage and round to whole number using R

I kindly request for help in grouping by ID, counting non-zeros and presenting the results as the percentage of the total in that particular ID
My data
library(dplyr)
id<-c(1,1,1,1,1,2,2,2,2)
x<-c(0,1,0,1,0,0,0,1,0)
df1<-data.frame(id,x)
head(df1)
in my results, after grouping by id=1, then i need column with total for ones as 2 and another column with the precentage (2/5)= 40. For group id=2 then i need the total for the column as 1 and percentage as (1/4)=25
df1 %>%
group_by(id) %>%
summarise(sum_of_1 = sum(x!=0),
pct= round((sum_of_1/n())*100))
This one works. Thanks for the help Hulk
Try this-
df1 %>%
group_by(id) %>%
summarise(sum_of_1 = sum(x, na.rm = TRUE),
pct = round((sum_of_1/n())*100))

Counting the rows based on two other column values, and manipulate the value in a loop through one of these column values in R

There are three columns: website, Date ("%Y %m"), click_tracking (T/F). I would like to add a variable describing the number of websites whose click tracking = T in each month / the number of all website in that month.
I thought the steps would be something like:
aggregate(sum(df$click_tracking = TRUE), by=list(Category=df$Date), FUN = sum)
as.data.frame(table(Date))
Then somehow loop through Date and divide the two variables above which would have been already grouped by Date. How can I achieve this? Many thanks!
If we are creating a column, then do a group by 'Date' and get the sum of 'click_tracking' (assuming it is a logical column - TRUE/FALSE) iin mutate
library(dplyr)
df %>%
group_by(Date) %>%
mutate(countTRUE = sum(click_tracking))
If the column is factor, convert to logical with as.logical
df %>%
group_by(Date) %>%
mutate(countTRUE = sum(as.logical(click_tracking)))
If it is to create a summarised output
df %>%
group_by(Date) %>%
summarise(countTRUE = sum(click_tracking))
In the OP's code, = (assignment) is used instead of == in sum(df$click_tracking = TRUE) and there is no need to do a comparison on a logical column
aggregate(cbind(click_tracking = as.logical(click_tracking)) ~ Date, FUN = sum)
This will create the proportion of websites with click tracking (out of all websites) per month.
aggregate(data=df, click_tracking ~ Date, mean)

How to count grouped negative values by year

I have a data set of prices in which the first column is year and the rest of the columns are regions. I'm trying to count the number of negative values by year by region.
I've tried to use dplyr to group_by(year) then summarise_at() but I can't figure out the exact code to use.
Neg_Count <- select (BaseCase, Year, 'Hub1': 'Hub15')
Neg_Count <- Neg_Count %>%
group_by(Year) %>%
What is the best way to do this?
We can use summarise_at. Create a logical vector and get the sum
library(dplyr)
Neg_Count %>%
group_by(Year) %>%
summarise_at(vars(starts_with("Hub")), ~ sum(. < 0))

Resources