Getting the median by date using dplyr's summarise() in R - r

I have a data frame of integer-count observations listed by date and time interval. I want to find the median of these observations by date using the dplyr package. I've already formatted the date column correctly, and used group_by like so:
data.bydate <- group_by(data.raw, date)
When I use summarise() to find the median of each date group, all I'm getting are a bunch of zeroes. There are NA's in the data, so I've been stripping them with na.rm = TRUE.
data.median <- summarise(data.bydate, median = median(count, na.rm = TRUE)
Is there another way I should be doing this?

You can do something like,
data.raw %>% group_by(date) %>% summarise(median = median(count, na.rm = TRUE))

It's possible each group has too many zero values. Try to identify number of unique value in each group to check whether the groups have too many zeros in them. The below code could help to see the number of unique values and total values available for count variable in each group.
summarise(data.bydate, unique_code = n_distinct(count), total_count = n(count))

example how I made this using dplyr
data.median<-data.bydate%>% summarise(median = median(count, na.rm = TRUE))

Related

Trying to compute row wise median, and then group the values by higher and lower than the median

Trying to sort a dataframe with time series of stock betas. Would like to create two portfolios, one with stocks that have higher than median beta, and another with lower than median beta for each month. I need to do this row-wise, as each row is a new month.
This is what i have tried so far. not working... I've removed Names which in our dataset is the variable containing dates. Then i have converted characters to NA, and lastly trying to get a new column with median values for each row.
Have not started sorting yet, no idea where to begin here.
Thanks for any help.
My data looks like this:
OMXCB_END_MED <- OMXCB_END %>%
select(-Names) %>%
mutate(across(where(is.character), ~na_if(.,"N/A"))) %>%
rowwise() %>%
mutate(median = median(across(cols = everything()), na.rm = TRUE))

Better way to apply which.max over dataframe

so I'm trying to learn R while playing with a dataset from https://www.kaggle.com/abcsds/pokemon
data = read.csv("Pokemon.csv")
data$Name = sub(".*(Mega)", "Mega", data$Name) # replacing name duplications
And I want to find all the pokemon that have a maximum value on any columns (Total, Attack, HP, etc):
I know I can do this: sapply(data[5:11], max, na.rm = TRUE) to find out the max values and stuff like
data[which.max(data$Total),]
data[which.max(data$HP),]
data[which.max(data$Attack),]
to find all the rows that have a max.
Is there a way I can use something like sapply in order to get all the rows without going through them sequentially?
I believe this is what you want to achieve
I use tidyverse for this, as the data is in wide format with different columns for stat, I first convert it into long format using pivot_longer then I group_by stats column and filter the max of each group to achieve the desired result.
library(tidyverse)
df %>%
select(c(2, 5:11)) %>%
pivot_longer(-1, names_to = "stats") %>%
group_by(stats) %>%
filter(value == max(value))

How to Exclude Certain times from one column in r?

I want to find the mean of each column in the dataset airquality in RStudio. I want to only include values where the month is July (7) and then find the mean. I thought of doing filter so I could filter through the column and assign it to a variable and then find the mean of each column.
Is there a way to do this in a single line command?
Here's my code so far:
apply(airquality[,1:6],2,mean, na.rm=TRUE)
This should work for you,lapply(subset(airquality, Month == 7), mean, na.rm = T) %>% unlist()

How to find the mean and standard deviation of rows in dataframes with some having NAs and others not

I'm trying to find the mean and standard deviation for C and P separately.
I have toyed around with this so far:
C <- rowMeans(dplyr::select(total, C1:41), na.rm=TRUE)
This didn't yield what I needed it to.
Then I thought about just using the summary, but again it didn't give me what I needed.
So then I thought of using na.omit:
Of course though, this would take out all of the data since I have NAs throughout the dataframe.
What am I missing here? Is this a matter of aggregating my data into certain groups?
I know describeby could force these descriptives, but again I'm not sure how to do that.
So, I think the angle I want to take is to order these, then aggregate and find totals, and then find the descriptives using describeby in order to avoid NAs. I'm stuck though. Where am I going wrong?
Try using this :
library(dplyr)
total %>%
#Select only columns that have S in their name
#i.e SP and SC
select(starts_with('S')) %>%
#Get the data in long format, remove NA values
tidyr::pivot_longer(cols = everything(), values_drop_na = TRUE) %>%
#Create a group for each participant
group_by(grp = c('Participant1', 'Participant2')[grepl('C\\d+', name) + 1]) %>%
#Take mean and standard deviation for each group
summarise(mean = mean(value), sd = sd(value))

How do I aggregate certain columns from data frame by a Unique ID?

I have a list of statcast data, per day dating back to 2016. I am attempting to aggregate this data for finding the mean for each pitching ID.
I have the following code:
aggpitch <- aggregate(pitchingstat, by=list(pitchingstat$PitcherID),
FUN=mean, na.rm = TRUE)
This function aggregates every single column. I am looking to only aggregate a certain amount of columns.
How would I include only certain columns?
If you have more than one column that you'd like to summarize, you can use QAsena's approach and add summarise_at function like so:
pitchingstat %>%
group_by(PitcherID) %>%
summarise_at(vars(col1:coln), mean, na.rm = TRUE)
Check out link below for more examples:
https://dplyr.tidyverse.org/reference/summarise_all.html
Replace the first argument (pitchingstat) with the name of the column you want to aggregate (or a vector thereof)
How about?:
library(tidyverse)
aggpitch <- pitchingstat %>%
group_by(PitcherID) %>%
summarise(pitcher_mean = mean(variable)) #replace 'variable' with your variable of interest here
or
library(tidyverse)
aggpitch <- pitchingstat %>%
select(var_1, var_2)
group_by(PitcherID) %>%
summarise(pitcher_mean = mean(var_1),
pitcher_mean2 = mean(var_2))
I think this works but could use a dummy example of your data to play with.

Resources