How to Exclude Certain times from one column in r? - r

I want to find the mean of each column in the dataset airquality in RStudio. I want to only include values where the month is July (7) and then find the mean. I thought of doing filter so I could filter through the column and assign it to a variable and then find the mean of each column.
Is there a way to do this in a single line command?
Here's my code so far:
apply(airquality[,1:6],2,mean, na.rm=TRUE)

This should work for you,lapply(subset(airquality, Month == 7), mean, na.rm = T) %>% unlist()

Related

Trying to compute row wise median, and then group the values by higher and lower than the median

Trying to sort a dataframe with time series of stock betas. Would like to create two portfolios, one with stocks that have higher than median beta, and another with lower than median beta for each month. I need to do this row-wise, as each row is a new month.
This is what i have tried so far. not working... I've removed Names which in our dataset is the variable containing dates. Then i have converted characters to NA, and lastly trying to get a new column with median values for each row.
Have not started sorting yet, no idea where to begin here.
Thanks for any help.
My data looks like this:
OMXCB_END_MED <- OMXCB_END %>%
select(-Names) %>%
mutate(across(where(is.character), ~na_if(.,"N/A"))) %>%
rowwise() %>%
mutate(median = median(across(cols = everything()), na.rm = TRUE))

Trying to use ddply to subset a dataframe by two column variables, then find the maximum of a third column in r?

I have a dataframe called data with variables for data, time, temperature, and a group number called Box #. I'm trying to subset the data to find the maximum temperature for each day, for each box, along with the time that temperature occurred at. Ideally I could place this data into a new dataframe with the date, time, maximum temperature and the time is occurred at.
I tried using ddply but was the code only returns one line of output
ddply(data, .('Box #', 'Date'), summarize, max('Temp'))
I was able to find the maximum temperatures for each day using tapply on separate dataframes that only contain the values for individual groups
mx_day_2 <- tapply(box2$Temp, box2$Date, max)
I was unable to apply this to the larger dataframe with all groups and cannot figure out how to also get time from this code.
Is it possible to have ddply subset by both Box # and Date, then return two separate outputs of both maximum temperature and time, or do I need to use a different function here?
Edit: I managed to get the maximum times using a version of the code in the answer below, but still haven't figured out how to find the time at which the max occurs in the same data. The code that worked for the first part was
max_data <- data %>%
group_by(data$'Box #', data$'Date')
max_values <- summarise(max_data, max_temp=max(Temp, na.rm=TRUE))
I would use dplyr/tidyverse in stead of plyr, it's an updated version of the package. And clean the column names with janitor: a space is difficult to work with (it changes 'Box #' to box_number).
library(tidyverse)
library(janitor)
mx_day2 <- data %>%
clean_names() %>%
group_by(date,box_number)%>%
summarise(max_temp=max(temp, na.rm=TRUE)
I found a solution that pulls full rows from the initial dataframe into a new dataframe based on only max values. Full code for the solution below
max_data_v2 <- data %>%
group_by(data$'Box #', data$'Date') %>%
filter(Temp == max(Temp, na.rm=TRUE))

how to average a set of columns and exclude other specific columns in R using the summarise command?

I'm breaking my head here with academic work. I have a data.frame with several numeric columns. I am using the command summarize and group_by in R to perform the average calculations of my data frame.
I tried with the code summarize (across (where (is.numeric), mean), -c(Mes, year_date), but it calculates the average of the entire data.frame and in addition, it creates a new column -c (Mes, year_date)), I would like some numeric columns to be excluded from the media calculation, but continue on the data.frame.
Note that I tried -c(Mes, year_date) to exclude these two columns from the average calculation, but it didn't work.
I tried
library(tidyr)
library(dplyr)
library(lubridate)
sample_station <-c('A','A','A','A','A','A','A','A','A','A','A','B','B','B','B','B','B','B','B','B','B','C','C','C','C','C','C','C','C','C','C','A','B','C','A','B','C')
Date_dmy <-c('01/01/2000','08/08/2000','16/03/2001','22/09/2001','01/06/2002','05/01/2002','26/01/2002','16/02/2002','09/03/2002','30/03/2002','20/04/2002','04/01/2000','11/08/2000','19/03/2001','25/09/2001','04/06/2002','08/01/2002','29/01/2002','19/02/2002','12/03/2002','13/09/2001','08/01/2000','15/08/2000','23/03/2001','29/09/2001','08/06/2002','12/01/2002','02/02/2002','23/02/2002','16/03/2002','06/04/2002','01/02/2000','01/02/2000','01/02/2000','02/11/2001','02/11/2001','02/11/2001')
temperature <-c(17,20,24,19,17,19,23,26,19,19,21,15,23,18,22,22,23,18,19,26,21,22,23,27,19,19,21,23,24,25,26,29,30,21,25,24,23)
wind_speed<-c(3.001,6.332,9.321,10.9091,6.38,10.5882,10.5,10.4348,10.3846,10.3448,10.3125,8.35,10.2632,10.2439,10.2273,10.2128,10.2,10.1887,10.1786,12,10.1613,10.1538,10.1471,10.1408,10.1351,10.1299,10.125,2.36,10.1163,10.1124,10.1087,11.2,10.102,10.099,10.0962,10.0935,10.0909)
esp<-c(11.6,11.3,11,10.7,10.4,10.1,9.8,9.5,9.2,8.9,8.6,8.3,8,11.2,10.9,10.6,10.3,10,12.8,12.5,12.2,11.9,11.6,11.3,11,4.36,4.06,3.76,3.46,3.16,2.86,2.56,2.26,1.96,1.66,1.36,23)
volum<-c(300,300,300,300,300,300,300,300,250,250,250,250,250,250,400,400,400,400,400,105,105,105,105,105,105,105,105,105,105,81,81,81,81,81,81,81,81)
df<-data.frame(sample_station, Date_dmy, temperature, wind_speed, esp, volum)%>%
mutate(Date_dmy = dmy(Date_dmy)) %>%
mutate(year_date = floor_date(Date_dmy,'year'))%>%
mutate(Ano=year(Date_dmy))%>%
mutate(Mes=month(Date_dmy))%>%
mutate(Epoca = ifelse(Mes %in% 4:9,'dry','rainy'))%>%
group_by(sample_station, Epoca, Ano)%>%
summarise(across(where(is.numeric), mean), -c(Mes, year_date))
I have several columns that I don't want to be averaged (even if they are numeric). For exemple, columns esp and volum.
update
Exit expectation
Because you are summarising only part of the data, you need to specify what data (rows) of the un-summarised data you want to maintain. In your example, you don't want to summarise Mes and year_date, however you have multiple values within each group (sample_station, Epoca, Ano), of these Mes and year_date columns.
Which values of these unsummarised columns do you want to keep?
If you want to keep all values of the unsummarised columns, you may want to include Mes and year_date inside group_by(sample_station, Epoca, Ano) before summarising.
Alternatively, you may use mutate() rather than summarise() to get summary values in a new column for each row of the original dataframe, then choose your rows from there.
Update:
Again, with your edited post including desired output, what values do you expect for Mes. For example, when sample_station == 'A', Epoca == 'rainy' and Ano == 2000, you have values for Mes of 1 & 2, and the same year_date. summarise() wants to calculate one single summary value for this group.
You can use across(c(where(is.numeric), -Mes). Note that year_date is not included in the calculation as it is not of class numeric and also because it is included in group_by.
You can also combine multiple mutate statements into one.
If you want to exclude certain columns from the average calculation but want to keep it in the dataframe you need to decide which value do you want to keep. For example, to keep the 1st value you can use first.
library(dplyr)
library(lubridate)
data.frame(sample_station, Date_dmy, temperature, wind_speed)%>%
mutate(Date_dmy = dmy(Date_dmy),
year_date = floor_date(Date_dmy,'year'),
Ano=year(Date_dmy),
Mes=month(Date_dmy),
Epoca = ifelse(Mes %in% 4:9,'dry','rainy')) %>%
group_by(sample_station, year_date, Epoca) %>%
summarise(across(c(where(is.numeric), -Mes), mean),
across(Mes, first))

How do I aggregate certain columns from data frame by a Unique ID?

I have a list of statcast data, per day dating back to 2016. I am attempting to aggregate this data for finding the mean for each pitching ID.
I have the following code:
aggpitch <- aggregate(pitchingstat, by=list(pitchingstat$PitcherID),
FUN=mean, na.rm = TRUE)
This function aggregates every single column. I am looking to only aggregate a certain amount of columns.
How would I include only certain columns?
If you have more than one column that you'd like to summarize, you can use QAsena's approach and add summarise_at function like so:
pitchingstat %>%
group_by(PitcherID) %>%
summarise_at(vars(col1:coln), mean, na.rm = TRUE)
Check out link below for more examples:
https://dplyr.tidyverse.org/reference/summarise_all.html
Replace the first argument (pitchingstat) with the name of the column you want to aggregate (or a vector thereof)
How about?:
library(tidyverse)
aggpitch <- pitchingstat %>%
group_by(PitcherID) %>%
summarise(pitcher_mean = mean(variable)) #replace 'variable' with your variable of interest here
or
library(tidyverse)
aggpitch <- pitchingstat %>%
select(var_1, var_2)
group_by(PitcherID) %>%
summarise(pitcher_mean = mean(var_1),
pitcher_mean2 = mean(var_2))
I think this works but could use a dummy example of your data to play with.

Getting the median by date using dplyr's summarise() in R

I have a data frame of integer-count observations listed by date and time interval. I want to find the median of these observations by date using the dplyr package. I've already formatted the date column correctly, and used group_by like so:
data.bydate <- group_by(data.raw, date)
When I use summarise() to find the median of each date group, all I'm getting are a bunch of zeroes. There are NA's in the data, so I've been stripping them with na.rm = TRUE.
data.median <- summarise(data.bydate, median = median(count, na.rm = TRUE)
Is there another way I should be doing this?
You can do something like,
data.raw %>% group_by(date) %>% summarise(median = median(count, na.rm = TRUE))
It's possible each group has too many zero values. Try to identify number of unique value in each group to check whether the groups have too many zeros in them. The below code could help to see the number of unique values and total values available for count variable in each group.
summarise(data.bydate, unique_code = n_distinct(count), total_count = n(count))
example how I made this using dplyr
data.median<-data.bydate%>% summarise(median = median(count, na.rm = TRUE))

Resources