im new on R and i have a data set of 22x252, the 252 have many repeated values on column 1(ID). I made another dataset that has nrows of the unique values (with those values already populated), and i want to populate the rest of the columns based on the other dataset (basically summing all the values that share the same value in column 1.)
Is there a basic function that enables me to do this?
Thanks & Regards
We can use aggregate in base R. Assuming the column name of first column is 'ID' and all other columns are numeric class, we group by 'ID' and get the sum of the rest of the columns in aggregate
aggregate(.~ ID, df1, sum, na.rm = TRUE)
Or with dplyr
library(dplyr)
df1 %>%
group_by(ID) %>%
summarise_at(vars(-group_cols()), sum, na.rm = TRUE)
Or with new version with across
df1 %>%
group_by(ID) %>%
summarise(across(-group_cols(), sum, na.rm = TRUE))
Related
I was trying to figure out a way to transform all values in selected columns of my dataset using an equation $$x_i = x_{max} - x_i$$ using dplyr. I'm not sure how to correctly do this for one column, let alone multiple columns. My attempt at mutating 1 column:
df1 <- df %>% mutate(column1 = replace(column1, ., x = max(column1) - x)
My x = max(column1) - x part is not literal, I just want to know how I can implement that equation into all row entries in the column. Furthermore, how can I do this for multiple columns in the same line? Any help is appreciated. Thanks!
If it is to replace all values across multiple columns, loop across the numeric columns and subtract the values from its max value for that column
library(dplyr)
df <- df %>%
mutate(across(where(is.numeric), ~ max(., na.rm = TRUE) - .))
I have a tibble, with many observations and variables.
What I want to do is simply calculate(grouping by variable1 and variable2) the mean of variableXXX,and the total number of missing values for each group.
this is what I have written so far:
data%>%
group_by(variable1,variable2)%>%
summarise(mean(variableXXX))
how can I calculate the number of missing values for each group? I am new using R, so the easiest solution is better
We can get the sum of logical vector created with is.na
library(dplyr)
data%>%
group_by(variable1,variable2)%>%
summarise(Mean = mean(variableXXX, na.rm = TRUE),
MissingCount = sum(is.na(variableXXX)))
NOTE: Assuming that we are interested in the count of NAs in the 'variableXXX' column grouped by 'variable1' and 'variable2'
If we need the NA count of the whole subset of dataset
library(purrr)
data %>%
group_split(variable1, variable2) %>%
map_dfr(~ .x %>%
summarise(Mean = mean(variableXXX, na.rm = TRUE),
MissingCount = sum(is.na(.))))
There are three columns: website, Date ("%Y %m"), click_tracking (T/F). I would like to add a variable describing the number of websites whose click tracking = T in each month / the number of all website in that month.
I thought the steps would be something like:
aggregate(sum(df$click_tracking = TRUE), by=list(Category=df$Date), FUN = sum)
as.data.frame(table(Date))
Then somehow loop through Date and divide the two variables above which would have been already grouped by Date. How can I achieve this? Many thanks!
If we are creating a column, then do a group by 'Date' and get the sum of 'click_tracking' (assuming it is a logical column - TRUE/FALSE) iin mutate
library(dplyr)
df %>%
group_by(Date) %>%
mutate(countTRUE = sum(click_tracking))
If the column is factor, convert to logical with as.logical
df %>%
group_by(Date) %>%
mutate(countTRUE = sum(as.logical(click_tracking)))
If it is to create a summarised output
df %>%
group_by(Date) %>%
summarise(countTRUE = sum(click_tracking))
In the OP's code, = (assignment) is used instead of == in sum(df$click_tracking = TRUE) and there is no need to do a comparison on a logical column
aggregate(cbind(click_tracking = as.logical(click_tracking)) ~ Date, FUN = sum)
This will create the proportion of websites with click tracking (out of all websites) per month.
aggregate(data=df, click_tracking ~ Date, mean)
Title is self-explanatory. Looking to calculate percent NA by ID group in R. There are lots of posts on calculating NA by variable column but almost nothing on doing it by row groups.
If there are multiple columns, after grouping by 'ID', use summarise_at to loop over the columns, create a logical vector with is.na, get the mean, and multiply by 100
library(dplyr)
df1 %>%
group_by(ID) %>%
summarise_at(vars(-group_cols()), ~ 100 *mean(is.na(.)))
If we want to get the percentage across all other variables,
library(tidyr)
df1 %>%
pivot_longer(cols = -ID) %>%
group_by(ID) %>%
summarise(Perc = 100 * mean(is.na(value)))
Or with aggregate from base R
aggregate(.~ ID, df1, FUN = function(x) 100 * mean(is.na(x)), na.action = na.pass)
Or to get the percentage across, then unlist, the other columns, create a table with the logical vector and the 'ID' column, and use prop.table to get the percentage
prop.table(table(cbind(ID = df1$ID,
value = is.na(unlist(df1[setdiff(names(df1), "ID")]))))
I have a list of statcast data, per day dating back to 2016. I am attempting to aggregate this data for finding the mean for each pitching ID.
I have the following code:
aggpitch <- aggregate(pitchingstat, by=list(pitchingstat$PitcherID),
FUN=mean, na.rm = TRUE)
This function aggregates every single column. I am looking to only aggregate a certain amount of columns.
How would I include only certain columns?
If you have more than one column that you'd like to summarize, you can use QAsena's approach and add summarise_at function like so:
pitchingstat %>%
group_by(PitcherID) %>%
summarise_at(vars(col1:coln), mean, na.rm = TRUE)
Check out link below for more examples:
https://dplyr.tidyverse.org/reference/summarise_all.html
Replace the first argument (pitchingstat) with the name of the column you want to aggregate (or a vector thereof)
How about?:
library(tidyverse)
aggpitch <- pitchingstat %>%
group_by(PitcherID) %>%
summarise(pitcher_mean = mean(variable)) #replace 'variable' with your variable of interest here
or
library(tidyverse)
aggpitch <- pitchingstat %>%
select(var_1, var_2)
group_by(PitcherID) %>%
summarise(pitcher_mean = mean(var_1),
pitcher_mean2 = mean(var_2))
I think this works but could use a dummy example of your data to play with.