This question already has answers here:
Calculate group mean, sum, or other summary stats. and assign column to original data
(4 answers)
Closed 4 years ago.
Without using a join or merge I would like to add a mean(metric) column to this table, which averages the metric by sector
symbol sector date_bom recommendation metric
A Strip Center 20XX-08-01 BUY 0.01
B Office Center 20XX-09-01 BUY 0.02
C Strip Center 20XX-07-01 SELL -0.01
I've tried a couple things in dplyr but it seems like I want/need a group-by within the summarise clause, and that is not allowed.
If we are going to create a column, use the mutate instead of summarise
library(dplyr)
df1 %>%
group_by(sector) %>%
mutate(Mean = mean(metric))
Though, it is possible to create a list column in summarise and then unnest, but that is not needed here. It is useful in situations, where the output column length is not 1 or not the same the number of rows of each group. Besides, summarise will only get you the grouping column and the summarised column leaving behind all other columns
In base R, we use ave for this kind of operations
df1$Mean <- with(df1, ave(metric, sector))
Note that there is a FUN argument for ave, but by default it gets the mean. So, here it is not needed
Related
This question already has answers here:
filtering within the summarise function of dplyr
(3 answers)
Opposite of %in%: exclude rows with values specified in a vector
(13 answers)
Closed 3 months ago.
This post was edited and submitted for review 3 months ago and failed to reopen the post:
Original close reason(s) were not resolved
EDIT: I want to specify which values NOT to include in my calculation by providing a list of values for records to skip. I do NOT want to provide a list of values to include in my calculation because my dataset is too large.
I want to group records based on a certain value, and then I want to do some other calculations for certain variables; however, I want to exclude certain values from one of those calculations. Here is an example of what the data transformation would look like without any exclusions:
library(dplyr)
grouped <- starwars %>%
group_by(species) %>% #group my data by a particular value
summarise(Total_Mass = sum(mass), #make a calculation
Average_Height = mean(height)) # make another calculation
and here's what I am attempting to do:
exclude <- c("R2-D2","Luke","Darth") #make a list of the names of records I would like to exclude
grouped2 <- starwars %>%
group_by(species) %>%
summarise(Total_Mass = sum(mass) where name !%in% exclude, #sum mass for all records except those where name is in the exclude list
Average_Height = mean(height)) # make another calculation without any exclusions
This question already has answers here:
How to sum a variable by group
(18 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 1 year ago.
This is my current code for this image data[1:20,c("Job.Family", "Salaries", "Retirement")]. The goal here is to group all the same jobs in the Job.Family column together without loosing any data associated with it. So for example I would like to find out the sum of "Salaries" and "Retirement" for all those in the "Information System" Job.Family. Hopefully this makes sense.
You are probably looking into some very basic subsetting and summarising operations here.
I strongly recommend you study the dplyr package.
Your example:
library(dplyr)
df %>% filter(Job.Family = "Information Systems") %>%
summarise(across(c(Salaries, Retirement), mean))
You may want to calculate this for all groups, as in:
df %>% group_by(Job.Family) %>%
summarise(across(c(Salaries, Retirement), mean))
This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed 3 years ago.
I have a large dataset called genetics which I need to break down. There are 4 columns, the first one is patientID that is sometimes duplicated, and 3 columns that describe the patients.
As said before, some of the patient IDs are duplicated and I want to know which ones, without losing the remaining columns.
dedupedGenID<- unique(Genetics$ID)
Will only give me the unique IDs, without the column.
In order to subset the df by those unique IDs I did
dedupedGenFull <- Genetics[str_detect(Genetics$patientID, pattern=dedupedGenID,]
This gives me an error of "longer object length is not a multiple of shorter object length" and the dedupedGenFull has only 55 rows, while dedupedGenID is a character vector of 1837.
My questions are: how do I perform that subsetting step correctly? How do I do the same, but with those that are multiplicated, i.e. how do I subset the df so that I get IDs and other columns of those patients that repeat?
Any thoughts would be appreciated.
We can use duplicated to get ID that are multiplicated and use that to subset data
subset(Genetics, ID %in% unique(ID[duplicated(ID)]))
Another approach could be to count number of rows by ID and select rows which are more than 1.
This can be done in base R :
subset(Genetics, ave(seq_along(ID), ID, FUN = length) > 1)
dplyr
library(dplyr)
Genetics %>% group_by(ID) %>% filter(n() > 1)
and data.table
library(data.table)
setDT(Genetics)[, .SD[.N > 1], ID]
library(data.table)
genetics <- data.table(genetics)
genetics[,':='(is_duplicated = duplicated(ID))]
This chunk will make a data.table from your data, and adds a new column which contains TRUE if the ID is duplicated and FALSE if not. But it marks only duplicated, meaning the first one will be marked as FALSE.
This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 6 years ago.
I am analyzing a set of data with many columns (almost 30 columns). I want to group data based on two columns and apply sum and mean functions to all the columns except timestamp.
How would I use summarise_each on all columns except timestamp?
This is the draft code I have but it obviously not correct. Plus it generates and error because it can not apply Sum to POSIXt data type (Error: 'sum' not defined for "POSIXt" objects)
features <- dataset %>%
group_by(X, Y) %>%
summarise_each(funs(mean,sum)) %>%
arrange(TIMESTAMP)
Try summarise_each(funs(mean,sum), -TIMESTAMP) to exclude TIMESTAMP from the summarisation.
This question already has answers here:
Extract the maximum value within each group in a dataframe [duplicate]
(3 answers)
Closed 7 years ago.
I have a vector of hourly temperatures (DATA$TEMP) linked to dates (DATA$DATE) and thermometer position (DATA$PLACE).
I want identify the maximum temperature conditional on date and position. I can easily do this one date and position at a time, given I specify each date and position. eg.
x <- max(DATA$TEMP[DATA$DATE =="20/12/15" & DATA$PLACE=="room"])
However I have many dates and positions and would like a function that can run through each date / position combination and return a vector of max temps linked to each.
Try this
library(dplyr)
x <- DATA %>%
group_by(DATE, PLACE) %>%
summarise(maximum= max(TEMP))
Another option with data.table
library(data.table)
setDT(DATA)[, list(Max = max(TEMP)) , .(DATE, PLACE)]
Or with base R aggregate
aggregate(TEMP~DATE+PLACE, DATA, FUN= max)
Here an answer using base:
by(DATA$TEMP, list(DATA$DATE, DATA$PLACE), max)
On a more general note, this kind of problem falls under the split-apply-combine paradigm. If you google this, you will find that there are quite many ways of doing this in R. From several base functions, to plyr, dplyr, and data.table versions. See e.g. here.