I have a dataframe with 8 variables:
For the variable Labor Category, we have 5 factors: Holiday Worked, Regular, Overtime, Training, Other Worked.
The question is: Can I find a way to aggregate rows with same values except Labor Category and sum up the Sum_FTEvariable?
i.e. Can we reduce the number of rows while add more columns:
"Labor.CategoryHoliday.Worked","Labor.CategoryOther.Worked","Labor.CategoryOvertime","Labor.CategoryRegular","Labor.CategoryTraining" and use 0 or 1 to indicate the status of each factor. And then sum up the Total FTE from rows with same values except Labor Category.
We can do one of group by operations. Using dplyr, we specify the column names in the group_by as grouping variables and then get the sum of "Sum_FTE" with summarise.
library(dplyr)
df1 %>%
group_by_(.dots= names(df1)[c(1:2,4:5)]) %>%
summarise(TotalFTE= sum(Sum_FTE))
For the second part of the question, we can use dcast (it would have been better to show the dataset with dput instead of image file)
library(data.table)
setDT(df1)[, N := 1:.N, (Labor.Category)]
dcast(df1, Med.Center+Charged.Job+Month+Pay.Period.End ~N,
value.var="Labor.Category, length)
Related
I am using R.
I have two dfs, A and B.
A is grouped by trial, so contains numerous observations for each subject (e.g. reaction times per trial).
B is grouped by subject, so contains just one observation per subject (e.g. self-reported individual difference measures).
I want to transfer the B values so they repeat per participant across trials in A. There are numerous variables I wish to transfer from B to A, so I'm looking for an elegant solution.
What you want is to use dplyr::left_join to do this elegantly.
library(dplyr)
C <- A %>%
left_join(B, by = "subject_id")
This question already has answers here:
Calculate group mean, sum, or other summary stats. and assign column to original data
(4 answers)
Closed 4 years ago.
Without using a join or merge I would like to add a mean(metric) column to this table, which averages the metric by sector
symbol sector date_bom recommendation metric
A Strip Center 20XX-08-01 BUY 0.01
B Office Center 20XX-09-01 BUY 0.02
C Strip Center 20XX-07-01 SELL -0.01
I've tried a couple things in dplyr but it seems like I want/need a group-by within the summarise clause, and that is not allowed.
If we are going to create a column, use the mutate instead of summarise
library(dplyr)
df1 %>%
group_by(sector) %>%
mutate(Mean = mean(metric))
Though, it is possible to create a list column in summarise and then unnest, but that is not needed here. It is useful in situations, where the output column length is not 1 or not the same the number of rows of each group. Besides, summarise will only get you the grouping column and the summarised column leaving behind all other columns
In base R, we use ave for this kind of operations
df1$Mean <- with(df1, ave(metric, sector))
Note that there is a FUN argument for ave, but by default it gets the mean. So, here it is not needed
I have data which looks like this:
patient day response
Bob "08/08/2011" 5
However, sometimes, we have several responses for the same day (from the same patient). For all such rows, I want to replace them all with just one row, where the patient and the day is of course what it happens to be for all those rows, and the response is the average of them.
So if we also had
patient day response
Bob "08/08/2011" 6
then we'd remove both these rows and replace them with
patient day response
Bob "08/08/2011" 5.5
How do I write up a code in R to do this for a data frame that spans tens of thousands of rows?
EDIT: I might need the code to generalize to several covariables. So, for example, apart from day, we might have "location", so then we'd only want to average all the rows which correspond to the same patient on the same day on the same location.
Required output can be obtained by:
aggregate(a$response, by=list(Category=a$patient,a$date), FUN=mean)
You can do this with the dplyr package pretty easily:
library(dplyr)
df %>% group_by(patient, day) %>%
summarize(response_avg = mean(response))
This groups by whatever variables you choose in the group_by so you can add more. I named the new variable "response_avg" but you can change that to what you want also.
just to add a data.table solution if any reader is a data.table user.
library(data.table)
setDT(df)
df[, response := mean(response, na.rm = T), by = .(patient, day)]
df <- unique(df) # to remove duplicates
Sorry if this is a simple question.
if I have a dataframe that has an ID column and then an Observation column (containing say 'Good' and 'Bad'), with multiple observations per ID..
How can I get r to spread the observation into two columns Good and Bad, with counts of the observations in each column?
Thanks!
Assumption: df is the data.frame.
table(df$Observation)
If you want to calculate count of observation per ID, then:
library(data.table)
setDT(df)
df[ ,table(Observation), by= ID]
I have been teaching myself R from scratch so please bear with me. I have found multiple ways to count observations, however, I am trying to figure out how to count frequencies using (logical?) expressions. I have a massive set of data approx 1 million observations. The df is set up like so:
Latitude Longitude ID Year Month Day Value
66.16667 -10.16667 CPUELE25399 1979 1 7 0
66.16667 -10.16667 CPUELE25399 1979 1 8 0
66.16667 -10.16667 CPUELE25399 1979 1 9 0
There are 154 unique ID's and similarly 154 unique lat/long. I am focusing in on the top 1% of all values for each unique ID. For each unique ID I have calculated the 99th percentile using their associated values. I went further and calculated each ID's 99th percentile for individual years and months i.e.. for CPUELE25399 for 1979 for month=1 the 99th percentile value is 3 (3 being the floor of the top 1%)
Using these threshold values: For each ID, for each year, for each month- I need to count the amount of times (per month per year) that the value >= that IDs 99th percentile
I have tried at least 100 different approaches to this but I think that I am fundamentally misunderstanding something maybe in the syntax? This is the snippet of code that has gotten me the farthest:
ddply(Total,
c('Latitude','Longitude','ID','Year','Month'),
function(x) c(Threshold=quantile(x$Value,probs=.99,na.rm=TRUE),
Frequency=nrow(x$Value>=quantile(x$Value,probs=.99,na.rm=TRUE))))
R throws a warning message saying that >= is not useful for factors?
If any one out there understands this convoluted message I would be supremely grateful for your help.
Using these threshold values: For each ID, for each year, for each month- I need to count the amount of times (per month per year) that the value >= that IDs 99th percentile
Does this mean you want to
calculate the 99th percentile for each ID (i.e. disregarding month year etc), and THEN
work out the number of times you exceed this value, but now split up by month and year as well as ID?
(note: your example code groups by lat/lon but this is not mentioned in your question, so I am ignoring it. If you wish to add it in, just add it as a grouping variable in the appropriate places).
In that case, you can use ddply to calculate the per-ID percentile first:
# calculate percentile for each ID
Total <- ddply(Total, .(ID), transform, Threshold=quantile(Value, probs=.99, na.rm=T))
And now you can group by (ID, month and year) to see how many times you exceed:
Total <- ddply(Total, .(ID, Month, Year), summarize, Freq=sum(Value >= Threshold))
Note that summarize will return a dataframe with only as many rows as there are columns of .(ID, Month, Year), i.e. will drop all the Latitude/Longitude columns. If you want to keep it use transform instead of summarize, and then the Freq will be repeated for all different (Lat, Lon) for each (ID, Mon, Year) combo.
Notes on ddply:
can do .(ID, Month, Year) rather than c('ID', 'Month', 'Year') as you have done
if you just want to add extra columns, using something like summarize or mutate or transform lets you do it slickly without needing to do all the Total$ in front of the column names.