Spread a column of factors by count of factor - r

Sorry if this is a simple question.
if I have a dataframe that has an ID column and then an Observation column (containing say 'Good' and 'Bad'), with multiple observations per ID..
How can I get r to spread the observation into two columns Good and Bad, with counts of the observations in each column?
Thanks!

Assumption: df is the data.frame.
table(df$Observation)
If you want to calculate count of observation per ID, then:
library(data.table)
setDT(df)
df[ ,table(Observation), by= ID]

Related

Create full data frame from possible combinations of grouping variables

I apologize if this has been asked before, but I could not find the answer I needed when there are three grouping variables.
I need to fill a dataframe with possible combinations of variables, but insert NAs for a non-grouping observation values when a combination does not appear. Say there is a dataframe with three grouping variables: Year, Geography, and Grouping:
Year <- rep(2008:2019,each=50)
Geography <- rep(1:60,each=10)
Grouping <- rep(1:4,each=150)
value <- seq(rnorm(600,mean=0,sd=1))
df=cbind(Year,Geography)
df=as.data.frame(cbind(df,value))
But the dataframe is missing some random observations like so:
df2=df[-c(15,60,150,510),]
How would one go about changing the dataframe back into a length of 600 (which is the length it would be if all possible combinations of three grouping variables were present), but inserting NAs where the value would be if the combinations were in the dataframe? Note that all unique observations for each grouping variable are present in the dataset at some point.

Subsetting a dataframe based on a vector of strings [duplicate]

This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed 3 years ago.
I have a large dataset called genetics which I need to break down. There are 4 columns, the first one is patientID that is sometimes duplicated, and 3 columns that describe the patients.
As said before, some of the patient IDs are duplicated and I want to know which ones, without losing the remaining columns.
dedupedGenID<- unique(Genetics$ID)
Will only give me the unique IDs, without the column.
In order to subset the df by those unique IDs I did
dedupedGenFull <- Genetics[str_detect(Genetics$patientID, pattern=dedupedGenID,]
This gives me an error of "longer object length is not a multiple of shorter object length" and the dedupedGenFull has only 55 rows, while dedupedGenID is a character vector of 1837.
My questions are: how do I perform that subsetting step correctly? How do I do the same, but with those that are multiplicated, i.e. how do I subset the df so that I get IDs and other columns of those patients that repeat?
Any thoughts would be appreciated.
We can use duplicated to get ID that are multiplicated and use that to subset data
subset(Genetics, ID %in% unique(ID[duplicated(ID)]))
Another approach could be to count number of rows by ID and select rows which are more than 1.
This can be done in base R :
subset(Genetics, ave(seq_along(ID), ID, FUN = length) > 1)
dplyr
library(dplyr)
Genetics %>% group_by(ID) %>% filter(n() > 1)
and data.table
library(data.table)
setDT(Genetics)[, .SD[.N > 1], ID]
library(data.table)
genetics <- data.table(genetics)
genetics[,':='(is_duplicated = duplicated(ID))]
This chunk will make a data.table from your data, and adds a new column which contains TRUE if the ID is duplicated and FALSE if not. But it marks only duplicated, meaning the first one will be marked as FALSE.

R Using lag() to create new columns in dataframe

I have a dataset with survey score results for 3 hospitals over a number of years. This survey contains 2 questions.
The dataset looks like this -
set.seed(1234)
library(dplyr)
library(tidyr)
dataset= data.frame(Hospital=c(rep('A',10),rep('B',8),rep('C',6)),
YearN=c(2015,2016,2017,2018,2019,
2015,2016,2017,2018,2019,
2015,2016,2017,2018,
2015,2016,2017,2018,
2015,2016,2017,
2015,2016,2017),
Question=c(rep('Overall Satisfaction',5),
rep('Overall Cleanliness',5),
rep('Overall Satisfaction',4),
rep('Overall Cleanliness',4),
rep('Overall Satisfaction',3),
rep('Overall Cleanliness',3)),
ScoreYearN=c(rep(runif(24,min = 0.6,max = 1))),
TotalYearN=c(rep(round(runif(24,min = 1000,max = 5000),0))))
MY OBJECTIVE
To add two columns to the dataset such that -
The first column contains the score for the given question in the given
hospital for the previous year
The second column contains the total number of respondents for the given question in the given hospital for the previous year
MY ATTEMPT
I called the first column ScoreYearN-1 and the second column TotalYearN-1
I used the lag function to create the new columns that contain the lagged values from the existing columns.
library(dplyr)
library(tidyr)
dataset$`ScoreYearN-1`=lag(dataset$ScoreYearN)
dataset$`TotalYearN-1`=lag(dataset$TotalYearN)
Which gives me a resulting dataset where I have the desired outcome for the first five rows only (these rows correspond to the first Hospital-Question combination).
The remaining rows do not account for this grouping, and hence the 2015 'N-1' values take on the values of the previous group.
I'm not sure this is the best way to go about this problem. If you have any better suggestions, I'm happy to consider them.
Any help will be greatly appreciated.
You're close! Just use dplyr to group by hospital
dataset_lagged <- dataset %>%
group_by(Hospital,Question) %>%
mutate(`ScoreYearN-1` = lag(ScoreYearN),
`TotalYearN-1` = lag(TotalYearN))

R: replace identical rows with average

I have data which looks like this:
patient day response
Bob "08/08/2011" 5
However, sometimes, we have several responses for the same day (from the same patient). For all such rows, I want to replace them all with just one row, where the patient and the day is of course what it happens to be for all those rows, and the response is the average of them.
So if we also had
patient day response
Bob "08/08/2011" 6
then we'd remove both these rows and replace them with
patient day response
Bob "08/08/2011" 5.5
How do I write up a code in R to do this for a data frame that spans tens of thousands of rows?
EDIT: I might need the code to generalize to several covariables. So, for example, apart from day, we might have "location", so then we'd only want to average all the rows which correspond to the same patient on the same day on the same location.
Required output can be obtained by:
aggregate(a$response, by=list(Category=a$patient,a$date), FUN=mean)
You can do this with the dplyr package pretty easily:
library(dplyr)
df %>% group_by(patient, day) %>%
summarize(response_avg = mean(response))
This groups by whatever variables you choose in the group_by so you can add more. I named the new variable "response_avg" but you can change that to what you want also.
just to add a data.table solution if any reader is a data.table user.
library(data.table)
setDT(df)
df[, response := mean(response, na.rm = T), by = .(patient, day)]
df <- unique(df) # to remove duplicates

How to transfer multiple factors into one row?

I have a dataframe with 8 variables:
For the variable Labor Category, we have 5 factors: Holiday Worked, Regular, Overtime, Training, Other Worked.
The question is: Can I find a way to aggregate rows with same values except Labor Category and sum up the Sum_FTEvariable?
i.e. Can we reduce the number of rows while add more columns:
"Labor.CategoryHoliday.Worked","Labor.CategoryOther.Worked","Labor.CategoryOvertime","Labor.CategoryRegular","Labor.CategoryTraining" and use 0 or 1 to indicate the status of each factor. And then sum up the Total FTE from rows with same values except Labor Category.
We can do one of group by operations. Using dplyr, we specify the column names in the group_by as grouping variables and then get the sum of "Sum_FTE" with summarise.
library(dplyr)
df1 %>%
group_by_(.dots= names(df1)[c(1:2,4:5)]) %>%
summarise(TotalFTE= sum(Sum_FTE))
For the second part of the question, we can use dcast (it would have been better to show the dataset with dput instead of image file)
library(data.table)
setDT(df1)[, N := 1:.N, (Labor.Category)]
dcast(df1, Med.Center+Charged.Job+Month+Pay.Period.End ~N,
value.var="Labor.Category, length)

Resources