I have a dataset containing rows of unique identifiers. Each unique identifier occupies several rows b/c each person (identifier) has different ratings. For example unique identifier 1 may have a rating for Goal A, Goal B, Goal C, all represented in a separate row.
What would be the best way to find the average for each unique identifier (i.e. for manager 1 (unique identifier 1), what is their average score across Goal A, Goal B and Goal C?
In excel, I'd do this by using the data sort > and check unique identifiers, copy and paste those values at the bottom of the dataset, and find the average using a series of conditional statements. I'm sure there must be a way to do this in R. Would appreciate any help/insight.
I started with this code, but am not sure if this is what I need. I'm filtering by departments (FSO), then asking it to give me a list of unique IDs, and then computing the average for each manager.
df %>% filter(newdept=='FSO') %>%
distinct(ID) %>%
summarize(compmean = mean(CompRating2, na.rm=TRUE))
A base R solution would be to use aggregate:
dat <- data.frame(id=sample(LETTERS, 50, replace=TRUE), score=sample(1:5, 50, replace=TRUE), stringsAsFactors=FALSE)
aggregate(score ~ id, data=dat, mean)
Related
I have a dataframe of length 3000 with different occupations of 2500 different people (many respondents have multiple jobs). There is no ID var, it is just 1 column (Occupation) with the list of occupations (e.g., lobbyist, teacher, teacher, lobbyist, government employee, etc.).
I would like to see what percentage of my n=2500 each occupation is. So, as many people have multiple jobs, all of the percentages should add up to over 100%.
Here is the prop table I created, however, it bases the calculations off of n=3000. Is there a way to set the prop.table to n=2500? If not, is there another function I should use?
This is my code:
# Create the Proportion table
Occupation_Perc <- t(prop.table(table(NewData$Occupation))) #* 100
# filter out uncommon occupations (I'm only interested in common ones)
Occupation_Perc <- data.frame(Occupation_Perc) %>%
filter(Freq > .01)
# Drop unnecessary column produced by prop.table, and rename other column
Occupation_Perc <- as.data.frame(Occupation_Perc) %>%
select(-Var1) %>%
rename(Occupation = Var2)
to get matched pairs due to PSM ("Matchit"-Package and Method = full) i need to specifiy my command for my longitudinal data frame. Every Case has several obeservations but i only need the first observation per patient to be included in the Matching. So the matching should be based on every patients' first observation but my later analysis should include the complete dataset of each patient with all observations.
Has anyone an idea how to achieve this?
I tried using a data subset (first observation per patient) but wasn't able to get the matching included in the data set (with all observations per patient) using "Match.data".
Thanks in advance
Simon (desperately writing his masters thesis)
My udnerstanding is that you want to create matches at just the first time point but have those matches be identified for each unit at all time points. Fortunatly, this is pretty straightforward: just perform the matching at the first time point and then merge the matched dataset with the full dataset. Here is how this might look. Let's say your original long dataset is d and has an ID column id and a time column time.
m <- matchit(treat ~ X1 + X2, data = subset(d, time == 1), method = "full")
md1 <- match.data(m)
d <- merge(d, md1[c("id", "subclass", "weights")], by = "id", all.x = TRUE)
Your new dataset should have two new columns, subclass and weights, which contain the matching subclass and matching weight for each unit. Rows with identical IDs (i.e., rows corresponding to the same unit at multiple time points) will have the same value of subclass and weight.
I am using R.
I have two dfs, A and B.
A is grouped by trial, so contains numerous observations for each subject (e.g. reaction times per trial).
B is grouped by subject, so contains just one observation per subject (e.g. self-reported individual difference measures).
I want to transfer the B values so they repeat per participant across trials in A. There are numerous variables I wish to transfer from B to A, so I'm looking for an elegant solution.
What you want is to use dplyr::left_join to do this elegantly.
library(dplyr)
C <- A %>%
left_join(B, by = "subject_id")
I have data in r that has over 6000 observations and 96 variables.
The data relates to groups of individuals and their activities etc. If a group returned the Group ID number was recorded again and a new observation was made. I need to merge the rows by ID so that the # of individuals take the highest number recorded, but the activities etc are a combination of both observations.
The data contains, #of individuals, activities, impacts, time of arrival etc. The issue is that some of the observations were split across 2 lines, so there may be activities which were recorded for the same group in another line. The Group ID for both observations is the same, but one may have the #of individuals recorded and some activity records or impacts, but the second may be incomplete and only have Group ID and then Impacts (which are additional to those in the 1st record). The #of individuals in the group never changes, so I need some way to combine them so that activities are additive, but #visitors takes the value that is highest, time of arrival needs to be the earliest recorded and time of departure needs to be the later of the 2 observations.
Does anyone know how to merge observations based on Group ID but vary the merging protocol based on the variable.
enter image description here
I'm not sure if this actually is what you want, but to combine rows of a data frame based on multiple conditions you can use the dplyr package and its summarise()function. I generated some data to use in R directly, you would have to modify the code according to your needs.
# generate data
ID<-rep(1:20,2)
visitors<-sample(1:50, 40, replace=TRUE)
impact<-sample(rep(c("a", "b", "c", "d", "e"), 8))
arrival<-sample(rep(8:15, 5))
departure <- sample(rep(16:23, 5))
df<-data.frame(ID, visitors, impact, arrival, departure)
df$impact<-as.character(df$impact)
# summarise rows with identical ID
df_summary <- df %>%
group_by(ID) %>%
summarise(visitors = max(visitors), arrival = min(arrival),
departure = max(departure), impact = paste0(impact, collapse =", "))
Hope this helps!
I have 5 categorical variables: age(5 levels), sex(2 levels), zone(4 levels), qmat(5 levels), and qsoc(5 levels) for a total of 1000 unique combinations. Each unique combination has a corresponding data value (e.g. population size). I would like to assign this data to a 1000 x 6 table where the first five columns contain the indices of age, sex, zone, qmat, qsoc and the 6th column holds the data value.
I would like to avoid using nested for loops which are inefficient in R (some of my datasets will have more than 1000 unique combinations). I know there exist many tools in R for parallel operations (but am not familiar with them). Is there an efficient way to perform the above variable assignment using parallel/vector operations? Any suggestions or references would be appreciated.
It's hard to understand how the original data you have looks like, but assuming that you have your data on a data frame, you may want to use aggregate().
# simulating a data frame
set.seed(1)
N = 9000
df = data.frame(pop=rnorm(N),
age=sample(1:5, N, replace=T),
sex=sample(1:2, N, replace=T)
)
# 'aggregate' this data frame by 'age' and 'sex'
newData = aggregate(pop ~ age + sex, data=df, FUN=sum)
The R function expand.grid() will solve my problem e.g.
expand.grid(list(age,sex,zone,qmat,qsoc))
Thanks for all the responses and I apologize for any possible vagueness in the wording of my question.