how to create mean ses by school - tapply function error? - r

I have a dataframe that lists studentnumber <- c( 1,2,3.. nth) and schoolnumber<- c(1,1,2,3,4,4) so pupil 1 is in school 1, pupil 2 is in school 1, pupil 3 is in school 3....
I have social economic status for each pupil and I want to calculate a new column where the SESs are actual SES minus the mean SES of a particular school. The function for this is apparently:
mydata$meansocialeconomicstatus <- with(mydata, tapply(ses, schoolnumber, mean))
But I receive an error term because the new column is not repeating each value depending on if the school number has repeated. So this gives me a discrepancy in the number of rows in the new column not matching the dataframe. This is because each mean is only being given once.
My question is, what could I add to make the mean ses repeat in the new column depending on the school number?

You can use the dplyr package.
library(dplyr)
# Calculate the mean socialeconomicstatus per schoolnumber.
mydata2 <- mydata %>%
group_by(schoolnumber) %>%
summarise(meansocialeconomicstatus = mean(ses))
# Join the mean socialeconomicstatus back to the original dataset based on schoolnumber.
left_join(mydata,mydata2,by="schoolnumber")

Related

how to find the mean of non numerical values in R

we were given 2 data frames to import
1 contains a list of gene expression data for 17 patients (non - numerical)
the second one contains their gene ID and their treatment group
these data sets have to firstly be combined
and then we have to calculate the mean expression value for each treatment group
im struggling to work out how to calculate the mean and assosiate it to a certain treatment group
apologies if this does not make sense
patients<-read.table("GSE4922-GPL96_log2Mas5Sc500-N17.tab",sep = "\t", header=TRUE)
patients
attach(patients)
ProbeSetID
patientpID<-read.table("Patient-Groups-N17.tab", sep = "\t",header=TRUE)
patientpID
attach(patientpID)
PatientID
mergeddata<-merge(patientpID,patients)
grouping(TreatmentGroup)
sum(avg_pID = mean("ProbeSetID"))
this is what I have so far, but I need to find the mean of the Probe Set ID
and the group it into a treatment group

Apply if function to identify the value of variable based on the value of another variable

I am trying to identify the value of the variable in an R data frame conditioning on the value of another variable, but unable to do it.
Basically, I am giving 3 different Dose of vaccine to three groups of animals (5 animal per group ( Total )) and recording results as Protec which means the number of protected animals in each group. From the Protec, I am calculating the proportion of protection (Protec/Total as Prop for each Dose group. For example
library(dplyr)
Dose=c(1,0.25,0.0625);Dose #Dose or Dilution
Protec=c(5,4,3);Protec
Total=c(rep(5,3));Total
df=as.data.frame(cbind(Dose,Protec,Total));df
df=df %>% mutate(Prop=Protec/Total);df
Question is, what is the log10 of minimum value of Dose for which Prop==1, which can be found using the following code
X0=log10(min(df$Dose[df$Prop1==1.0]));X0
The result should be X0=0
If the Protec=c(5,5,3), the Prop becomes c(1.0,1.0,0.6) then the X0 should be -0.60206.
If the Protec=c(5,5,5), the Prop becomes c(1.0,1.0,1.0), For which I want X0=0.
if the Protec=c(5,4,5), the Prop becomes c(1.0,0.8,1.0), then also I want X0=0 because I consider them as unordered and take the highest dose for calculating X0
I think it requires if function but the conditions for which I don't know how to write the code.
can someone explain how to do it in R?. thanking you in advance
We can use mutate_at to create apply the calculation on multiple columns that have column name starting with 'Protec'
library(dplyr)
df1 <- df %>%
mutate_at(vars(starts_with("Protec")), list(Prop = ~./Total))

R Using lag() to create new columns in dataframe

I have a dataset with survey score results for 3 hospitals over a number of years. This survey contains 2 questions.
The dataset looks like this -
set.seed(1234)
library(dplyr)
library(tidyr)
dataset= data.frame(Hospital=c(rep('A',10),rep('B',8),rep('C',6)),
YearN=c(2015,2016,2017,2018,2019,
2015,2016,2017,2018,2019,
2015,2016,2017,2018,
2015,2016,2017,2018,
2015,2016,2017,
2015,2016,2017),
Question=c(rep('Overall Satisfaction',5),
rep('Overall Cleanliness',5),
rep('Overall Satisfaction',4),
rep('Overall Cleanliness',4),
rep('Overall Satisfaction',3),
rep('Overall Cleanliness',3)),
ScoreYearN=c(rep(runif(24,min = 0.6,max = 1))),
TotalYearN=c(rep(round(runif(24,min = 1000,max = 5000),0))))
MY OBJECTIVE
To add two columns to the dataset such that -
The first column contains the score for the given question in the given
hospital for the previous year
The second column contains the total number of respondents for the given question in the given hospital for the previous year
MY ATTEMPT
I called the first column ScoreYearN-1 and the second column TotalYearN-1
I used the lag function to create the new columns that contain the lagged values from the existing columns.
library(dplyr)
library(tidyr)
dataset$`ScoreYearN-1`=lag(dataset$ScoreYearN)
dataset$`TotalYearN-1`=lag(dataset$TotalYearN)
Which gives me a resulting dataset where I have the desired outcome for the first five rows only (these rows correspond to the first Hospital-Question combination).
The remaining rows do not account for this grouping, and hence the 2015 'N-1' values take on the values of the previous group.
I'm not sure this is the best way to go about this problem. If you have any better suggestions, I'm happy to consider them.
Any help will be greatly appreciated.
You're close! Just use dplyr to group by hospital
dataset_lagged <- dataset %>%
group_by(Hospital,Question) %>%
mutate(`ScoreYearN-1` = lag(ScoreYearN),
`TotalYearN-1` = lag(TotalYearN))

How to mutate variables on a rollwing time window by groups with unequal time distances?

I have a large df with around 40.000.000 rows , covering in total a time period of 2 years and more than 400k unique users.
The time variable is formatted as POSIXct and I have a unique user_id per user. I observe each user over several points in time.
Each row is therefore a unqiue combination of user_id, time and a set of variables.
Based on a set of dummy variables (df$v1, df$v2), a category variable(df$category_var) and the time variable (df$time_var) I now want to calculate 3 new variables on a user_id level on a rolling time window over the previous 30 days.
So in each row, the new variable should be calculated over the values of the previous 30 days of the input variables.
I do not observe all users over the same time period, some enter later some leave earlier, also the distances between times are not equal, therefore I can not calculate the variables just by number of rows.
So far I only managed to calculate my new variables per user_id over the whole observation period, but I couldn’t achieve to calculate the variables for the previous 30 days rolling window per user.
After checking and trying all the related posts here, I assume a data.table solution is the most suitable, but since I have so far mainly worked with dplyr the attempt of calculating these variables on the rolling time window on a groupey_by user_id level has taken more than a week without any results. I would be so grateful for your support!
My df basically looks like :
user_id <- c(1,1,1,1,1,2,2,2,2,3,3,3,3,3)
time_var <- c(“,2,3,4,5, 1.5, 2, 3, 4.5, 1,2.5,3,4,5)
category_var <- c(“A”, “A”, “B”, “B”, “A”, “A”, “C”, “C”, “A”, …)
v1 <- c(0,1,0,0,1,0,1,1,1,0,1,…)
v2 <- c(1,1,0,1,0,1,1,0,...)
My first needed new variable (new_x1) is basically a cumulative sum based on a condition in dummy variable v1. What I achieved so far:
df <- df %>% group_by(user_id) %>% mutate(new_x1=cumsum(v1==1))
What I need: That variables only counting over the previoues 30 days per user
Needed new variable (new_x2): Basically cumulative count of v1 if v2 has a (so far) unique value. So for each new value in v2 given v1==1, count.
What I achieved so far:
df <- df %>%
group_by(user_id, category_var) %>%
mutate(new_x2 = cumsum(!duplicated(v2 )& v1==1))
I also need this based on the previous 30 days and not the whole observation period per user.
My third variable of interest (new__x3):
The time between two observations given a certain condition (v1==1)
#Interevent Time
df2 <- df%>% group_by(user_id) %>% filter(v1==1) %>% mutate(time_between_events=time-lag(time))
I would also need this on the previoues 30 days.
Thank you so much!
Edit after John Springs Post:
My potential solution would then be
setDT(df)[, `:=`(new_x1= cumsum(df$v1==1[df$user_id == user_id][between(df$time[df$user_id == user_id], time-30, time, incbounds = TRUE)]),
new_x2= cumsum(!duplicated(df$v1==1[df$user_id == user_id][between(df$time[df$user_id == user_id], time-30, time, incbounds = TRUE)]))),
by = eval(c("user_id", "time"))]
I really not familiar with data.table and not sure, if I can nest my conditions on cumsum in on data.table like that.
Any suggestions?

Random population sample divided into age groups

I want to create a random population data column of around 4000 rows and then randomly distribute each row of this population data column into 4 age group columns (like 0-24, 25-64, 64-85 and 85+).
Sorry for the silly reply earlier, I this what you are looking for:
Population=as.integer(runif(4000,10000,1000000))
df <- matrix(runif(16000,0,1), nc=4)
df <- sweep(df, 1, rowSums(df), FUN="/")
df=as.data.frame(df)
df=cbind(Population,df)
colnames(df)=c('Population','0-24','25-64','64-85','85 above')
df1=cbind(Population,round(df$Population*df[,2:5],0))

Resources