Histogram in R when using a binary value - r

I have data of students from several schools. I want to show a histogram of the percentage of all students that passed the test in each school, using R.
My data looks like this (id,school,passed/failed):
432342 school1 passed
454233 school2 failed
543245 school1 failed
etc'
(The point is that I am only interested in the percent of students that passed, obviously those that didn't passed have failed. I want to have one column for each school that shows the percent of the students in that school that passed)
Thanks

there are many ways to do that.
one is:
df<-data.frame(ID=sample(100),
school=factor(sample(3,100,TRUE),labels=c("School1","School2","School3")),
result=factor(sample(2,100,TRUE),labels=c("passed","failed")))
p<-aggregate(df$result=="passed"~school, mean, data=df)
barplot(p[,2]*100,names.arg=p[,1])

My previous answer didn't go all the way. Here's a redo. Example is the one from #eyjo's answer.
students <- 400
schools <- 5
df <- data.frame(
id = 1:students,
school = sample(paste("school", 1:schools, sep = ""), size = students, replace = TRUE),
results = sample(c("passed", "failed"), size = students, replace = TRUE, prob = c(.8, .2)))
r <- aggregate(results ~ school, FUN = table, data = df)
r <- do.call(cbind, r) # "flatten" the result
r <- as.data.frame(cbind(r, sum = rowSums(r)))
r$perc.passed <- round(with(r, (passed/sum) * 100), 0)
library(ggplot2)
ggplot(r, aes(x = school, y = perc.passed)) +
theme_bw() +
geom_bar(stat = "identity")

Since you have individual records (id) and want to calculate based on index (school) I would suggest tapply for this.
students <- 400
schools <- 5
df <- data.frame("id" = 1:students,
"school" = sample(paste("school", 1:schools, sep = ""),
size = students, replace = TRUE),
"results" = sample(c("passed", "failed"),
size = students, replace = TRUE, prob = c(.8, .2)))
p <- tapply(df$results == "passed", df$school, mean) * 100
barplot(p)

Related

Perform an operation with complete cases without changing the original vectors

I would like to calculate a rank-biserial correlation. But the (only it seems) package can't handle missing values that well. It has no built in "na.omit = TRUE" function. I could remove the missings in the data frame, but that would be a hustle with many different calculations.
n <- 500
df <- data.frame(id = seq (1:n),
ord = sample(c(0:3), n, rep = TRUE),
sex = sample(c("m", "f"), n, rep = TRUE, prob = c(0.55, 0.45))
)
df <- as.data.frame(apply (df, 2, function(x) {x[sample( c(1:n), floor(n/10))] <- NA; x} ))
library(rcompanion)
wilcoxonRG(x = df$ord, g = df$sex, verbose = T)
I imagine something stupidly easy like "complete.cases(wilcoxonRG(x = df$ord, g = df$sex, verbose = T)). It's probably not that hard but I could only find comeplete data frame manipulations. Thanks in advance!

Creating a function that prints out multiple data-frames in r

I have a function that I have written to create a simulation that demonstrates the central limit theorem. I'm not sure if its possible or if I am better off just making separate functions but currently it only stores that data frame containing the mean values of all the trials.
# create function to perform CLT simulation
# where n = sample size, t = number of trials, pop = which population is being used, popmean = population mean,
cltsim <- function(n, t, pop, popmean, popsd, poptitle){
popsim <- data.frame()
# Run the simulation
for(i in n) { # for each value of n
col <- c()
for(j in t) { #loop through each value of t
trial <- 1:j
counter <- j #set up an egg timer based on whichever t value we're on
value <- c()
while(counter > 0) { # and extract n samples from the population
bucket <- sample(pop, i, replace = TRUE)
xbar <- mean(bucket) #calculate the sample mean
value <- c(value, xbar) # and add it to a vector
counter <- counter - 1 #egg timer counts down and loops back until it hits 0
}
sbar <- sd(value) #calculate the sample standard deviation
col <- cbind(trial, value, sbar, i, j) #merge all info together
popsim <- rbind(popsim, col) # attach it to empty dataframe
}
}
#clean up so just the finished data frame is left
rm(col, bucket, value, counter, i, j, n, sbar, t, xbar, trial)
#tidy up data frame in order to graph it
names(popsim) <- c("trial#", "value", "sdev", "samples", "trials")
#view the rows of data in popsim data table
popsim
}
when I try to add any more code that requires creating datatables it doesnt store them, below are the blocks of code I wish to add to the function
g1 <- ggplot(popsim, aes(x = value)) + geom_density(fill = "#09AB30") +
facet_grid(samples ~ trials, labeller = label_both) +
ggtitle(paste("Demonstrating The Central Limit Theorem with Simulation using", poptitle)) +
geom_vline(xintercept = popmean, linetype = "dashed")
g1
and
#create data frame of simulated sample standard deviations \
sdmatrix <- matrix(unique(popsim$sdev), nrow = 4, ncol = 4)
sdf <- as.data.frame(sdmatrix, row.names = c("t10", "t100", "t1000", "t10000"))
names(sdf) <- c("s1", "s10", "s30", "s50")
sdf <- t(sdf)
rm(sdmatrix)
sdf
exvals <- pop1sd/sqrt(c(1, 10, 30, 50))
dfex <- as.data.frame(exvals, row.names = c("s1", "s10", "s30", "s50"))
names(dfex) <- "Predicted Standard Deviations"
dfex
Ive had a look around and I cant find a solution anywhere, am I better off just writing different functions for them? Any advice or input on how to make this lot of code more effective/efficient would be greatly appreciated.
thanks in advance

Creating non-random matched pairs

I am looking for an R package that would allow me to match match each subject in a treatment group to a subject in the general population that has similar characteristics (age, gender, etc).
I use the MatchIt package for doing this type of thing. You may receive advice to use propensity score matching, but there are limitations to that widely used approach (see: PS Not)
library(MatchIt) # use for matching
library(tidyverse) # The overall package. It will load lots of dependencies
set.seed(950)
n.size <- 1000
# This creates a tibble (an easier to use version of a data frame)
myData <- tibble(
a = lubridate::now() + runif(n.size) * 86400,
b = lubridate::today() + runif(n.size) * 30,
ID = 1:n.size,
# d = runif(1000),
ivFactor = sample(c("Level 1", "Level 2", "Level 3", "Level 4" ), n.size, replace = TRUE),
age = round(rnorm(n = n.size, mean = 52, sd = 10),2),
outContinuous = rnorm(n = n.size, mean = 100, sd = 10),
tmt = sample(c(1,0), size = n.size, prob = c(.3, .7), replace = TRUE)
)
# Using matching methods suggestions found in Ho, Imai, King and Stuart
myData.balance <- matchit(tmt~age + ivFactor, data = myData, method = "nearest", distance = "logit")
# Check to see if the matching improved balance between treatment and control group
summary(myData.balance)
# Extract the matched data. Now we can use this in subsequent analyses
myData.matched <- match.data(myData.balance)

R Survey library Difference of Means Test

I am currently using R's survey library to analyze survey data. I have two samples from two different time periods. My goal is to test if the difference between the two weighted sample means is equal to 0. Question: How do I approach this using R's survey library?
I have tried two approaches to doing this:
Approach 1: Create two different postStratify objects. Toy example:
q1 = c(1,1,1,1,0)
group = c(0,0,0,1,1)
df = data.frame(q1, group, time)
svy_design = svydesign(ids = ~1 , data = df)
pop_data = data.frame(group = c(0,1), Freq = c(10,90))
ps_design = postStratify(svy_design, strata = ~group,pop_data)
first = svymean(q1, ps_design) #Weighted Mean of first sample
q1 = c(1,1,1,0,0)
g2 = c(1,1,0,0,0)
df2 = data.frame(q1, g2)
pop_data_2 = data.frame(group = c(0,1), Freq = c(20,80))
svd_2 = svydesign(ids = ~1, data = df2)
psd_2 = postStratify(svd_2, strata = ~g2, pop_data_2)
second = svymean(q2, psd_2) #Weighted mean of second sample
The problem with this approach is that I do not know how to conduct the difference of means test on "first" and "Second" - the two svymean objects.
Approach 2: Create only one postStratify object. Toy example:
q1 = c(1,1,1,1,0, 1,1,0,0,1)
group = c(0,0,0,1,1, 0,0,1,1,1)
time = c(0,0,0,0,0, 1,1,1,1,1) #Variable that distinguishes between the samples
df = data.frame(q1, group, time)
svy_design = svydesign(ids = ~1 , data = df)
pop_data = data.frame(group = c(0,1), Freq = c(10,90))
ps_design = postStratify(svy_design, strata = ~group,pop_data)
svyby(~q1, ~time, ps_design, svymean)
svyttest(q1~time, ps_design)
The problem with this approach is that when i run svyby just to check the created mean values, the output of svyby is not what I expect. It puts out mean = 0.5714 for time = 0, when the theoretical weighted mean for that is 0.55. Any insight as to why the theoretical mean differs from that of svyby will be greatly appreciated.
Thank you so much for your time.
are you looking for this? thanks
library(survey)
q1 = c(1,1,1,1,0, 1,1,0,0,1)
# edited #
group = c(0,0,0,1,1, 2,2,3,3,3)
time = c(0,0,0,0,0, 1,1,1,1,1) #Variable that distinguishes between the samples
df = data.frame(q1, group, time)
svy_design = svydesign(ids = ~1 , data = df)
# edited #
pop_data = data.frame(group = c(0,1,2,3), Freq = c(10,90,20,80))
ps_design = postStratify(svy_design, strata = ~group,pop_data)
svyby(~q1, ~time, ps_design, svymean)
svyttest(q1~time, ps_design)

Define function that takes variable from data.frame as input and creates new variables in R

Sorry if this has been asked before I have looked throughly but couldn't find anything.
My problem is the following. I have survey data and want to perform the same 3 steps with different variables. I always want the output to be separated by gender. So I want to create a function that automates this and returns three new variables. It should look something like this:
myfunction <- function(x) {
x.mean.by.gender <- svyby(~x, ~gender, svymean, design = s.design)
x.boxplot <- svyboxplot(x~gender, varwith=FALSE, design = s.design)
x.ttest <- svyttest(x~gender, design = s.design)}
myfunction(data$income)
This is a working example of what I want to do (boxplot doesn't split by gender but not important):
require("survey")
income <- runif(50, 1000, 2500)
wealth <- runif(50, 10000, 100000)
weight <- runif(50, 1.0, 1.99)
id <- seq(from = 1, to = 50, 1)
gender <- sample(0:1, 50, replace = TRUE)
data <- data.frame(income, wealth, weight, id, gender)
data.w <- svydesign(ids = ~id, data = data, weights = ~weight)
data.w <- update(data.w, count=1)
svytotal(~count, data.w)
# Income difference
income.mean.table <- svyby(~income, ~gender, svymean, design = data.w)
income.mean.table
income.boxplot <- svyboxplot(income~gender, varwith = FALSE, design = data.w)
income.ttest <- svyttest(income~gender, design = data.w)
income.ttest
# Wealth difference
wealth.mean.table <- svyby(~wealth, ~gender, svymean, design = data.w)
wealth.mean.table
wealth.boxplot <- svyboxplot(wealth~gender, varwith = FALSE, design = data.w)
wealth.ttest <- svyttest(wealth~gender, design = data.w)
wealth.ttest
So the function should perform one of those iterations of svyby, svyboxplot and svyttest and create variables with the name of the variable used as input.function (e.g.) income.ttest. I hope this clears things up. Sorry for the confusion.
If data objects are of different type, just make list and return it
UPDATE: version where you could use named elements of the return list
fl <- function(x) {
a <- x*12.0
b <- rep(1L, 12)
c <- "qqqqq"
list(A = a, B = b, C = c)
}
q <- fl(c(1, 2, 3, 4, 5))
print(q$A)
print(q$B)
print(q$C)

Resources