Dummy for lower tertile normalized by multiple variables - r

I need to find the proportion of respondents whose hand grip strength was in the bottom tertile normalised for age, gender, weight and height.
First, I am trying to normalize grip strength given age, male, height and weight. Secondly, I try to create a variable given the tertiles of this normalized grip strength variable. And lastly, I try to create a dummy equal to 1 if an individual is in the lower tertile.
So far I have constructed this code, but it is not really working:
df <- within(df, normal <- ave(grip, male, age, weight, height ,FUN=function(x) (x-min(x))/diff(range(x))))
df$tertile <- ave(df$normal,
FUN=function(x){cut(x, labels=1:3, breaks=quantile(x, probs = 0:3/3, na.rm = TRUE), include.lowest=TRUE)})
df$lowgrip <- ifelse(df$tertile==1, 1, 0)
Data could look like this:
set.seed(123)
df <- data.frame(
age = sample(50:79, 40, replace = TRUE),
male = sample(c("1", "0"), 40, replace = TRUE),
grip = sample(5:80, 40, replace = TRUE),
weight = sample(50:100, 40, replace = TRUE),
height = sample(150:200, 40, replace = TRUE)
)
However, my real data has around 7000 observations.

Related

How do I add difference proportion among each levels of a categorical variable in R using ybl_svysummary^

I would like to reproduce the following table.Desired table How ever I can't figure out how to add the p-value next to the statistics. The p-value here compares the difference of proportion among each level of those two groups. I'm using this dataset from the library questionr in RStudio. I tried to add_difference(), but it doesn't do what I expected. Here is my Rcode of what I've done so far:
library(questionr)
data(hdv2003)
d <- hdv2003
d$sport2[d$sport == "Oui"] <- TRUE
d$grpage <- cut(d$age, c(16, 25, 45, 65, 99), right = FALSE, include.lowest =
TRUE)
d$etud <- d$nivetud
levels(d$etud) <- c(
"Primaire", "Primaire", "Primaire",
"Secondaire", "Secondaire", "Technique/Professionnel",
"Technique/Professionnel", "Supérieur"
)
d$etud <- forcats::fct_explicit_na(d$etud, "manquant")
d$sexe <- relevel(d$sexe, "Femme")
dw <- svydesign(ids = ~1, data = d, weights = ~poids)
dw %>%
tbl_svysummary(by = sexe,
include = c(sport,sexe , grpage, etud, relig, heures.tv ))

Creating non-random matched pairs

I am looking for an R package that would allow me to match match each subject in a treatment group to a subject in the general population that has similar characteristics (age, gender, etc).
I use the MatchIt package for doing this type of thing. You may receive advice to use propensity score matching, but there are limitations to that widely used approach (see: PS Not)
library(MatchIt) # use for matching
library(tidyverse) # The overall package. It will load lots of dependencies
set.seed(950)
n.size <- 1000
# This creates a tibble (an easier to use version of a data frame)
myData <- tibble(
a = lubridate::now() + runif(n.size) * 86400,
b = lubridate::today() + runif(n.size) * 30,
ID = 1:n.size,
# d = runif(1000),
ivFactor = sample(c("Level 1", "Level 2", "Level 3", "Level 4" ), n.size, replace = TRUE),
age = round(rnorm(n = n.size, mean = 52, sd = 10),2),
outContinuous = rnorm(n = n.size, mean = 100, sd = 10),
tmt = sample(c(1,0), size = n.size, prob = c(.3, .7), replace = TRUE)
)
# Using matching methods suggestions found in Ho, Imai, King and Stuart
myData.balance <- matchit(tmt~age + ivFactor, data = myData, method = "nearest", distance = "logit")
# Check to see if the matching improved balance between treatment and control group
summary(myData.balance)
# Extract the matched data. Now we can use this in subsequent analyses
myData.matched <- match.data(myData.balance)

data manipulation - R

I am struggling with data manipulation in R. My dataset consists of variables type(5 factors), intensity(3 factors), damage(continous). I want to calculate mean damage(demage1, demage2 and damage3 separately) with respect to intensity and type. In onther words I want to summarize the average damage by type and intensity. I have created this small reproducible example of my data:
type <- sample(seq(from = 1, to = 5, by = 1), size = 50, replace = TRUE)
intensity <- sample(seq(from = 1, to = 3, by = 1), size = 50, replace = TRUE)
damage1 <- sample(seq(from = 1, to = 50, by = 1), size = 50, replace = TRUE)
damage2 <- sample(seq(from = 1, to = 200, by = 1), size = 50, replace = TRUE)
damage3 <- sample(seq(from = 1, to = 500, by = 1), size = 50, replace = TRUE)
dat <- cbind(type, intensity, damage1, damage2, damage3)
then to manipulate the data I have used the pipe operator %>% buy my commands seem not to work very well:
dat <- as.data.frame(dat)
dat %>%
filter(type == 1) %>%
group_by(intensity, damage) %>%
summarise(mean_damage = mean(Value))
I have read about multiple usefull functions here:
efficient reshaping using data tables
manipulating data tables
Do Faster Data Manipulation using These 7 R Packages
But I wasnt able to make any progress here. My question are:
What is wrong with my code?
Am I even going in the right direction here?
Is there some alternative how to do this?

Define function that takes variable from data.frame as input and creates new variables in R

Sorry if this has been asked before I have looked throughly but couldn't find anything.
My problem is the following. I have survey data and want to perform the same 3 steps with different variables. I always want the output to be separated by gender. So I want to create a function that automates this and returns three new variables. It should look something like this:
myfunction <- function(x) {
x.mean.by.gender <- svyby(~x, ~gender, svymean, design = s.design)
x.boxplot <- svyboxplot(x~gender, varwith=FALSE, design = s.design)
x.ttest <- svyttest(x~gender, design = s.design)}
myfunction(data$income)
This is a working example of what I want to do (boxplot doesn't split by gender but not important):
require("survey")
income <- runif(50, 1000, 2500)
wealth <- runif(50, 10000, 100000)
weight <- runif(50, 1.0, 1.99)
id <- seq(from = 1, to = 50, 1)
gender <- sample(0:1, 50, replace = TRUE)
data <- data.frame(income, wealth, weight, id, gender)
data.w <- svydesign(ids = ~id, data = data, weights = ~weight)
data.w <- update(data.w, count=1)
svytotal(~count, data.w)
# Income difference
income.mean.table <- svyby(~income, ~gender, svymean, design = data.w)
income.mean.table
income.boxplot <- svyboxplot(income~gender, varwith = FALSE, design = data.w)
income.ttest <- svyttest(income~gender, design = data.w)
income.ttest
# Wealth difference
wealth.mean.table <- svyby(~wealth, ~gender, svymean, design = data.w)
wealth.mean.table
wealth.boxplot <- svyboxplot(wealth~gender, varwith = FALSE, design = data.w)
wealth.ttest <- svyttest(wealth~gender, design = data.w)
wealth.ttest
So the function should perform one of those iterations of svyby, svyboxplot and svyttest and create variables with the name of the variable used as input.function (e.g.) income.ttest. I hope this clears things up. Sorry for the confusion.
If data objects are of different type, just make list and return it
UPDATE: version where you could use named elements of the return list
fl <- function(x) {
a <- x*12.0
b <- rep(1L, 12)
c <- "qqqqq"
list(A = a, B = b, C = c)
}
q <- fl(c(1, 2, 3, 4, 5))
print(q$A)
print(q$B)
print(q$C)

Histogram in R when using a binary value

I have data of students from several schools. I want to show a histogram of the percentage of all students that passed the test in each school, using R.
My data looks like this (id,school,passed/failed):
432342 school1 passed
454233 school2 failed
543245 school1 failed
etc'
(The point is that I am only interested in the percent of students that passed, obviously those that didn't passed have failed. I want to have one column for each school that shows the percent of the students in that school that passed)
Thanks
there are many ways to do that.
one is:
df<-data.frame(ID=sample(100),
school=factor(sample(3,100,TRUE),labels=c("School1","School2","School3")),
result=factor(sample(2,100,TRUE),labels=c("passed","failed")))
p<-aggregate(df$result=="passed"~school, mean, data=df)
barplot(p[,2]*100,names.arg=p[,1])
My previous answer didn't go all the way. Here's a redo. Example is the one from #eyjo's answer.
students <- 400
schools <- 5
df <- data.frame(
id = 1:students,
school = sample(paste("school", 1:schools, sep = ""), size = students, replace = TRUE),
results = sample(c("passed", "failed"), size = students, replace = TRUE, prob = c(.8, .2)))
r <- aggregate(results ~ school, FUN = table, data = df)
r <- do.call(cbind, r) # "flatten" the result
r <- as.data.frame(cbind(r, sum = rowSums(r)))
r$perc.passed <- round(with(r, (passed/sum) * 100), 0)
library(ggplot2)
ggplot(r, aes(x = school, y = perc.passed)) +
theme_bw() +
geom_bar(stat = "identity")
Since you have individual records (id) and want to calculate based on index (school) I would suggest tapply for this.
students <- 400
schools <- 5
df <- data.frame("id" = 1:students,
"school" = sample(paste("school", 1:schools, sep = ""),
size = students, replace = TRUE),
"results" = sample(c("passed", "failed"),
size = students, replace = TRUE, prob = c(.8, .2)))
p <- tapply(df$results == "passed", df$school, mean) * 100
barplot(p)

Resources