I have a dataset from a survey that looks something like this.
library(dplyr)
library(modelsummary)
library(Hmisc)
set.seed(123)
df<- data.frame(var1=runif(1000),
var2=runif(1000),
var3=runif(1000),
particip=rbinom(100, size=1, p=0.1)) %>%
mutate(refusal=ifelse(particip==1,0,rbinom(100, size=1, p=0.06)),
sampled=ifelse(particip==1|refusal==1,"Sampled","Not Sampled")) %>%
arrange(desc(particip))
describe(df[,4:6])
My aim is to check for covariate balance (var1:var3) across different subset of the data.
particip are respondents that were sampled and replied to the survey
refusal are respondents that were sampled but did not respond to the survey
sampled are all the respondents that were sampled regardless of whether they refused or accepted to take the survey.
I would like to create a single balance table that compares the means of var1:var3 across multiple subsamples. In particular, I would like to compare the means for the whole universe of respondents (all 1000 possible respondents), to the means of the sampled respondents, to the means of the respondents that refused to take the survey to the means of the respondents that eventually took part in the survey.
I have tried using the function datasummary_balance from the package modelsummary but I am only able to compare the means of one group at a time (participated, sampled, or refusals). Instead, I would like to create a single table, will all of the means by these three different groups.
datasummary_balance(~particip,
fmt=3,
data=df,
output = "markdown")
datasummary_balance(~refusal,
fmt=3,
data=df,
output = "markdown")
datasummary_balance(~sampled,
fmt=3,
data=df,
output = "markdown")
If anyone know how to do this it would be of great help
Related
I tried to ask these questions through imputations, but I want to see if this can be done with predictive modelling instead. I am trying to use information from 2003-2004 NHANES to predict future NHANES cycles. For some context, in 2003-2004 NHANES measured blood contaminants in individual people's blood. In this cycle, they also measured things such as triglycerides, cholesterol etc. that influence the concentration of these blood contaminants.
The first step in my workflow is the impute missing blood contaminant concentrations in 2003-2004 using the measured values of triglycerides, cholesterol etc. This is an easy step and very straightforward. This will be my training dataset.
For future NHANES years (for example 2005-2006), they took individual blood samples combined them (or pooled in other words) and then measured blood contaminants. I need to figure out what the individual concentrations were in these cycles. I have individual measurements for triglycerides, cholesterol etc. and the pooled value is considered the mean. Could I use the mean, 2003-2004 data to unpool or predict the values? For example, if a pool contains 8 individuals, we know the mean, the distribution (2003-2004) and the other parameters (triglycerides) which we can use in the regression to estimate the blood contaminants in those 8 individuals. This would be my test dataset where I have the same contaminants as in the training dataset, with a column for the number of individuals in each pool and the mean value. Alternatively, I can create rows of empty values for contaminants, add mean values separately.
I can easily run MICE, but I need to make sure that the distribution of the imputed data matches 2003-2004 and that the average of the imputed 8 individuals from the pools is equal to the measured pool. So the 8 values for each pool, need to average to the measured pool value while the distribution has to be the same as 2003-2004.
Does that make sense? Happy to provide context if need be. There is an outline code below.
library(mice)
library(tidyverse)
library(VIM)
#Papers detailing these functions can be found in MICE Cran package
df <- read.csv('2003_2004_template.csv', stringsAsFactors = TRUE, na.strings = c("", NA))
#Checking out the NA's that we are working with
non_detect_summary <- as.data.frame(df %>% summarize_all(funs(sum(is.na(.)))))
#helpful representation of ND
aggr_plot <- aggr(df[, 7:42], col=c('navyblue', 'red'),
numbers=TRUE,
sortVars=TRUE,
labels=names(df[, 7:42]),
cex.axis=.7,
gap=3,
ylab=c("Histogram of Missing Data", "Pattern"))
#Mice time, m is the number of imputed datasets (you can think of this as # of cycles)
#You can check out what regression methods below in console
methods(mice)
#Pick Method based on what you think is the best method. Read up.
#Now apply the right method
imputed_data <- mice(df, m = 30)
summary(imputed_data)
#if you want to see imputed values
imputed_data$imp
#finish the dataset
finished_imputed_data <- complete(imputed_data)
#Check for any missing values
sapply(finished_imputed_data, function(x) sum(is.na(x))) #All features should have a value of zero
#Helpful plot is the density plot. The density of the imputed data for each imputed dataset is showed
#in magenta while the density of the observed data is showed in blue.
#Again, under our previous assumptions we expect the distributions to be similar.
densityplot(x = imputed_data, data = ~ LBX028LA+LBX153LA+LBX189LA)
#Print off finished dataset
write_csv(finished_imputed_data, "finished_imputed_data.csv")
#This is where I need to use the finished_imputed_data to impute the values in the future years.
I have a set of control and treated plots which had been sampled during years. I run the prc function in the vegan package and want to perform a permutation test to check whether control vs treated plots significantly differ during years. As my data is unbalanced, I can not use strata function. my code look like:
library(vegan)
year=as.factor(c(rep(1995,8),rep(1999,8),rep(2001,8),rep(2013,4),rep(1995,4),
rep(1999,4),rep(2001,4),rep(2013,4)))
treatment=as.factor(c(rep("control",28),rep("treated",16)))
I've written this, but I'm sure that it is wrong because the treatment is missing here:
h1 <- how(within = Within(type = "series", mirror = F),
blocks = year, nperm = 999
)
Any suggestions is greatly appreciated.
Under the null hypothesis, samples from the control or treated groups are exchangeable and hence you don't want them in the permutation design; you really want to permute them to generate the permutation-based null distribution for the test statistic.
The permutation design is there to indicate what isn't exchangeable.
You haven't explained why you want samples within the blocks to be permuted in series; why are samples within years also time series? If they're not, you don't want this.
You only need to worry about imbalance if you want to permute the strata. Whilst using blocks is similar in some respects to strata, blocks are never permuted so if you can use blocks you can use strata as you won't be permuting them.
If you want to permute the years as groups of samples, then you'll need strata and you'll need balance at the year level, which you don't have.
What you have defined with your call to how() is:
groups samples by year and as such samples will never be swapped between years, and
samples within the levels of year will be permuted in series, keeping their temporal order intact after applying cyclic shift permutations.
If that's not what you want to do, you need to explain in words what you want to do. By "do" I mean what is it you want to test? What is your model in vegan?
I have two groups. The treatment group is exposure to media; the control group is no media. They are distinguished by a categorial variable in the data frame. (exposure to media = 1, no media = 0)
Now, I want to examine whether there are any clear differences between these two groups. To do this, apply the k-means algorithm with two clusters to four variables (proportion of black population, proportion of male population, proportion of hispanic population, median income on the logarithmic scale).
How to do this in R? Could anyone give some hints? Thanks!
Try this:
km <-kmeans(your data, 2, nstart=10)
your data here as a data.frame (your whole data or you can select the variables that you are interesting about them). You need to select the number of clusters (here is 2). A good practice to understand your data is to apply different number of cluster and then see which one fit your data better (use for example any criteria methods such as AIC or BIC).
k-means is an approach applied to cluster data. Where this data come from different distribution and we would like to know from where each observation come from (from which distribution).
You can also have a look at many tutorials about kmeans in R. For example,
https://onlinecourses.science.psu.edu/stat857/node/125
https://www.r-statistics.com/2013/08/k-means-clustering-from-r-in-action/
http://www.statmethods.net/advstats/cluster.html
I have a community matrix (species as columns, samples as rows) from which I would like to generate a species accumulation curve (SAC) using the specaccum() and fitspecaccum() functions in R's vegan package. In order for the resulting SAC and cumulative species richness at sample X to be comparable among regions (I have 1 community matrix per region), I need to have specaccum() choose the same number of sets within each region. My problem is that some regions have a larger number of sets than others. I would like to limit the sample size to the minimum number of sets among regions (in my case, the minimum number of sets is 45, so I would like specaccum() to randomly sample 45 sets, 100 times (set permutations=100) for each region. I would like to sample from the entire data set available for each region. The code below has not worked... it doesn't recognize "subset=45". The vegan package info says "subset" needs to be logical... I don't understand how subset number can be logical, but maybe I am misinterpreting what subset is... Is there another way to do this? Would it be sufficient to run specaccum() for the entire number of sets available for each region and then just truncate the output to 45?
require(vegan)
pool1<-specaccum(comm.matrix, gamma="jack1", method="random", subet=45, permutations=100)
Any help is much appreciated.
Why do you want to limit the function to work in a random sample of 45 cases? Just use the species accumulation up to 45 cases. Taking a random subset of 45 cases gives you the same accumulation, except for the random error of subsampling and throwing away information. If you want to compare your different cases, just compare them at the sample size that suits all cases, that is, at 45 or less. That is the idea of species accumulation models.
The subset is intended for situations where you have (possibly) heterogeneous collection of sampling units, and you want to stratify data. For instance, if you want to see only the species accumulation in the "OldLow" habitat type of the Barro Colorado data, you could do:
data(BCI, BCI.env)
plot(specaccum(BCI, subset = BCI.env$Habitat == "OldLow"))
If you want to have, say, a subset of 30 sample plots of the same data, you could do:
take <- c(rep(TRUE, 30), rep(FALSE, 20))
plot(specaccum(BCI)) # to see it all
# repeat the following to see how taking subset influences
lines(specaccum(BCI, subset = sample(take)), col = "blue")
If you repeat the last line, you see how taking random subset influences the results: the lines are normally within the error bars of all data, but differ from each other due to random error.
I'm trying to use R to conduct Poisson regression on some data that I have. The current structure of the data is as follows:
Data is stratified based on three occupations. There are four levels of income in the data. Within each stratum, for each level of income there is
the number of workplace accidents that have occurred, and
the total man months observed.
Here's an example of the setup. The number in parentheses is the total man months observed and the number not in parentheses is the number of workplace accidents.
My question is how do I set up this data and perform a Poisson regression on the effect of income level on the occurrence of workplace accidents? Ideally I would like to adjust for occupation and find out the effect of only income, but as a starting point, I'm not sure how to set it up as a Poisson regression problem at all. I thought about doing something like dividing the number of injuries by the months of observation, but then that gives non-integer values so I assume that's not the right thing to do.
To reiterate, predictor: income level; response variable: workplace accidents.
BTW, it would be very easy to separate the parentheses numbers and put them into their own column, if that would make sense to do.
I'd really appreciate any suggestions on how to set this up. I am sure other statisticians are working with similarly structured data and might like to gain some insight as well. Thanks so much!
#thelatemail might be correct in think this to be better suited for stats.stackexchange.com but here is some R code. That data is in wide format and you need to re-structure it to long format. (And you will not want to include the totals columns. After converting the first four columns to a long format where you had 'occupation' and 'level' as factor-class variables, and accident 'counts' and exposure 'months' as numeric columns, you could use this call to glm.
fit <- glm( counts ~ level + occup + offset(log(months)), data=dfrm, family="poisson")
The offset needs to be log()-ed to agree with the logged counts created by the default link function for the poisson-family.
(You cannot really expect us to redo that data entry task, now can you?)