I am analyzing some survey data in R. Due to the sample design, all analysis has to be done with the "survey" package that can take the sample structure into account, which means I can't just get within-column or within-row percents using prop.table() the way I would on non-survey data.
For anyone not familiar with the row/column percent terminology, what I mean is percents for one variable conditional on being in a specific row/column for another variable. For example:
| male | female
black | 10 | 20
white | 15 | 15
other | 10 | 15
A row percent would be number of observations in a cell divided by number of observations in that row, for example the percent for "male" in the row "other" is 40% (10/(10+15)). A column percent would be number of observations in a cell divided by number of observations in that column, for example the percent for "other" in the column "female" is 30% (15/(20+15+15)). Normally these are easily calculated with prop.table(), but I can't use prop.table() this time because it doesn't account for survey sample design.
I have been Googling and testing things trying to figure out how to do this with the "survey" package, and so far I have found the svytable() function and can get it to give me a basic cross-tab of counts (eg. race by gender) but not survey-weighted percents. I have also found the svymean() and svytotal() functions, but so far all I've managed to do is get univariate weighted percents from svymean() (which appears to dummy-code each category as 0/1 then take a mean), and to combine svymean with the interaction function (eg. svymean(~interaction(race,gender),...)) to get cell percents (eg. "black males are XX% of the total sample"), but I still can't get within-row and within-column percents.
How do I get the "survey" package to give me survey-adjusted column and row percents for a cross-tab of two variables?
You didn't provide any sample data, so I'll use the built-in datasets of the survey package:
library(survey)
data(api)
dclus1 <- svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)
svyby(~awards, by = ~stype, design=dclus1, FUN=svymean)
stype awardsNo awardsYes se.awardsNo se.awardsYes
E E 0.2291667 0.7708333 0.02904587 0.02904587
H H 0.5714286 0.4285714 0.14564997 0.14564997
M M 0.4800000 0.5200000 0.11663553 0.11663553
These are row percentages, or the percentages of each award category (yes / no) within each of the three school types. We see that 77.1% of elementary schools in the whole state of California were eligible for an awards program.
Related
I have the sex ratios (M / M + F) of the offspring of ~35,000 mother birds from >400 species organized in this manner
dat <- data.frame(ID=sample(100:200, n),
Species=rep(LETTERS[1:3],n/3),
SR=sample(0:100,n,replace=TRUE))
ID is the anonymous ID of the mother bird, Species is the species name of the mother bird, and SR is the sex ratio of the mother bird's offspring. In this sample, SR is between 1 and 100 because I do not know how to create sample datasets of ratios.
I want to group the data by species and calculate medians, IQRs, and sign tests. Using my own messy code I can calculate species' medians and IQRs but I am at a loss at how to calculate sign tests on this data. I want to use these sign tests to see if the species' medians differ significantly from 50/50.
Does anyone know code which would allow me to
(1) calculate medians, IQRs, and sign tests on this data
(2) create a summary table with species names, medians, IQRs, sign test p-values and n's.
Thanks in advance - I appreciate any help as I am pretty new to R and really at a loss.
I am beginning my PhD in transcriptome (affymetrix assay) analysis.
I have an expression matrix (trans_data : 32000 genes x 620 samples), and a clinical matrix (clin_data : 620 samples x 42 clinical caracteristics).
The samples belong to 1 of the 4 populations A-B-C-D. I'd like to draw comparision of gene expression between population A and B without triying to bind the two matrix.
I'd like to optain a matrix with mean expression of each genes in the two population, then pvalue, then adjusted p value.
Then I could select only differentially expressed genes (padj < 0,05).
thanks for your help.
Alain
Can't answer your question directly without a clear reproducible example but you might want to check out the rather excellent tableone package
I have a community matrix (species as columns, samples as rows) from which I would like to generate a species accumulation curve (SAC) using the specaccum() and fitspecaccum() functions in R's vegan package. In order for the resulting SAC and cumulative species richness at sample X to be comparable among regions (I have 1 community matrix per region), I need to have specaccum() choose the same number of sets within each region. My problem is that some regions have a larger number of sets than others. I would like to limit the sample size to the minimum number of sets among regions (in my case, the minimum number of sets is 45, so I would like specaccum() to randomly sample 45 sets, 100 times (set permutations=100) for each region. I would like to sample from the entire data set available for each region. The code below has not worked... it doesn't recognize "subset=45". The vegan package info says "subset" needs to be logical... I don't understand how subset number can be logical, but maybe I am misinterpreting what subset is... Is there another way to do this? Would it be sufficient to run specaccum() for the entire number of sets available for each region and then just truncate the output to 45?
require(vegan)
pool1<-specaccum(comm.matrix, gamma="jack1", method="random", subet=45, permutations=100)
Any help is much appreciated.
Why do you want to limit the function to work in a random sample of 45 cases? Just use the species accumulation up to 45 cases. Taking a random subset of 45 cases gives you the same accumulation, except for the random error of subsampling and throwing away information. If you want to compare your different cases, just compare them at the sample size that suits all cases, that is, at 45 or less. That is the idea of species accumulation models.
The subset is intended for situations where you have (possibly) heterogeneous collection of sampling units, and you want to stratify data. For instance, if you want to see only the species accumulation in the "OldLow" habitat type of the Barro Colorado data, you could do:
data(BCI, BCI.env)
plot(specaccum(BCI, subset = BCI.env$Habitat == "OldLow"))
If you want to have, say, a subset of 30 sample plots of the same data, you could do:
take <- c(rep(TRUE, 30), rep(FALSE, 20))
plot(specaccum(BCI)) # to see it all
# repeat the following to see how taking subset influences
lines(specaccum(BCI, subset = sample(take)), col = "blue")
If you repeat the last line, you see how taking random subset influences the results: the lines are normally within the error bars of all data, but differ from each other due to random error.
I have a file containing survey data. For example, the file looks like this:
IDNUMBER AGE SEX NumPrescr OnPrescr SURV_WGT BSW1 BSW2....BSW500
123456 22 1 6 1 ... ... ... ...
Here, OnPrescrp is a binary variable indicating whether or not the subjects is on prescription meds and BSW1 - BSW500 are the bootstrap weights and SURV_WGT is the survery weight per subject. There are roughly 20000 entries.
I am tasked with creating tables of various statistics within certain age-gender group breakdowns. For example, how many males from 17 to 24 are on prescription medications. And I need a count N and 95% CI for each of these types of calculations. I'm not familiar at all with survey methods.
From what I understand, I can't just simply add the number of people in each category to get the final count N for each question/category (i.e., cannot just add all the males 17 to 24 who are using prescription meds). Instead, I have to take into account the survery weights and bootstrap weights when constructing my final count N and confidence intervals.
I was then told in STATA this is a one line command:
svyset [pw=SURV_WGT], brr(bsw1-bsw500)
I am working in R however. What is the equivalent command in R and what exactly is the above command doing?
PS: My sample of roughly 20000 indiviudals is a sample of a population of roughly 35 million.
You will want to use the survey package in R. This will be your best friend for weighted/complex survey analysis in R.
install.packages("survey")
The survey package has two main steps to your analysis. The first is creating the svydesign object, which stores information about your survey design including weights, replicate weights, data, etc. Then use any number of analysis functions to run analysis/descriptives on those design objects (e.g., svymean, svyby - for subgroup analysis, svyglm, and many more).
Based on your question, you have survey weights and replicate weights (bootstrapped). While the more common svydesign function is used for surveys with a single set of weights, you want to use svrepdesign, which will allow you to specify survey weights and replicate weights. Check out the documentation, but here is what you can do:
mydesign <- svrepdesign(data = mydata,
weights = ~SURV_WGT,
repweights = "BSW[0-9]+",
type = "bootstrap",
combined.weights = TRUE)
You should read the documentation, but briefly: data will be your data frame, weights takes your single survey weight vector, usually as a formula, repweights is great in that it accepts a regex string that identifies all the replicate weight columns in your data by column name, type tells the design what your replicate weights are (how they were derived), combined.weights is logical for whether the replicate weights contain sampling weights - I assume this is true but it may not be.
From this design object, you can then run analysis. E.g., let's calculate the average number of prescriptions by sex:
myresult <- svyby(~NumPrescr, # variable to pass to function
by = ~SEX, # grouping
design = mydesign, # design object
vartype = "ci", # report variation as confidence interval
FUN = svymean # specify function from survey package, mean here
)
Hope this helps!
EDIT: if you want to look at something by age groups, as you suggest, you need to create a character or factor variable that is coded for each age group and use that new variable in your svyby call.
I have a data frame in R , df, where each row, X, is a subject (N= 100) and each column,S, the score for each subject on a task each month over the span of two years. Thus i have a data frame of 100 subjects and 24 observations evenly spaced by 1 month intervals (ignoring month/day variance).
Question1: how do I fit a line (linear regression) to each subject? I have trouble understanding how to do this over columns, as opposed to rows within a column.
Question2: how do I fit a line (linear regression) to the whole data set? I ask because I would like to segment the dataset into groups A and B (i.e. a column is labeled as condition: {A,B}), and fit a line to each subset of subject over the 24 timepoints.
apologies if this a simple question.
I constructed a dataset based on your description. If this is useful, perhaps include it in your question itself.
df<- as.data.frame(matrix(rep(1:24,100)+rnorm(2400),nrow=100,byrow=T))
names(df)<- paste("S",1:24,sep="")
df$ID<-1:100
df$group <- as.factor(sample(c("A","B"),100,replace=T))
Now melt your data frame to get the S1 to S24 columns as a factor variable.
library(reshape2)
m<- melt(df,id.vars=c("ID","group"))
Then you can use the following kind of call to examine a linear model of time for a particular ID. You can use lapply to do this in one shot for all IDs.
summary(lm(value~as.numeric(variable), data=m, subset=ID==5))
And this will model all items as predicted by group. Note that the group factor is coerced to numeric. In this case A is 1 and B is 2.
summary(lm(value~group, data=m))