How are BRR weights used in the survey package for R? - r

Does anyone know how to use BRR weights in Lumley's survey package for estimating variance if your dataset already has BRR weights it in?
I am working with PISA data, and they already include 80 BRR replicates in their dataset. How can I get as.svrepdesign to use these, instead of trying to create its own? I tried the following and got the subsequent error:
dstrat <- svydesign(id=~uniqueID,strata=~strataVar, weights=~studentWeight,
data=data, nest=TRUE)
dstrat <- as.svrepdesign(dstrat, type="BRR")
Error in brrweights(design$strata[, 1], design$cluster[, 1], ...,
fay.rho = fay.rho, : Can't split with odd numbers of PSUs in a stratum
Any help would be greatly appreciated, thanks.

no need to use as.svrepdesign() if you have a data frame with the replicate weights already :) you can create the replicate weighted design directly from your data frame.
say you have data with a main weight column called mainwgt and 80 replicate weight columns called repwgt1 through repwgt80 you could use this --
yoursurvey <-
svrepdesign(
weights = ~mainwgt ,
repweights = "repwgt[0-9]+" ,
type = "BRR",
data = yourdata ,
combined.weights = TRUE
)
-- this way, you don't have to identify the exact column numbers. then you can run normal survey commands like --
svymean( ~variable , design = yoursurvey )
if you'd like another example, here's some example code and an explanatory blog post using the current population survey.

I haven't used the PISA data, I used the svprepdesign method last year with the Public Use Microsample from the American Community Survey (US Census Bureau) which also shipped with 80 replicate weights. They state to use the Fay method for that specific survey, so here is how one can construct the svyrep object using that data:
pums_p.rep<-svrepdesign(variables=pums_p[,2:7],
repweights=pums_p[8:87],
weights=pums_p[,1],combined.weights=TRUE,
type="Fay",rho=(1-1/sqrt(4)),scale=1,rscales=1)
attach(pums_p.rep)
#CROSS - TABS
#unweighted
xtabs(~ is5to17youth + withinAMILimit)
table(is5to17youth + withinAMILimit)
#weighted, mean income by sex by race for select age groups
svyby(~PINCP,~RAC1P+SEX,subset(
pums_p.rep,AGEP > 25 & AGEP <35),na.rm = TRUE,svymean,vartype="se","cv")
In getting this to work, I found the article from A. Damico helpful: Damico, A. (2009). Transitioning to R: Replicating SAS, Stata, and SUDAAN Analysis Techniques in Health Policy Data. The R Journal, 1(2), 37–44.

Related

How to apply weights associated with the NIS (National inpatient sample) in R

I am trying to apply weights given with NIS data using the R package "survey", but I have been unsuccessful. I am fairly new to R and survey commands.
This is what I have tried:
# Create the unweighted dataset
d <- read.dta13(path)
# This produces the correct weighted amount of cases I need.
sum(d$DISCWT) # This produces the correct weighted amount of cases I need.
library(survey)
# Create survey object
dsvy <- svydesign(id = ~ d$HOSP_NIS, strata = ~ d$NIS_STRATUM, weights = ~ d$DISCWT, nest = TRUE, data = d)
d$count <- 1
svytotal(~d$count, dsvy)
However I get the following error after running the survey total:
Error in onestrat(x[index, , drop = FALSE], clusters[index], nPSU[index][1], :
Stratum (1131) has only one PSU at stage 1
Any help would be greatly appreciated, thank you!
The error indicates that you have specified a design where one of the strata has just a single primary sampling unit. It's not possible to get an unbiased estimate of variance for a design like that: the contribution of stratum 1131 will end up as 0/0.
As you see, R's default response is to give an error, because a reasonably likely explanation is that the data or the svydesign statement is wrong. Sometimes, as here, that's not what you want, and the global option 'survey.lonely.psu' describes other ways to respond. You want to set
options(survey.lonely.psu = "adjust")
This and other options are documented at help(surveyoptions)

R won't run model because it insists the data is not an UnmarkedFrame Occu Subject

I am trying to create a dual-species occupancy model using unmarkedFrameOccuMulti. I've been successful in producing the UMF and have even got a basic plot of the detections but when I try to run an individual model I get the error message;
Error in occu(~1, ~Vill_Dist, umf) : Data is not an unmarkedFrameOccu object.
I've made sure the csvs have the same number of rows etc. I'm a bit mythed because I can't find much online and the UMF itself has ran perfectly, just R can't seem to seperate out the aspects of it?
S <- 2 # number of species
M <- 354 #number of sites - i.e. number of sites with actual data (#i.e. not NAs/transects that were taken - some transects were done 14 times, others as little as 2 times)
J <- 9.07 #average number of visits per transect
y <- list(matrix(rbinom(354, 1, 0.456)), #species 1 leopard
matrix(rbinom(354, 1, 0.033))) #species 2 wolf
So the above is code I'm following from the R help on unmarkedoccumulti. The ordering of the numbers is based on the rbinom function. i.e. 0.033% of the sites surveyed wolves were seen.
obscov <- read.csv("grazcov2.csv")
Error message is ObsCovData needs M*obsNum of rows
umf <- unmarkedFrameOccuMulti(y=y, siteCovs = predcovs2, obsCovs = NULL)
predcovs2
summary(umf)
plot(umf)
umf
m1 <- occu(~1, ~Vill_Dist, umf) - this is the code that doesn't work - Vill_Dist being one of the covariates in the csv - spellt correctly/same etc.
I was expected to produce a model that would predict occurence of leopards/wolves based off the covariates.
As I was writing this out I had an idea for what might be going wrong. I couldn't get the model to work previously because I was putting in the detection data in csv format rather than using the simple binomial function.
Is it simply that R cannot mix csv/imported data and the binomial data?

How to create Naive Bayes in R for numerical and categorical variables

I am trying to implement a Naive Bayes model in R based on known information:
Age group, e.g. "18-24" and "25-34", etc.
Gender, "male" and "female"
Region, "London" and "Wales", etc.
Income, "£10,000 - £15,000", etc.
Job, "Full Time" and "Part Time", etc.
I am experiencing errors when implementing. My code is as per below:
library(readxl)
iphone <- read_excel("~/Documents/iPhone_1k.xlsx")
View(iphone)
summary(iphone)
iphone
library(caTools)
library(e1071)
set.seed(101)
sample = sample.split(iphone$Gender, SplitRatio = .7)
train = subset(iphone, sample == TRUE)
test = subset(iphone, sample == FALSE)
nB_model <- naiveBayes(Gender ~ Region + Retailer, data = train)
pred <- predict(nB_model, test, type="raw")
In the above scenario, I have an excel file called iPhone_1k (1,000 entries relating to people who have visited a website to buy an iPhone). Each row is a person visiting the website and the above demographics are known.
I have been trying to make the model work and have resorted to following the below link that uses only two variables (I would like to use a minimum of 4 but introduce more, if possible):
https://rpubs.com/dvorakt/144238
I want to be able to use these demographics to predict which retailer they will go to (also known for each instance in the iPhone_1k file). There are only 3 options. Can you please advise how to complete this?
P.S. Below is a screenshot of a simplified version of the data I have used to keep it simple in R. Once I get some code to work, I'll expand the number of variables and entries.
You are setting the problem incorrectly. It should be:
naiveBayes(Retailer ~ Gender + Region + AgeGroup, data = train)
or in short
naiveBayes(Retailer ~ ., data = train)
Also you might need to convert the columns into factors if they are characters. You can do it for all columns, right after reading from excel, by
iphone[] <- lapply(iphone, factor)
Note that if you add numeric variables in the future, you should not apply this step on them.

R: Difficulty for analysis of GSE7864 of NCBI GEO in Limma

I am trying to analyze GSE7864 and would like to know how miR34a, miR34b, and miR34c influence the gene expression, i.e., what is the Differentially expressed genes (DGE) caused by miR34a, miR34b, and miR34c, respectively?
The following is my code, but I am not sure how to construct a design matrix according to the tTarget information (i.e., targets frame according to Limma tutorial). I am trying to select a subset according to different Cy3 and the subsetted targets frame called sTarget, I know sTarget belongs to two-color with common reference designs (p37 in Limma tutorial), but using sTargets only can not build linear model in Limma since no enough replicates for each treatment. In this case, how can I get the DGE permuted by miR34a, miR34b, and miR34c, respectively? Or is there another way to obtain the DGE by using all arrays instead of just 3 like in sTargets? If so, how to contrast the design matrix and contrast matrix? I can not find similar examples in Limma tutorial.
If 2-fold change has used the measure the extent of DGE for GSM190752 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM190752), then FC = 10^VAUE (since VALUE is LOG10 RATIO)? and the genes with abs(FC) > 2 are DGE permuted by miR34a?
Any help is appreciated!
Kevin
The code I used is listed:
#https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE7864
eset <- getGEO(filename = "GSE7864_series_matrix.txt.gz")
tCy3 = rep(c("A549H1", "HCT116Dicer", "TOV21GH1", "DLDDicer", "HeLa", "A549p53", "TOV21Gp53"), each = 4)
tCy5 = rep(c("Luc", "miR34a", "miR34b", "miR34c"), times = 7)
pd <- pData(eset)
tTarget <- data.frame(gsm = rownames(pd), Cy3 = tCy3, Cy5 = tCy5)
sCy3 = c("A549H1")
sCy5 = c("miR34a", "miR34b", "miR34c")
isSelected <- (tTarget$Cy3 %in% sCy3) & (tTarget$Cy5 %in% sCy5)
sTarget <- tTarget[isSelected, ]
In Bioconductor,
https://support.bioconductor.org/p/91258/#91332
From this, I got the answer.

Getting a constant answer while finding patterns with neuralnet package

I'm trying to find patterns in a large dataset using the neuralnet package.
My data file looks something like this (30,204,447 rows) :
id.company,EPS.or.Sales,FQ.or.FY,fiscal,date,value
000001,EPS,FY,2001,20020201,-5.520000
000001,SAL,FQ,2000,20020401,70.300003
000001,SAL,FY,2001,20020325,49.200001
000002,EPS,FQ,2008,20071009,-4.000000
000002,SAL,FY,2008,20071009,1.400000
I have split this initial file into four new files for annual/quarterly sales/EPS and it is on those files that I want to use neural networks to see if I can use the variables id.company, fiscal and date in the case below to predict the annual sales results.
To do so, I have written the following code:
dataset <- read.table("fy_sal_data.txt",header=T, sep="\t") #my file doesn't actually use comas as separators
#extract training set and testing set
trainset <- dataset[1:1000, ]
testset <- dataset[1001:2000, ]
#building the NN
ann <- neuralnet(value ~ id.company + fiscal + date, trainset, hidden = 3,
lifesign="minimal", threshold=0.01)
#testing the output
temp_test <- subset(testset, select=c("id.company", "fiscal", "date"))
ann.results <- compute(ann, temp_test)
#display the results
cleanoutput <- cbind(testset$value, as.data.frame(ann.results$net.result))
colnames(cleanoutput) <- c("Expected Output", "NN Output")
head(cleanoutput, 30)
Now my problem is that the compute function returns a constant answer no matter the inputs of the testing set.
Expected Output NN Output
1001 2006.500000 1417.796651
1002 2009.000000 1417.796651
1003 2006.500000 1417.796651
1004 2002.500000 1417.796651
I am very new to R and its neural networks packages but I have found online that some of the reasons for such results can be either:
an insufficient number of training examples (here I'm using a thousand ones but I've also tried using a million rows and the results were the same, only it took 4h to train)
or an error in the formula.
I am sure I'm doing something wrong but I can't seem to figure out what.

Resources