I'm trying to apply inverse probability weights to a regression, but lm() only uses analytic weights. This is part of a replication I'm working on where the original author is using pweight in Stata, but I'm trying to replicate it in R. The analytic weights are providing lower standard errors which is causing problems with some of my variable being significance.
I've tried looking at the survey package, but am not sure how to prepare a survey object for use with svyglm(). Is this the approach I want, or is there an easier way to apply inverse probability weights?
dput :
data <- structure(list(lexptot = c(9.1595012302023, 9.86330744180814,
8.92372556833205, 8.58202430280175, 10.1133857229336), progvillm = c(1L,
1L, 1L, 1L, 0L), sexhead = c(1L, 1L, 0L, 1L, 1L), agehead = c(79L,
43L, 52L, 48L, 35L), weight = c(1.04273509979248, 1.01139605045319,
1.01139605045319, 1.01139605045319, 0.76305216550827)), .Names = c("lexptot",
"progvillm", "sexhead", "agehead", "weight"), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -5L))
Linear Model (using analytic weights)
prog.lm <- lm(lexptot ~ progvillm + sexhead + agehead, data = data, weight = weight)
summary(prog.lm)
Alright, so I figured it out and thought I would update the post incase others were trying to figure it out. It's actually pretty straightforward.
data$X <- 1:nrow(data)
des1 <- svydesign(id = ~X, weights = ~weight, data = data)
prog.lm <- svyglm(lexptot ~ progvillm + sexhead + agehead, design=des1)
summary(prog.lm)
Standard errors are now correct.
Related
I am doing a regression with panel data of EU countries over time with observations from 2007-16. I want to use the observation for 2007 for each specific country as the weight. Is there a simple way to do this?
The is essentially the regression I run, but I don't think the weighting is working as I intend it to.
lm(log(POP25) ~ log(EMPLOY25), weights = POP25, data = data)
structure(list(...1 = 1:6, TIME = 2007:2012, NUTS_ID = c("AT",
"AT", "AT", "AT", "AT", "AT"), NUMBER = c(1L, 1L, 1L, 1L, 1L,
1L), POP15 = c(5529.1, 5549.3, 5558.5, 5572.1, 5601.1, 5620.8
), POP20 = c(5047.1, 5063.2, 5072.6, 5090, 5127.1, 5151.9), POP25 = c(4544,
4560.7, 4571.3, 4587.8, 4621.5, 4639), EMPLOY15 = c(3863.6, 3928.7,
3909.3, 3943.9, 3982.3, 4013.4), EMPLOY20 = c(3676.2, 3737, 3723.8,
3761.9, 3802.3, 3835), EMPLOY25 = c(3333.5, 3390.4, 3384.7, 3424.6,
3454.4, 3486.4)), row.names = c(NA, 6L), class = "data.frame")
You are right - this is not doing what you expect it to. The reason is that you are supplying POP25 as the weight but you haven't yet made it explicit that you only want the POP25 value from 2007.
A weights vector needs to be the same length as the dependent and independent variables. The easiest way to do this is by creating a weights column in the table, where the value is the POP25 value for each NUTS_ID in the year 2007:
library(dplyr)
data <- data |>
group_by(NUTS_ID) |>
mutate(weights = POP25[TIME==2007])
You can then supply this as the weights vector:
lm(log(POP25) ~ log(EMPLOY25), weights = weights, data = data)
I'm trying to visualize how a neural network separates a simple 2 dimension points into 2 classes. I use geom_point to denote the training points and geom_raster to denote how the neural network separates the 2D space. Here's the functions and some of the data points plotted.
library(tidyverse)
library(neuralnet)
data2 <- structure(list(X1 = c(152, 178, 19, 101, 145, 184), x = c(32.4083268723916,
84.5016641449183, 114.483315175202, 51.914560098842, 79.6402378017537,
82.6861507166177), y = c(18.339864264708, 83.42093185056, 63.2843023451388,
55.7215069333086, 42.6517407153766, 86.5805756277405), label = structure(c(2L,
1L, 1L, 2L, 2L, 1L), .Label = c("1", "2"), class = "factor")), row.names = c(152L,
178L, 19L, 101L, 145L, 184L), class = "data.frame")
nn.model <- neuralnet(label~x+y, data2, hidden=4, linear.output=FALSE)
background <- expand_grid(x=seq(-40,120,0.1), y=seq(0,100,0.1))
background$label <- predict(nn.model, background) %>% apply(1, which.max)
ggplot()+geom_raster(data=background, aes(x, y, fill=label))+geom_point(data=data2, aes(x, y, color=label))+scale_color_manual(values=c("white","red"))
In the original dataset, the points lie in x range (-40, 120) and y range (0, 100); therefore the background expands accordingly. This approach, of course, takes some time because R will need to have the neural network predict some 1600 x 1000 points and then render them on the geom_raster layer.
My question: is there way to optimize or do this another way in ggplot (or in another package, if this problem is solved well there), as this approach is brute force in geom_rastering the background?
I have already searched the Forum for Hours (really) and start to get the faint Feeling that I am slowly going crazy, especially as it appears to me to be a really easily solvable Problem.
What do I want to do?
Basically, I want to simulate clinical data. Specifically, for each Patient (column 1:ID) an arbitrary score (column 3: score), dependant on the assigned Treatment Group (column 2: group).
set.seed(123)
# Number of subjects in study
n_patients = 1000
# Score: Mean and SDs
mean_verum = 70
sd_verum = 20
mean_placebo = 40
sd_placebo = 20
# Allocating to Treatment groups:
data = data.frame(id = as.character(1:n_patients))
data$group[1:(n_patients/2)] <- "placebo"
data$group[(n_patients/2+1):n_patients] <- "verum"
# Attach Score for each treatment group
data$score <- ifelse(data$group == "verum", rnorm(n=100, mean=mean_verum, sd=sd_verum), rnorm(n=100, mean=mean_placebo, sd=sd_placebo))
So far so easy. Now, I wish to 1) calculate a probability of an Event happening (logit function) depending on the score. Then, 2) I want to actually assign an Event, depending on the probability (rbinom).
I want to do this for n different probablities/Events. This is the Code I've used so far:
Calculate probabilities:
a = -1
b = 0.01
p1 = 1-exp(a+b*data$score)/(1+exp(a+b*data$score))
data$p_AE1 <- p1
a = -0.5
b = 0.01
p1 = 1-exp(a+b*data$score)/(1+exp(a+b*data$score))
data$p_AE2 <- p1
…
Assign Events:
data$Abbruch_AE1 <- rbinom(n_patients, 1, data$p_E1)
data$Abbruch_AE2 <- rbinom(n_patients, 1, data$p_E2)
…
Obviously, this is really inefficient, as it would like to easily scale this up or down, depending on how many probabilities/Events I want to simulate.
The Problem is, I simply do not get it, how I can simultaneously a) generate new, single column in the dataframe, where I want to put in the values for each, b) perform the function to assign the probabilities/Events, and c) do this for a number n of different formulas, which have their specific a and b.
I am sure the solution to this Problem is a simple one - what I didn't manage was to do all These Things at once, which is were I would like this to be eventually. I ahve played around with for loops, all to no avail.
Any help would be greatly appreciated!
This how my dataframe Looks like:
structure(list(id = structure(1:3, .Label = c("1", "2", "3"), class = "factor"),
group = c("placebo", "placebo", "placebo"), score = c(25.791868726014,
45.1376741831306, 35.0661624307525), p_AE1 = c(0.677450814266315,
0.633816117436442, 0.656861351663365), p_AE2 = c(0.560226492151216,
0.512153420188678, 0.537265362130761), p_AE3 = c(0.435875409622676,
0.389033483248856, 0.413221988111604), p_AE4 = c(0.319098312196655,
0.278608032377073, 0.299294085148527), p_AE5 = c(0.221332386680766,
0.189789774534235, 0.205762225373345), p_AE6 = c(0.147051201194953,
0.124403316086538, 0.135795233451071), p_AE7 = c(0.0946686004658072,
0.0793379289917946, 0.0870131973838217), p_AE8 = c(0.0596409872667201,
0.0496714832182721, 0.0546471270895262), AbbruchAE1 = c(1L,
1L, 1L), AbbruchAE2 = c(1L, 1L, 0L), AbbruchAE3 = c(0L, 0L,
0L), AbbruchAE4 = c(0L, 1L, 0L), AbbruchAE5 = c(1L, 0L, 0L
), AbbruchAE6 = c(1L, 0L, 0L), AbbruchAE7 = c(0L, 0L, 0L),
AbbruchAE8 = c(0L, 0L, 0L)), .Names = c("id", "group", "score", "p_AE1", "p_AE2", "p_AE3", "p_AE4", "p_AE5", "p_AE6", "p_AE7", "p_AE8", "AbbruchAE1", "AbbruchAE2", "AbbruchAE3", "AbbruchAE4", "AbbruchAE5", "AbbruchAE6", "AbbruchAE7", "AbbruchAE8"), row.names = c(NA, 3L), class = "data.frame")
I am trying to make a raster file to predict where species will occur. The model that I want to run is:
glmer(colorSymbol ~ snow.cover_sc + bio2_sc + bio3_sc + alt_sc + y + x + (1|spsCode), family = binomial, data = data)
It is a mixed effects model with lat (x) and long (y) as independent predictors as well as various environmental covariates. x and y are geographical coordinates. Since x and y are independent predictors I am trying to use the interpolate function in the raster package. However, the help file for interpolate gives gstat (gstat package) or Krige (fields package) as example model objects.
So my first question is can a glmer serve as a model for the interpolate function? I realize that a glmer assumes independence between the predictors and tested for that to see if there was correlation between any of them and found none (<0.5).
My second question is if glmer can be the model object, what does the following error mean:
Error in model.frame.default(Terms, newdata, na.action = na.action,
xlev = object$xlevels) : object is not a matrix In addition:
Warning message: closing unused connection 4
For context I get this error when I run the reduced model:
m4<- glm(colorSymbol ~ bio_2 + alt + x + y, family = binomial, data = data)
and create a raster stack of bio_2 and alt. Links to files below: (Note they can also be accessed via worldclim(dot)org/current (sorry can't post more than 2 links) and downloading ESRI 30 second grids.)
alt raster file
bio_2 raster file
Here is the full code:
data4<-structure(list(colorSymbol = c(1L, 1L, 1L, 0L, 0L, 1L, 0L, 0L,
1L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 1L), bio_2 =
structure(c(-1.65319533124791,
5.12773277360962, -2.96563302896227, 2.13829135103802, 1.62789891303799,
0.169634804466482, -0.049104811819245, -0.049104811819245,
-0.049104811819245,
-0.049104811819245, -0.267844428104972, 0.315461215323633,
0.315461215323633,
-0.63241045524785, -0.63241045524785, 0.315461215323633, -0.122018017247821,
-0.049104811819245, -0.413670838962123, 0.169634804466482), .Dim = c(20L,
1L)), alt = structure(c(0.751340496188818, 4.17865221830445,
-0.118372064874358, -0.554302064617135, 1.86371359898073,
0.0126216788907128,
-0.595103394642321, -0.595103394642321, -0.558596941461892,
-0.573629010418539,
-0.0840130501163067, -0.0625386658925246, 0.0620127626054117,
2.11925877124374, 2.11925877124374, -0.124814380141493, -0.543564872505244,
-0.719654823140258, 0.495795323925811, 0.20803857532713), .Dim = c(20L,
1L)), y = structure(c(0.0353970033327643, 1.83610974064461,
-4.82625744580285,
-4.36879939725431, 1.11073398331965, 0.101128667461893, 0.171636464096401,
0.171636497654332, 0.168280671013401, 0.168839981046544, 0.173873670671044,
0.10507991246954, 0.0997146033725779, 0.0106082967555351,
0.0106082967555351,
0.105639188944753, 0.182263153378545, 0.186305172589088, 0.133466968853809,
0.10507991246954), .Dim = c(20L, 1L)), x = structure(c(3.73193203850492,
-3.74207501883321, 1.93312018034606, -3.43881737052527, -1.87240343109311,
-0.289046352405738, 0.13805360014565, 0.13805360014565, 0.0955082550467424,
0.0997628661381006, 0.00616320902913118, 0.0869992881355855,
-0.236861953199331, -0.103499155724443, -0.103499410996004,
0.0912538992269437,
0.0997628661381006, 0.0498381307812604, 0.220158634177113,
0.0784903212244285
), .Dim = c(20L, 1L))), .Names = c("colorSymbol", "bio_2", "alt",
"y", "x"), row.names = c(NA, 20L), class = "data.frame")
library("lme4")
library("raster")
library("rgdal")
library("RArcInfo")
m7 = glm(colorSymbol ~ bio_2 + alt + x + y, family = binomial, data = data4)
#write model
alt<-raster("alt.ovr")
bio_2<-raster("bio_2.ovr")
#import rasters
rasstack<-stack(alt, bio_2)
#make raster stack
test<-raster::interpolate(rasstack, m7)
#try and interpolate model
Any help would be greatly appreciated!
This would seem to be a straightforward problem but I can't find an answer for it...
How do I write a function where one of the calls refers to a specific variable name?
For example, if I have a data frame:
data=structure(list(x = 1:10, treatment = c(1L, 2L, 1L, 2L, 1L, 2L,
1L, 2L, 1L, 2L)), .Names = c("x", "treatment"), row.names = c(NA,
-10L), class = "data.frame")
I can write a trivial function that adds treatment to the other variable in the data frame but this only works if there is a variable called "treatment" in d.
ff=function(data,treatment){data+treatment)}
ff(data,data$treatment)
This works but I want to set it up so the user doesn't call data$Var in the function.
Is this what you want?
ff <- function(data, colname) {
data + data[[colname]]
}
ff( data, "treatment" )
or
ff <- function(data, column) {
colname <- deparse(substitute(column))
data + data[[colname]]
}
ff( data, treatment )
(the later can lead to hard to find bugs if someone tries something like ff(data, 1:10))