Trying to track down why I'm getting this error: Error in hist.default(temp_summer[(temp_Period == "1951-1980")], plot = FALSE) : invalid number of 'breaks'
tempdata <- read.csv("NH.Ts+dSST.csv", skip = 1, na.strings = "***")
tempdata$Period <- factor(NA, levels = c("1921-1950", "1951-1980", "1981-2010"), ordered = TRUE)
tempdata$Period[(tempdata$Year > 1920) & (tempdata$Year < 1951)] <- "1921-1950"
tempdata$Period[(tempdata$Year > 1950) & (tempdata$Year < 1981)] <- "1951-1980"
tempdata$Period[(tempdata$Year > 1980) & (tempdata$Year < 2011)] <- "1981-2010"
# Combine the temperature data for June, July, and August
temp_summer <- c(tempdata$June, tempdata$July, tempdata$Aug)
# Mirror the Period information for temp_sum
temp_Period <- c(tempdata$Period, tempdata$Period, tempdata$Period)
# Repopulate the factor information
temp_Period <- factor(temp_Period,
levels = 1:nlevels(tempdata$Period),
labels = levels(tempdata$Period))
hist(temp_summer[(temp_Period == "1951-1980")], plot = FALSE)
data link:
https://data.giss.nasa.gov/gistemp/tabledata_v4/NH.Ts+dSST.csv
I have survey data (picture sample below) I'm working with to find 95% confidence intervals for. The Q#d columns (Q1d, Q2d, etc.) each correspond to different questions on the survey (Likert scale with results dichotomized, 1 = yes, 0 = no). The intervention column describes whether the results were before intervention (FALSE) or after intervention (TRUE). What I want to do is get the 95% confidence intervals on the difference in proportions for each question before and after intervention.
For example, let's say for Q1d the proportion that answered "yes" before intervention is .2 and after the intervention is .5. The difference would be .3 or 30%, and I want to calculate the confidence interval (let's say between 25% and 35%) on the difference. I want to do this for every single question in the survey (all Q1d). I have not been able to find a way to iterate through and do this for all questions (columns). I've written a function that can successfully do it for one column, but iterating through column names isn't working for me, and I don't know how to store the results as a vector/dataframe. I've included the function below. Any guidance?
Thanks so much!!
get_conf_int <- function(df, colName) {
myenc <- enquo(colName)
p <- df %>%
group_by(Intervention) %>%
summarize(success=sum(UQ(myenc)==1, na.rm=TRUE), total=n())
prop.test(x=pull(p,success), n=pull(p, total))$conf.int[2:1]*-100
}
And I can call the function like:
get_conf_int(db, Q1d)
I'm using prop.test to find confidence interval for now, but open to other methods as well.
I can't assure you if prop.table is better than binom.test, you should read more about those two.
library(dplyr)
# just for this example, you have your survey here
df <- data.frame(Intervention=sample(x = c(TRUE,FALSE), size = 20, replace = TRUE),
Q1d=sample(x = 0:1, size = 20, replace = TRUE),
Q2d=sample(x = 0:1, size = 20, replace = TRUE),
Q3d=sample(x = 0:1, size = 20, replace = TRUE),
Q4d=sample(x = 0:1, size = 20, replace = TRUE),
Q5d=sample(x = 0:1, size = 20, replace = TRUE),
Q6d=sample(x = 0:1, size = 20, replace = TRUE),
Q7d=sample(x = 0:1, size = 20, replace = TRUE))
# vector with the sum of FALSE and the sum of TRUE
count_Intervention <- c(length(which(!df$Intervention)),length(which(df$Intervention)))
# group by TRUE/FALSE and sum(count) the 1's
df_sum <- df %>%
group_by(Intervention) %>%
summarize(across((colnames(df)[-1]),list(sum)))
# for new info. I added the pvalue, that might be important
new_df <- data.frame(Question=as.character(), LowerConfInt=as.numeric(), UpperConfInt=as.numeric(), Pvalue = as.numeric())
#loop
for (Q_d in colnames(df_sum)[-1]) {
lower <- prop.test(as.vector(t(df_sum[,Q_d])), count_Intervention)$conf.int[1]
upper <- prop.test(as.vector(t(df_sum[,Q_d])), count_Intervention)$conf.int[2]
pvalue <- prop.test(as.vector(t(df_sum[,Q_d])), count_Intervention)$p.value
new_df <- rbind(new_df, data.frame(Q_d, lower, upper, pvalue))
}
new_df
Q_d lower upper pvalue
1 Q1d_1 -0.2067593 0.8661000 0.34844258
2 Q2d_1 -0.9193444 -0.1575787 0.05528499
3 Q3d_1 -0.4558861 0.5218202 1.00000000
4 Q4d_1 -0.4558861 0.5218202 1.00000000
5 Q5d_1 -0.7487377 0.3751114 0.74153726
6 Q6d_1 -0.2067593 0.8661000 0.34844258
7 Q7d_1 -0.4558861 0.5218202 1.00000000
I am trying to do a propensity matched analysis but am having a lot of trouble. I have a large data set with an exposure coded as 0 (no exposure) and 1 (exposure) and am trying to matched based on a couple of variables. Basically I was trying to follow a tutorial on propensity matching via Coursera but am getting a really weird output. My initial dataset has 2,202 distinct observations. However, once I do the matching, my dataset has 3,074 distinct observations, which is obviously not supposed to happen. It creates a matched sample, but I'm not sure where the additional observations come from...
Does anyone know what I'm doing wrong? I have been trying to troubleshoot for the past week but keep coming up empty handed.
Here is what I'm doing:
race <- as.numeric(cohort$race_eth)
insurance <- as.numeric(cohort$privateinsurance)
language <- as.numeric(cohort$primarylanguage)
bloodpressure <- as.numeric(cohort$bloodpressure
bmi <- cohort$bmiatdelivery
exp <- as.numeric(cohort$prechange)
out <- as.numeric(cohort$tdapvaccinedate_yn)
#merge new dataset
propensity <- cbind(race, insurance, language, bloodpressure, bmi, exp, out)
propensity <- data.frame(prop_score)
#covariates to use in matching
xvars <- c("race", "insurance", "language", "bloodpressure", "age", "bmi")
table1 <- CreateTableOne(vars=xvars, strata="exp",data=propensity, test=FALSE)
print(table1, smd=TRUE)
#do matching
greedymatch <- Match(Tr=propensity$exp, M=1, X=propensity[xvars])
matched <- propensity[unlist(greedymatch[c("index.treated", "index.control")]),] # THIS IS WHERE THE PROBLEM OCCURS SHOWING THAT I HAVE 3074 OBSERVATIONS ```
it is not as easy to tackle your question, as your provided code snippet seems to have some issues before the error you report to experience, hence I was not able to reproduce your error. Still, I have created in the following a dummy dataset based on random numbers, and proceeded your steps. I have put down comments where potential errors could arise with your current code. Maybe this already helps!
#Indicate which Packages are needed
library(tableone)
library(Matching)
##Create Reproducible Example Dataset
# I create dummy data with random variables
race = as.numeric(rep(c(1,2),5))
insurance = as.numeric(sample(1:100,10))
language = as.numeric(sample(1:100,10))
bloodpressure = as.numeric(sample(1:100,10))
bmi = sample(1:100,10)
# it would be safer if you renamed exp as it is is also base function
exp = rep(c(1,0),5)
out = sample(1:100,10)
#You did not inlcude here an age variable but will refer to it later
# did you maybe forgot to include it?
age= sample(1:50,10)
#merge new dataset
propensity <- cbind(race, insurance, language, bloodpressure, bmi, exp, out,age)
# In your example a new dataset "prop_score" appeared
# I can only guess you ment the just created dataset propensity
propensity <- data.frame(propensity)
#Input Dimension
dim(propensity)
#covariates to use in matching
xvars <- c("race", "insurance", "language", "bloodpressure", "age", "bmi")
table1 <- CreateTableOne(vars=xvars, strata="exp",data=propensity, test=FALSE)
print(table1, smd=TRUE)
#do matching
greedymatch <- Match(Tr=propensity$exp, M=1, X=propensity[,xvars])
matched <- propensity[unlist(greedymatch[c("index.treated", "index.control")]),]
#Output Dimensions
dim(matched) #The Dimensions are fine
Thanks for responding! I did ur code and it still didn't work. What do you mean by "it would be safer if you renamed exp as it is is also base function"? I have attached a reprex for my dataset. Does this help?
outcome = c(0, 0, 0, 1, 0),
exposure = c(0, 0, 0, 1, 1),
insurance = c(1, 1, 1, 1, 1),
language = c(3, 1, 1, 1, 1),
age = c(32, 36, 22, 26, 38),
bmi = c(23.8407, 25.354099, 29.709999, 26.9098, 36.290401),
race_eth = as.factor(c("5", "1", "2", "1", "2")),
nullip = as.factor(c("1", "0", "1", "1", "0"))
)
library(tableone)
library(Matching)
#recode variables to use in matching
race <- as.numeric(cohort$race_eth)
insurance <- as.numeric(cohort$insurance)
language <- as.numeric(cohort$language)
nullip <- as.numeric(cohort$nullip)
age <- cohort$age
bmi <- cohort$bmi
exp <- as.numeric(cohort$exposure)
out <- as.numeric(cohort$outcome)
#create new dataset
prop_score <- cbind(race, insurance, language, nullip, pnc, age, bmi, exp, out)
prop_score <- data.frame(prop_score)
xvars <- c("race", "insurance", "language", "nullip", "pnc", "age", "bmi")
#table 1
table1 <- CreateTableOne(vars=xvars, strata="exp",data=prop, test=FALSE)
print(table1, smd=TRUE)
#matching
greedymatch <- Match(Tr=prop$exp, M=1, X=prop[xvars])
matched <- prop[unlist(greedymatch[c("index.treated", "index.control")]),]`
just add the replace = FALSE to your code
greedymatch <- Match(Tr = propensity$exp, M = 1, X = propensity[xvars], replace = FALSE)
I have a function called sim.LifeAnnuity, which takes three parameters nsim (number of samples), a (age) and g (Gender of either Male of Female). There is another table that I am drawing data from called LifeTable. This table ranges from ages 0 to 119. The function is supposed to return the number of payments purchased by an individual of age = a and gender (Male of Female). The probability is impacted by the gender and age, where it should take into consideration the prob values from age (a to 119). This is the my code thus far:
sim.lifeAnnuity <- function(nsim, a, g = c("Female", "Male")) {
age <- c(1:(120-a))
malep <- c(Lab1$Male_Prob[(a+1):120])
falep <- c(Lab1$Female_Prob[(a+1):120])
if(a>=0 & a<=119 & g=="Male") {
male <- sample(age, size = nsim, replace = TRUE, prob =
malep)
return(male)
}
if(a>=0 & a<=119 & g=="Female") {
female <- sample(age, size = nsim, replace = TRUE, prob =
falep)
return(female)
}
}
sim.lifeAnnuity(nsim = 5, a = 40, g = "Male")
I created two vectors malep and femalep and utilized them in sample(). The return outcome should give a vector of nsim payments, whose average should be close to the age I pick. However the values I receive are not close to the age I chose.
In a data frame, I am trying to look for data points that are more than (threshold * s.d.) away from mean. The dim of the data frame is as follows:
[1] 4032 4
To find the data points for the above condition, I did:
df$mean = rollapply(df$value, width = 2, FUN = mean, align = "right", fill = "extend")
df$sd = rollapply(df$value, width = 2, FUN = sd, align = "right", fill = "extend")
After the above the head(df) looks like:
timestamp value mean sd
2007-03-14 1393577520 37.718 38.088 0.5232590
2007-03-15 1393577220 38.458 38.088 0.5232590
2007-03-16 1393576920 37.912 38.185 0.3860803
2007-03-17 1393576620 40.352 39.132 1.7253405
2007-03-18 1393576320 38.474 39.413 1.3279465
2007-03-19 1393576020 39.878 39.176 0.9927779
To find the datapoints:
anomaly = df[df$value > abs((threshold*df$sd + df$mean) |
(df$mean - threshold*df$sd)),]
Is above the correct way to find data points that are more than (threshold * s.d.) away from mean. The reason I am suspicious is that dim of anomaly is same as that of df.
This will do it
# creating some dummy data
m <- matrix(runif(16128,-1,1), ncol = 4)
tresh <- .004+1
m[which(abs(m-mean(m)) > tresh*sd(m), arr.ind = T)]
Where m denotes your matrix (or your column value depending on whichever you take the mean/sd) and tresh your treshold.
Update Here are the first couple of entries of my result:
dat <- df$value[which(abs(df$value-mean(df$value)) > tresh*sd(df$value))]
head(dat)
[1] 51.846 48.568 44.986 49.108 53.404 46.314