Sub-setting to find the anomaly - r

In a data frame, I am trying to look for data points that are more than (threshold * s.d.) away from mean. The dim of the data frame is as follows:
[1] 4032 4
To find the data points for the above condition, I did:
df$mean = rollapply(df$value, width = 2, FUN = mean, align = "right", fill = "extend")
df$sd = rollapply(df$value, width = 2, FUN = sd, align = "right", fill = "extend")
After the above the head(df) looks like:
timestamp value mean sd
2007-03-14 1393577520 37.718 38.088 0.5232590
2007-03-15 1393577220 38.458 38.088 0.5232590
2007-03-16 1393576920 37.912 38.185 0.3860803
2007-03-17 1393576620 40.352 39.132 1.7253405
2007-03-18 1393576320 38.474 39.413 1.3279465
2007-03-19 1393576020 39.878 39.176 0.9927779
To find the datapoints:
anomaly = df[df$value > abs((threshold*df$sd + df$mean) |
(df$mean - threshold*df$sd)),]
Is above the correct way to find data points that are more than (threshold * s.d.) away from mean. The reason I am suspicious is that dim of anomaly is same as that of df.

This will do it
# creating some dummy data
m <- matrix(runif(16128,-1,1), ncol = 4)
tresh <- .004+1
m[which(abs(m-mean(m)) > tresh*sd(m), arr.ind = T)]
Where m denotes your matrix (or your column value depending on whichever you take the mean/sd) and tresh your treshold.
Update Here are the first couple of entries of my result:
dat <- df$value[which(abs(df$value-mean(df$value)) > tresh*sd(df$value))]
head(dat)
[1] 51.846 48.568 44.986 49.108 53.404 46.314

Related

Sampling Random Points Closer to Today?

I have this dataset in R:
date = sample(seq(as.Date('2015-01-01'), as.Date('2022-08-12'), by = "day"), 1000)
var1 = rnorm(1000, 1000,1000)
var2 = rnorm(1000, 1000,1000)
var3 = rnorm(1000, 1000,1000)
question_data = data.frame(date, var1, var2, var3)
question_data$id = 1:nrow(question_data)
I want to take 1000 random samples from this data such that "there are more points closer to today's date compared to the starting date".
I thought of a very simple way to do this - first, I order this dataset by date:
question_data <- question_data[order(-question_data$date),]
Then, I create a new "date_id":
question_data$date_id = 1:nrow(question_data)
From here, I choose an arbitrary cut-off and arbitrarily take weighted samples:
part_1 <- question_data[which(question_data$date_id > 750), ]
part_2 <- question_data[which(question_data$date_id < 750), ]
library(dplyr)
random_sample = data.frame(sample_n(part_1, 250, replace = TRUE), sample_n(part_2, 500, replace = TRUE))
Is there a better way to do this? Perhaps some methods that might be able to perform "smooth" random samples?
Thank you!
We can see the distribution of dates in the original data set if we do:
hist(lubridate::year(question_data$date), breaks = 2014:2022 + 0.5)
If we want to sample the dates more frequently as they get closer to the current time, we can first arrange the data frame in date order:
question_data <- question_data[order(question_data$date),]
Now, we can sample from all rows of the data frame, but we can specify the row number itself as a weighting, such that the probability of a particular row being selected goes from essentially 0 for row 1 to about 1 in 500 for the final row. Let's take a sample of 100 using this method and look at the histogram of dates:
n <- 100
samp <- question_data[sample(seq(nrow(question_data)), n, replace = FALSE,
prob = seq(nrow(question_data))),]
hist(lubridate::year(samp$date), breaks = 2014:2022 + 0.5)

How to calculate the average of multiple standard deviations in R?

I am trying to figure out how to calculate the standard deviation of a dataset when I have a couple of standard deviations. Let's just look at this MWE:
set.seed(1234)
dummy_data <- data.frame(
"col_1" = sample(1:7, size = 10, replace = TRUE),
"col_2" = sample(1:7, size = 10, replace = TRUE),
"col_3" = sample(1:7, size = 10, replace = TRUE),
"col_4" = sample(1:7, size = 10, replace = TRUE)
)
Now since I know all the data points I can calculate the total standard deviation as follows:
> sd(as.matrix(dummy_data))
[1] 1.727604
But the real data that I have at hand is the following:
> dplyr::summarise_all(dummy_data, sd)
col_1 col_2 col_3 col_4
1 1.837873 1.873796 1.37032 1.888562
If I follow the usual method of calculating the average of multiple standard deviations with similar sample sizes, I would apply the following:
sds <- dplyr::summarise_all(dummy_data, sd)
vars <- sds^2
mean_sd <- sqrt(sum(vars) / (length(vars) - 1))
> mean_sd
[1] 2.027588
which is not the same! Now I have tried without the minus one:
> sqrt(sum(vars) / (length(vars)))
[1] 1.755942
which does not solve the problem. I have tried defining an own standard deviation function like this:
own_sd <- function(x) {
sqrt(sum((x - mean(x))^2) / length(x))
}
to eliminate the x - 1 in the dplyr::summarise_all() step, and then average according to the step above:
> sqrt(sum(dplyr::summarise_all(dummy_data, own_sd)^2) / 3)
[1] 1.923538
> sqrt(sum(dplyr::summarise_all(dummy_data, own_sd)^2) / 4)
[1] 1.665833
But all seem to give a different result than the sd(as.matrix()) method. What is going wrong here?
You can't calculate a global SD from only knowing group SDs. For example:
x1 = 1:5
x2 = 11:15
x3 = 101:105
## all the SDs are equal
(sd1 = sd(x1))
#[1] 1.581139
(sd2 = sd(x2))
#[1] 1.581139
(sd3 = sd(x3))
#[1] 1.581139
## however, combining the groups in pairs give very different results
sd(c(x1, x2))
# [1] 5.477226
sd(c(x1, x3))
# [1] 52.72571
This demonstrates that even if the sample sizes are identical, knowing the standard deviation of two groups does not help you calculate the standard deviation of those groups combined.
As per Merijn van Tilborg's comment, if you know the group sizes and the group means, the calculation is possible as shown here.

Calculating 95% Confidence Interval for Several Columns at Once in R

I have survey data (picture sample below) I'm working with to find 95% confidence intervals for. The Q#d columns (Q1d, Q2d, etc.) each correspond to different questions on the survey (Likert scale with results dichotomized, 1 = yes, 0 = no). The intervention column describes whether the results were before intervention (FALSE) or after intervention (TRUE). What I want to do is get the 95% confidence intervals on the difference in proportions for each question before and after intervention.
For example, let's say for Q1d the proportion that answered "yes" before intervention is .2 and after the intervention is .5. The difference would be .3 or 30%, and I want to calculate the confidence interval (let's say between 25% and 35%) on the difference. I want to do this for every single question in the survey (all Q1d). I have not been able to find a way to iterate through and do this for all questions (columns). I've written a function that can successfully do it for one column, but iterating through column names isn't working for me, and I don't know how to store the results as a vector/dataframe. I've included the function below. Any guidance?
Thanks so much!!
get_conf_int <- function(df, colName) {
myenc <- enquo(colName)
p <- df %>%
group_by(Intervention) %>%
summarize(success=sum(UQ(myenc)==1, na.rm=TRUE), total=n())
prop.test(x=pull(p,success), n=pull(p, total))$conf.int[2:1]*-100
}
And I can call the function like:
get_conf_int(db, Q1d)
I'm using prop.test to find confidence interval for now, but open to other methods as well.
I can't assure you if prop.table is better than binom.test, you should read more about those two.
library(dplyr)
# just for this example, you have your survey here
df <- data.frame(Intervention=sample(x = c(TRUE,FALSE), size = 20, replace = TRUE),
Q1d=sample(x = 0:1, size = 20, replace = TRUE),
Q2d=sample(x = 0:1, size = 20, replace = TRUE),
Q3d=sample(x = 0:1, size = 20, replace = TRUE),
Q4d=sample(x = 0:1, size = 20, replace = TRUE),
Q5d=sample(x = 0:1, size = 20, replace = TRUE),
Q6d=sample(x = 0:1, size = 20, replace = TRUE),
Q7d=sample(x = 0:1, size = 20, replace = TRUE))
# vector with the sum of FALSE and the sum of TRUE
count_Intervention <- c(length(which(!df$Intervention)),length(which(df$Intervention)))
# group by TRUE/FALSE and sum(count) the 1's
df_sum <- df %>%
group_by(Intervention) %>%
summarize(across((colnames(df)[-1]),list(sum)))
# for new info. I added the pvalue, that might be important
new_df <- data.frame(Question=as.character(), LowerConfInt=as.numeric(), UpperConfInt=as.numeric(), Pvalue = as.numeric())
#loop
for (Q_d in colnames(df_sum)[-1]) {
lower <- prop.test(as.vector(t(df_sum[,Q_d])), count_Intervention)$conf.int[1]
upper <- prop.test(as.vector(t(df_sum[,Q_d])), count_Intervention)$conf.int[2]
pvalue <- prop.test(as.vector(t(df_sum[,Q_d])), count_Intervention)$p.value
new_df <- rbind(new_df, data.frame(Q_d, lower, upper, pvalue))
}
new_df
Q_d lower upper pvalue
1 Q1d_1 -0.2067593 0.8661000 0.34844258
2 Q2d_1 -0.9193444 -0.1575787 0.05528499
3 Q3d_1 -0.4558861 0.5218202 1.00000000
4 Q4d_1 -0.4558861 0.5218202 1.00000000
5 Q5d_1 -0.7487377 0.3751114 0.74153726
6 Q6d_1 -0.2067593 0.8661000 0.34844258
7 Q7d_1 -0.4558861 0.5218202 1.00000000

Plotting after Doing Simulation of Linear Regression with R

I am doing the simulation of linear regression with R.
A regression model I consider is y_i = a + b_1 * x_1i + b_2 * x_2i + e_i.
The parameter design as follows:
x_1i ~ N(2,1), x_2i ~ Poisson(4), e_i ~ N(0, 1), theta = (a, b_1, b_2)
The following code I am doing is that I would like to generate 100 independent random samples of (y, x_1, x_2) 1000 times using the distribution which I have mentioned above, and I also want to estimate theta_hat (the estimator of theta). After getting the theta_hat, I would like to plot the distribution of estimator of a (a_hat), b_1 (b_1_hat), b_2 (b_2_hat), respectively.
## Construct 1000 x_1
x_1_1000 <- as.data.frame(replicate(n = 1000,expr = rnorm(n = 100,
mean = 2, sd = 1)))
colnames(x_1_1000) <- paste("x_1", 1:1000, sep = "_")
x_2_1000 <- as.data.frame(replicate(n = 1000,expr = rpois(n = 100,
lambda = 4)))
colnames(x_2_1000) <- paste("x_2", 1:1000, sep = "_")
error_1000 <- as.data.frame(replicate(n = 1000, expr = rnorm(n = 100,
mean = 0, sd = 1)))
colnames(error_1000) <- paste("e", 1:1000, sep = "_")
y_1000 <- as.data.frame(matrix(data = 0, nrow = 100, ncol = 1000))
y_1000 = 1 + x_1_1000 * 1 + x_2_1000*(-2) + error_1000
colnames(y_1000) <- paste("y", 1:1000, sep = "_")
######################################################################
lms <- lapply(1:1000, function(x) lm(y_1000[,x] ~ x_1_1000[,x] + x_2_1000[,x]))
theta_hat_1000 <- as.data.frame(sapply(lms, coef))
After doing the linear regression, I just store the result into lms which is a list. Because I just want the data of coefficient, I store those simulation coefficients into "theta_hat_1000" However, when I wanna plot the distribution graph, I cannot get what I want in the final. I have tried two ways to solve the problem but still being confused.
The first way I tried is that I just rename the data frame "theta_hat_1000". I have successfully renamed the column_i, where i from 1 to 1000. However, I just cannot successfully rename the rows.
rownames(theta_hat_1000[1,]) <- "ahat"
rownames(theta_hat_1000[2,]) <- "x1hat"
rownames(theta_hat_1000[3,]) <- "x2hat"
The code listed above did not show any error message but finally failed to change the row names. Thus, I have tried the following code
rownames(theta_hat_1000) <- c("ahat", "x1hat", "x2hat")
This has successfully renamed. However, when I want to check there is anything store in the data frame, it reports "NULL"
theta_hat_1000$ahat
NULL
Therefore, I notice that there is something weird. Thus I have tried the second way like the following.
I have tried to unlist "theta_hat_1000" which is a list stored in my global environment. However, after doing such things, I am not getting what I want. The expected result is just getting three rows and each row with 1000 values, but the actual is that I got 3000 obs with 1 column.
The ideal result is getting three columns and each column with 1000 values and putting them into a data frame to do a further process like using ggplot to demonstrate the distribution of estimated coefficients.
I have stuck on this for several hours. It would be appreciated if anyone can help me and give me some suggestions.
This line theta_hat_1000$ahat in your code does not work, because "ahat" is a rowname not a column name in the data frame. You would get the result by calling theta_hat_1000["ahat",].
However, I understand that your desired result is actually a dataframe with 3 columns (and 1000 rows) representing the 3 parameters of your regression model (intercept, x1, x2). This line in your code as.data.frame(sapply(lms, coef)) produces a dataframe with 3 rows and 1000 columns. You can, for instance, transpose the matrix before changing it into a data frame to get 1000 rows and 3 columns.
theta_hat_1000 <- sapply(lms, coef)
theta_hat_1000 <- as.data.frame(t(theta_hat_1000))
colnames(theta_hat_1000) <- c("ahat", "x1hat", "x2hat")
head(theta_hat_1000)
ahat x1hat x2hat
1 2.0259326 0.7417404 -2.111874
2 0.7827929 0.9437324 -1.944320
3 1.1034906 1.0091594 -2.035405
4 0.9677150 0.8168757 -1.905367
5 1.0518646 0.9616123 -1.985357
6 0.8600449 1.0781489 -2.017061
Now you could also call the variables with theta_hat_1000$ahat.

Find the y-coordinate at intersection of two curves when x is known

Background and Summary of Objective
I am trying to find the y-coordinate at the intersection of two plotted curves using R. I will provide complete details and sample data below, but in the hopes that this is a simple problem, I'll be more concise up front.
The cumulative frequencies of two curves(c1 and c2 for simplicity) are defined by the following function, where a and b are known coefficients:
f(x)=1/(1+exp(-(a+bx)))
Using the uniroot() function, I found "x" at the intersection of c1 and c2.
I had assumed that if x is known then determining y should be simple substitution: for example, if x = 10, y=1/(1+exp(-(a+b*10))) (again, a and b are known values); however, as will be shown below, this is not the case.
The objective of this post is to determine how to find y-coordinate.
Details
This data replicates respondents' stated price at which they find the product's price to be too.cheap (i.e., they question its quality) and the price at which they feel the product is a bargain.
The data will be cleaned before use to ensure that too.cheap is
always less than the bargain price.
The cumulative frequency for the
bargain price will be inverted to become not.bargain.
The intersection of bargain and too.cheap will represent the point at
which an equal share of respondents feel the price is not a bargain
and too.cheap --- the point of marginal cheapness ("pmc").
Getting to the point where I'm having a challenge will take a number of steps.
Step 1: Generate some data
# load libraries for all steps
library(car)
library(ggplot2)
# function that generates the data
so.create.test.dataset <- function(n, mean){
step.to.bargain <- round(rnorm(n = n, 3, sd = 0.75), 2)
price.too.cheap <- round(rnorm(n = n, mean = mean, sd = floor(mean * 100 / 4) / 100), 2)
price.bargain <- price.too.cheap + step.to.bargain
df.temp <- cbind(price.too.cheap,
price.bargain)
df.temp <- as.data.frame(df.temp)
return(df.temp)
}
# create 389 "observations" where the too.cheap has a mean value of 10.50
# the function will also create a "bargain" price by
#adding random values with a mean of 3.00 to the too.cheap price
so.test.df <- so.create.test.dataset(n = 389, mean = 10.50)
Step 2: Create a data frame of cumulative frequencies
so.get.count <- function(p.points, p.vector){
cc.temp <- as.data.frame(table(p.vector))
cc.merged <- merge(p.points, cc.temp, by.x = "price.point", by.y = "p.vector", all.x = T)
cc.extracted <- cc.merged[,"Freq"]
cc.extracted[is.na(cc.extracted)] <- 0
return(cc.extracted)
}
so.get.df.price<-function(df){
# creates cumulative frequencies for three variables
# using the price points provided by respondents
# extract and sort all unique price points
# Thanks to akrun for their help with this step
price.point <- sort(unique(unlist(round(df, 2))))
#create a new data frame to work with having a row for each price point
dfp <- as.data.frame(price.point)
# Create cumulative frequencies (as percentages) for each variable
dfp$too.cheap.share <- 1 - (cumsum(so.get.count(dfp, df$price.too.cheap)) / nrow(df))
dfp$bargain.share <- 1 - cumsum(so.get.count(dfp, df$price.bargain)) / nrow(df)
dfp$not.bargain.share <- 1 - dfp$bargain.share# bargain inverted so curves will intersect
return(dfp)
}
so.df.price <- so.get.df.price(so.test.df)
Step 3: Estimate the curves for the cumulative frequencies
# Too Cheap
so.l <- lm(logit(so.df.price$too.cheap.share, percents = TRUE)~so.df.price$price.point)
so.cof.TCh <- coef(so.l)
so.temp.nls <- nls(too.cheap.share ~ 1 / (1 + exp(-(a + b * price.point))), start = list(a = so.cof.TCh[1], b = so.cof.TCh[2]), data = so.df.price, trace = TRUE)
so.df.price$Pr.TCh <- predict(so.temp.nls, so.df.price$price.point, lwd=2)
#Not Bargain
so.l <- lm(logit(not.bargain.share, percents = TRUE) ~ price.point, so.df.price)
so.cof.NBr <- coef(so.l)
so.temp.nls <- nls(not.bargain.share ~ 1 / (1 + exp(-(a + b * price.point))), start = list(a = so.cof.NBr[1], b = so.cof.Br[2]), data= so.df.price, trace=TRUE)
so.df.price$Pr.NBr <- predict(so.temp.nls, so.df.price$price.point, lwd=2)
# Thanks to John Fox & Sanford Weisberg - "An R Companion to Applied Regression, second edition"
At this point, we can plot and compare the "observed" cumulative frequencies against the estimated frequencies
ggplot(data = so.df.price, aes(x = price.point))+
geom_line(aes(y = so.df.price$Pr.TCh, colour = "Too Cheap"))+
geom_line(aes(y = so.df.price$Pr.NBr, colour = "Not Bargain"))+
geom_line(aes(y = so.df.price$too.cheap.share, colour = "too.cheap.share"))+
geom_line(aes(y = so.df.price$not.bargain.share, colour = "not.bargain.share"))+
scale_y_continuous(name = "Cummulative Frequency")
The estimate appears to fit the observations reasonably well.
Step 4: Find the intersection point for the two estimate functions
so.f <- function(x, a, b){
# model for the curves
1 / (1 + exp(-(a + b * x)))
}
# note, this function may also be used in step 3
#I was building as I went and I don't want to risk a transpositional error that breaks the example
so.pmc.x <- uniroot(function(x) so.f(x, so.cof.TCh[1], so.cof.TCh[2]) - so.f(x, so.cof.Br[1], so.cof.Br[2]), c(0, 50), tol = 0.01)$root
We may visually test the so.pmc.x by plotting it with the two estimates. If it is correct, a vertical line for so.pmc.x should pass through the intersection of too.cheap and not.bargain.
ggplot(data = so.df.price, aes(x = price.point)) +
geom_line(aes(y = so.df.price$Pr.TCh, colour = "Too Cheap")) +
geom_line(aes(y = so.df.price$Pr.NBr, colour = "Not Bargain")) +
scale_y_continuous(name = "Cumulative Frequency") +
geom_vline(aes(xintercept = so.pmc.x))
...which it does.
Step 5: Find y
Here is where I get stumped, and I'm sure I'm overlooking something very basic.
If a curve is defined by f(x) = 1/(1+exp(-(a+bx))), and a, b and x are all known, then shouldn't y be the result of 1/(1+exp(-(a+bx))) for either estimate?
In this instance, it is not.
# We attempt to use the too.cheap estimate to find y
so.pmc.y <- so.f(so.pmc.x, so.cof.TCh[1], so.cof.TCh[2])
# In theory, y for not.bargain at price.point so.pmc.x should be the same
so.pmc.y2 <- so.f(so.pmc.x, so.cof.NBr[1], so.cof.NBr[2])
EDIT: This is where the error occurs (see solution below).
a != so.cof.NBr[1] and b != so.cof.NBr[2], instead a and be should be defined as the coefficients from so.temp.nls (not so.l)
# Which they are
#> so.pmc.y
#(Intercept)
# 0.02830516
#> so.pmc.y2
#(Intercept)
# 0.0283046
If we calculate the correct value for y, a horizontal line at yintercept = so.pmc.y, should pass through the intersection of too.cheap and not.bargain.
...which it obviously does not.
So how does one estimate y?
I've solved this, and as I suspected, it was a simple error.
My assumption that y = 1/(1+exp(-(a+bx))) is correct.
The issue is that I was using the wrong a, b coefficients.
My curve was defined using the coefficients in so.cof.NBr as defined by so.l.
#Not Bargain
so.l <- lm(logit(not.bargain.share, percents = TRUE) ~ price.point, so.df.price)
so.cof.NBr <- coef(so.l)
so.temp.nls <- nls(not.bargain.share ~ 1 / (1 + exp(-(a + b * price.point))), start = list(a = so.cof.NBr[1], b = so.cof.Br[2]), data= so.df.price, trace=TRUE)
so.df.price$Pr.NBr <- predict(so.temp.nls, so.df.price$price.point, lwd=2)
But the resulting curve is so.temp.nls, NOT so.l.
Therefore, once I find so.pmc.x I need to extract the correct coefficients from so.temp.nls and use those to find y.
# extract coefficients from so.temp.nls
so.co <- coef(so.temp.nls)
# find y
so.pmc.y <- 1 / (1 + exp(-(so.co[1] + so.co[2] * so.pmc.x)))
ggplot(data = so.df.price, aes(x = price.point))+
geom_line(aes(y = so.df.price$Pr.TCh, colour = "Too Cheap"))+
geom_line(aes(y = so.df.price$Pr.NBr, colour = "Not Bargain"))+
scale_y_continuous(name = "Cumulative Frequency")+
geom_hline(aes(yintercept = so.pmc.y))
Yielding the following...
which graphically depicts the correct answer.

Resources