Set Acceptable Region for My Skewness Test in R - r

I am writing the below function to let me conduct a test of skewness for a vector of samples (10, 20, 50, 100) with a 1000 replicate.
library(moments)
out <- t(sapply(c(10, 20, 50, 100), function(x)
table(replicate(1000, skewness(rgamma(n = x, shape = 3, rate = 0.5))) > 2)))
row.names(out) <- c(10, 20, 50, 100)
out
My conditions
My condition of rejecting the Null hypothesis is that the statistic must fulfil two (2) conditions:
less than -2
or greater than +2.
What I have
But in my R function I can only describe the second condition.
What I want
How do I include both the first and the second condition in my function?

Perhaps adding the abs would be the easiest approach to meet both conditions
out <- t(sapply(c(10, 20, 50, 100), function(x)
table(abs(unlist(replicate(1000, skewness(rgamma(n = x, shape = 3, rate = 0.5))))) > 2)))
row.names(out) <- c(10, 20, 50, 100)
out

Related

Generate samples from bernoulli(p) using R

Using R, generate data from the Bernoulli(p), for various sample sizes (n= 10, 15, 20, 25, 30, 50, 100, 150, 200), for p = 0.01, 0.4, 0.8
I know how to do it for one case using the rbinom function. Like for the first scenario: rbinom(n=10, size=1, p=0.01).
My aim is to build a function that could compute all these scenarios preventing me to do all of them individually.
The following function will give you a list of list. I tried to give them appopriate naming.
ff <- function(n,probs) {
res <- lapply(n, function(i) {
setNames(lapply(probs, function(p) {
rbinom(n=i, size=1, p=p)
}),paste0("p=",probs))
})
names(res) <- paste0("n=",n)
res
}
Just call it like ff(n=c(10,15,20),probs = c(0.01,0.4,0.8)) and you will get a list of length 3 (for every n) which contains a list of length 3 (for every probability) with the vectors from the bernoulli-sample.
You can generate a dataframe of your np combinations with expand.grid and then use the map2 function of the tidyverse purrr package to generate a list of outputs for the np pairs:
library(tidyverse)
n <- c(10, 15, 20, 25, 30, 50, 100, 150, 200)
p <- c(0.01, 0.4, 0.8)
nps <- expand.grid(n = n, p = p)
samples <- 10
outlist <- map2(nps$n, nps$p, function(x, y) rbinom(samples, x, y))

rnorm is generating non-random looking realizations

I was debugging my simulation and I find that when I run rnorm(), my random normal values don't look random to me at all. ccc is the mean sd vector that is given parametrically. How can I get really random normal realizations? Since my original simulation is quite long, I don't want to go into Gibbs sampling... Should you know why I get non-random looking realizations of normal random variables?
> ccc
# [1] 144.66667 52.52671
> rnorm(20, ccc)
# [1] 144.72325 52.31605 144.44628 53.07380 144.64438 53.87741 144.91300 54.06928 144.76440
# [10] 52.09181 144.61817 52.17339 145.01374 53.38597 145.51335 52.37353 143.02516 52.49332
# [19] 144.27616 54.22477
> rnorm(20, ccc)
# [1] 143.88539 52.42435 145.24666 50.94785 146.10255 51.59644 144.04244 51.78682 144.70936
# [10] 53.51048 143.63903 51.25484 143.83508 52.94973 145.53776 51.93892 144.14925 52.35716
# [19] 144.08803 53.34002
It's a basic concept to set parameters in a function. Take rnorm() for example:
Its structure is rnorm(n, mean = 0, sd = 1). Obviously, mean and sd are two different parameters, so you need to put respective values to them. Here is a confusing situation where you may get stuck:
arg <- c(5, 10)
rnorm(1000, arg)
This actually means rnorm(n = 1000, mean = c(5, 10), sd = 1). The standard deviation is set to 1 because the position of arg represents the parameter mean and you don't set sd additionally. Therefore, rnorm() will take the default value 1 to sd. However, what does mean = c(5, 10) mean? Let's check:
x <- rnorm(1000, arg)
hist(x, breaks = 50, prob = TRUE)
# lines(density(x), col = 2, lwd = 2)
mean = c(5, 10) and sd = 1 will recycle to length 1000, i.e.
rnorm(n = 1000, mean = c(5, 10, 5, 10, ...), sd = c(1, 1, 1, 1, ...))
and hence the final sample x is actually a blend of 500 N(5, 1) samples and 500 N(10, 1) samples which are drawn alternately, i.e.
c(rnorm(1, 5, 1), rnorm(1, 10, 1), rnorm(1, 5, 1), rnorm(1, 10, 1), ...)
As for your question, it should be:
arg <- c(5, 10)
rnorm(1000, arg[1], arg[2])
and this means rnorm(n = 1000, mean = 5, sd = 10). Check it again, and you will get a normal distribution with mean = 5 and sd = 10.
x <- rnorm(1000, arg[1], arg[2])
hist(x, breaks = 50, prob = T)
# curve(dnorm(x, arg[1], arg[2]), col = 2, lwd = 2, add = T)

R: apply the pclm function

I have trouble to apply the Penalized Composite Link Model (PCLM) function which only works with vectors. I use the pclm function to generate single years of age (syoa) population data from 5-year age group population data.
pclm() can be installed by following the instructions given by the author on https://github.com/mpascariu/ungroup.
Usage of the function:
pclm(x, y, nlast,control = list())
-x: vector of the cumulative sum points of the sequence in y.
-y: vector of values to be ungrouped.
-nlast: Length of the last interval.
-control: List with additional parameters.
Here's my training dataset:
data<-data.frame(
GEOID= c(1,2),
name= c("A","B"),
"Under 5 years"= c(17,20),
"5-9 years"= c(82,90),
"10-14 years"= c(18, 22),
"15-19 years"= c(90,88),
"20-24 years"= c(98, 100),
check.names=FALSE)
#generating a data.frame storing the fitted values from the pclm for the first row: GEOID=1.
#using the values directly
syoa <- data.frame(fitted(pclm(x=c(0, 5, 10, 15, 20), y=c(17,82,18,90,98), nlast=5, control = list(lambda = .1, deg = 3, kr = 1))))
#or referring to the vector by its rows and columns
syoa <- data.frame(fitted(pclm(x=c(0, 5, 10, 15, 20), y=c(data[1,3:7]), nlast=5, control = list(lambda = .1, deg = 3, kr = 1))))
As my data have many observations, I'd like to apply the pclm() function across all the rows for columns 3-7: data[,3:7].
apply(data[3:7], 1, pclm(x=c(0, 5, 10, 15, 20), y=c(data[,3:7]), nlast=5, control = list(lambda = .1, deg = 3, kr = 1)))
but it's not working and gives the following error message:
Error in eval(substitute(expr), data, enclos = parent.frame()) :
(list) object cannot be coerced to type 'double'
I don't know the issue's related to apply() or the pclm ()function. Can anyone help? Thanks.
It's easier than I thought.
pclm <- data.frame(apply(data[3:7], 1, function(x){
pclm <- pclm(x=c(0, 5, 10, 15, 20), y=c(x), nlast=5, control = list(lambda = NA, deg = 3, kr = 1))
round(fitted(pclm))
}))

How to find RMSE by using loop in R

If I have a data frame contain 3 variables :
origdata <- data.frame(
age <- c(22, 45, 50, 80, 55, 45, 60, 24, 18, 15),
bmi <- c(22, 24, 26, 27, 28, 30, 27, 25.5, 18, 25),
hyp <- c(1, 2, 4, 3, 1, 2, 1, 5, 4, 5) )
I created MCAR (missing complete at random) data :
halpha <- 0.1
# MCAR for attribute (1) age:
mcar <- runif(10, min = 0, max = 1)
age.mcar <- ifelse(mcar < alpha, NA, origdata$age)
# MCAR for attribute (2) bmi:
mcar <- runif(10, min = 0, max = 1)
bmi.mcar <- ifelse(mcar < alpha, NA, origdata$bmi)
# MCAR for attribute (3) hyp:
mcar <- runif(10, min = 0, max = 1)
hyp.mcar <- ifelse(mcar < alpha, NA, origdata$hyp)
After that I used the mice package to impute the missing value as follows:
install.packages("mice")
library("mice")
imp <- mice(df, 10) # 10 is mean 10 iteration imputing data
fill1 <- complete(imp, 1) # dataset 1
fill2 <- complete(imp, 2) # dataset 2
allfill <- complete(imp, "long") # all iterations together
My question is: I want to find RMSE for all 10 datasets individually by using a loop. This is my RMSE equation :
RMSE <- sqrt((sum((origdata - fill)^2)) / sum(is.na(df)))
I mean to make a loop to find the RMSE for each imputed dataset individually:
RMSE1 (for dataset #1)
RMSE2 (for dataset #2)
...
RMSE10 (for dataset #10)
And I also want to know which dataset is best for impute NAs.
loop in R:
m <- imp$m # number of imputations
RSME <- rep(NA, m)
for (i in seq_len(m)) {
fill <- complete(imp, i)
RMSE[i] <- (sqrt((sum((orgdata - fill)^2))/sum(is.na(x))))
}

Optimizing an optimization

I have a discrete data set with multiple peaks. I am trying to generate an automatic method for fitting a Gaussian curve to an unknown number of data points. The ultimate goal is to provide a measure of uncertainty on the location (x-axis) of the peak in the y-axis, using the sigma value of a best-fit Gaussian curve. The full data set has a half dozen or so unique peaks of various shapes.
Here is a sample data set.
working <- data.frame(age = seq(1, 50), likelihood = c())
likelihood = c(10, 10, 10, 10, 10, 12, 14, 16, 17, 18,
19, 20, 19, 18, 17, 16, 14, 12, 11, 10,
10, 9, 8, 8, 8, 8, 7, 6, 6, 6))
Here is the Gaussian fitting procedure. I found it on SO, but I can't find the page I took it from again, so please forgive the lack of link and citation.
fitG =
function(x,y,mu,sig,scale)
f = function(p){
d = p[3] * dnorm( x, mean = p[ 1 ], sd = p[ 2 ] )
sum( ( d - y ) ^ 2)
}
optim( c( mu, sig, scale ), f )
}
This works well if I pre-define the area to fit. For instance taking only the area around the peak and using input mean = 10, sigma = 5, and scale = 1:
work2 <- work[5:20, ]
fit1 <- fitG(work2$age, work2$likelihood, 10, 5, 1)
fitpar1 <- fit1$par
plot(work2$age, work2$likelihood, pch = 20)
lines(work2$age, fitpar1[3]*dnorm(work2$age, fitpar1[1], fitpar1[2]))
However, I am interested in automating the procedure in some way, where I define the peak centers for the whole data set using peakwindow from the cardidates package. The ideal function would then iterate the number of data points used in the fit around a given peak in order to optimize the Gaussian parameters. Here is my attempt:
fitG.2 <- function (x, y) {
g <- function (z) {
newdata <- x[(y - 1 - z) : (y + 1 + z), ]
newfit <- fitG( newdata$age, newdata$likelihood, 10, 5, 1)
}
optimize( f = g, interval = c(seq(1, 100)))
}
However, I can't get this type of function to actually work (an error I can't solve). I have also tried creating a function with a for loop and setting break parameters but this method does not work consistently for peaks with widely varying shape parameters. There are likely many other R functions unknown to me that do exactly this.

Resources