r data simulations using parLapply - r

I am trying to simulate using the following code and data.For each of the iterations it is simulating n from rpois and then n values from function rbeta.This bit is working fine.
The only issue is that for each of n it should getid from the table below based on probability weighted(id_prob) sampling using function sample but for some reason it is only getting one ID for all values of n.
cl <- makeCluster(num_cores)
clusterEvalQ(cl,library(evir))
clusterExport(cl, varlist=c("Sims","ID","id_prob","beta_a","beta_b")
Sims<-10000
set.seed(0)
system.time(x1<-parLapply(cl,1:Sims, function(i){
id<-sample(ID,1,replace=TRUE,prob=id_prob)
n<-rpois(1,9)
rbeta(n,beta_a[id],beta_b[id])
}
))
ID Rate id_prob Beta_a Beta_b
1 1.5 16.7% 0.5 0.5
2 2 22.2% 0.4 0.4
3 1 11.1% 0.3 0.3
4 1.5 16.7% 0.6 0.6
5 2 22.2% 0.1 0.1
6 1 11.1% 0.2 0.2

Related

Find a threshold value for confusion matrix in R

I was doing a logistic regression and made a table that represents the predicted probability ,actual class, and predicted class.
If the predicted probability is more than 0.5, I classified it as 1,so the predicted class becomes 1. But I want to change the threshold value from 0.5 to another value.
I was considering to find a threshold value that maximizes both true positive rate and true negative rate. Here I made a simple data df to demonstrate what I want to do.
df<-data.frame(actual_class=c(0,1,0,0,1,1,1,0,0,1),
predicted_probability=c(0.51,0.3,0.2,0.35,0.78,0.69,0.81,0.31,0.59,0.12),
predicted_class=c(1,0,0,0,1,1,1,0,1,0))
If I can find a threshold value, I will classify using that value instead of 0.5.
I don't know how to find a threshold value that both maximizes true positive rate and true negative rate.
You can check a range of values pretty easily:
probs <- seq(0, 1, by=.05)
names(probs) <- probs
results <- sapply(probs, function(x) df$actual_class == as.integer(df$predicted_probability > x))
results is a 10 row by 21 column logical matrix showing when the predicted class equals the actual class:
colSums(results) # Number of correct predictions
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
5 5 5 4 5 5 4 6 6 6 6 7 8 8 7 7 6 5 5 5 5
predict <- as.integer(df$predicted_probability > .6)
xtabs(~df$actual_class+predict)
# predict
# df$actual_class 0 1
# 0 5 0
# 1 2 3
You can see that probabilities of .6 and .65 result in 8 correct predictions. This conclusion is based on the data you used in the analysis so it probably overestimates how successful you would be with new data.

How to remove outliers along with factor variables?

I am trying to remove outliers from a dataset s consisted of 3 variables:
id consumption period
a 0.1 summer
a 0.2 summer
b 0.3 summer
a 0.4 winter
b 10 winter
c 12 winter
I used outliers <- s$consumption[!s$consumption %in% boxplot.stats(s$consumption)$out] to remove the outliers from s and got something like this:
consumption
0.1
0.2
0.3
0.4
However, I want to get something like this below:
id consumption period
a 0.1 summer
a 0.2 summer
b 0.3 summer
a 0.4 winter
But the $out function only allows me to remove the column with numbers (not with factors).
I found a solution which is to find the min of the output I got from outliers <- s$consumption[!s$consumption %in% boxplot.stats(s$consumption)$out], which is l in this case:
consumption
0.1
0.2
0.3
0.4
By knowing my min value, I can then take a subset of s by setting a condition where consumption has to be less than min(l).
new <- subset(s, consumption < min(l))

Calculate average within a specified range

I am using the 'diamonds' dataset from ggplot2 and am wanting to find the average of the 'carat' column. However, I want to find the average every 0.1:
Between
0.2 and 0.29
0.3 and 0.39
0.4 and 0.49
etc.
You can use function aggregate to mean by group which is calculated with carat %/% 0.1
library(ggplot2)
averageBy <- 0.1
aggregate(diamonds$carat, list(diamonds$carat %/% averageBy * averageBy), mean)
Which gives mean by 0.1
Group.1 x
1 0.2 0.2830764
2 0.3 0.3355529
3 0.4 0.4181711
4 0.5 0.5341423
5 0.6 0.6821408
6 0.7 0.7327491
...

How to meta analyze p values of different observations

I am trying to meta analyze p values from different studies. I have data frame
DF1
p-value1 p-value2 pvalue3 m
0.1 0.2 0.3 a
0.2 0.3 0.4 b
0.3 0.4 0.5 c
0.4 0.4 0.5 a
0.6 0.7 0.9 b
0.6 0.7 0.3 c
I am trying to get fourth column of meta analyzed p-values1 to p-value3.
I tried to use metap package
p<–rbind(DF1$p-value1,DF1$p-value2,DF1$p-value3)
pv–split (p,p$m)
library(metap)
for (i in 1:length(pv))
{pvalue <- sumlog(pv[[i]]$pvalue)}
But it results in one p value. Thank you for any help.
You can try
apply(DF1[,1:3], 1, sumlog)

dynamic column names in data.table correlation

I've combined the outputs for each user and item (for a recommendation system) into this all x all R data.table. For each row in this table, I need to calculate the correlation between user scores 1,2,3 & item scores 1,2,3 (e.g. for the first row what is the correlation between 0.5,0.6,-0.2 and 0.2,0.8,-0.3) to see how well the user and the item match.
user item user_score_1 user_score_2 user_score_3 item_score_1 item_score_2 item_score_3
A 1 0.5 0.6 -0.2 0.2 0.8 -0.3
A 2 0.5 0.6 -0.2 0.4 0.1 -0.8
A 3 0.5 0.6 -0.2 -0.2 -0.4 -0.1
B 1 -0.6 -0.1 0.9 0.2 0.8 -0.3
B 2 -0.6 -0.1 0.9 0.4 0.1 -0.8
B 3 -0.6 -0.1 0.9 -0.2 -0.4 -0.1
I have a solution that works - which is:
scoresDT[, cor(c(user_score_1,user_score_2,user_score_3), c(item_score_1,item_score_2,item_score_3)), by= .(user, item)]
...where scoresDT is my data.table.
This is all well and good, and it works...but I can't get it to work with dynamic variables instead of hard coding in the variable names.
Normally in a data.frame I could create a list and just input that, but as it's character format, the data.table doesn't like it. I've tried using a list with "with=FALSE" and have had some success when trying basic subsetting of the data.table but not with the correlation syntax that I need...
Any help is much, much appreciated!
Thanks,
Andrew
Here's what I would do:
mDT = melt(scoresDT,
id.vars = c("user","item"),
measure.vars = patterns("item_score_", "user_score_"),
value.name = c("item_score", "user_score")
)
mDT[, cor(item_score, user_score), by=.(user,item)]
user item V1
1: A 1 0.8955742
2: A 2 0.9367659
3: A 3 -0.8260332
4: B 1 -0.6141324
5: B 2 -0.9958706
6: B 3 0.5000000
I'd keep the data in its molten/long form, which fits more naturally with R and data.table functionality.

Resources