Batch distribution fitting using Tidyverse and fitdistrplus - r

I have a dataset that is as follows (10,000+ Rows):
P_ID
SNUM
RNUM
X
ID_233
10
2
40.31
ID_233
10
3
23.21
ID_234
12
5
11.00
ID_234
12
6
0.31
ID_234
13
1
0.00
ID_235
10
2
66.23
From this dataset, I want to fit each distinct P_ID to a Gamma distribution (ignoring the testing of how well the sampled data fits the distribution)
Using the fitdistrplus package, I can achieve this by extracting the X for an individual P_ID into a vector and then run it through fw <- fitdist(data,"gamma") and then extract the shape and rate descriptive variables out, but this is all very manual.
I would like to find a method using tidyverse to go from the data frame above to:
P_ID
Distrib
G_Shape
G_Rate
ID_233
Gamma
1.21557116
0.09206639
ID_234
Gamma
3.23234542
0.34566432
ID_235
Gamma
2.34555553
0.92344521
How would i achieve this with Tidyverse and Pipes and not doing a succession of for loops?

You could apply fitdist for every individual using group_by and extract shape and rate values out of each model.
library(dplyr)
library(purrr)
library(fitdistrplus)
data %>%
group_by(P_ID) %>%
summarise(model = list(fitdist(X, "gamma"))) %>%
mutate(G_Shape = map_dbl(model, pluck, 'estimate', 'shape'),
G_rate = map_dbl(model, pluck, 'estimate', 'rate')) -> result
result

Related

Repeated random sampling and kurtosis on unbalanced sample

I have an unbalanced dataset with people from liberal and conservative background giving rating on an issue (1-7). Would like to see how polarized the issue is.
The sample is heavily skewed towards liberal (70% of the sample). How do I do repeated sampling using R to create a balanced sample (50-50) and calculate kurtosis?
For example, I have total 50 conservatives. How do I randomly sample 50 liberals out of 150 repeatedly?
A sample dataframe below:
political_ort rating
liberal 1
liberal 6
conservative 5
conservative 3
liberal 7
liberal 3
liberal 1
What you're describing is termed 'undersampling'. Here is one method using tidyverse functions:
# Load library
library(tidyverse)
# Create some 'test' (fake) data
sample_df <- data_frame(id_number = (1:100),
political_ort = c(rep("liberal", 70),
rep("conservative", 30)),
ratings = sample(1:7, size = 100, replace = TRUE))
# Take the fake data
undersampled_df <- sample_df %>%
# Group the data by category (liberal / conservative) to treat them separately
group_by(political_ort) %>%
# And randomly sample 30 rows from each category (liberal / conservative)
sample_n(size = 30, replace = FALSE) %>%
# Because there are only 30 conservatives in total they are all included
# Finally, ungroup the data so it goes back to a 'vanilla' dataframe/tibble
ungroup()
# You can see the id_numbers aren't in order anymore indicating the sampling was random
There is also the ROSE package that has a function ("ovun.sample") that can do this for you: https://www.rdocumentation.org/packages/ROSE/versions/0.0-3/topics/ovun.sample

Run DBSCAN against grouped coordinates

I'm attempting to run DBSCAN against some grouped coordinates in order to get sub-clusters. I've clustered some spacial data and I'd now like to further divide these regions according to the density of points within them. I think DBSCAN is probably the best way to do this.
My issue is that I can't figure out how to run DBSCAN against each cluster seperately and then output the cluster assignment as a new column. Here's some sample data:
library(dplyr)
library(dbscan)
# Create sample data
df <- data.frame(
"ID"=1:200,
"X"=c(1.0083,1.3166,1.3072,1.1311,1.2984,1.2842,1.1856,1.3451,1.1932,1.0926,1.2464,1.3197,1.2331,1.2996,1.3482,
1.1944,1.2800,1.3051,1.4471,0.9068,1.3150,1.1846,1.0232,1.0005,1.0640,1.3177,1.1015,0.9598,1.0354,1.2203,
0.8388,0.8655,1.3387,1.0133,1.0106,1.1753,1.3200,1.0139,1.1511,1.3508,1.2747,1.3681,1.1074,1.2735,1.2245,
0.9695,1.3250,1.0537,1.2020,1.3093,0.9268,1.3244,1.2626,1.3123,1.2819,1.1063,0.8759,1.0063,1.0173,1.0187,
1.2396,1.0241,1.2619,1.2682,1.0008,1.0827,1.3639,1.3099,1.0004,0.8886,1.2359,1.1370,1.2783,1.0803,1.1918,
1.1156,1.3313,1.1205,1.0776,1.3895,1.3559,0.8518,1.1315,1.3521,1.2281,1.2589,0.9974,1.1487,1.4204,0.9998,
1.0154,1.0098,0.8851,1.0252,0.9331,1.2197,1.0084,1.2303,1.2808,1.3125,0.5500,0.6694,0.3301,0.3787,0.6492,
0.6568,0.6773,0.3769,0.6237,0.7265,0.5509,0.3579,0.7201,0.2631,0.3881,0.7596,0.3343,0.7049,0.3430,0.2951,
0.5483,0.7699,0.3806,0.6555,0.2524,0.4030,0.6329,0.5006,0.2701,0.0822,0.5442,0.5233,0.7105,0.5660,0.3962,
0.3187,0.3143,0.5673,0.3731,0.7310,0.6376,0.4864,0.8865,0.3352,0.7540,0.0690,0.7983,0.6990,0.4090,0.5658,
0.5636,0.5420,0.7223,0.6146,0.5648,0.2711,0.3422,0.7214,0.2196,0.2848,0.6496,0.7907,0.7418,0.7825,0.4550,
0.4361,0.7417,0.2661,0.8978,0.7875,0.2343,0.3853,0.6874,0.7761,0.2905,0.6092,0.5329,0.6189,0.0684,0.5726,
0.5740,0.7060,0.4609,0.3568,0.7037,0.2874,0.6200,0.7149,0.5100,0.7059,0.2520,0.3105,0.6870,0.7888,0.3674,
0.6514,0.7271,0.6679,0.3752,0.7067),
"Y"=c(-1.2547,-1.1499,-1.1803,-1.0626,-1.2877,-1.1151,-1.0958,-1.1339,-1.0808,-1.5461,-1.0775,-1.1431,-1.0499,
-1.1521,-1.1675,-1.0963,-1.1407,-1.1916,-1.1229,-1.2297,-1.1308,-1.0341,-1.3071,-1.2370,-1.5043,-1.1154,
-1.5452,-1.0349,-1.5412,-1.0348,-1.3620,-1.3776,-1.1830,-1.2552,-1.2354,-1.0838,-1.1214,-1.2396,-1.4937,
-1.0793,-1.1857,-1.0679,-1.5425,-1.1633,-1.1620,-1.0838,-1.0750,-1.3493,-1.4155,-1.1354,-1.0615,-1.1494,
-1.1620,-1.1582,-1.1800,-1.5230,-1.3019,-1.2484,-1.5490,-1.2435,-1.0487,-1.2330,-1.1234,-1.0924,-1.0702,
-1.0446,-1.1077,-1.1144,-1.2170,-1.2715,-1.1537,-1.5077,-1.1305,-1.3396,-1.2107,-1.5458,-1.1482,-1.1224,
-1.3690,-1.2058,-1.1685,-1.3400,-1.5033,-1.2152,-1.3805,-1.1439,-1.5183,-1.4288,-1.1252,-1.2330,-1.2511,
-1.5429,-1.3333,-1.1851,-1.1367,-1.3952,-1.1240,-1.2113,-1.1632,-1.1965,-0.9917,-0.7416,-0.7729,-1.1279,
-0.9323,-0.9372,-0.7013,-1.1746,-0.9191,-0.9356,-0.7873,-1.1957,-0.9838,-0.5825,-1.0738,-0.9302,-0.7713,
-0.9407,-0.7774,-0.8160,-0.9861,-1.0440,-0.9896,-0.6478,-0.8865,-1.0601,-1.0640,-0.9898,-0.5989,-0.7375,
-0.7689,-0.9799,-0.9147,-1.1048,-0.9735,-0.8591,-0.7913,-1.0085,-0.7231,-0.9688,-0.9272,-0.9395,-0.9494,
-0.7859,-1.0817,-0.7262,-0.9915,-0.9329,-1.0953,-1.0425,-1.0806,-1.0132,-0.8514,-1.0785,-1.1109,-0.8542,
-1.0849,-0.9665,-0.5940,-0.6145,-0.7830,-0.9601,-0.8996,-0.7717,-0.7447,-1.0406,-1.0067,-0.5710,-0.9839,
-1.0594,-0.7069,-1.1202,-0.9705,-1.0100,-0.6377,-1.0632,-0.9450,-0.9163,-0.7865,-1.0090,-1.1005,-1.0049,
-0.8042,-1.0781,-0.6829,-0.5962,-1.0759,-0.7918,-0.9732,-0.7353,-0.5615,-1.2002,-0.9295,-0.9944,-1.1570,
-0.9524,-0.9257,-0.9360,-1.1328,-0.7661),
"cluster"=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2))
# How do you run DBSCAN against the points within each cluster?
I first thought I'd try to use the group_by function in dplyr but DBSCAN requires a data matrix input and group_by doesn't work for matrices.
matrix <- as.matrix(df[, -1])
set.seed(1234)
db = matrix %>%
group_by(cluster) %>%
dbscan(matrix, 0.4, 4)
#Error in UseMethod("group_by_") :
# no applicable method for 'group_by_' applied to an object of class "c('matrix', 'double', 'numeric')"
I've also tried using by() but get duplicate results for each cluster grouping, which isn't right:
by(data = df, INDICES = df$cluster, FUN = function(x) {
out <- dbscan(as.matrix(df[, c(2:3)]),eps=.0215,minPts=4)
})
#df$cluster: 1
#DBSCAN clustering for 200 objects.
#Parameters: eps = 0.0215, minPts = 4
#The clustering contains 10 cluster(s) and 138 noise points.
#
# 0 1 2 3 4 5 6 7 8 9 10
#138 11 12 4 5 8 2 4 8 4 4
#
#Available fields: cluster, eps, minPts
#--------------------------------------------------------------------------
#df$cluster: 2
#DBSCAN clustering for 200 objects.
#Parameters: eps = 0.0215, minPts = 4
#The clustering contains 10 cluster(s) and 138 noise points.
#
# 0 1 2 3 4 5 6 7 8 9 10
#138 11 12 4 5 8 2 4 8 4 4
#
#Available fields: cluster, eps, minPts
Can anyone point me in the right direction?
To be clear, dbscan::dbscan works fine on data.frame objects. You do not need to convert to matrix. It returns an object that includes a vector with the same dimension as the number of records in your input. The issue is that dplyr exposes variables to other functions as individual vectors, rather than as data.frame or matrix objects. You are free to do something like:
df %>%
group_by(cluster) %>%
mutate(
dbscan_cluster = dbscan::dbscan(
data.frame(X, Y),
eps = 0.0215,
minPts = 4
)[["cluster"]]
)
dplyr is not necessary, by also works, you just need to supply a generic function rather than one that directly references the source object directly. Your data must already be ordered by cluster.
df$dbscan_cluster <- unlist(
by(
df,
INDICES = df$cluster,
function(x) dbscan::dbscan(x[,c(2,3)], eps = 0.0215, minPts = 4)[["cluster"]]
)
)
However, you can still get garbage results if you don't have a good way to pick your epsilon. You might consider using dbscan::optics instead.

Using dplyr to run rma() on multiple subsets

I want to run a subgroup meta-analysis within metafor package. The simplest way to do it is:
model.s.1 <- rma(yi=ES, vi=Va, data=dataset, method="DL", subset=S=="S_Level1")
model.s.2 <- rma(yi=ES, vi=Va, data=dataset, method="DL", subset=S=="S_Level2")
...
model.s.n <- rma(yi=ES, vi=Va, data=dataset, method="DL", subset=S=="S_Leveln")
However, it's very confusing to do it by hand if a factor for subgroups has multiple levels. I tried to use dplyr to solve this and extract simply coefficients for all subgroups:
Dataset %>%
mutate(S=as.factor(S)) %>%
group_by(S) %>%
summarize(Coeff=coef.rma(rma(yi=ES, vi=Va, method="DL", data=.)))
But the result looked like this:
S Coeff
<fct> <dbl>
1 hmdb 0.114
2 HMDB0000123 0.114
3 HMDB0000148 0.114
4 HMDB0000158 0.114
5 HMDB0000159 0.114
6 HMDB0000161 0.114
7 HMDB0000162 0.114
8 HMDB0000167 0.114
9 HMDB0000168 0.114
10 HMDB0000172 0.114
# ... with 14 more rows
It seems that the rma function omits the group_by and calculates the pooled effect for the whole dataset each time. What might be the cause? Are there any alternatives for such approach?
We may do a group_split and then loop through the list elements with map
library(tidyverse)
Dataset %>%
group_split(S= factor(S)) %>%
map_dfr(~ .x %>%
summarise(S = first(S), Coeff=coef.rma(rma(yi=ES,
vi=Va, method="DL", data=.))))
Dear #akrun I have one more question on a similar piece of code (previous one was in wrong window, sorry for that)
Let's assume that for every subset of studies I'd like to add a fixed-effect meta-regression with a binary factor (0/1) - we call it F.
library(tidyverse)
Dataset %>%
group_split(S=factor(S)) %>%
map.dfr(~ .x %>%
summarise(S=first(S), Coeff=coef.rma(rma(yi=ES,vi=Va, mods=~F, method="DL",
data=.))))
If a certain subset from S has only zero's or one's, it will give an error message from rma function. How can I then add a formula to drop such cases from list and repalce them with "NA"?
Thank you,
Jakub
library(metafor)
library(tidyverse)
Results <- Org %>% # Primary analysis - DerSimonian-Laire Estimator
group_split(Metabolite= factor(Metabolite)) %>%
map_dfr(~ .x %>%
summarise(Metabolite = first(Metabolite),
Coeff = ifelse(nlevels(Biospecimen)>1,
ifelse((rma(yi=Est,sei=SE, method="DL", data=.))$k>=5,
coef.rma(rma(yi=Est,sei=SE, mods=~Biospecimen, method="DL", data=.)),NA),NA)))
It worked, but produced warnings from rma function. However results seem to be corrrect. Thanks a lot #akrun

How to get CI 95% for coefficients of linear model using simpleboot package

I'm trying to predict a linear model (basic linear regressions with 4 predictors) with the procedure lm(). This works all fine.
What I want to do now is bootstrapping the model. After a quick research on Google I found out about the package simpleboot which seemed to be quite easy to understand.
I can easily bootstrap the lm.object using something like this:
boot_mod <- lm.boot(mod,R=100,rows=TRUE)
and afterwards print the object boot_mod.
I can also access the list in which the coefficients for each bootstrap sample are among other metrics such as RSS, R² and so on.
Can anyone tell me how I can save all coefficients from the boot list in a list or dataframe?
The result would look like this at best:
boot_coef
sample coef 1 coef 2 coef 3...
1 1,1 1,4 ...
2 1,2 1,5 ...
3 1,3 1,6 ...
library(tidyverse)
library(simpleboot)
### Some Dummy-Data in a dataframe
a <- c(3,4,5,6,7,9,13,12)
b <- c(5,9,14,22,12,5,12,18)
c <- c(7,2,8,7,12,5,3,1)
df <- as_data_frame(list(x1=a,x2=b,y=c))
### Linear model
mod <- lm(y~x1+x2,data=df)
### Bootstrap
boot_mod <- lm.boot(mod,R=10,rows = TRUE)
You can also use the function sample of the same package simpleboot:
given the output from either lm.boot or loess.boot, you can specify what kind of information you want to extract:
samples(object, name = c("fitted", "coef", "rsquare", "rss"))
It outputs either a vector or matrix depending on the entity extracted.
Source:
https://rdrr.io/cran/simpleboot/man/samples.html
Here is a tidyverse option to save all coefficients from the boot.list:
library(tidyverse)
as.data.frame(boot_mod$boot.list) %>%
select(ends_with("coef")) %>% # select coefficients
t(.) %>% as.data.frame(.) %>% # model per row
rownames_to_column("Sample") %>% # set sample column
mutate(Sample = parse_number(Sample))
# output
Sample (Intercept) x1 x2
1 1 5.562417 -0.2806786 0.12219191
2 2 8.261905 -0.8333333 0.54761905
3 3 9.406171 -0.5863124 0.07783740
4 4 8.996784 -0.6040479 0.06737891
5 5 10.908036 -0.7249561 -0.03091908
6 6 8.914262 -0.5094340 0.05549390
7 7 7.947724 -0.2501127 -0.08607481
8 8 6.255539 -0.2033771 0.07463971
9 9 5.676581 -0.2668020 0.08236743
10 10 10.118126 -0.4955047 0.01233728

How to plot ROC curve for cross validation from Weka output for binary class and multiclass data?

I have tried different matlab functions like plotroc and packages in R like pROC, ROCR and cvAUC. Each package or function produces different graph and gives different AUC than Weka result.
I would like to compare multiple classifier using 10-fold-cross-validation and would like to pot ROC of each. I have collected results in Weka but I don't want to plot it in Weka itself.
My experiments are based on both binary class and multi-class data.
My Weka output cross-validated instance predictions are at https://drive.google.com/folderview?id=0BzRIG3kN-8Z9fnh5OElKTExNT2NuZUVna2tKcmMzU1VBankwdVc2OGxBSXFnaFJqSEhHYVE&usp=sharing
Please, suggest me how can I plot graph for attached results for binaryclass as well as multiclass.
This is a placeholder answer, but the first thing to note is that one your observations got cross-validated less than 10 times:
library(pROC)
library(dplyr)
filenameROC = "Data/term3_IBk_3_multiclass.txt"
fileROC = readLines(filenameROC)
dfCV = read.csv2(text = fileROC,
nrows = length(fileROC) - 51 - 19,
header = TRUE,
sep = ",",
skip = 19, stringsAsFactors = FALSE)
dfCV %>%
group_by(inst.) %>%
tally() %>%
filter(n < 10)
Which gives:
> dfCV %>%
+ group_by(inst.) %>%
+ tally() %>%
+ filter( n < 10)
Source: local data frame [1 x 2]
inst. n
1 773 4
Can you explain this?
Additionally, you also need to add a cross-validation iteration identifier. Once you do that it is simply a question of running multiclass.roc from the pROC package by CV iteration.
Edit:
OP claims that there are 7724 *observations` whereas it is easy to see that there are 773 observations repeated 10 times in 772 cases and 4 times for observation number 772 -- consistent with 10-fold cross-validation data:
> dfCV %>%
+ group_by(inst.) %>%
+ tally()
Source: local data frame [773 x 2]
inst. n
1 1 10
2 2 10
3 3 10
4 4 10
5 5 10
6 6 10
7 7 10
8 8 10
9 9 10
10 10 10
.. ... ..
Edit 2:
Here is the code to produce the multi-class ROC by CV fold:
dfCVROC = dfCV %>%
dplyr::filter(inst. != 773) %>%
arrange(inst.) %>%
dplyr::mutate(cvfold = rep.int(1:10, 772)) %>%
group_by(cvfold) %>%
do(multiclass_roc = multiclass.roc(as.factor(.$actual), as.numeric(.$prediction)))
# get the AUCs by CV fold
sapply(dfCVROC$multiclass_roc, function(x) x$auc)
I didn't find exact solution for the issue. However, here are some points that I observe from Weka output
While weka plots the ROC it get predictions directly from classifier evaluation output.
Weka uses predictions upto 6 decimal points number for calculating threshold values (more precision helps in calculating more number of threshold values for ROC curve).
By default in Weka explorer, classifier outputs prediction upto 3 decimal points only (as in my attached experiments results).
Apart from this I didn't understand how Weka calculates the threshold values form the predictions. I observe that with same Weka prediction output I found different threshold values in Weka and R (and Matlab).
Finally, I used Weka API code for plotting ROC Generate ROC Curve and extract the TPR and FPR for the experiments (I re-run all the experiments). After extracting TPR and FPR I can plot graphs in any tool like Excel, gnuplot, Matlab or R.

Resources