I have a hypothetical sample of my data as follows:
df <- read.table(header = TRUE, text =
"time goal book pen temp1 temp2 weight hight
18 13 18 15 13 15 18 13
18 13 20 16 16 15 20 19
20 12 20 14 18 20 17 15
15 19 18 16 18 20 15 15
16 17 14 12 17 20 20 15
17 12 18 16 17 19 14 16
20 18 15 18 19 13 15 18
13 14 19 20 14 18 12 15
20 16 18 12 16 16 14 13
12 19 19 15 20 14 20 16
14 16 13 14 15 15 15 17
18 19 17 13 14 13 15 17
16 15 14 17 12 18 14 14
19 20 15 13 12 16 20 15
17 12 18 16 16 17 12 20
20 16 14 19 17 20 17 13
17 13 15 16 15 15 17 17
12 17 15 16 14 16 18 13
15 18 17 20 15 13 14 19
19 12 13 17 12 20 12 13
19 18 14 19 15 20 12 16
20 17 18 15 13 19 19 17
16 18 17 19 16 16 12 16
12 15 19 18 20 15 17 19"
)
Please consider, I have more columns. I want to plot boxplots for each group and see them within a sing plot
The logic is : time with goal; book with pen; temp1 with temp2 and weight with hight.
I would like to see p-values for each group, e.g., time with goal, book with pen...
I struggled to show the outcome, but I hope that my description does make sense.
Based on your data this may be what you are looking for
library (ggpubr)
#get data into long format
long <- data.table::melt(df)
#get colnames
nms <- colnames(df)
# get pairs of variables in a list to pass to comparison
comps <- split(nms,ceiling(seq_along(nms) / 2))
#add group to data for colouring
long$group <- rep(letters[1:4],each = nrow(long)/4)
#plot
ggboxplot(long,x="variable",y="value",fill ="group") +
stat_compare_means(comparisons = comps)
Related
I have a list such as:
> generate_all[1,]
[1] 16 16 19 17 17 16 17 18 13 18 18 21 19 9 27 16 13 18 20 23 17 18 17 14 13 22 20 17 20 16 15 15 19 14 15 13 19 18 18 15 19
[42] 18 18 20 21 24 12 14 14 15 17 16 15 11 19 22 18 14 22 9 21 14 20 17 23 13 21 15 17 19 20 15 14 19 23 14 17 14 20 16 17 20
[83] 14 17 17 22 14 20 26 24 13 13 15 21 20 19 13 17 20 13 17 17 23 19 23 13 22 18 14 15 22 13 21 15 25 15 22 17 19 10 16 18 18
[124] 12 8 22 15 27 21 13 15 19 21 17 11 18 16 22 19 20 17 20 19 22 24 22 18 11 21 16 10 12 29 13 10 19 18 19 19 26 13 16 20 18
[165] 17 18 15 14 16 19 25 21 11 16 16 17 11 22 16 13 19 24 18 13 19 24 11 17 11 25 11 11 24 20 18 15 8 13 14 11 21 19 17 21 21
[206] 16 20 20 21 14 15 17 15 18 17 20 19 13 14 25 13 17 25 9 13 14 19 18 19 18 23 23 24 19 14 24 17 21 17 16 13 20 18 14 20 17
[247] 22 20 12 19 19 20 21 21 12 19 18 16 16 11 15 11 15 26 15 18 13 18 18 17 20 26 13 15 16 12 15 16 18 20 9 20 18 17 19 23 12
[288] 13 25 11 17 16 17 21 19 17 19 29 14 22 15 19 21 20 18 18 16 12 18 21 13 7 17 20 16 16 26 18 9 12 17 18 20 17 20 28 11 17
[329] 21 28 11 20 17 10 8 24 15 20 17 15 17 8 14 13 18 15 17 14 20 28 16 19 17 18 26 19 17 23 19 24 10 12 14 17 15 16 17 12 17
[370] 13 13 18 18 14 18 22 20 14 14 11 16 17 16 14 16 23 10 15 20 16 16 14 16 20 12 10 18 16 15 16 15 25 13 20 22 20 18 13 15 24
[411] 14 21 18 14 20 11 13 19 13 13 12 19 23 21 17 25 16 12 15 19 14 15 21 19 20 17 18 15 16 23 17 14 23 21 15 14 19 20 13 16 18
[452] 13 18 16 23 16 14 17 17 16 23 8 18 22 18 15 19 15 16 25 19 13 20 17 22 18 10 20 19 13 19 18 14 15 15 14 11 16 17 17 25 14
[493] 11 12 19 14 14 16 15 14 11 9 15 21 19 25 19 13 17 12 15 22 12 22 15 14 20 19 19 12 20 18 11 18 17 22 13 17 15 18 17 22 16
[534] 15 14 14 29 14 21 17 24 21 14 17 20 19 19 16 17 13 21 15 20 14 12 22 17 21 13 15 21 9 24 15 14 12 17 15 17 15 21 17 24 19
[575] 18 14 16 12 17 11 14 17 20 21 22 17 16 18 24 24 21 17 12 18 18 17 19 22 27 14 16 15 18 24 17 24 18 13 13 12 14 19 11 16 15
[616] 16 23 17 20 9 15 14 17 16 23 15 17 18 18 7 11 27 21 14 20 17 18 19 17 22 18 11 16 14 18 25 17 19 15 22 20 14 18 19 25 25
[657] 16 17 17 22 16 11 23 7 20 18 17 24 17 16 19 14 19 12 17 21 20 16 23 16 14 14 20 22 12 22 19 19 18 23 19 13 22 14 16 24 20
[698] 18 24 29 16 13 11 25 11 16 16 17 26 20 15 14 16 18 13 15 14 13 23 15 17 17 14 15 26 18 13 14 23 22 17 16 18 17 19 16 17 17
[739] 11 16 21 12 14 14 15 17 18 15 19 20 25 16 19 10 27 12 25 14 23 14 22 19 19 13 18 12 18 16 19 21 13 12 16 17 19 19 18 20 17
[780] 20 18 15 16 13 14 9 13 13 20 20 16 19 13 12 13 17 20 14 18 17 24 14 10 16 16 18 16 20 14 13 12 19 18 11 21 23 10 13 17 21
[821] 15 16 9 16 19 15 15 15 20 19 11 19 21 16 13 16 19 22 20 14 15 12 20 21 10 16 17 13 13 15 12 18 19 21 25 12 23 16 19 17 16
[862] 20 18 18 13 22 17 18 13 10 20 16 17 15 24 9 16 17 14 20 21 15 19 7 19 18 21 19 23 18 22 15 15 14 18 20 21 19 23 25 15 11
[903] 19 13 14 16 17 14 15 15 14 11 21 20 17 21 12 21 16 20 17 22 15 20 13 17 17 16 24 19 17 28 22 15 19 16 21 15 22 17 16 13 18
[944] 9 13 15 26 19 20 21 16 23 22 9 14 20 21 23 17 22 22 24 20 18 17 14 17 18 15 20 15 22 20 19 17 20 21 18 22 13 17 23 16 17
[985] 22 16 17 16 15 15 22 21 14 18 24 14 18 12 13 21
and I would like to calculate an empiric pvalue.
Here is the idea:
if value_obs > mean(generate_all[1,]):
pvalue=(sum(i > value for i in generate_all[1,])/lenght(generate_all[1,])
else:
pvalue=(sum(i < value for i in generate_all[1,])/lenght(generate_all[1,])
so because the value_obs = 5 and the mean of the exemple distribution is
> mean(generate_all[1,])
[1] 17.2347
I want to calculate the number of values that are above 5 in the distribution and get a % of it by dividing with the length generate_all[1,].
Does anyone have an idea with R ?
I am not sure I understand your issue exactly, but if you want to compute your custom pvalue for each column of your data.frame, you can use the apply function provided in R
pvalue <- apply(df, 2, function(x) if(value_obs > mean(x)){
return(sum(x>5)/nrow(df))} else {
return(sum(x<5)/nrow(df))})
Here, pvalue will be a vector containing the pvalues computed for each column of the dataframe.
You should check Grouping functions (tapply, by, aggregate) and the *apply family for detailed explanations.
I want to display a vector consistently in different R environment.
For example, for a vector like this
c(1:30)
will display 24 values per row
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
[25] 25 26 27 28 29 30
and not
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
The closest thing to what you are looking for is to use options() to configure the width of the results window:
options(width = 75)
c(1:30)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
[24] 24 25 26 27 28 29 30
I have a dataset where the last columns indicate the number of stops extracted from that dataset.
ColA ColB ColC 1 2 3 4 5 6 7 8 9 10 (...)
a g c a q e r e r q g h q (...)
What I want is to select from column 1, until the last column, and add Stop before it, ending up with Stop1, Stop2, etc...
The problem is that those columns can vary. Sometimes I have 10 after 1 other times I have 6.
I've tried with dplyr and data.table but I'm not sure how to automate this.
EDIT: ColA to ColC are fixed and always the same.
If I correctly understood your problem, this is a sufficiently flexible code that should solve your problem. Start considering the following dataset:
set.seed(1)
df <- data.frame(matrix(rpois(130, 20),ncol=13))
names(df) <- c(paste("Col",LETTERS[1:3],sep=""),as.character(1:10))
df
#######
ColA ColB ColC 1 2 3 4 5 6 7 8 9 10
1 17 21 20 13 13 15 29 25 16 15 12 23 17
2 25 17 11 24 23 14 22 23 25 14 18 19 15
3 25 18 22 18 19 30 16 19 23 27 18 19 11
4 21 18 24 25 23 19 19 18 27 23 18 16 18
5 13 21 16 18 21 23 22 18 22 24 22 26 15
6 22 16 17 27 17 20 24 24 14 21 19 17 15
7 23 23 18 22 16 16 20 18 21 27 17 22 14
8 22 22 17 17 26 13 19 25 24 17 15 13 20
9 18 24 21 22 28 26 15 22 23 20 19 15 27
10 26 23 19 16 18 20 17 25 16 20 19 18 19
Now rename columuns as required:
k <- which(names(df)=="1")
names(df)[k:ncol(df)] <- paste("Stop",1:(ncol(df)-k+1),sep="")
df
#############
ColA ColB ColC Stop1 Stop2 Stop3 Stop4 Stop5 Stop6 Stop7 Stop8 Stop9 Stop10
1 17 21 20 13 13 15 29 25 16 15 12 23 17
2 25 17 11 24 23 14 22 23 25 14 18 19 15
3 25 18 22 18 19 30 16 19 23 27 18 19 11
4 21 18 24 25 23 19 19 18 27 23 18 16 18
5 13 21 16 18 21 23 22 18 22 24 22 26 15
6 22 16 17 27 17 20 24 24 14 21 19 17 15
7 23 23 18 22 16 16 20 18 21 27 17 22 14
8 22 22 17 17 26 13 19 25 24 17 15 13 20
9 18 24 21 22 28 26 15 22 23 20 19 15 27
10 26 23 19 16 18 20 17 25 16 20 19 18 19
I hope it can help you.
I using Hierarchical average-linkage method to do clustering using Euclidean distance. To find cluster number (k) to cut I need to do two plots one for Minimum Distance within clusters against number of cluster (graph 1) and one for linkage distance between clusters against number of cluster (graph 2).
> df
Site1 Site2 Site3 Site4 Site5 Site6
1985 11 0 5 15 13 15
1986 12 12 5 31 14 26
1987 23 21 17 14 25 12
1988 22 25 18 17 24 14
1989 11 16 8 18 13 19
1990 7 5 21 8 9 24
1991 20 13 9 21 22 7
1992 15 11 6 19 17 20
1993 19 18 9 11 21 11
1994 33 9 28 17 26 20
1995 16 14 19 33 17 10
1996 14 21 25 4 6 47
1997 4 0 11 22 14 16
1998 10 31 13 26 12 14
1999 24 17 18 41 19 20
2000 21 17 23 19 23 14
2001 12 8 6 7 19 20
2002 19 24 19 31 24 17
2003 13 29 10 28 7 9
2004 19 14 19 22 20 13
2005 16 8 9 10 11 13
2006 8 9 46 9 20 19
2007 12 10 15 13 10 9
2008 12 18 25 12 47 22
2009 19 18 18 23 21 20
2010 23 10 46 35 25 12
2011 20 35 18 30 22 18
2012 23 13 23 34 25 34
2013 17 28 20 13 19 21
2014 19 22 16 16 21 23
df2 <- data.frame(t(df))
tree <- hclust(dist(df2))
Since there's no question stated, I'm assuming that you are interested to plot the figure above with the example data-set. Please correct if I'm wrong with that assumption.
(i) find the number of groups based on sequence linkage distances. Sequence of linkage distance in this case was eyeballed from plot(tree):
library(dplyr)
cls.df <- data.frame(h=40:100)
cls.df$k <- sapply(cls.df$h, function(x) cutree(tree, h=x) %>% max )
(ii) clean the table by retaining only the minimum linkages distance h for number of group k
cls.df <- cls.df %>%
group_by(k) %>%
summarise(h=min(h))
(iii) Plot:
library(ggplot2)
ggplot(cls.df, aes(k, h)) +
geom_line() +
geom_point() +
theme_bw() +
ylab("Linkage Distance") +
xlab("Number of Cluster")
package using:
‘xgboost’ version 0.4-4
i am using model building function xgboost() using code :
fit <- xgboost(data =sparse_matrix , label = trainSet$OutputClass,
max.depth = 4,eta = 1, nthread = 2, nround = 10,
eval_metric = "merror",objective = "multi:softmax",num_class = 45)
when i use the prediction function:
Prediction<- predict(fit,sparse_matrixtestSet)
the above code gave output as below( instead of class names its giving numerical equivalent value eventhough "label = trainSet$OutputClass" contain class names)
output:
[1] 1 1 1 1 1 35 3 3 3 4 31 7 7 7 3 3 9 9 9 9 9 9 9 10 10 11
[27] 11 11 11 11 11 11 11 11 11 13 13 13 13 13 13 13 13 13 14 14 14 14 14 14 10 10
[53] 15 15 15 15 15 15 15 15 15 15 15 16 16 16 16 16 16 16 16 16 16 16 16 18 18 18
[79] 18 18 18 18 35 35 35 18 21 21 21 21 32 1 1 25 25 25 25 26 27 27 27 27 27 27
[105] 27 27 29 29 29 29 29 30 30 30 30 30 30 30 30 30 30 35 35 32 32 32 43 43 32 32
[131] 32 32 32 32 32 32 43 32 32 32 32 32 33
I have also set stringsAsFactors=FALSE while reading data set.
Can Someone Please help me How to Get predicted values in terms of class names instead of numerical values...
Thanks in advance