Related
I have several measured values from different sources, I want to put an upper and lower limit for a given Median of a single test ID. I have different tests grouped together as you see in the picture I have several so to say, each test have about 5 sources and each source has 3 Measured values. therefore I have put boxplots for each source over its data and had all the tests with the boxplots of the different sources grouped in one source. my problem starts when I want to put a z score limit over the data just one z score per test is registerd but i would rather have a certain line limit over all the boxplots and not have just single points where they are all connected ( see the pic )
here is my code without the data
## Libraries call
library(readxl)
require(tidyverse)
require(rlang)
library(dplyr)
require(tidyr)
require(stringr)
require(plotly)
require(ggplot2)
require(matrixStats)
require(openxlsx)
############################
# source comparision Functions
############################
# Mean und Median bauen
df$Mean = rowMeans(as.matrix(df[,c(6,7,8)]),na.rm = TRUE)
df$Median = rowMedians(as.matrix(df[,c(6,7,8)]),na.rm = TRUE)
# summarize for TestID
df_sum <-df%>%
group_by(TestID)%>%
summarise(Mean=mean(Mean)
,Max=max(Mean)
,Min=min(Mean)
,Median=median(Median)
,Std=sd(Mean)
,Mad=mad(Mean)
,z_limit_std=2*Std
,z_limit_mad=2*Mad
)
# Merge von summary und DLG Daten
df_Median<- df[,c('TestID','Median')]
df_sum_Median <- df_Median%>% group_by(TestID)%>% summarise(Median=median(Median))
df = merge(x = df, y = df_sum, by = "TestID")
############################
#Box Plot
############################
Plot_Data_df <- data.frame(df$TestID
,df$`measured_value 1`
,df$`measured_value 2`
,df$`measured_value 3`
,df$Median.y
,df$z_limit_std)
# Daten in einem String umformen und die measured_valuee mit subset Daten mit NA
dfboxplot <- data.frame(TestID = rep(paste0(Plot_Data_df$df.TestID, '_Test'), 3)
,measured_value = c(Plot_Data_df$df..measured_value.1.,
Plot_Data_df$df..measured_value.2.,
Plot_Data_df$df..measured_value.3.)
,Median = rep(Plot_Data_df$df.Median.y, 3)
,z_limit = rep(Plot_Data_df$df.z_limit_std, 3)
)
dfboxplot$lower_limit <- dfboxplot$Median - dfboxplot$z_limit
dfboxplot$upper_limit <- dfboxplot$Median + dfboxplot$z_limit
plot <-plot_ly(dfboxplot, x = ~TestID, y = ~measured_value , color = ~Lab, type = "box",inherit=FALSE) %>%
layout(boxmode = "group",
xaxis = list(title='Test ID'),
yaxis = list(title= ' measured_value'))%>%
plotly::add_lines(data = dfboxplot # lower limit einführen
,y= ~Median
,x= ~TestID
,type = 'scatter'
,mode = 'lines'
,showlegend = FALSE
,line = list(color = 'rgb(0, 0, 0)',
width = 1)
,name = 'Median'
)%>% plotly::add_lines(data = dfboxplot # lower limit einführen
,y= ~upper_limit
,x= ~TestID
,type = 'scatter'
,mode = 'lines'
,showlegend = FALSE
,line = list(color = 'rgb(200, 0, 0)',
width = 1)
,name = 'upper limit'
)%>%
#
plot
I'm working on a Bubble map where I generated two columns, one for a color id (column Color) and one for a text refering to the id (column Class). This is a classification of my individuals (Color always belongs to Class).
Class is a factor following a certain order that I made with :
COME1039$Class <- as.factor(COME1039$Class, levels = c('moins de 100 000 F.CFP',
'entre 100 000 et 5 millions F.CFP',
'entre 5 millions et 1 milliard F.CFP',
'entre 1 milliard et 20 milliards F.CFP',
'plus de 20 milliards F.CFP'))
This is my code
g <- list(
scope = 'world',
visible = F,
showland = TRUE,
landcolor = toRGB("#EAECEE"),
showcountries = T,
countrycolor = toRGB("#D6DBDF"),
showocean = T,
oceancolor = toRGB("#808B96")
)
COM.g1 <- plot_geo(data = COME1039,
sizes = c(1, 700))
COM.g1 <- COM.g1 %>% add_markers(
x = ~LONGITUDE,
y = ~LATITUDE,
name = ~Class,
size = ~`Poids Imports`,
color = ~Color,
colors=c(ispfPalette[c(1,2,3,7,6)]),
text=sprintf("<b>%s</b> <br>Poids imports: %s tonnes<br>Valeur imports: %s millions de F.CFP",
COME1039$NomISO,
formatC(COME1039$`Poids Imports`/1000,
small.interval = ",",
digits = 1,
big.mark = " ",
decimal.mark = ",",
format = "f"),
formatC(COME1039$`Valeur Imports`/1000000,
small.interval = ",",
digits = 1,
big.mark = " ",
decimal.mark = ",",
format = "f")),
hovertemplate = "%{text}<extra></extra>"
)
COM.g1 <- COM.g1%>% layout(geo=g)
COM.g1 <- COM.g1%>% layout(dragmode=F)
COM.g1 <- COM.g1 %>% layout(showlegend=T)
COM.g1 <- COM.g1 %>% layout(legend = list(title=list(text='Valeurs des importations<br>'),
orientation = "h",
itemsizing='constant',
x=0,
y=0)) %>% hide_colorbar()
COM.g1
Unfortunately my data are too big to be added here, but this is the output I get :
As you can see, the order of the legend is not the one of the factor levels. How to get it ? If data are mandatory to help you to give me a hint, I will try to limit their size.
Many thanks !
Plotly is going to alphabetize your legend and you have to 'make' it listen. The order of the traces in your plot is the order in which the items appear in your legend. So if you rearrange the traces in the object, you'll rearrange the legend.
I don't have your data, so I used some data from rnaturalearth.
First I created a plot, using plot_geo. Then I used plotly_build() to make sure I had the trace order in the Plotly object. I used lapply to investigate the current order of the traces. Then I created a new order, rearranged the traces, and plotted it again.
The initial plot and build.
library(tidyverse)
library(plotly)
library(rnaturalearth)
canada <- ne_states(country = "Canada", returnclass = "SF")
x = plot_geo(canada, sizes = c(1, 700)) %>%
add_markers(x = ~longitude, y = ~latitude,
name = ~name, color = ~name)
x <- plotly_build(x) # capture all elements of the object
Now for the investigation; this is more so you can see how this all comes together.
# what order are they in?
y = vector()
invisible(
lapply(1:length(x$x$data),
function(i) {
z <- x$x$data[[i]]$name
message(i, " ", z)
})
)
# 1 Alberta
# 2 British Columbia
# 3 Manitoba
# 4 New Brunswick
# 5 Newfoundland and Labrador
# 6 Northwest Territories
# 7 Nova Scotia
# 8 Nunavut
# 9 Ontario
# 10 Prince Edward Island
# 11 Québec
# 12 Saskatchewan
# 13 Yukon
In your question, you show that you made the legend element a factor. That's what I've done as well with this data.
can2 = canada %>%
mutate(name = ordered(name,
levels = c("Manitoba", "New Brunswick",
"Newfoundland and Labrador",
"Northwest Territories",
"Alberta", "British Columbia",
"Nova Scotia", "Nunavut",
"Ontario", "Prince Edward Island",
"Québec", "Saskatchewan", "Yukon")))
I used the data to reorder the traces in my Plotly object. This creates a vector. It starts with the levels and their row number or order (1:13). Then I alphabetized the data by the levels (so it matches the current order in the Plotly object).
The output of this set of function calls is a vector of numbers (i.e., 5, 6, 1, etc.). Since I have 13 names, I have 1:13. You could always make it dynamic, as well 1:length(levels(can2$name).
# capture order
df1 = data.frame(who = levels(can2$name), ord = 1:13) %>%
arrange(who) %>% select(ord) %>% unlist()
Now all that's left is to rearrange the object traces and visualize it.
x$x$data = x$x$data[order(c(df1))] # reorder the traces
x # visualize
Originally:
With reordered traces:
I have tried using split trace with scatterpolar and it seems to partly work but can't get it to plot the values for all 10 variables. So I want each row (identified by "ean") be plotted as its own line using the values from X1 to X10.
library(tidyverse)
library(vroom)
library(plotly)
types <- rep(times = 10, list(
col_integer(f = stats::runif,
min = 1,
max = 5)))
products = bind_cols(
tibble(ean = sample.int(1e9, 25)),
tibble(kategori = sample(c("kat1", "kat2", "kat3"), 25, replace = TRUE)),
gen_tbl(25, 10, col_types = types)
)
plot_ly(
products,
type = 'scatterpolar',
mode = "lines+markers",
r = ~X1,
theta = ~"X1",
split = ~ean
)
How can I get plotly to plot all variables in the radarchart (X1-X10)? Usually I would select the columns with X1:X10 but I can't do that here (I think it has to do with that ~ is used to select variable here).
So I want the result to look something like this (but I only show lines and not filled polygons and I would have more products). So in the end 25 products is a lot but I am connecting it so that the user can select the diagrams it wants to show.
In plotly it's convenient to use data in long format - see ?gather.
Please check the following:
library(dplyr)
library(tidyr)
library(vroom)
library(plotly)
types <- rep(times = 10, list(
col_integer(f = stats::runif,
min = 1,
max = 5)))
products = bind_cols(
tibble(ean = sample.int(1e9, 25)),
tibble(kategori = sample(c("kat1", "kat2", "kat3"), 25, replace = TRUE)),
gen_tbl(25, 10, col_types = types)
)
products_long <- gather(products, "key", "value", -ean, -kategori)
plot_ly(
products_long,
type = 'scatterpolar',
mode = "lines+markers",
r = ~value,
theta = ~key,
split = ~ean
)
I get 0 values on the y Axis while plotting a discreteBarChart inside renderChart(), However, the highest value of yAxis appears (not 0) but also with some wierd format and commmas (see 2nd screenshot down named Chart Plot)
I want to plot 2 columns in rCharts, the x Axis is a character (countryname) and the yAxis is numeric (Collective_Turnover)
I created this variable (Collective_Turnover) from the data, it is the sum of the Net_Turnover
I tried to put as.numeric() before it, but still, getting 0 on the yAxis
data$countryname= as.character(data$countryname)
output$top10countries <-renderChart({
topcountries <-
arrange(data%>%
group_by(as.character(countryname)) %>%
summarise(
Collective_Turnover= sum(as.numeric(`Net turnover`))
), desc(Collective_Turnover))
colnames(topcountries )[colnames(topcountries )=="as.character(countryname)"] <- "Country"
topcountries <- subset(topcountries [1:10,], select = c(Country, Collective_Turnover))
p <- nPlot(Collective_Turnover~Country, data = topcountries , type = "discreteBarChart", dom = "top10countries")
p$params$width <- 1000
p$params$height <- 200
p$xAxis(staggerLabels = TRUE)
# p$yAxis(axisLabel = "CollectiveTO", width = 50)
return(p)
})
The output of topcountries in R is a table like this:
that is arranged in descending order...
and the plot that i get is this:
The ticks labels are truncated because they are too long. You need to set the left margin and a padding. To get rid of the commas, use a number formatter.
dat <- data.frame(
Country = c("Russian", "Italy", "Spain"),
x = c(12748613.6, 5432101.2, 205789.7)
)
p <- nPlot(x ~ Country, data = dat, type = "discreteBarChart")
p$yAxis(tickPadding = 15, tickFormat = "#! function(d) {return d3.format('.1')(d)} !#")
p$chart(margin = list(left = 100))
p
Given a data frame containing mixed variables (i.e. both categorical and continuous) like,
digits = 0:9
# set seed for reproducibility
set.seed(17)
# function to create random string
createRandString <- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}
df <- data.frame(ID=c(1:10), name=sample(letters[1:10]),
studLoc=sample(createRandString(10)),
finalmark=sample(c(0:100),10),
subj1mark=sample(c(0:100),10),subj2mark=sample(c(0:100),10)
)
I perform unsupervised feature selection using the package FactoMineR
df.princomp <- FactoMineR::FAMD(df, graph = FALSE)
The variable df.princomp is a list.
Thereafter, to visualize the principal components I use
fviz_screeplot() and fviz_contrib() like,
#library(factoextra)
factoextra::fviz_screeplot(df.princomp, addlabels = TRUE,
barfill = "gray", barcolor = "black",
ylim = c(0, 50), xlab = "Principal Component",
ylab = "Percentage of explained variance",
main = "Principal Component (PC) for mixed variables")
factoextra::fviz_contrib(df.princomp, choice = "var",
axes = 1, top = 10, sort.val = c("desc"))
which gives the following Fig1
and Fig2
Explanation of Fig1: The Fig1 is a scree plot. A Scree Plot is a simple line segment plot that shows the fraction of total variance in the data as explained or represented by each Principal Component (PC). So we can see the first three PCs collectively are responsible for 43.8% of total variance. The question now naturally arises, "What are these variables?". This I have shown in Fig2.
Explanation of Fig2: This figure visualizes the contribution of rows/columns from the results of Principal Component Analysis (PCA). From here I can see the variables, name, studLoc and finalMark are the most important variables that can be used for further analysis.
Further Analysis- where I'm stuck at: To derive the contribution of the aforementioned variables name, studLoc, finalMark. I use the principal component variable df.princomp (see above) like df.princomp$quanti.var$contrib[,4]and df.princomp$quali.var$contrib[,2:3].
I've to manually specify the column indices [,2:3] and [,4].
What I want: I want to know how to do dynamic column index assignment, such that I do not have to manually code the column index [,2:3] in the list df.princomp?
I've already looked at the following similar questions 1, 2, 3 and 4 but cannot find my solution? Any help or suggestions to solve this problem will be helpful.
Not sure if my interpretation of your question is correct, apologies if not. From what I gather you are using PCA as an initial tool to show you what variables are the most important in explaining the dataset. You then want to go back to your original data, select these variables quickly without manual coding each time, and use them for some other analysis.
If this is correct then I have saved the data from the contribution plot, filtered out the variables that have the greatest contribution, and used that result to create a new data frame with these variables alone.
digits = 0:9
# set seed for reproducibility
set.seed(17)
# function to create random string
createRandString <- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}
df <- data.frame(ID=c(1:10), name=sample(letters[1:10]),
studLoc=sample(createRandString(10)),
finalmark=sample(c(0:100),10),
subj1mark=sample(c(0:100),10),subj2mark=sample(c(0:100),10)
)
df.princomp <- FactoMineR::FAMD(df, graph = FALSE)
factoextra::fviz_screeplot(df.princomp, addlabels = TRUE,
barfill = "gray", barcolor = "black",
ylim = c(0, 50), xlab = "Principal Component",
ylab = "Percentage of explained variance",
main = "Principal Component (PC) for mixed variables")
#find the top contributing variables to the overall variation in the dataset
#here I am choosing the top 10 variables (although we only have 6 in our df).
#note you can specify which axes you want to look at with axes=, you can even do axes=c(1,2)
f<-factoextra::fviz_contrib(df.princomp, choice = "var",
axes = c(1), top = 10, sort.val = c("desc"))
#save data from contribution plot
dat<-f$data
#filter out ID's that are higher than, say, 20
r<-rownames(dat[dat$contrib>20,])
#extract these from your original data frame into a new data frame for further analysis
new<-df[r]
new
#finalmark name studLoc
#1 53 b POTYQ0002N
#2 73 i LWMTW1195I
#3 95 d VTUGO1685F
#4 39 f YCGGS5755N
#5 97 c GOSWE3283C
#6 58 g APBQD6181U
#7 67 a VUJOG1460V
#8 64 h YXOGP1897F
#9 15 j NFUOB6042V
#10 81 e QYTHG0783G
Based on your comment, where you said you wanted to 'Find variables with value greater than 5 in Dim.1 AND Dim.2 and save these variables to a new data frame', I would do this:
#top contributors to both Dim 1 and 2
f<-factoextra::fviz_contrib(df.princomp, choice = "var",
axes = c(1,2), top = 10, sort.val = c("desc"))
#save data from contribution plot
dat<-f$data
#filter out ID's that are higher than 5
r<-rownames(dat[dat$contrib>5,])
#extract these from your original data frame into a new data frame for further analysis
new<-df[r]
new
(This keeps all the original variables in our new data frame since they all contributed more than 5% to the total variance)
There are a lot of ways to extract contributions of individual variables to PCs. For numeric input, one can run a PCA with prcomp and look at $rotation (I spoke to soon and forgot you've got factors here so prcomp won't work directly). Since you are using factoextra::fviz_contrib, it makes sense to check how that function extracts this information under the hood. Key factoextra::fviz_contrib and read the function:
> factoextra::fviz_contrib
function (X, choice = c("row", "col", "var", "ind", "quanti.var",
"quali.var", "group", "partial.axes"), axes = 1, fill = "steelblue",
color = "steelblue", sort.val = c("desc", "asc", "none"),
top = Inf, xtickslab.rt = 45, ggtheme = theme_minimal(),
...)
{
sort.val <- match.arg(sort.val)
choice = match.arg(choice)
title <- .build_title(choice[1], "Contribution", axes)
dd <- facto_summarize(X, element = choice, result = "contrib",
axes = axes)
contrib <- dd$contrib
names(contrib) <- rownames(dd)
theo_contrib <- 100/length(contrib)
if (length(axes) > 1) {
eig <- get_eigenvalue(X)[axes, 1]
theo_contrib <- sum(theo_contrib * eig)/sum(eig)
}
df <- data.frame(name = factor(names(contrib), levels = names(contrib)),
contrib = contrib)
if (choice == "quanti.var") {
df$Groups <- .get_quanti_var_groups(X)
if (missing(fill))
fill <- "Groups"
if (missing(color))
color <- "Groups"
}
p <- ggpubr::ggbarplot(df, x = "name", y = "contrib", fill = fill,
color = color, sort.val = sort.val, top = top, main = title,
xlab = FALSE, ylab = "Contributions (%)", xtickslab.rt = xtickslab.rt,
ggtheme = ggtheme, sort.by.groups = FALSE, ...) + geom_hline(yintercept = theo_contrib,
linetype = 2, color = "red")
p
}
<environment: namespace:factoextra>
So it's really just calling facto_summarize from the same package. By analogy you can do the same thing, simply call:
> dd <- factoextra::facto_summarize(df.princomp, element = "var", result = "contrib", axes = 1)
> dd
name contrib
ID ID 0.9924561
finalmark finalmark 21.4149175
subj1mark subj1mark 7.1874438
subj2mark subj2mark 16.6831560
name name 26.8610132
studLoc studLoc 26.8610132
And that's the table corresponding to your figure 2. For PC2 use axes = 2 and so on.
Regarding "how to programmatically determine the column indices of the PCs", I'm not 100% sure I understand what you want, but if you just want to say for column "finalmark", grab its contribution to PC3 you can do the following:
library(tidyverse)
# make a tidy table of all column names in the original df with their contributions to all PCs
contribution_df <- map_df(set_names(1:5), ~factoextra::facto_summarize(df.princomp, element = "var", result = "contrib", axes = .x), .id = "PC")
# get the contribution of column 'finalmark' by name
contribution_df %>%
filter(name == "finalmark")
# get the contribution of column 'finalmark' to PC3
contribution_df %>%
filter(name == "finalmark" & PC == 3)
# or, just the numeric value of contribution
filter(contribution_df, name == "finalmark" & PC == 3)$contrib
BTW I think ID in your example is treated as numeric instead of factor, but since it's just an example I'm not bothering with it.