Inerting multiple level into sankey diagram using googlevis - r

I am trying to make a sankey diagram including 6 levels in total in r-studio using the googlevis package. With the help of How to make a googleVis multiple Sankey from a data.frame? I was successful with three levels with the code presented there. Here it is:
'''
source <- sample(c("NorthSrc", "SouthSrc", "EastSrc", "WestSrc"), 100, replace=T)
mid <- sample(c("NorthMid", "SouthMid", "EastMid", "WestMid"), 100, replace=T)
destination <- sample(c("NorthDes", "SouthDes", "EastDes", "WestDes"), 100, replace=T)
dummy <- rep(1,100) # For aggregation
dat <- data.frame(source,mid,destination,dummy)
aggdat <- aggregate(dummy~source+mid+destination,dat,sum)
library(dplyr)
datSM <- dat %>%
group_by(source, mid) %>%
summarise(toMid = sum(dummy) ) %>%
ungroup()
datMD <- dat %>%
group_by(mid, destination) %>%
summarise(toDes = sum(dummy) ) %>%
ungroup()
colnames(datSM) <- colnames(datMD) <- c("From", "To", "Dummy")
datVis <- rbind(datSM, datMD)
p <- gvisSankey(datVis, from="From", to="To", weight="dummy")
plot(p)
'''
This results in:
enter image description here
My question now is, how can I insert additional levels and how do I have to adapt the code that multiple mid-levels are accepted? Here is the example dataset:
'''
source <- sample(c("NorthSrc", "SouthSrc", "EastSrc", "WestSrc"), 100, replace=T)
mid_one <- sample(c("North", "South", "East", "West"), 100, replace=T)
mid_two <-sample(c("WestNorth", "WestSouth", "SouthEast", "NorthWest"), 100, replace=T)
mid_three <- sample(c("NorthMid", "SouthMid", "EastMid", "WestMid"), 100, replace=T)
mid_four <- sample(c("West", "East", "NorthCis", "SouthCis"), 100, replace=T)
destination <- sample(c("NorthDes", "SouthDes", "EastDes", "WestDes"), 100, replace=T)
dummy <- rep(1,100) # For aggregation
dat <- data.frame(source,mid_one, mid_two,mid_three, mid_four,destination,dummy)
aggdat <- aggregate(dummy~source+mid_one+mid_two+mid_three+mid_four+destination,dat,sum)
'''

just found the solution myself (if someone is looking for a similar problem):
datSM1 <- dat %>%
group_by(source, mid_one) %>%
summarise(toMid1 = sum(dummy) ) %>%
ungroup()
datSM2 <- dat %>%
group_by(mid_one, mid_two) %>%
summarise(toMid2 = sum(dummy) ) %>%
ungroup()
datSM3 <- dat %>%
group_by(mid_two, mid_three) %>%
summarise(toMid3 = sum(dummy) ) %>%
ungroup()
datSM4 <- dat %>%
group_by(mid_three, mid_four) %>%
summarise(toMid4 = sum(dummy) ) %>%
ungroup()
datMD <- dat %>%
group_by(mid_four, destination) %>%
summarise(toDes = sum(dummy) ) %>%
ungroup()
colnames(datSM1)<- colnames(datSM2) <- colnames(datSM3)<- colnames(datSM4)<-
colnames(datMD) <- c("From", "To", "Dummy")
datVis <- rbind(datSM1,datSM2, datSM3,datSM4,datMD)
then plot via gvisSankey

Related

Extremely slow calculations in SPARK (using sparklyr)

I'm using sparklyr library in order to calculate some metrics on big data. I've decide to test my code on a small portion of data (parquet-table format ~ 1,3m rows, 40MB), but after executing some basic filtering SPARK needs approximately 2-3 minutes to calculate sum.
Here is an example of my code
#Connecting
df <- spark_read_parquet(spark_conn, name = "df", path = "hdfs://****.db/table", header = TRUE, memory = FALSE, repartition = 1, overwrite = FALSE)
#Basic filtering
df <- df %>%
mutate (ZD = datediff(Maturity_Date - Treaty_Date))
df <- df %>%
filter(Treaty_Amount <= 30000, ZD <= 30)
df <- df %>%
dplyr::mutate(MD_Received_From_Borrower_2018_1 = rowSums(.[139:150], na.rm = TRUE))
df <- df %>%
dplyr::mutate(PD_Received_From_Borrower_2018_1 = rowSums(.[175:186], na.rm = TRUE))
df <- df %>%
mutate (Total_Amount_Received_From_Borrower_2018_1= (MD_Received_From_Borrower_2018_1 + PD_Received_From_Borrower_2018_1))
#More filtering for a project purposes
df <- df %>%
mutate(Preditog1_2018_01 = ifelse((Debt_Duration_As_Of_31.01.2018 >= 361 & Debt_Duration_As_Of_31.01.2019 == 0 & Total_Amount_Received_From_Borrower_2018_1 == 0), 361, 0))
df <- df %>%
mutate(Preditog2_2018_01 = ifelse((Date_Of_Debt_Sale <= "2019-01-31" & !is.na(Date_Of_Debt_Sale)), 361, 0))
df <- df %>%
mutate(Preditog3_2018_01 = ifelse((Date_Of_Debt_Cancellation <= "2019-01-31" & !is.na(Date_Of_Debt_Cancellation )), 361, 0))
df <- df %>%
mutate(Preditog_2018_01 = ifelse((Preditog1_2018_01 == 361 | Preditog2_2018_01 == 361 | Preditog3_2018_01 == 361), 361, 0))
#df <- df %>%
mutate(Itog_2018_01 = ifelse(Preditog_2018_01 == 361, 361, Debt_Duration_As_Of_31.01.2019))
#Calculations (Where the process is extremely slow especially when calculating sum)
##Extra filtering
df <- df %>%
mutate(f_0_v_360_2018_01_d = ifelse((Debt_Duration_As_Of_31.01.2018 == 0 & Itog_2018_01 > 360), Main_Debt_As_Of_31.01.2018, 0))
##Calculating sum
df %>%
summarise(res1 = sum(f_0_v_360_2018_01_d)
Maybe I'm doing smth wrong?
How is it possible to optimize calculation speed?
Thank you in advance for any help!

Scraping Oxford5000 words and obtaining two equivalent word lists

I'd like to get two lists of 25 nouns each. The lists have to be balanced in the average length of words and in average semantic distance. It would also be ideal if I can get a balance of categories within each list (i.e., people, places, things, concepts).
So far I have been able to get a list of 25 nouns at CEFR level A1 from the Oxford5000, and obtain semantic distances between each of the selected words (see below).
I need some help in adjusting this so that it gives me two lists of different words, and so that:
The lists have comparable average length of words
The lists have comparable average semantic distance (i.e., the values in lsa.mat)
All semantic distances are greater than 0.7
It would also be great if the words in the two lists had:
The same average ranking of how common they are
A balance of categories (i.e., people, places, things, concepts)
...though I don't know where I could find data on that.
Here's what I have so far:
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(rvest)
library(purrr)
library(magrittr)
library(dplyr)
library(tidyr)
library(LSAfun)
url <- "https://www.oxfordlearnersdictionaries.com/wordlists/oxford3000-5000"
url %>%
map(. %>%
read_html() %>%
html_nodes(".belong-to , .pos , a") %>%
html_text()
) %>%
unlist() -> words_typesA1
wordtypes.tbl <- words_typesA1 %>% tibble()
full.first <- which(wordtypes.tbl$. == "a")
full.last <- which(wordtypes.tbl$. == "zone")
full.last <- full.last+2
# These cases are hidden in the online list and doesn't have associated CEFR level
missing <- which(wordtypes.tbl$. == "accounting")
missing2 <- which(wordtypes.tbl$. == "angrily")
missing3 <- which(wordtypes.tbl$. == "cleaning")
missing4 <- which(wordtypes.tbl$. == "feeding")
missing5 <- which(wordtypes.tbl$. == "major")
wordtypes.tbl <- wordtypes.tbl[-c(missing, missing+1, missing2, missing2+1, missing3, missing3+1, missing4, missing4+1, missing5[2], missing5[2]+1), ]
word.types.full <- wordtypes.tbl[c(full.first:full.last),]
oxford5000 <- word.types.full %>%
mutate(ind = rep(c(1:3), length.out = n())) %>%
group_by(ind) %>%
mutate(id = row_number())
oxford5000 <- oxford5000 %>% dplyr::rename("word" = ".")
oxford5000 <- oxford5000 %>%
pivot_wider(names_from = ind, values_from = word) %>%
select(-id)
oxford5000 <- oxford5000 %>%
rename("Word" = "1", "Type" = "2", "CEFR" = "3")
oxford5000$WordLength <- nchar(oxford5000$Word)
a1nouns <- oxford5000 %>% filter(Type == "noun" & CEFR == "a1") %>% select(Word)
a1nouns
cues <- sample(a1nouns$Word, 25, replace = F)
cues.mat <- expand.grid(cues, cues)
cues.mat <- cues.mat[,c(2,1)]
cues.mat <- cues.mat %>% filter(Var2 != Var1)
cues.mat <- cues.mat %>%
mutate(case = row_number())
cues.mat$case <- as.factor(cues.mat$case)
x <- cues.mat %>% select(!case) %>% as.matrix() %>% list()
cuenames <- c(rep(cues[1], 24), rep(cues[2], 24), rep(cues[3], 24), rep(cues[4], 24), rep(cues[5], 24), rep(cues[6], 24), rep(cues[7], 24), rep(cues[8], 24), rep(cues[9], 24), rep(cues[10], 24), rep(cues[11], 24), rep(cues[12], 24), rep(cues[13], 24), rep(cues[14], 24), rep(cues[15], 24), rep(cues[16], 24), rep(cues[17], 24), rep(cues[18], 24), rep(cues[19], 24), rep(cues[20], 24), rep(cues[21], 24), rep(cues[22], 24), rep(cues[23], 24), rep(cues[24], 24), rep(cues[25], 24))
cues.tib <- cues.mat %>%
mutate(name = paste0(Var2, sep = "_", Var1))
names <- cues.tib$name
cues.tib <- cues.tib %>%
select(Var2, Var1) %>%
pivot_longer(everything()) %>%
mutate(id = paste0(name, sep=".", 1:n())) %>%
select(-name) %>%
pivot_wider(values_from = value, names_from = id)
# TASA corpus downloaded from https://sites.google.com/site/fritzgntr/software-resources/semantic_spaces#:~:text=TASA
load(file = "../data/TASA.rda")
n <- cues.tib %>% nrow()
mat <- matrix(ncol=600, nrow=n)
k=1
for(i in seq(from=1, to=1199, by=2)){
inc=i+1
x = cues.tib[,i:inc]
for (j in 1:nrow(x)){
word1 = tolower(x[j,1])
word2 = tolower(x[j,2])
output = Cosine(word1, word2, tvectors = TASA)
mat[j,k] <- c(1-output)
}
k=k+1
}
# See semantic distances between each of the selected words.
lsa.mat <- mat %>% as.data.frame()
names(lsa.mat) <- names
lsa.mat

Remove comma in thousands in echarts4r

Here's a plot made with echarts4r:
library(tibble)
library(echarts4r)
data_test <- tibble(
year = seq(1900, 1920, 1),
variable = seq(200, 400, 10)
)
data_test %>%
e_charts(year) %>%
e_x_axis(year) %>%
e_y_axis(variable) %>%
e_line(variable)
How can I format the values on the x-axis so that I have 1900 instead of 1,900 for example?
I checked here but couldn't find a solution.
Probably not the most elegant solution, but this should work :
data_test %>%
e_charts(year) %>%
e_x_axis(type='category') %>%
e_y_axis(variable) %>%
e_line(variable)
Adding options for a cleaner output :
data_test %>%
e_charts(year) %>%
e_x_axis(type='category',axisLabel = list(interval = 4),axisTick = list(inside=TRUE,alignWithLabel=TRUE,interval=4)) %>%
e_y_axis(variable) %>%
e_line(variable)
Output :
I opened a GitHub issue for this question and this is the answer of the developer of echarts4r (I shortened it a little, see the link for the original answer).
There are two solutions.
1) Transform the variable year as a factor. This works only if there are no missing years.
library(tibble)
library(echarts4r)
data_test <- tibble(
year = seq(1900, 1920, 1),
variable = seq(200, 400, 10)
)
data_test %>%
dplyr::mutate(year = as.factor(year)) %>%
e_charts(year) %>%
e_line(variable)
2) Modify the JavaScript function. This is more robust, and it works when there are missing years.
library(echarts4r)
data_test <- tibble(
year = c(1900, 1901, 1905),
variable = 1:3
)
label <- list(
formatter = htmlwidgets::JS(
'function(value, index){
return value;
}'
)
)
data_test %>%
e_charts(year) %>%
e_y_axis(variable) %>%
e_line(variable) %>%
e_x_axis(serie = year, axisLabel = label)

R Explaining Random Forest Variable Selection Sample Code

I have the sample code of random forest variable selection. We want to choose the combination of variables with most importance and build the random forest model with the lowest OOB. Can anyone explain the for loop part in the function for me?
clinical_variables <- c("Age","location", "smoke", "perianal_disease","upper_tract", "LnASCA
IgA","LnASCA IgG", "LnANCA", "LnCbir", "LnOMPC", "CRP", "Albumin", "African American Race")
variable_selected_progress_biomarkers <- vector("list", 50)
error_rate_min_progress_biomarkers <- rep(NA, 50)
for (j in 1:50){
risk_progress_biomarker_variables <- risk_full %>%
select(names(risk), clinical_variables) %>%
select(-c("STRICTURE", "TIM2STRICTURE", "PENETRATING", "TIM2PENETRATING","BDNF","LASTFOLLOWUPDAYSPROGRESS", "PROGRESSED")) %>% names
risk_progress_biomarker_variables_total <- vector("list",104)
names(risk_progress_biomarker_variables_total) <- 104:1
error_rate_tail_progress_biomarker <- rep(NA, 104)
for (i in 1:104){
set.seed(4182019)
risk_progress_biomarker_variables_total[[i]] <- risk_progress_biomarker_variables
rf_risk_progress_biomarker <- rfsrc(
Surv(LASTFOLLOWUPDAYSPROGRESS, PROGRESSED) ~ .,
data = risk_full %>% select(risk_progress_biomarker_variables, LASTFOLLOWUPDAYSPROGRESS, PROGRESSED)%>%
mutate_if(is.factor, as.numeric),
ntree=1000,
importance = TRUE
)
error_rate_tail_progress_biomarker[i] <- tail(rf_risk_progress_biomarker$err.rate,n =1)
rf_risk_progress_biomarker_importance <- rf_risk_progress_biomarker$importance %>%
as.data.frame() %>%
rownames_to_column() %>%
as.tibble() %>%
dplyr::rename(VIMP = ".") %>%
arrange(desc(VIMP))
risk_progress_biomarker_variables <- rf_risk_progress_biomarker_importance %>%
head((dim(rf_risk_progress_biomarker_importance)[1]-1)) %>%
# top_n((dim(rf_risk_progress_biomarker_importance)[1]-1)) %>%
pull(rowname)
print(i)
}
tibble_error_rate_tail_progress_biomarker <- tibble(n = 104:1, error_rate = error_rate_tail_progress_biomarker)
suppressMessages(n_min_progress_biomarker <- tibble_error_rate_tail_progress_biomarker %>% top_n(-1) %>% pull(n))
suppressMessages(error_rate_min_progress_biomarker <- tibble_error_rate_tail_progress_biomarker %>% top_n(-1) %>% pull(error_rate))
variable_selected_progress_biomarkers[[j]] <- str_replace_all(risk_progress_biomarker_variables_total[[105-n_min_progress_biomarker]], "_", "")
error_rate_min_progress_biomarkers[j] <- error_rate_min_progress_biomarker
print(paste("Finish", j))
}

Shiny not detecting input in output

If I run the following code, everything runs fine:
library(dplyr)
library(tidyr)
library(shiny)
id <- 1:100
gender <- sample(c('M','F'), 100, replace=TRUE)
age <- sample(18:22, 100, replace=TRUE)
ethnicity <- sample(c('W','B','H','A','O'), 100, replace = TRUE)
grade <- sample(LETTERS[1:4], 100, replace=TRUE)
df <- cbind(id,gender,age,ethnicity,grade) %>% as.data.frame()
list1 <- list("id"="id","gender"="gender","age"="age","ethnicity"="ethnicity","grade"="grade")
list2 <- list("id"="id","gender"="gender","age"="age","ethnicity"="ethnicity","grade"="grade")
ui <-fluidPage(
selectInput("picker1", "PICKER 1", choices = list1, selected = "gender"),
selectInput("picker2", "PICKER 2", choices = list2, selected = "grade"),
tableOutput("crosstabs")
)
server <- function(input,output,session){
output$crosstabs <- renderTable({
t<-df %>% select_all() %>% select(-id) %>%
pivot_longer(cols = input$picker1) %>%
count(name,value, grade) %>% pivot_wider(names_from = grade, values_from = n)
t
})
}
shinyApp(ui,server)
But when I update output$crosstabs to include input$picker2, it returns an error. I'm rather confused, as I'm using the same structure, but simply with grade swapped out for input$picker2:
output$crosstabs <- renderTable({
t<-df %>% select_all() %>% select(-id) %>%
pivot_longer(cols = input$picker1) %>%
count(name,value, input$picker2) %>% pivot_wider(names_from = input$picker2, values_from = n)
t
})
What exactly is happening here?
It is a string and the count expects a unquoted column name, we can convert it to symbol and evaluate (!!)
server <- function(input,output,session){
output$crosstabs <- renderTable({
df %>%
select_all() %>%
select(-id) %>%
pivot_longer(cols = input$picker1) %>%
count(name,value, !!rlang::sym(input$picker2)) %>%
pivot_wider(names_from = input$picker2, values_from = n)
})
}
-testing

Resources