How to get the grouping right in R with Plotly - r

I have some problem to group my data in Plotly under R. To start with I was using local data from a csv file, reading them with:
geogrid_data <- read.delim('geogrid.csv', row.names = NULL, stringsAsFactors = TRUE)
and the plotting went well, using the following:
library(plotly)
library(RColorBrewer)
x <- list(
title = 'Date'
)
p <- plotly::plot_ly(geogrid_data,
type = 'scatter',
x = ~ts_now,
y = ~absolute_v_sum,
text = paste('Table: ', geogrid_data$table_name,
'<br>Absolute_v_Sum: ', geogrid_data$absolute_v_sum),
hoverinfo = 'text',
mode = 'lines',
color = list(
color = colorRampPalette(RColorBrewer::brewer.pal(11,'Spectral'))(
length(unique(geogrid_data$table_name))
)
),
transforms = list(
list(
type = 'groupby',
groups = ~table_name
)
)
) %>% layout(showlegend = TRUE, xaxis = x)
Here the output:
Then I was going to alter the data source to an Oracle database table, reading the data as follows, using the ROracle package:
# retrieve data into resultSet object
rs <- dbSendQuery(con, "SELECT * FROM GEOGRID_STATS")
# fetch records from the resultSet into a data.frame
geogrid_data <- fetch(rs)
# free resources occupied by resultSet
dbClearResult(rs)
dbUnloadDriver(drv)
# remove duplicates from dataframe (based on TABLE_NAME, TS_BEFORE, TS_NOW, NOW_SUM)
geogrid_data <- geogrid_data %>% distinct(TABLE_NAME, TS_BEFORE, TS_NOW, NOW_SUM, .keep_all = TRUE)
# alter date columns in place
geogrid_data$TS_BEFORE <- as.Date(geogrid_data$TS_BEFORE, format='%d-%m-%Y')
geogrid_data$TS_NOW <- as.Date(geogrid_data$TS_NOW, format='%d-%m-%Y')
and adjusting the plotting to:
p <- plotly::plot_ly(
type = 'scatter',
x = geogrid_data$TS_NOW,
y = geogrid_data$ABSOLUTE_V_SUM,
text = paste('Table: ', geogrid_data$TABLE_NAME,
'<br>Absolute_v_Sum: ', geogrid_data$ABSOLUTE_V_SUM,
'<br>Date: ', geogrid_data$TS_NOW),
hoverinfo = 'text',
mode = 'lines',
color = list(
color = colorRampPalette(RColorBrewer::brewer.pal(11,'Spectral'))(
length(unique(geogrid_data$TABLE_NAME))
)
),
transforms = list(
list(
type = 'groupby',
groups = geogrid_data$TABLE_NAME
)
)
) %>% layout(showlegend = TRUE, xaxis = x)
Unfortunately, this is leading to some problem with the grouping as it seems.:
As you can see from the label text when hovering over the data point, the point represents data from NY_SKOV_PLANTEB_MW_POLY while the legend is set to show data from NY_BYGN_MW_POLY. Looking at other data points in this graph I found a wild mix of points of all sorts in this graph, some of them representing data of NY_BYGN_MW_POLY, most of them not.
Also the plotting with respect to the time line does not work any more, e.g. data are plotted with start on Dec. 11 - Dec. 10 - Dec. 10 - Dec. 12 - Dec. 20 - Dec. 17 - Dec. 16 - Dec. 15.
Where do I go wrong in handling the data, and what do I have to do to get it right?

Of course, one should look at the data... thanks Marco, after your question I did look at my data.
There are some points where I simply assumed things.
The reason why all data plotted fine with data from the csv file is simple. All information manually compiled in the csv file came from information in emails that have been ordered by date. Hence, I compiled the data in the csv file ordered by date and Plotly does not have any problems grouping the data by table_name.
After looking at my data I tidied up, keeping only the data I need to show in the plot and used dplyr to sort the data by time.
geogrid_data <- dplyr::arrange(geogrid_data, TS_NOW)
It is only by time and not by time and table name because the sorting by table name is done anyway by Plotly and the groupby statement

Related

How to plot Sankey Graph with R networkD3 values and percentage below each node

Good afternoon,
from the code below I am able to produce a Graph chart but it does not show the underlying values.
I tried to tweak the code in this thread but I got no joy. I never used Java.
What I need is a graph that has also the values and the percentage under each node like the following picture.
Thanks
library(dplyr)
library(networkD3)
library(tidyverse)
library(readxl)
library(RColorBrewer)
df = data.frame(Source = c("ABC","CDE","MNB","PCI","UCD"),
Destination = c("Me","You","Him","Her","Her"),
Value = c(200,350,456,450,100))
## Reshape dataframe to long
df2 = pivot_longer(df, c(Destination, Source))
## make unique list for destination and source
dest = unique(as.character(df$Destination))
sources = unique(as.character(df$Source))
## Assign nodes number to each element of the chart
nodes2 = data.frame(node = append(dest,sources), nodeid = c(0:8))
res = merge(df,nodes2, by.x="Source", by.y = "node")
res = merge(res,nodes2, by.x="Destination", by.y = "node")
## Make links
links2 = res[, c("nodeid.x","nodeid.y","Value")]
colnames(links2) <- c("source", "target", "value")
## Add a 'group' column to each connection:
links2$group = as.factor(c("type_a","type_b","type_c","type_d","type_e"))
## defining nodes
nodes2["groups"] = nodes2$node
nodes2$groups = as.factor(nodes2$groups)
# Give a color for each group:
my_color <- 'd3.scaleOrdinal() .domain(["type_a","type_b","type_c","type_d","type_e","Me","You","Him","Her","Her"]) .range(["rgb(165,0,38,0.4)", "rgb(215,48,39, 0.4)", "rgb(244,109,67,0.4)", "rgb(253,174,97,0.4)", "rgb(254,224,139,0.4)",
"rgb(255,255,191,0.4)", "rgb(217,239,139,0.4)", "rgb(166,217,106,0.4)",
"rgb(102,189,99,0.4)","rgb(26,152,80,0.4)"])'
# plot graph
networkD3::sankeyNetwork(Links = links2, Nodes = nodes2,
Source = 'source',
Target = 'target',
Value = 'value',
NodeID = 'node',
units = 'Amount',
colourScale=my_color,
LinkGroup="group",
NodeGroup="groups",
fontFamily = "arial",
fontSize = 8,
nodeWidth = 8)
Update below original content; it is a fully developed solution to your original request.
I'm still working on rendering the string with multiple lines (instead of on one line). However, it's proving to be quite difficult as SVG text. However, here is a method in which you can get all of the desired information onto your diagram, even if it isn't styled exactly as you wished.
First I created the data to add to the plot. This has to be added to the widget after it's created. (It will just get stripped if you try to add it beforehand.)
This creates the before and after percentages and the aggregated sums (where needed).
# for this tiny data frame some of this grouping is redundant---
# however, this method could be used on a much larger scale
df3 <- df %>%
group_by(Source) %>%
mutate(sPerc = paste0(round(sum(Value) / sum(df$Value) * 100, 2), "%")) %>%
group_by(Destination) %>%
mutate(dPerc = paste0(round(sum(Value) / sum(df$Value) * 100, 2), "%")) %>%
pivot_longer(c(Destination, Source)) %>%
mutate(Perc = ifelse(name == "Destination",
dPerc, sPerc)) %>% # determine which % to retain
select(Value, value, Perc) %>% # only fields to add to widget
group_by(value, Perc) %>%
summarise(Value = sum(Value)) # get the sum for 'Her'
I saved the Sankey diagram with the object name plt. This next part adds the new data to the widget plt.
plt$x$nodes <- right_join(plt$x$nodes, df3, by = c("name" = "value"))
This final element adds the value and the percentages to the source and destination node labels.
htmlwidgets::onRender(plt, '
function(el, x) {
d3.select(el).selectAll(".node text")
.text(d => d.name + " " + d.Perc + " " + d.Value)
}')
Update: Multi-line labels
I guess I just needed to sleep on it. This update will get you multi-line text.
You also asked for resources on how you would go about doing this yourself. There are a few things at play here: Javascript, SVG text, D3, and the package htmlwidgets. When you use onRender, it's important to know the script file that that connects the package R code to the package htmlwidgets. I would suggest starting with learning about htmlwidgets. For example, how to create your own.
Alright-- back to answering the original question. This appends the new values using all of the content I originally provided, except the call to onRender.
htmlwidgets::onRender(plt, '
function(el, x) {
d3.select(el).selectAll(".node text").each(function(d){
var arr, val, anc
arr = " " + d.Perc + " " + d.Value;
arr = arr.split(" ");
val = d3.select(this).attr("x");
anc = d3.select(this).attr("text-anchor");
for(i = 0; i < arr.length; i++) {
d3.select(this).append("tspan")
.text(arr[i])
.attr("dy", i ? "1.2em" : 0)
.attr("x", val)
.attr("text-anchor", anc)
.attr("class", "tspan" + i);
}
})
}')

Census Data Using an API on R studio

So, I am new to using R, so sorry if the questions seem a little basic!
But my work is asking me to look through census data using an API and identify some variables in each tract, then create a csv file they can look at. The code is fully written for me, I believe, but I need to change the variables to:
S2602_C01_023E - black / his
S2602_C01_081E - unemployment rate
S2602_C01_070E - not US citizen (divide by total population)
S0101_C01_030E - # over 65 (divide by total pop)
S1603_C01_009E - # below poverty (divide by total pop)
S1251_C01_010E - # child under 18 (divide by # households)
S2503_C01_013E - median income
S0101_C01_001E - total population
S2602_C01_078E - in labor force
And, I need to divide some of the variables, like I have written, and export all of this into a CSV file. I just don't really know what to do with the code..like I am just lost because I have never used R. I try changing the variables to the ones I need, but an error comes up. Any help would be greatly appreciated!
library(tidycensus)
library(tidyverse)
library(stringr)
library(haven)
library(profvis)
#list of variables possible
v18 <- load_variables(year = 2018,
dataset = "acs5",
cache = TRUE)
#function to get variables for all states. Year, variables can be
easily edited.
get_census_data <- function(st) {
Sys.sleep(5)
df <- get_acs(year = 2018,
variables = c(totpop = "B01003_001",
male = "B01001_002",
female = "B01001_026",
white_alone = "B02001_002",
black_alone = "B02001_003",
americanindian_alone = "B02001_004",
asian_alone = "B02001_005",
nativehaw_alone = "B02001_006",
other_alone = "B02001_007",
twoormore = "B02001_008",
nh = "B03003_002",
his = "B03003_003",
noncit = "B05001_006",
povstatus = "B17001_002",
num_households = "B19058_001",
SNAP_households = "B19058_002",
medhhi = "B19013_001",
hsdiploma_25plus = "B15003_017",
bachelors_25plus = "B15003_022",
greater25 = "B15003_001",
inlaborforce = "B23025_002",
notinlaborforce = "B23025_007",
greater16 = "B23025_001",
civnoninstitutional = "B27010_001",
withmedicare_male_0to19 = "C27006_004",
withmedicare_male_19to64 = "C27006_007",
withmedicare_male_65plus = "C27006_010",
withmedicare_female_0to19 = "C27006_014",
withmedicare_female_19to64 = "C27006_017",
withmedicare_female_65plus = "C27006_020",
withmedicaid_male_0to19 = "C27007_004",
withmedicaid_male_19to64 = "C27007_007",
withmedicaid_male_65plus = "C27007_010",
withmedicaid_female_0to19 = "C27007_014",
withmedicaid_female_19to64 = "C27007_017",
withmedicaid_female_65plus ="C27007_020"),
geography = "tract",
state = st )
return(df)
}
#loops over all states
df_list <- setNames(lapply(states, get_census_data), states)
##if you want to keep margin of error, remove everything after %>%
in next two lines
final_df <- bind_rows(df_list) %>%
select(-moe)
colnames(final_df)[3] <- "varname"
#cleaning up final data, making it wide instead of long
final_df_wide <- final_df %>%
gather(variable, value, -(GEOID:varname)) %>%
unite(temp, varname, variable) %>%
spread(temp, value)
#exporting to csv file, adjust your path
write.csv(final_df,"C:\Users\NAME\Documents\acs_2018_tractlevel_dat.
a.csv" )
Since you can't really give an reproducible example without revealing your API key, I'll try my best to figure out what could work here:
Let's first edit the function that pulls data from the API:
get_census_data <- function(st) {
Sys.sleep(5)
df <- get_acs(year = 2018,
variables = c(blackHis= "S2602_C01_023E",
unEmployRate = "S2602_C01_081E",
notUSCit = "S2602_C01_070E")
geography = "tract",
state = st )
return(df)
}
I've just put in two of the variables, but you should get the point.
Try if this works for you. And returns the data that is stored in the respective variables.

C3.ai COVID-19 Data Lake Quickstart in R

I am working on a research assignment on COVID and using the datalake API to fetch different kind of datasets available to us.
I am wondering if it's possible to fetch all outbreak countries.
ids = list("Australia"), this works with individual country, it doesnt seem to accept wildcard or all.
Can anyone give me any insights on this please.
# Total number of confirmed cases in Australia and proportion of getting infected.
today <- Sys.Date()
casecounts <- evalmetrics(
"outbreaklocation",
list(
spec = list(
**ids = list("Australia"),**
expressions = list("JHU_ConfirmedCasesInterpolated","JHU_ConfirmedDeathsInterpolated"),
start = "2019-12-20",
end = today-1,
interval = "DAY"
)
)
)
casecounts
The easiest way to access a list of countries is in the Excel file linked at https://c3.ai/covid-19-api-documentation/#tag/OutbreakLocation. It has a list of countries in the first sheet, and shows which of those have data from JHU.
You could also fetch an approximate list of country-level locations with:
locations <- fetch(
"outbreaklocation",
list(
spec = list(
filter = "not(contains(id, '_'))"
)
)
)
That should contain all of the countries, but could have some non-countries like World Bank regions.
Then, you'd use this code to get the time series data for all of those locations:
location_ids <-
locations %>%
dplyr::select(-location) %>%
unnest_wider(fips, names_sep = ".") %>%
sample_n(15) %>% # include this to test on a smaller set
pull(id)
today <- Sys.Date()
casecounts <- evalmetrics(
"outbreaklocation",
list(
spec = list(
ids = location_ids,
expressions = list("JHU_ConfirmedCasesInterpolated","JHU_ConfirmedDeathsInterpolated"),
start = "2019-12-20",
end = today-1,
interval = "DAY"
)
),
get_all = TRUE
)
casecounts

Shiny and Leaflet integration is really slow - how can I speed it up?

Right now i'm almost certain that my current use of shiny and leaflet is sub-optimal.
At a high level my current approach looks like this:
Generate a leaflet.
Create a reactive dataframe on user input.
Create a reactive dataframe of lat lon coordinates on user selection of their area of interest.
Merge a spatial dataframe (containing postcode polygon boundaries) with the reactive dataframe from step 2, then draw the map with the joined dataframe. This keeps all the data necessary for drawing polygons, adding colorBins and fillColor and labels inside the same final dataframe.
In more detail, the steps are executed as follows:
Generate a map like this:
output$leaflet_map <- renderLeaflet({
leaflet() %>%
addTiles()
})
Produce a reactive dataframe of marketing data to be joined onto an sf spatial dataframe containing postcode polygons via sp::merge() (the join happens a little later, i'll get to that):
reactive_map_data1 <- reactive({
df %>%
filter(BrandAccount_Brand %in% input$selectBrandRecruitment1) %>%
group_by(POA_CODE, ordertype) %>%
summarise("Number of Orders type and postcode" = n(), "AOV" = round(mean(TotalDiscount), 2)) %>%
left_join(seifa, by = "POA_CODE") %>%
left_join(over25bypostcode, by = "POA_CODE") %>%
mutate(`Proportion of Population Over 25` = round(n() / `25_and_over` * 100, 4))
})
Create a reactive dataframe containing the lat and lon coordinates of the State selected by the user to be fed into the call to render the map:
reactive_state_recruitment1 <- reactive({
australian_states %>%
filter(States == input$selectState_recruitment1)
})
Render the final map - profvis determines that this is infact the slow part:
observeEvent(
input$gobutton_recruitment1, {
## First I load the spatial data with each call to render the
## map - this is almost certainly sub-optimal however I can't
## think of another way to do this as each time the data are
## joined I have no other way of re-setting the gdal.postcodes2
## spatial dataframe to its original state which is why I reload
## it from .rds each time:
gdal.postcodes_recruitment1 <- readRDS("gdal.postcodes2.rds")
## I then merge the marketing `reactive_map_data1()` dataframe
## created in Step 2 with the freshly loaded `gdal.postcodes2`
## spatial dataframe - `profvis` says this is pretty slow but
## not as slow as the rendering of the map
gdal.postcodes_recruitment1#data <- sp::merge(gdal.postcodes_recruitment1#data, reactive_map_data1(), by.x = "POA_CODE", all.x = TRUE)
## Next I generate the domain of `colorBin` with the `Number of
## Orders type and postcode` variable that only exists after the
## merge and is subject to change from user input - it resides
## within the `reactive_map_data1()` dataframe that gets merged
## onto the `gdal.postcodes2()` spatial dataframe.
pal <- colorBin("YlOrRd", domain =
gdal.postcodes_recruitment1$`Number of Orders type and
postcode`, bins = bins_counts)
## Lastly I update the leaflet with `leafletProxy()` to draw the
## map with polygons and fill colour based on the
## `reactive_map_data1()` values
leafletProxy("leaflet_map_recruitment1", data = gdal.postcodes_recruitment1) %>%
addPolygons(data = gdal.postcodes_recruitment1,
fillColor = ~pal(gdal.postcodes_recruitment1$`Number of Orders type and postcode`),
weight = 1,
opacity = 1,
color = "white",
dashArray = "2",
fillOpacity = .32,
highlight = highlightOptions(
weight = 3.5,
color = "white",
dashArray = "4",
fillOpacity = 0.35,
bringToFront = TRUE),
layerId = gdal.postcodes_recruitment1#data$POA_CODE,
label = sprintf(
"<strong>%s<br/>%s</strong><br/>%s<br/>%s<br/>%s<br/>%s",
paste("Postcode: ", gdal.postcodes_recruitment1$POA_CODE, sep = ""),
paste("% of Population Over 25: ", gdal.postcodes_recruitment1$`Proportion of Population Over 25`, "%"),
paste("Number of Orders: ", gdal.postcodes_recruitment1$`Number of Orders type and postcode`, sep = ""),
paste("Ave Order Value: $", gdal.postcodes_recruitment1$`AOV`, sep = ""),
paste("Advantage & Disadvantage: ", gdal.postcodes_recruitment1$`Relative Socio-Economic Advantage and Disadvantage Decile`, sep = ""),
paste("Education and Occupation: ", gdal.postcodes_recruitment1$`Education and Occupation Decile`, sep = "")
) %>%
lapply(htmltools::HTML),
labelOptions = labelOptions(
style = list("font-weight" = "normal", padding = "3px 8px"),
textsize = "15px",
direction = "auto")) %>%
addLegend("bottomright", pal = pal, values = ~bins_counts,
title = "# of Recruits (All Time)",
labFormat = labelFormat(suffix = ""),
opacity = 1
) %>%
setView(lng = reactive_state_recruitment1()$Lon, lat = reactive_state_recruitment1()$Lat, zoom = reactive_state_recruitment1()$States_Zoom)
})
All up the map takes between 7 and 20 seconds to render as the data are quite large.
Some things to note:
The polygons have already been simplified to death, they are currently only displaying at 10% of the detail that was originally provided to define postcode boundaries by the Australian Bureau of Statistics. Simplifying the polygons further is not an option.
sp::merge() is not the fastest of join functions I have come across, but it is necessary in order to merge a spatial dataframe with a non-spatial dataframe (other joins such as those offered by dplyr will not accomplish this task - a look at the sp::merge() documentation reveals that this has something to do with S3 and S4 datatypes, in any case this part is not the slow part according to profvis).
According to profvis the actual rendering of the map in step 4 (drawing polygons) is the slow part. Ideally a solution to speed this whole process up would involve drawing the polygons on the original leaflet, and only updating the fillColor and labels applied to each polygon upon input of the 'Go' actionButton. I have not figured out a way to do this.
Can anyone think of a way to re-structure this whole procedure to optimise efficiency?
Any input is greatly appreciated.

How I can select the coordinate X and Y of R plot from Column Filter (R/Knime)?

So, I have this workflow :
I have selected 2 columns(Day and Temperature) from my file using ‘Columns filter’. And I connected to ‘R plot’ that I configurated but I obtain this :
The day column is not selected as X axis but (Row ID) and the Y axis is ok.
This is my code in R plot:
# Library
library(qcc)
library(readr)
library(Rserve)
Rserve(args = "--vanilla")
# Data column filter from CSV file imported
Test <- kIn
#Background color
qcc.options(bg.margin = "white", bg.figure = "gray95")
#R graph ranges of a continuous process variable
qcc(data = Test,
type = "R",
sizes = 5,
title = "Sample R Chart Title",
digits = 2,
plot = TRUE)
Here is my try (using KNIME's R, not the community contribution):
#install.packages("qcc")
library(qcc)
data <- knime.in
#Change the names to use Day instead of row keys
row.names(data) <- data$Day
#Using the updated data
plot(qcc(data = data,
type = "R",
sizes = 5,
title = "Sample R Chart Title",
digits = 2,
plot = TRUE))
With results like:
If you want to select the column for the X axis, just change the row.names assignment. (It can also come from knime.flow.in in case the column name is coming from a flow variable, but as I understand it is not the case for you.)

Resources