Finding a table on a webpage with Power Query - table is not always in the same location - web-scraping

I use Power Query to scrape a webpage that has multiple tables on it. The tables can move around so one week it may be the second table, the next the 3rd, then back to the 2nd the next week. The table data does have the same text in the first column so I can select it if I can find the entry.
My code is as follows. The issue is the Source{2} can be a different number based on the actual page layout.
let
Source = Web.Page(Web.Contents("https://www.eia.gov/naturalgas/weekly/#tabs-supply-1")),
Data0 = Source{2}[Data],
#"Changed Type" = Table.TransformColumnTypes(Data0,{{"Column1", type text}, {"Column2", type text}, {"Column3", type text}, {"Column4", type text}, {"Column5", type text}}),
#"Replaced Value" = Table.ReplaceValue(#"Changed Type","U.S. natural gas supply - ","",Replacer.ReplaceText,{"Column1"}),
RecRepl = Table.ReplaceValue(#"Replaced Value",null, each _[Column1], Replacer.ReplaceValue,{"Column2"}),
#"Removed Columns" = Table.RemoveColumns(RecRepl,{"Column1"}),
#"Removed Top Rows" = Table.Skip(#"Removed Columns",1),
#"Promoted Headers" = Table.PromoteHeaders(#"Removed Top Rows", [PromoteAllScalars=true])
in
#"Promoted Headers"
How can I search the table in the Data column to find the row I need?
Thanks!!!
I've tried a variety of steps but have been unable to find the table.

try this
This code looks for the phrase Henry Hub in column zero of all the tables in the Data column, and figures out which one has it. It then filters for true and pulls the contents (of the first match). You can adjust what you are looking for in terms of column to search, text to search for, and take it from there
let Source = Web.Page(Web.Contents("https://www.eia.gov/naturalgas/weekly/#tabs-supply-1")),
#"Added Custom" = Table.AddColumn(Source, "Custom", each List.Contains(Table.Column([Data],Table.ColumnNames([Data]){0}),"Henry Hub")),
#"Filtered Rows" = Table.SelectRows(#"Added Custom", each [Custom] = true),
Data0 = #"Filtered Rows"{0}[Data]
in Data0
EDIT
This code looks for the word Henry in column zero of all the tables in the Data column, and figures out which one has it. It then filters for true and pulls the contents (of the first match). You can adjust what you are looking for in terms of column to search, text to search for, and take it from there
let Source = Web.Page(Web.Contents("https://www.eia.gov/naturalgas/weekly/#tabs-supply-1")),
#"Added Custom" = Table.AddColumn(Source, "Custom", each List.Count(List.FindText(Table.Column([Data],Table.ColumnNames([Data]){0}),"Henry"))>0),
#"Filtered Rows" = Table.SelectRows(#"Added Custom", each [Custom] = true),
Data0 = #"Filtered Rows"{0}[Data]
in Data0

Related

Can't convert an Excel Matrix into a transaction file using "read.transactions" in R

I am currently trying to complete a Market Based analysis of Purchase orders with arules. However I am unable to convert my matrix of orders into transaction data.
I have a matrix on excel for purchase orders, with the top row, being headers of item names and the 1800 rows below listing the various quantities of that item in each order.
This is the current Formula that I have created: "trans = read.transactions("Market Based Analysis/Table.csv", format = "basket", sep = ",", header = TRUE, rm.duplicates = F)"
but it can't seem to read the data correctly, as it is adding an item "0" and giving false quantities when I summarise the figures.

R shiny app DataTable with excel-like filtering to include everything but not some elements or the opposite

I was wondering if it's possible to have an output datatable in a R Shiny app with Excel-like filtering. More specifically, I'm looking for a way to either select all the elements in a given column but exclude a few (imagine we have 100 different values in a column and we want to exclude just 2 of them) or unselect all the values and only choose a few.
In Excel we have those little square boxes that allow us to tick/untick a specific value in a column.
Now, I know we have the shiny selectInput with the MULTIPLE parameter:
selectInput("select_max_mdy_rtg", label = "Max Mdy Rtg",
choices = list("Aaa" = "aaa", "Aa1"="aa1", "Aa2"="aa2", "Aa3" = "aa3", "A1"="a1","A2"="a2","A3"="a3",
"Baa1"="baa1","Baa2"="baa2","Baa3"="baa3","Ba1"="ba1","Ba2"="ba2","Ba3"="ba3","B1"="b1","B2"="b2","B3"="b3",
"Caa1"="caa1","Caa2"="caa2","Caa3"="caa3","Ca"="ca","C"="c","D"="d","NR/WR/NA"="na"),
selected = 'aaa', MULTIPLE=T)
However, this doesn't allow me to exclude just a few values.
I have seen a similar question Shiny datatable filter box but the solution proposed there doesn't allow to select all the values and exclude just a few of them, plus I don't necessarily need the filters at the bottom of the table.
I would need precisely what we can do in excel when filtering values in a given column.
Thoughts on this?
Thanks
You could try looking into PickerInput from the ShinyWidgets package
For example:
pickerInput("select_max_mdy_rtg", "Max Mdy Rtg",
choices = list("Aaa" = "aaa", "Aa1"="aa1", "Aa2"="aa2", "Aa3" = "aa3", "A1"="a1","A2"="a2","A3"="a3",
"Baa1"="baa1","Baa2"="baa2","Baa3"="baa3","Ba1"="ba1","Ba2"="ba2","Ba3"="ba3","B1"="b1","B2"="b2","B3"="b3",
"Caa1"="caa1","Caa2"="caa2","Caa3"="caa3","Ca"="ca","C"="c","D"="d","NR/WR/NA"="na"),
options = list(`actions-box` = TRUE),
multiple = TRUE)

Apriori Labels rejected despite other function factors writing rules in R

I have a large data set (3.5 million observations and 185 variables) that I'm doing market basket analysis on using apriori(), most of the columns have a yes/no result. I've converted my data frame correctly but for some of the yes/no columns one of the factors (usually a yes) will occasionally not run and give an Error in asMethod(object) : variable is an unknown item label as the output or it won't write any rules while the others run fine. Since my file is so large I need to narrow down the rules I run via a lhs = specification, hence my concern about the sporadic code.
I've checked that the label exists in my dataframe, it does, and I went so far to factor it again just in case that's the issue. When I run labels() on my transaction data I can't find any entries with the problematic label despite table() showing that some exist. However, I don't have an efficient way to search all the transaction data so I only searched a few hundred transactions so they could still be there.
my csv is a dataframe that has a row per transaction and column for basket items. Its not as wide as it could be because the Yes/no values are in the same column. I've also attached the column name to the cells with a . to make the rules easier to read. df2 is the same as ExportMD1.csv
Here's my data conversion
tr <- read.transactions('ExportMD1.csv', format = 'basket', sep = ',', cols = 185, header = TRUE)
I'll use isTreasuryBill as an example, the table shows that there are 'yes' values
table(df2$isInterestBearing)
isInterestBearing.n 69745
isInterestBearing.y 276824
I get one of two outputs when I run the following code:
rules <- apriori(tr, paramete = list(supp = 0.5, conf = 0.8, minlen = 2), appearance = list(lhs= "isInterestBearing"))
Option 1
Error in asMethod(object) : isInterestBearing is an unknown item label
4. stop(paste(indicator[!indicator %in% from$labels], "is an unknown item label", collapse = ", "))
3. asMethod(object)
2. as(c(appearance, list(labels = itemLabels(data))), "APappearance")
1. apriori(tr, paramete = list(supp = 0.5, conf = 0.8, minlen = 2), appearance = list(lhs = "isInterestBearing"))
Option 2
Parameter specification:
Algorithmic control:
Absolute minimum support count: 173284
set item appearances ...[1 item(s)] done [0.04s].
set transactions ...[430165 item(s), 346569 transaction(s)] done [24.73s].
sorting and recoding items ... [177 item(s)] done [0.97s].
creating transaction tree ... done [1.35s].
checking subsets of size 1 done [0.02s].
writing ... [0 rule(s)] done [0.04s].
creating S4 object ... done [0.22s].
There's no difference in the dataframe or read.transaction when these issues occur.
Ideally apriori() would run consistently without any errors. I suspect that the reason I'm not getting any rules for some is because the counts are so low but I have no idea why the labels aren't being reliably recognized.
I think you are just not using the right item label in appearance. Check what item labels you have in your transactions with
itemlabels(tr)
The correct item label will be something like isInterestBearing=y.

Creating different indent for row names in pander

for the purpose of exporting a results table into word using R markdown, I found that pander has fulfilled almost all my needs. However, after a lot of searching I couldn't find a way to indent some of the row names in my table (the equivalent to add_indent in kable). For example, the first row's name is: "Level 1", and than I want the second row's name ("intercept") to be indented right. Is this possible? (I found pandoc.indent but didn't succeed in applying it on my table).
Attached is the current code I use for the table.
library(pander)
set.alignment('center', row.names = 'left')
panderOptions('missing', '')
pander::pander(df,split.cell = c(50,15,15,15), split.table = Inf,
emphasize.rownames = FALSE)
Thanks!

Is it possible to set a cube with a snowflake schema in data.cube R?

Or, can dimensions be, in some way nested, in data.cube?
Given the following example (accessed via ?data.cube on R, having installed last branch of data.cube-oop package, by #jangorecki) for which I post code and image example.
Consider I want to expand the cube adding a new dimension which would turn the schema to snowflake, so for each geography location, I would have another set of data (data.table) which would describe demography properties (i.e. population based on gender, age, etc)
Image
dotted: possible new dimensions.
black: actual facts and dimensions from code example.
green: new dimension which turns the schema into snowflake.
Code
# multidimensional hierarchical data from fact and dimensions
X = populate_star(N = 1e3)
sales = X$fact$sales
time = X$dims$time
geography = X$dims$geography
# define hierarchies
time.hierarchies = list( # 2 hierarchies in time dimension
"monthly" = list(
"time_year" = character(),
"time_quarter" = c("time_quarter_name"),
"time_month" = c("time_month_name"),
"time_date" = c("time_month","time_quarter","time_year")
),
"weekly" = list(
"time_year" = character(),
"time_week" = character(),
"time_date" = c("time_week","time_year")
)
)
geog.hierarchies = list( # 1 hierarchy in geography dimension
list(
"geog_region_name" = character(),
"geog_division_name" = c("geog_region_name"),
"geog_abb" = c("geog_name","geog_division_name","geog_region_name")
)
)
# create dimensions
dims = list(
time = as.dimension(x = time,
id.vars = "time_date",
hierarchies = time.hierarchies),
geography = as.dimension(x = geography,
id.vars = "geog_abb",
hierarchies = geog.hierarchies)
)
# create fact
ff = as.fact(
x = sales,
id.vars = c("geog_abb","time_date"),
measure.vars = c("amount","value"),
fun.aggregate = sum,
na.rm = TRUE
)
# create data.cube
dc = as.data.cube(ff, dims)
str(dc)
other questions related to the example are:
what is the value expected for each element? why
"time_week" = character()
"time_date" = c("time_week","time_year")
instead of
"time_week" = character()
"time_date" = date()
and why this naming as in columns of data.table?
"time_quarter" = c("time_quarter_name"),
"time_month" = c("time_month_name")
Cube model is underlying structure that user don't have to deal with. data.cube-oop uses the following data model.
Going precisely to your example in question. You can't add new dimensions in snowflake schema this way. New dimensions must be connected to fact table. Tables in snowflake schema that are not directly connected to fact table are just hierarchy levels in dimensions. In your example that means the customer dimension is just a higher level attributes of geography dimensions. You might eventually do the opposite, create customer dimension and on higher levels of customer hierarchy keep geography attributes.
In any way you decide to do it, you must create your dimensions from single table each (can be same wide table of course). You cannot construct dimension by providing its levels separately, this would be more confusing than just single denormalized dimension table which is something that users most often deal with.
So in order to keep geography attributes in customer dimension just lookup geography values to your customer table and supply as.dimension with new table.
As for the second part of your question, hierarchy lists defines relationship between attributes in hierarchy, not data types. Column name on LHS define a key column in hierarchy level, while RHS define dependent attributes which will be present on that particular level. You basically define which column goes on which level in hierarchy, lower levels must refer to upper level in order to create real hierarchy. This is enforced by uniqueness of data, i.e. you must have only single time_month_name for each time_month.
To see this relationship better try the following:
library(data.cube)
X = populate_star(N = 1e3)
dc = as.data.cube(X)
dc$dimensions$time$levels
It will print all hierarchy levels in time dimension, each hierarchy level is a separate table.

Resources