CSS Selector found with rvest and not with RSelenium

CSS Selector found with rvest and not with RSelenium - css

The goal is to read the 1-5yr GIC rates for Guaranteed Investment Certificate - Long-Term and Compound Interest under the Non-Cashable GICs tab.
Selector Gadget tells me that the css identifier is #container-9565195e5e .cmp-chart__chart span. Using rvest:
page <- read_html('https://www.td.com/ca/en/personal-banking/products/saving-investing/gic-rates-canada/')
page %>%
html_nodes("#container-9565195e5e .cmp-chart__chart span")
# {xml_nodeset (5)}
# [1] <span data-source="tdct-gic" data-view="single" data-filter-item="productId:315|minimumDepositAmt:0.01|minimumTermYearCnt:1" data-value="postedRate"></span>
# [2] <span data-source="tdct-gic" data-view="single" data-filter-item="productId:315|minimumDepositAmt:0.01|minimumTermYearCnt:2" data-value="postedRate"></span>
# [3] <span data-source="tdct-gic" data-view="single" data-filter-item="productId:315|minimumDepositAmt:0.01|minimumTermYearCnt:3" data-value="postedRate"></span>
# [4] <span data-source="tdct-gic" data-view="single" data-filter-item="productId:315|minimumDepositAmt:0.01|minimumTermYearCnt:4" data-value="postedRate"></span>
# [5] <span data-source="tdct-gic" data-view="single" data-filter-item="productId:315|minimumDepositAmt:0.01|minimumTermYearCnt:5" data-value="postedRate"></span>}
rvest can't read the actual rates because of the use of JavaScript on the site.
Turning to RSelenium using the same css selector results in an error:
remDr$navigate("https://www.td.com/ca/en/personal-banking/products/saving-investing/gic-rates-canada/")
webElem <- remDr$findElement(using = "css", "#container-9565195e5e .cmp-chart__chart span")
# Selenium message:Unable to locate element: {"method":"css selector","selector":"#container-9565195e5e .cmp-chart__chart span"}
# For documentation on this error, please visit: http://seleniumhq.org/exceptions/no_such_element.html
# Build info: version: '2.53.1', revision: 'a36b8b1', time: '2016-06-30 17:37:03'
# System info: host: 'ef4080d2cb73', ip: '172.17.0.2', os.name: 'Linux', os.arch: 'amd64', os.version: '5.4.0-135-generic', java.version: '1.8.0_91'
# Driver info: driver.version: unknown
#
# Error: Summary: NoSuchElement
# Detail: An element could not be located on the page using the given search parameters.
# class: org.openqa.selenium.NoSuchElementException
# Further Details: run errorDetails method
So how do I use RSelenium to read the 1-5yr rates for Guaranteed Investment Certificate - Long-Term and Compound Interest for Non-registered and Registered (TFSA, RSP, RIF, RESP)

Replaced RSelenium with Chromote (which is on its way to rvest: r4ds, gh). The selector in question seems to refer to another table, Long-Term and Simple Interest. While values are currently the same, still switched to the one mentioned in question.
library(chromote)
library(rvest)
b <- ChromoteSession$new()
# Display the current session in the Chromote browser:
# b$view()
b$Page$navigate("https://www.td.com/ca/en/personal-banking/products/saving-investing/gic-rates-canada/")
b$Page$loadEventFired()
# Non-Cashable GICs >> Guaranteed Investment Certificate - Long-Term and Compound Interest
b$Runtime$evaluate("document.querySelector('#container-8a263227af table').outerHTML")$result$value %>%
minimal_html() %>%
html_element("table") %>%
html_table()
#> # A tibble: 5 × 2
#> Term `Non-registered and Registered (TFSA, RSP, RIF, RESP)`
#> <chr> <chr>
#> 1 1 year 4.65%
#> 2 2 years 4.35%
#> 3 3 years 3.75%
#> 4 4 years 4%
#> 5 5 years 4.05%
### Few alternatives
# evalute js in runtime:
sapply(1:5, \(x) b$Runtime$evaluate(paste0("document.querySelector('[data-filter-item=\"productId:703|minimumDepositAmt:0.01|minimumTermYearCnt:",x,"\"]').innerText"))$result$value)
#> [1] "4.65" "4.35" "3.75" "4" "4.05"
doc <- b$DOM$getDocument()
# elements where "data-filter-item" attribute starts with "productId:703|minimumDepositAmt:0.01|minimumTermYearCnt:"
nodeids <- b$DOM$querySelectorAll(doc$root$nodeId, '[data-filter-item^="productId:703|minimumDepositAmt:0.01|minimumTermYearCnt:"]')
sapply(nodeids$nodeIds, \(x) b$DOM$getOuterHTML(x) %>% minimal_html() %>% html_text())
#> [1] "4.65" "4.35" "3.75" "4" "4.05"
# close session
b$close()
#> [1] TRUE
Created on 2023-01-21 with reprex v2.0.2

The page does an initial POST request that gets all the data (let's call it master) for all the options. It then uses the various data-filter-item attribute values associated with a given table's cells e.g., data-filter-item="productId:703|minimumDepositAmt:0.01|minimumTermYearCnt:1", to filter the master data to the items needed to update that table.
You can replicate a simplified version of this POST request, create a DataFrame of all values in the value part of the response (think of it like a master reference table), then extract the required filters from a request to the original URI
> filters
[1] "productId:703|minimumDepositAmt:0.01|minimumTermYearCnt:1" "productId:703|minimumDepositAmt:0.01|minimumTermYearCnt:2"
[3] "productId:703|minimumDepositAmt:0.01|minimumTermYearCnt:3" "productId:703|minimumDepositAmt:0.01|minimumTermYearCnt:4"
[5] "productId:703|minimumDepositAmt:0.01|minimumTermYearCnt:5"
and turn those into a DataFrame for filtering.
You can then subset the master table using the smaller DataFrame. Column names will be matched upon if set in the master using the key values from the key:value response.
Finally, update the table, when extracted from request response for initial webpage, by updating the relevant column with the rate % from the filtered master DataFrame.
The html from the initial webpage is invalid so the target table selection was not as straightforward as I would like.
The CSS selector list I went with was designed with hopefully a longer shelf-life, in terms of remaining valid for longer than a more brittle path might.
One other thing to show might be the response from the POST request. The response has the following key:value format, where I use the key column to generate headers for my master DataFrame. The values get turned into the master DataFrame of all possible rates (and other dynamic page info.)
Credit:
I took the approach used by #akrun in their answer here, whereby read.dcf is used to map out a set of rows, with potentially repeated/new headers, into a single DataFrame with all headers present. NA is entered if a particular entry is not present in a given processed row.
This allowed me to turn this list of split filtering instructions:
> lapply(filters, str_split, "\\|") %>% unlist(recursive = F)
[[1]]
[1] "productId:703" "minimumDepositAmt:0.01" "minimumTermYearCnt:1"
[[2]]
[1] "productId:703" "minimumDepositAmt:0.01" "minimumTermYearCnt:2"
[[3]]
[1] "productId:703" "minimumDepositAmt:0.01" "minimumTermYearCnt:3"
[[4]]
[1] "productId:703" "minimumDepositAmt:0.01" "minimumTermYearCnt:4"
[[5]]
[1] "productId:703" "minimumDepositAmt:0.01" "minimumTermYearCnt:5"
into this:
> data_df
productId minimumDepositAmt minimumTermYearCnt
1 703 0.01 1
2 703 0.01 2
3 703 0.01 3
4 703 0.01 4
5 703 0.01 5
i.e. the set of filtering instructions for the master DataFrame as a DataFrame
The master DataFrame looking as follows:
> df %>% head()
productId minimumDepositAmt maximumDepositAmt minimumTermYearCnt maximumTermYearCnt minimumTermDayCnt maximumTermDayCnt postedRate
1 107 0.01 4999.99 0 0 90 119 4
2 107 5000 9999.99 0 0 90 119 4
3 107 10000 24999.99 0 0 90 119 4
4 107 25000 49999.99 0 0 90 119 4
5 107 50000 99999.99 0 0 90 119 4
6 107 100000 249999.99 0 0 90 119 4
minimumMarketGrowthRate maximumMarketGrowthRate stepperYear1Rate stepperYear2Rate stepperYear3Rate stepperYear4Rate stepperYear5Rate
1 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0
The subset master DataFrame:
> filtered_df
productId minimumDepositAmt minimumTermYearCnt maximumDepositAmt maximumTermYearCnt minimumTermDayCnt maximumTermDayCnt postedRate
1 703 0.01 1 4999.99 1 0 364 4.65
2 703 0.01 2 4999.99 2 0 364 4.35
3 703 0.01 3 4999.99 3 0 364 3.75
4 703 0.01 4 4999.99 4 0 364 4
5 703 0.01 5 4999.99 5 0 364 4.05
minimumMarketGrowthRate maximumMarketGrowthRate stepperYear1Rate stepperYear2Rate stepperYear3Rate stepperYear4Rate stepperYear5Rate
1 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0
The extracted table, from initial page, before update:
> table
# A tibble: 5 × 2
Term `Non-registered and Registered (TFSA, RSP, RIF, RESP)`
<chr> <chr>
1 1 year %
2 2 years %
3 3 years %
4 4 years %
5 5 years %
And the table after update using master (df - data from POST request to get rates info):
> print(table)
# A tibble: 5 × 2
Term `Non-registered and Registered (TFSA, RSP, RIF, RESP)`
<chr> <chr>
1 1 year 4.65%
2 2 years 4.35%
3 3 years 3.75%
4 4 years 4%
5 5 years 4.05%
r:
library(rvest)
library(tidyverse)
library(httr2)
page <- read_html("https://www.td.com/ca/en/personal-banking/personal-investing/products/gic/gic-rates-canada")
table_node <- page %>%
html_element('div.container:contains("Guaranteed Investment Certificate - Long-Term") .text:contains("Compound") ~ div table')
filters <- table_node %>%
html_elements("[data-filter-item]") %>%
html_attr("data-filter-item")
res <- request("https://www.td.com/ca/en/personal-banking/getRates/") %>%
req_headers(
"user-agent" = "Mozilla/5.0",
"content-type" = "application/json",
"x-kl-ajax-request" = "Ajax_Request"
) %>%
req_body_json(list("errorText" = "Unable to get the rate", "ratesType" = "gic")) %>%
req_perform() %>%
resp_body_string()
data <- jsonlite::parse_json(res, simplifyVector = T)
df <- set_names(data$value %>% as.data.frame(), data$key)
data_df <- map_dfr(lapply(filters, str_split, "\\|") %>%
unlist(recursive = F), ~ {
new <-
if (length(.x) > 0) {
as.data.frame(read.dcf(textConnection(.x)))
} else {
NULL
}
})
filtered_df <- inner_join(data_df, df)
table <- table_node %>% html_table()
table[2] <- str_c(filtered_df$postedRate, table[[2]])
print(table)

Related

Search for and extract words into new column

I have a dataset with 63,000 rows in R. One of the columns contained a list of words in the format '"Fireplace", "Garage", "One story with balcony", "Off street parking",' etc. They are property characteristics listed from sale websites.
I want to extract words from this column and create a new column that has a '0' or a '1' if the word is present or not (creating a dummy variable for regression). Once that has been completed, I want to be able to merge some of those columns together (ie take 'parking' 'Parking' 'garage' 'Garage' columns and merge them into one that includes all parking and garages). I'm assuming that R is sensitive to upper and lowercase characters but even if not, I need to be able to merge 'parking' and 'garage' together, for example.
This is for a hedonic pricing method so I need as many property characteristic variables as possible.
I don't know how to create new dummy variables or merge them into one column once I have, so am struggling. Would appreciate any help.

Is this what you are looking for?
library(tidyverse)
data.frame(txt) %>%
# tidy up `txt`:
mutate(txt = gsub("(?![, ])\\W", "", txt, perl = TRUE)) %>%
# split into rows
separate_rows(txt, sep = ",") %>%
# extract keywords matched:
mutate(keywords = str_extract(txt, "(?i)Parking|Garage|Garden|Freehold|Fireplace|Balcony"))
# A tibble: 19 × 2
txt keywords
<chr> <chr>
1 "Stunning seaside location" NA
2 " 24hour emergency call system and secure video entry" NA
3 " Mature landscaped gardens with large terraces and seating areas" garden
4 " Walk out balconies to selected apartments" NA
5 " Beautifully decorated homeowners8099 lounge" NA
6 " Parking spaces and car ports are available via an annual permit" Parking
7 " Wheelchair access" NA
8 " Lifts to all floors" NA
9 " Fire detection" NA
10 " Intruder alarm" NA
11 " Village Location" NA
12 " 4 Bedrooms" NA
13 " Gardens" Garden
14 " Balcony" Balcony
15 " On streetresidents parking" parking
16 " Central heating" NA
17 " Double glazing" NA
18 " Fireplace" Fireplace
19 " Ruralsecluded" NA
Data:
txt <- '"[\"Stunning seaside location\", \"24-hour emergency call system and secure video entry\", \"Mature landscaped gardens with large terraces and seating areas\", \"Walk out balconies to selected apartments\", \"Beautifully decorated homeownersâ\200\231 lounge\", \"Parking spaces and car ports are available via an annual permit\", \"Wheelchair access\", \"Lifts to all floors\", \"Fire detection\", \"Intruder alarm\"]", "[\"Village Location, 4 Bedrooms, Garden(s)\"]", "[\"Balcony\", \"On street/residents parking\", \"Central heating\", \"Double glazing\", \"Fireplace\", \"Rural/secluded\"]"'
If there may be more than 1 keyword per substring, then use str_extract_all in this way:
data.frame(txt) %>%
mutate(txt = gsub("(?![, ])\\W", "", txt, perl = TRUE)) %>%
separate_rows(txt, sep = ",") %>%
mutate(keywords = str_extract_all(txt, "(?i)Parking|Garage|Garden|Freehold|Fireplace|Balcony")) %>%
unnest(where(is.list), keep_empty = TRUE)
EDIT:
If the OP is looking to obtain a variable for each keyword, then this works:
data.frame(txt) %>%
mutate(txt = gsub("(?![, /])\\W", "", txt, perl = TRUE)) %>%
separate_rows(txt, sep = ", ") %>%
mutate(keywords = str_extract_all(txt, "(?i)Parking|Garage|Garden|Freehold|Fireplace|Balcony")) %>%
# unnest listed items:
unnest(where(is.list), keep_empty = TRUE) %>%
# capitalize initial letter:
mutate(keywords = sub("^(.)", "\\U\\1", keywords, perl = TRUE)) %>%
# cast each keaword into its own column:
pivot_wider(names_from = keywords, values_from = keywords,
values_fn = function(x) 1, values_fill = 0)
# A tibble: 19 × 6
txt `NA` Garden Parking Balcony Fireplace
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Stunning seaside location 1 0 0 0 0
2 24hour emergency call system and secure video entry 1 0 0 0 0
3 Mature landscaped gardens with large terraces and seating areas 0 1 0 0 0
4 Walk out balconies to selected apartments 1 0 0 0 0
5 Beautifully decorated homeowners8099 lounge 1 0 0 0 0
6 Parking spaces and car ports are available via an annual permit 0 0 1 0 0
7 Wheelchair access 1 0 0 0 0
8 Lifts to all floors 1 0 0 0 0
9 Fire detection 1 0 0 0 0
10 Intruder alarm 1 0 0 0 0
11 Village Location 1 0 0 0 0
12 4 Bedrooms 1 0 0 0 0
13 Gardens 0 1 0 0 0
14 Balcony 0 0 0 1 0
15 On street/residents parking 0 0 1 0 0
16 Central heating 1 0 0 0 0
17 Double glazing 1 0 0 0 0
18 Fireplace 0 0 0 0 1
19 Rural/secluded 1 0 0 0 0

Problem with webscraping selector gadget and rvest

I'm trying to scrape data from https://3g.dxy.cn/newh5/view/pneumonia , using SelectorGadget and rvest
I successfully scrape some text in the page with the following code.
library(rvest)
url <- 'https://3g.dxy.cn/newh5/view/pneumonia'
webpage <- read_html(url)
TEXT_html <- html_nodes(webpage,'.descText___Ui3tV')
TEXT <- html_text(TEXT_html)
But when I try to select the table where the most important data (the number of people infected in the table) see selection using the following code
TABLE_html <- html_nodes(webpage,'.areaBlock1___3V3UU p')
TABLE <- html_text(TABLE_html)
the output is "character 0"
I guess it's because the data in the table can't be seen because they are refreshed via API, but I don't really know how to solve this
Anybody has an idea? thank you very much

On this page, the data isn't retrieved seperately from an API. It is actually present in the html page you downloaded, but it is in JSON format inside a script tag and the reason why rvest can't read it is that the data is only added to the DOM by Javascript after the page loads. To get at the data you need to extract and parse the JSON:
library(rvest)
library(tibble)
library(jsonlite)
data <- 'https://3g.dxy.cn/newh5/view/pneumonia' %>%
read_html() %>%
html_node('#getAreaStat') %>% # This is the tag containing the JSON
html_text() %>% # Get the javascript from the node
strsplit("(getAreaStat = )|(}catch)") %>% # Carve out the JSON
unlist() %>%
`[`(2) %>% # Unlist and extract the JSON
fromJSON() # Parse the JSON
Now data is a data frame containing all the information from the JSON. However, the last column of data is actually a list of city-level data frames. Since these all have the same column names they can be bound together with rbind. The final column can then be removed from data so you have a data frame of province-level data and another one of city-level data.
city_data <- as_tibble(do.call(rbind, data$cities))
province_data <- as_tibble(data[, -8])
So province_data looks like a bit like this (the Chinese symbols haven't copied over but appear in the R console)
province_data
#> # A tibble: 33 x 7
#> provinceName provinceShortNa~ confirmedCount suspectedCount curedCount deadCount
#> <chr> <chr> <int> <int> <int> <int>
#> 1 ??? ?? 2714 0 52 100
#> 2 ??? ?? 207 0 4 0
#> 3 ??? ?? 173 0 3 0
#> 4 ??? ?? 168 0 0 1
#> 5 ??? ?? 143 0 0 0
#> 6 ??? ?? 132 0 0 0
#> 7 ??? ?? 106 0 0 0
#> 8 ??? ?? 95 0 0 0
#> 9 ??? ?? 91 0 2 1
#> 10 ??? ?? 90 0 0 0
#> # ... with 23 more rows, and 1 more variable: comment <chr>
and city_data looks like this (again, with cityName printed properly in the console).
#> # A tibble: 329 x 5
#> cityName confirmedCount suspectedCount curedCount deadCount
#> <chr> <int> <int> <int> <int>
#> 1 ?? 1590 0 47 85
#> 2 ?? 213 0 2 4
#> 3 ?? 173 0 0 1
#> 4 ?? 114 0 0 3
#> 5 ?? 91 0 0 0
#> 6 ?? 71 0 1 2
#> 7 ?? 70 0 0 0
#> 8 ?? 70 0 0 0
#> 9 ?? 65 0 0 0
#> 10 ?? 57 0 0 0
#> # ... with 319 more rows

Find a function to return value based on condition using R

I have a table with values
KId sales_month quantity_sold
100 1 0
100 2 0
100 3 0
496 2 6
511 2 10
846 1 4
846 2 6
846 3 1
338 1 6
338 2 0
now i require output as
KId sales_month quantity_sold result
100 1 0 1
100 2 0 1
100 3 0 1
496 2 6 1
511 2 10 1
846 1 4 1
846 2 6 1
846 3 1 0
338 1 6 1
338 2 0 1
Here, the calculation has to go as such if quantity sold for the month of march(3) is less than 60% of two months January(1) and February(2) quantity sold then the result should be 1 or else it should display 0. Require solution to perform this.
Thanks in advance.

If I understand well, your requirement is to compare sold quantity in month t with the sum of quantity sold in months t-1 and t-2. If so, I can suggest using dplyr package that offer the nice feature of grouping rows and mutating columns in your data frame.
resultData <- group_by(data, KId) %>%
arrange(sales_month) %>%
mutate(monthMinus1Qty = lag(quantity_sold,1), monthMinus2Qty = lag(quantity_sold, 2)) %>%
group_by(KId, sales_month) %>%
mutate(previous2MonthsQty = sum(monthMinus1Qty, monthMinus2Qty, na.rm = TRUE)) %>%
mutate(result = ifelse(quantity_sold/previous2MonthsQty >= 0.6,0,1)) %>%
select(KId,sales_month, quantity_sold, result)
The result is as below:
Adding
select(KId,sales_month, quantity_sold, result)
at the end let us display only columns we care about (and not all these intermediate steps).
I believe this should satisfy your requirement. NA is the result column are due to 0/0 division or no data at all for the previous months.
Should you need to expand your calculation beyond one calendar year, you can add year column and adjust group_by() arguments appropriately.
For more information on dplyr package, follow this link

Optimization of an R loop taking 18 hours to run

I've got an R code that works and does what I want but It takes a huge time to run. Here is an explanation of what the code does and the code itself.
I've got a vector of 200000 line containing street adresses (String) : data.
Example :
> data[150000,]
address
"15 rue andre lalande residence marguerite yourcenar 91000 evry france"
And I have a matrix of 131x2 string elements which are 5grams (part of word) and the ids of the bags of NGrams (example of a 5Grams bag : ["stack", "tacko", "ackov", "ckover", ",overf", ... ] ) : list_ngrams
Example of list_ngrams :
idSac ngram
1 4 stree
2 4 tree_
3 4 _stre
4 4 treet
5 5 avenu
6 5 _aven
7 5 venue
8 5 enue_
I have also a 200000x31 numerical matrix initialized with 0 : idv_x_bags
In total I have 131 5-grams and 31 bags of 5-grams.
I want to loop the string addresses and check whether it contains one of the n-grams in my list or not. If it does, I put one in the corresponding column which represents the id of the bag that contains the 5-gram.
Example :
In this address : "15 rue andre lalande residence marguerite yourcenar 91000 evry france". The word "residence" exists in the bag ["resid","eside","dence",...] which the id is 5. So I'm gonna put 1 in the column called 5. Therefore the corresponding line "idv_x_bags" matrix will look like the following :
> idv_x_sacs[150000,]
4 5 6 8 10 12 13 15 17 18 22 26 29 34 35 36 42 43 45 46 47 48 52 55 81 82 108 114 119 122 123
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Here is the code that does :
idv_x_sacs <- matrix(rep(0,nrow(data)*31),nrow=nrow(data),ncol=31)
colnames(idv_x_sacs) <- as.vector(sqldf("select distinct idSac from list_ngrams order by idSac"))$idSac
for(i in 1:nrow(idv_x_bags))
{
for(ngram in list_ngrams$ngram)
{
if(grepl(ngram,data[i,])==TRUE)
{
idSac <- sqldf(sprintf("select idSac from list_ngramswhere ngram='%s'",ngram))[[1]]
idv_x_bags[i,as.character(idSac)] <- 1
}
}
}
The code does perfectly what I aim to do, but it takes about 18 hours which is huge. I tried to recode it with c++ using Rcpp library but I encountered many problems. I'm tried to recode it using apply, but I couldn't do it.
Here is what I did :
apply(cbind(data,1:nrow(data),1,function(x){
apply(list_ngrams,1,function(y){
if(grepl(y[2],x[1])==TRUE){idv_x_bags[x[2],str_trim(as.character(y[1]))]<-1}
})
})
I need some help with coding my loop using apply or some other method that run faster that the current one. Thank you very much.

Check this one and run the simple example step by step to see how it works.
My N-Grams don't make much sense, but it will work with actual N_Grams as well.
library(dplyr)
library(reshape2)
# your example dataset
dt_sen = data.frame(sen = c("this is a good thing", "this is bad"), stringsAsFactors = F)
dt_ngr = data.frame(id_ngr = c(2,2,2,3,3,3),
ngr = c("th","go","tt","drf","ytu","bad"), stringsAsFactors = F)
# sentence dataset
dt_sen
sen
1 this is a good thing
2 this is bad
#ngrams dataset
dt_ngr
id_ngr ngr
1 2 th
2 2 go
3 2 tt
4 3 drf
5 3 ytu
6 3 bad
# create table of matches
expand.grid(unique(dt_sen$sen), unique(dt_ngr$id_ngr)) %>%
data.frame() %>%
rename(sen = Var1,
id_ngr = Var2) %>%
left_join(dt_ngr, by = "id_ngr") %>%
group_by(sen, id_ngr,ngr) %>%
do(data.frame(match = grepl(.$ngr,.$sen))) %>%
group_by(sen,id_ngr) %>%
summarise(sum_success = sum(match)) %>%
mutate(match = ifelse(sum_success > 0,1,0)) -> dt_full
dt_full
Source: local data frame [4 x 4]
Groups: sen
sen id_ngr sum_success match
1 this is a good thing 2 2 1
2 this is a good thing 3 0 0
3 this is bad 2 1 1
4 this is bad 3 1 1
# reshape table
dt_full %>% dcast(., sen~id_ngr, value.var = "match")
sen 2 3
1 this is a good thing 1 0
2 this is bad 1 1

using graph.adjacency() in R

I have a sample code in R as follows:
library(igraph)
rm(list=ls())
dat=read.csv(file.choose(),header=TRUE,row.names=1,check.names=T) # read .csv file
m=as.matrix(dat)
net=graph.adjacency(adjmatrix=m,mode="undirected",weighted=TRUE,diag=FALSE)
where I used csv file as input which contain following data:
23732 23778 23824 23871 58009 58098 58256
23732 0 8 0 1 0 10 0
23778 8 0 1 15 0 1 0
23824 0 1 0 0 0 0 0
23871 1 15 0 0 1 5 0
58009 0 0 0 1 0 7 0
58098 10 1 0 5 7 0 1
58256 0 0 0 0 0 1 0
After this I used following command to check weight values:
E(net)$weight
Expected output is somewhat like this:
> E(net)$weight
[1] 8 1 10 1 15 1 1 5 7 1
But I'm getting weird values (and every time different):
> E(net)$weight
[1] 2.121996e-314 2.121996e-313 1.697597e-313 1.291034e-57 1.273197e-312 5.092790e-313 2.121996e-314 2.121996e-314 6.320627e-316 2.121996e-314 1.273197e-312 2.121996e-313
[13] 8.026755e-316 9.734900e-72 1.273197e-312 8.027076e-316 6.320491e-316 8.190221e-316 5.092790e-313 1.968065e-62 6.358638e-316
I'm unable to find where and what I am doing wrong?
Please help me to get the correct expected result and also please tell me why is this weird output and that too every time different when I run it.??
Thanks,
Nitin

Just a small working example below, much clearer than CSV input.
library('igraph');
adjm1<-matrix(sample(0:1,100,replace=TRUE,prob=c(0.9,01)),nc=10);
g1<-graph.adjacency(adjm1);
plot(g1)
P.s. ?graph.adjacency has a lot of good examples (remember to run library('igraph')).
Related threads
Creating co-occurrence matrix
Co-occurrence matrix using SAC?

The problem seems to be due to the data-type of the matrix elements. graph.adjacency expects elements of type numeric. Not sure if its a bug.
After you do,
m <- as.matrix(dat)
set its mode to numeric by:
mode(m) <- "numeric"
And then do:
net <- graph.adjacency(m, mode = "undirected", weighted = TRUE, diag = FALSE)
> E(net)$weight
[1] 8 1 10 1 15 1 1 5 7 1

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

CSS Selector found with rvest and not with RSelenium - css

Related

Search for and extract words into new column

Problem with webscraping selector gadget and rvest

Find a function to return value based on condition using R

Optimization of an R loop taking 18 hours to run

using graph.adjacency() in R

Categories

Resources