Web Scraping in R with loop from data.frame - r

library(rvest)
df <- data.frame(Links = c("Qmobile_Noir-M6", "Qmobile_Noir-A1", "Qmobile_Noir-E8"))
for(i in 1:3) {
webpage <- read_html(paste0("https://www.whatmobile.com.pk/", df$Links[i]))
data <- webpage %>%
html_nodes(".specs") %>%
.[[1]] %>%
html_table(fill = TRUE)
}
want to make loop works for all 3 values in df$Links but above code just download the last one, and downloaded data must also be identical with variables (may be a new column with variables name)

The problem is in how you're structuring your for loop. It's much easier just to not use one in the first place, though, as R has great support for iterating over lists, like lapply and purrr::map. One version of how you could structure your data:
library(tidyverse)
library(rvest)
base_url <- "https://www.whatmobile.com.pk/"
models <- data_frame(model = c("Qmobile_Noir-M6", "Qmobile_Noir-A1", "Qmobile_Noir-E8"),
link = paste0(base_url, model),
page = map(link, read_html))
model_specs <- models %>%
mutate(node = map(page, html_node, '.specs'),
specs = map(node, html_table, header = TRUE, fill = TRUE),
specs = map(specs, set_names, c('var1', 'var2', 'val1', 'val2'))) %>%
select(model, specs) %>%
unnest()
model_specs
#> # A tibble: 119 x 5
#> model var1 var2
#> <chr> <chr> <chr>
#> 1 Qmobile_Noir-M6 Build OS
#> 2 Qmobile_Noir-M6 Build Dimensions
#> 3 Qmobile_Noir-M6 Build Weight
#> 4 Qmobile_Noir-M6 Build SIM
#> 5 Qmobile_Noir-M6 Build Colors
#> 6 Qmobile_Noir-M6 Frequency 2G Band
#> 7 Qmobile_Noir-M6 Frequency 3G Band
#> 8 Qmobile_Noir-M6 Frequency 4G Band
#> 9 Qmobile_Noir-M6 Processor CPU
#> 10 Qmobile_Noir-M6 Processor Chipset
#> # ... with 109 more rows, and 2 more variables: val1 <chr>, val2 <chr>
The data is still pretty messy, but at least it's all there.

it is capturing all three values, but it writes over them with each loop. That's why it only shows one value, and that one value being for the last page
You need to initialise a variable first before you go into your loop, I suggest a list so you can store data for each successive loop. So something like
final_table <- list()
for(i in 1:3) {
webpage <- read_html(paste0("https://www.whatmobile.com.pk/", df$Links[i]))
data <- webpage %>%
html_nodes(".specs") %>%
.[[1]] %>%
html_table(fill= TRUE)
final_table[[i]] <- data.frame(data, stringsAsFactors = F)
}
In this was, it appends new data to the list with each loop.

Related

Extracting JSON data with asymetric content from a dataframe column in R

I loaded a table from a database which contains a column that has JSON data in each row.
The table looks something like the example below. (I was not able to replicate the data.frame I have, due to the format of the column data)
dataframe_example <- data.frame(id = c(1,2,3),
name = c("name1","name2","name3"),
JSON_col = c({"_inv": [10,20,30,40]}, "_person": ["_personid": "green"],
{"_inv": [15,22]}, "_person": ["_personid": "blue"],
{"_inv": []}, "_person": ["_personid": "red"]))
I have the following two issues:
Some of the items (e.g. "_inv") sometimes have the full 4 numeric entries, sometimes less, and sometimes nothing. Some of the other items (e.g. "_person") usually contain another header, but only one character data point.
My goal is to preserve the existing dataframes colums (such as id and name) and spread the data in the json column such that I have new columns containing each point of information. The target dataframe would look a little like this:
data.frame(id = c(1,2,3),
name = c("name1","name2","name3"),
`_inv_1` = c(10,15,NA),
`_inv_2` = c(20,22,NA),
`_inv_3` = c(30,NA,NA),
`_inv_4` = c(40,NA,NA),
`_person_id` = c("green","blue","red"))
Please bear in mind that I have very little experience handling JSON data and no experience dealing with uneven JSON data.
Using purrr I got:
frame <- purrr::map(dataframe_example$JSON_col, jsonlite::fromJSON)
This gave me a large list with n elements, where n is the length of the original dataframe. The "Name" item contains n lists [[1]], each one with its own type of object, ranging from double to data.frame. The double object contain four numeric observations, (such as _inv), some of the objects are lists themselves (such as _person), which further contains "_personid" and then a single entry. The dataframe contains the datetime stamps for each observation in the JSON data. (each _inv item has a timestamp)
Is there a way to obtain the solution above, either by extracting the data from my "frame" object, or an altogether different solution?
library(tidyverse)
library(jsonlite)
#>
#> Attaching package: 'jsonlite'
#> The following object is masked from 'package:purrr':
#>
#> flatten
dataframe_example <-
data.frame(
id = c(1, 2, 3),
name = c("name1", "name2", "name3"),
JSON_col = c(
"{\"_inv\": [10,20,30,40], \"_person\": {\"_personid\": \"green\"}}",
"{\"_inv\": [15,22], \"_person\": {\"_personid\": \"blue\"}}",
"{\"_inv\": [], \"_person\": {\"_personid\": \"red\"}}"
)
)
dataframe_example %>%
as_tibble() %>%
mutate(
JSON_col = JSON_col %>% map(parse_json)
) %>%
unnest_wider(JSON_col) %>%
unnest(`_inv`) %>%
unnest(`_inv`) %>%
unnest(`_person`) %>%
unnest(`_person`) %>%
group_by(id, name) %>%
mutate(inv_id = row_number()) %>%
pivot_wider(names_from = inv_id, values_from = `_inv`, names_prefix = "_inv_")
#> # A tibble: 2 x 7
#> # Groups: id, name [2]
#> id name `_person` `_inv_1` `_inv_2` `_inv_3` `_inv_4`
#> <dbl> <chr> <chr> <int> <int> <int> <int>
#> 1 1 name1 green 10 20 30 40
#> 2 2 name2 blue 15 22 NA NA
Created on 2021-11-25 by the reprex package (v2.0.1)

For loop over a group_by when passing in a variable name

I'm trying to program an R function that will extract the proportion of positives in a community based on column value. More concretely, I have a dataset where each row is an individual. To simplify, column 1-5 has information about their individual characteristics, while column 6 has zip code, column 7 has the phone number they called to report a positive, column 8 has the day of the week, and column 9 has state. The goal is to calculate the proportion and number of positives at the aggregate level for zip code, phone number, day of the week, and state. For any one category, I successfully used code from https://edwinth.github.io/blog/dplyr-recipes/ to build a group and summarize function (below). Feed in a dataframe and column name and it will group by the distinct values on that column and summarize the count and proportion of positives.
group_and_summarize <- function(x, ...) {
grouping = rlang::quos(...)
temp = x %>% group_by(!!!grouping) %>% summarise(proportion = mean(positive, na.rm = TRUE), number = n())
temp = temp %>% filter(!is.na(!!!grouping))
colnames(temp)[2] = paste0(colnames(temp)[1], "_proportion")
colnames(temp)[3] = paste0(colnames(temp)[1], "_count")
return(temp)
}
Problem is, that code fails entirely when I try to aggregate across multiple columns. I currently have four fields to group over, but once the data are fully gathered, I expect to have ~15 columns. My strategy here is to store each of those as a separate element of a list for later use. I tried to use
output = vector(mode = "list", length = length(aggregate_cols)) #aggregate_cols lists columns needing count and proportion.
#aggregate_cols = c("ZIP_CODE", "PHONE_NUMBER", "DAY", "STATE")
for(i in 1:length(aggregate_cols)){
output[i] = group_and_summarize(df,aggregate_cols[i])
}
but got the following error message
Warning messages:
1: In output[i] <- group_and_summarize(df, aggregate_cols[i]) :
number of items to replace is not a multiple of replacement length
2: In output[i] <- group_and_summarize(df, aggregate_cols[i]) :
number of items to replace is not a multiple of replacement length
3: In output[i] <- group_and_summarize(df, aggregate_cols[i]) :
number of items to replace is not a multiple of replacement length
4: In output[i] <- group_and_summarize(df, aggregate_cols[i]) :
number of items to replace is not a multiple of replacement length
Testing for the first value
> i=1
> group_and_summarize(df,aggregate_cols[i])
# A tibble: 1 x 3
`aggregate_cols[i]` proportion number
<chr> <dbl> <int>
1 ZIP_CODE 0.168 5600
Any ideas how to resolve this? I can't think of a good way involving map or the apply family of functions, although I'd be open to those.
EDIT:
Reproducible code is below.
group_and_summarize_demo <- function(x, ...) {
grouping = quos(...)
temp = x %>% group_by(!!!grouping) %>% summarise(proportion = mean(am, na.rm = TRUE), number = n())
temp = temp %>% filter(!is.na(!!!grouping))
colnames(temp)[2] = paste0(colnames(temp)[1], "_proportion")
colnames(temp)[3] = paste0(colnames(temp)[1], "_count")
return(temp)
}
cars_cols = c("gear", "cyl")
output = vector(mode = "list", length = length(cars_cols))
for(i in 1:length(cars_cols)){
output[i] = group_and_summarize_demo(df,cars_cols[i]) #group_and_summarize gets count and proportion
}
> group_and_summarize_demo(mtcars, cyl)
# A tibble: 3 x 3
cyl cyl_proportion cyl_count
<dbl> <dbl> <int>
1 4 0.727 11
2 6 0.429 7
3 8 0.143 14
> cars_cols = c("gear", "cyl")
> output = vector(mode = "list", length = length(cars_cols))
> for(i in 1:length(cars_cols)){
+ output[i] = group_and_summarize_demo(df,cars_cols[i]) #group_and_summarize gets count and proportion
+ }
Show Traceback
Rerun with Debug
Error in UseMethod("group_by_") :
no applicable method for 'group_by_' applied to an object of class "function"
> cars_cols[1]
[1] "gear"
> group_and_summarize_demo(mtcars, cars_cols[1])
# A tibble: 1 x 3
`cars_cols[1]` `cars_cols[1]_proportion` `cars_cols[1]_count`
<chr> <dbl> <int>
1 gear 0.406 32
I don't understand why this is different from running group_and_summarize_demo(mtcars,cyl); I suspect understanding that will address this bug.
Outside your loop, you are passing the name directly to the function:
group_and_summarize_demo(mtcars, cyl)
However, inside your loop, you are passing the name as a string:
group_and_summarize_demo(mtcars, "cyl") #error
Indeed, using a string is easier in this setting. In order for this to work, you should not use quos() but syms():
group_and_summarize_demo <- function(x, ..., quosure=TRUE) {
if(quosure)
grouping = quos(...)
else
grouping = syms(...)
temp = x %>%
group_by(!!!grouping) %>%
summarise(proportion = mean(am, na.rm = TRUE), number = n())
temp = temp %>% filter(!is.na(!!!grouping))
colnames(temp)[2] = paste0(colnames(temp)[1], "_proportion")
colnames(temp)[3] = paste0(colnames(temp)[1], "_count")
return(temp)
}
group_and_summarize_demo(mtcars, cyl)
group_and_summarize_demo(mtcars, "cyl", quosure=F)
Obviously, in your final code you should choose one of these and stick to it.
EDIT:
If you only pass one variable at a time, using the ellipsis looks quite overkill and makes things complicated. Moreover, your example did not seem to work with several variables (group_and_summarize_demo(mtcars, cyl, vs)). You might want to consider these few improvements:
library(tidyverse)
group_and_summarize_demo <- function(x, gp_col) {
gp_col = sym(gp_col)
temp = x %>%
group_by(!!gp_col) %>%
summarise("{{gp_col}}_proportion" := mean(am, na.rm = TRUE),
"{{gp_col}}_count" := n()) %>%
filter(!is.na(!!gp_col))
temp
}
c("gear", "cyl") %>%
map(~group_and_summarize_demo(mtcars, .x)) #try map_dfc() also
#> [[1]]
#> # A tibble: 3 x 3
#> gear gear_proportion gear_count
#> <dbl> <dbl> <int>
#> 1 3 0 15
#> 2 4 0.667 12
#> 3 5 1 5
#>
#> [[2]]
#> # A tibble: 3 x 3
#> cyl cyl_proportion cyl_count
#> <dbl> <dbl> <int>
#> 1 4 0.727 11
#> 2 6 0.429 7
#> 3 8 0.143 14
Created on 2021-04-27 by the reprex package (v2.0.0)
Here, I used the templating feature of dplyr::summarise() using the := operator. I also used purrr::map() instead of the for loop, where the iteration is noted .x.

How to extract specific string in R and puts into another column?

I have data like this, below are the 3 rows from my data set:
total=7871MB;free=5711MB;used=2159MB;shared=0MB;buffers=304MB;cached=1059MB;
free=71MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;cached=1059MB;
cached=1059MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;free=109MB;
Expected output as below,
total free used shared buffers cached
7871MB 5711MB 2159MB 0MB 304MB 1059MB
5751MB 71MB 5MB 3159MB 30MB 1059MB
5751MB 109MB 5MB 3159MB 30MB 1059MB
and the problem here is I want to make different columns using above data like total value, free value, used value, shared value.
I can do that by splitting using ; but in other rows values are getting shuffled, like first value coming as free then total followed by other values,
Is there any way using REGEX in , if we find total get value till ; and put into one column, if we find free get value till ; and put into another column?
Here is one possibility using strsplit.
df <- as.data.frame(matrix(unlist(lapply(strsplit(x, ";"), strsplit, "=")), nrow = 2))
colnames(df) = df[1,]
df = df[-1,]
df
# total free used shared buffers cached
# 2 7871MB 5711MB 2159MB 0MB 304MB 1059MB
Edit
I don't know how your data are structured. But you can do something like the following:
x <- "total=7871MB;free=5711MB;used=2159MB;shared=0MB; buffers=304MB;cached=1059MB;
free=71MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;cached=1059MB;
cached=1059MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;free=109MB;"
x %>% str_split("\n") %>% unlist() %>% as_tibble() %>%
mutate(total = str_extract(value, "total=(.*?)MB;"),
free = str_extract(value, "free=(.*?)MB;"),
used = str_extract(value, "used=(.*?)MB;"),
shared = str_extract(value, "shared=(.*?)MB;"),
buffers = str_extract(value, "buffers=(.*?)MB;"),
cached = str_extract(value, "cached=(.*?)MB;")) %>%
select(-value) %>%
mutate_all(~as.numeric(str_extract(.,"[[:digit:]]+")))
# # A tibble: 3 x 6
# total free used shared buffers cached
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 7871. 5711. 2159. 0. 304. 1059.
# 2 5751. 71. 5. 3159. 30. 1059.
# 3 5751. 109. 5. 3159. 30. 1059.
We can try using strsplit followed by sub to separate the data from the labels. Then, create a data frame using this data:
x <- 'total=7871MB;free=5711MB;used=2159MB;shared=0MB;buffers=304MB;cached=1059MB;'
y <- unlist(strsplit(x, ';'))
names <- sapply(y, function(x) gsub("=.*$", "", x))
data <- sapply(y, function(x) gsub(".*=", "", x, perl=TRUE))
df <- data.frame(names=names, data=data)
df
Demo

Opposite of unnest_tokens after creating dummy variable

library(NLP)
library(tm)
library(tidytext)
library(tidyverse)
library(topicmodels)
library(dplyr)
library(stringr)
library(purrr)
library(tidyr)
#sample dataset
tags <- c("product, productdesign, electronicdevice")
web <- c("hardware, sunglasses, eyeware")
tags2 <- data_frame(tags, web, stringsAsFactors = FALSE)
#tokenize the words
toke <- tags2 %>%
unnest_tokens(word, tags)
toke
#create a dummy variable
toke2 <- toke%>% mutate(
product = ifelse(str_detect(word, "^product$"), "1", "0"))
#unnest the toke
nested_toke <- toke2 %>%
nest(word) %>%
mutate(text = map(data, unlist),
text = map_chr(text, paste, collapse = " "))
nested_toke %>%
select(text)
When I nest the column of tokenized words after creating the dummy variable based on the string "product" it seems to be inserting "product" into a new row below the original row where "product" was located.
product underlined should be in the row above
When you add a new column after unnesting, you have to think about what to do with it if you want to nest again. Let's work through it and see what we're talking about.
library(tidyverse)
tags <- c("product, productdesign, electronicdevice")
web <- c("hardware, sunglasses, eyeware")
tags2 <- data_frame(tags, web)
library(tidytext)
tidy_tags <- tags2 %>%
unnest_tokens(word, tags)
tidy_tags
#> # A tibble: 3 x 2
#> web word
#> <chr> <chr>
#> 1 hardware, sunglasses, eyeware product
#> 2 hardware, sunglasses, eyeware productdesign
#> 3 hardware, sunglasses, eyeware electronicdevice
So that is your data set unnested, converted to a tidy form. Next, let's add the new column that detects whether the word "product" is in the word column.
tidy_product <- tidy_tags %>%
mutate(product = ifelse(str_detect(word, "^product$"),
TRUE,
FALSE))
tidy_product
#> # A tibble: 3 x 3
#> web word product
#> <chr> <chr> <lgl>
#> 1 hardware, sunglasses, eyeware product T
#> 2 hardware, sunglasses, eyeware productdesign F
#> 3 hardware, sunglasses, eyeware electronicdevice F
Now think about what your options are for nesting again. If you nest again without taking into account the new column (nest(word)) the structure has a NEW COLUMN and will have to make a NEW ROW to account for the two different values that can take. You could instead do something like nest(word, product) but then the TRUE/FALSE values will end up in your text string. If you are wanting to get back to the original text format, you need to remove the new column you created, because having it there changes the relationships between rows and columns.
nested_product <- tidy_product %>%
select(-product) %>%
nest(word) %>%
mutate(text = map(data, unlist),
text = map_chr(text, paste, collapse = ", "))
nested_product
#> # A tibble: 1 x 3
#> web data text
#> <chr> <list> <chr>
#> 1 hardware, sunglasses, eyeware <tibble [3 × 1]> product, productdesign, …
Created on 2018-02-22 by the reprex package (v0.2.0).

fixing incompatible types error in R using dplyr/mutate

I'm trying to use the tidyverse/dplyr package in R to work with data including vectorized calls to an online API (from Altmetric) to add rows using mutate.
The smallest code I can create that reproduces the error is that below. I get the error "Error: incompatible types, expecting a numeric vector"
library(tidyverse)
library(jsonlite)
fromJSON_wrapper <- function(x,y) {
fromJSON(x)[[c(y)]]
}
toy <- tibble(
doi = c("10.1002/anie.201500251", "10.1080/19443994.2015.1005695", "10.1007/s13721-015-0095-0"),
url = c("https://api.altmetric.com/v1/doi/10.1002/anie.201500251", "https://api.altmetric.com/v1/doi/10.1080/19443994.2015.1005695", "https://api.altmetric.com/v1/doi/10.1080/19443994.2015.1005695")
)
extracted <- toy %>% rowwise() %>% mutate(score = fromJSON_wrapper(url,"score"))
The code for extracting a single score below works, whether just using the wrapper or on a one row tibble and I'm not sure why my code isn't working.
fromJSON_wrapper("https://api.altmetric.com/v1/doi/10.1007/s13721-015-0095-0")
extracted <- toy[1,] %>% rowwise() %>% mutate(score = fromJSON_wrapper(url, "score"))
Any suggestions would be appreciated.
It's simpler to just iterate over the vector of URLs and extract what you need. purrr::map_dbl makes this simple, though sapply would work fine, too.
library(tidyverse)
toy <- tibble(
doi = c("10.1002/anie.201500251", "10.1080/19443994.2015.1005695", "10.1007/s13721-015-0095-0"),
url = c("https://api.altmetric.com/v1/doi/10.1002/anie.201500251", "https://api.altmetric.com/v1/doi/10.1080/19443994.2015.1005695", "https://api.altmetric.com/v1/doi/10.1080/19443994.2015.1005695")
)
extracted <- toy %>% mutate(score = map_dbl(url, ~jsonlite::fromJSON(.x)$score))
extracted %>% select(doi, score)
#> # A tibble: 3 × 2
#> doi score
#> <chr> <dbl>
#> 1 10.1002/anie.201500251 0.25
#> 2 10.1080/19443994.2015.1005695 1.00
#> 3 10.1007/s13721-015-0095-0 1.00

Resources