R For loop unwanted overwrite - r

I would like every result of the loop in a different text(somename).
Right now the loop overwrites;
library(rvest)
main.page <- read_html(x = "http://www.imdb.com/event/ev0000681/2016")
urls <- main.page %>% # feed `main.page` to the next step
html_nodes(".alt:nth-child(2) strong a") %>% # get the CSS nodes
html_attr("href") # extract the URLs
for (i in urls){
a01 <- paste0("http://www.imdb.com",i)
text <- read_html(a01) %>% # load the page
html_nodes(".credit_summary_item~ .credit_summary_item+ .credit_summary_item .itemprop , .summary_text+ .credit_summary_item .itemprop") %>% # isloate the text
html_text()
}
How could I code it in such a way that the 'i' from the list is added tot text in the for statement?

To solidify my comment:
main.page <- read_html(x = "http://www.imdb.com/event/ev0000681/2016")
urls <- main.page %>% # feed `main.page` to the next step
html_nodes(".alt:nth-child(2) strong a") %>% # get the CSS nodes
html_attr("href") # extract the URLs
texts <- sapply(head(urls, n = 3), function(i) {
read_html(paste0("http://www.imdb.com", i)) %>%
html_nodes(".credit_summary_item~ .credit_summary_item+ .credit_summary_item .itemprop , .summary_text+ .credit_summary_item .itemprop") %>%
html_text()
}, simplify = FALSE)
str(texts)
# List of 3
# $ /title/tt5843990/: chr [1:4] "Lav Diaz" "Charo Santos-Concio" "John Lloyd Cruz" "Michael De Mesa"
# $ /title/tt4551318/: chr [1:4] "Andrey Konchalovskiy" "Yuliya Vysotskaya" "Peter Kurth" "Philippe Duquesne"
# $ /title/tt4550098/: chr [1:4] "Tom Ford" "Amy Adams" "Jake Gyllenhaal" "Michael Shannon"
If you use lapply(...), you'll get an unnamed list, which may or may not be a problem for you. Instead, using sapply(..., simplify = FALSE), we get a named list where each name is (in this case) the partial url retrieved from urls.
Using sapply without simplify can lead to unexpected outputs. As an example:
set.seed(9)
sapply(1:3, function(i) rep(i, sample(3, size=1)))
# [1] 1 2 3
One may think that this will always return a vector. However, if any of the single elements returned is not the same length (for instance) as the others, then the vector becomes a list:
set.seed(10)
sapply(1:3, function(i) rep(i, sample(3, size=1)))
# [[1]]
# [1] 1 1
# [[2]]
# [1] 2
# [[3]]
# [1] 3 3
In which case, it's best to have certainty in the return value, forcing a list:
set.seed(9)
sapply(1:3, function(i) rep(i, sample(3, size=1)), simplify = FALSE)
# [[1]]
# [1] 1
# [[2]]
# [1] 2
# [[3]]
# [1] 3
That way, you always know exactly how to reference sub-returns. (This is one of the tenets and advantages to Hadley's purrr package: each function always returns a list of exactly the type you declare. (There are other advantages to the package.)

Related

Apply regmatches function to a list of chr in R

I have this list of character stored in a variable called x:
x <-
c(
"images/logos/france2.png",
"images/logos/cnews.png",
"images/logos/lcp.png",
"images/logos/europe1.png",
"images/logos/rmc-bfmtv.png",
"images/logos/sudradio.png",
"images/logos/franceinfo.png"
)
pattern <- "images/logos/\\s*(.*?)\\s*.png"
regmatches(x, regexec(pattern, x))[[1]][2]
I wish to extract a portion of each chr string according to a pattern, like this function does, which works fine but only for the first item in the list.
pattern <- "images/logos/\\s*(.*?)\\s*.png"
y <- regmatches(x, regexec(pattern, x))[[1]][2]
Only returns:
"france2"
How can I apply the regmatches function to all items in the list in order to get a result like this?
[1] "france2" "europe1" "sudradio"
[4] "cnews" "rmc-bfmtv" "franceinfo"
[7] "lcp" "rmc" "lcp"
FYI this is a list of src tags that comes from a scraper
Try gsub
gsub(
".*/(.*)\\.png", "\\1",
c(
"images/logos/france2.png", "images/logos/cnews.png",
"images/logos/lcp.png", "images/logos/europe1.png",
"images/logos/rmc-bfmtv.png", "images/logos/sudradio.png",
"images/logos/franceinfo.png"
)
)
which gives
[1] "france2" "cnews" "lcp" "europe1" "rmc-bfmtv"
[6] "sudradio" "franceinfo"
Output of regmatches(..., regexec(...)) is a list. You may use sapply to extract the 2nd element from each element of the list.
sapply(regmatches(x, regexec(pattern, x)), `[[`, 2)
#[1] "france2" "europe1" "sudradio" "cnews" "rmc-bfmtv" "franceinfo"
#[7] "lcp" "rmc" "lcp"
You may also use the function basename + file_path_sans_ext from tools package which would give the required output directly.
tools::file_path_sans_ext(basename(x))
#[1] "france2" "europe1" "sudradio" "cnews" "rmc-bfmtv" "franceinfo"
#[7] "lcp" "rmc" "lcp"
A possible solution:
library(tidyverse)
df <- data.frame(
stringsAsFactors = FALSE,
strings = c("images/logos/france2.png","images/logos/cnews.png",
"images/logos/lcp.png","images/logos/europe1.png",
"images/logos/rmc-bfmtv.png","images/logos/sudradio.png",
"images/logos/franceinfo.png")
)
df %>%
mutate(strings = str_remove(strings, "images/logos/") %>%
str_remove("\\.png"))
#> strings
#> 1 france2
#> 2 cnews
#> 3 lcp
#> 4 europe1
#> 5 rmc-bfmtv
#> 6 sudradio
#> 7 franceinfo
Or even simpler:
library(tidyverse)
df %>%
mutate(strings = str_extract(strings, "(?<=images/logos/)(.*)(?=\\.png)"))
#> strings
#> 1 france2
#> 2 cnews
#> 3 lcp
#> 4 europe1
#> 5 rmc-bfmtv
#> 6 sudradio
#> 7 franceinfo

Extract certain words from dynamic strings vector

I'm working with questionnaire datasets where I need to extract some brands' names from several questions. The problem is each data might have a different question line, for example:
Data #1
What do you know about AlphaToy?
Data #2
What comes to your mind when you heard AlphaCars?
Data #3
What do you think of FoodTruckers?
What I want to extract are the words AlphaToy, AlphaCars, and FoodTruckers. In Excel, I can get those brands' names via flash fill, the illustration is below.
As I working with R, I need to convert the "flash fill" step into an R function, yet I couldn't found out how to do it. Here's desired output:
brandName <- list(
Toy = c(
"1. What do you know about AlphaToy?",
"2. What do you know about BetaToyz?",
"3. What do you know about CharlieDoll?",
"4. What do you know about DeltaToys?",
"5. What do you know about Echoty?"
),
Car = c(
"18. What comes to your mind when you heard AlphaCars?",
"19. What comes to your mind when you heard BestCar?",
"20. What comes to your mind when you heard CoolCarz?"
),
Trucker = c(
"5. What do you think of FoodTruckers?",
"6. What do you think of IceCreamTruckers?",
"7. What do you think of JellyTruckers?",
"8. What do you think of SodaTruckers?"
)
)
extractBrandName <- function(...) {
#some codes here
}
#desired output
> extractBrandName(brandName$Toy)
[1] "AlphaToy" "BetaToyz" "CharlieDoll" "DeltaToys" "Echoty"
As the title says, the function should work to dynamic strings, so when the function is applied to brandName the desired output is:
> lapply(brandName, extractBrandName)
$Toy
[1] "AlphaToy" "BetaToyz" "CharlieDoll" "DeltaToys" "Echoty"
$Car
[1] "AlphaCars" "BestCar" "CoolCarz"
$Trucker
[1] "FoodTruckers" "IceCreamTruckers" "JellyTruckers" "SodaTruckers"
Edit:
The brand name can be in lowercase, uppercase, or even two words or more, for instance: IBM, Louis Vuitton
The brand names might appear in the middle of the sentence, it's not always come at the end of the sentence. The thing is, the sentences are unpredictable because each client might provide different data of each other
Can anyone help me with the function code to achieve the desired output? Thank you in advance!
Edit, here's attempt
The idea (thanks to shs' answer) is to find similar words from the input, then exclude them leaving the unique words (it should be the brand names) behind. Following this post, I use intersect() wrapped inside a Reduce() to get the common words, then I exclude them via lapply() and make sure any two or more words brand names merged together with str_c(collapse = " ").
Code
library(stringr)
extractBrandName <- function(x) {
cleanWords <- x %>%
str_remove_all("^\\d+|\\.|,|\\?") %>%
str_squish() %>%
str_split(" ")
commonWords <- cleanWords %>%
Reduce(intersect, .)
extractedWords <- cleanWords %>%
lapply(., function(y) {
y[!y %in% commonWords] %>%
str_c(collapse = " ")
}) %>% unlist()
return(extractedWords)
}
Output (1st test case)
> #output
> extractBrandName(brandName$Toy)
[1] "AlphaToy" "BetaToyz" "CharlieDoll" "DeltaToys" "Echoty"
> lapply(brandName, extractBrandName)
$Toy
[1] "AlphaToy" "BetaToyz" "CharlieDoll" "DeltaToys" "Echoty"
$Car
[1] "AlphaCars" "BestCar" "CoolCarz"
$Trucker
[1] "FoodTruckers" "IceCreamTruckers" "JellyTruckers" "SodaTruckers"
Output (2nd test case)
This test case includes two or more words brand names, located at the middle and the beginning of the sentence.
brandName2 <- list(
Middle = c("Have you used any products from AlphaToy this past 6 months?",
"Have you used any products from BetaToys Collection this past 6 months?",
"Have you used any products from Charl TOYZ this past 6 months?"),
First = c("AlphaCars is the best automobile dealer, yes/no?",
"Best Vehc is the best automobile dealer, yes/no?",
"CoolCarz & Bike is the best automobile dealer, yes/no?")
)
> #output
> lapply(brandName2, extractBrandName)
$Middle
[1] "AlphaToy" "BetaToys Collection" "Charl TOYZ"
$First
[1] "AlphaCars" "Best Vehc" "CoolCarz & Bike"
In the end, the solution to this problem is found. Thanks to shs who gave the initial idea and the answer from the post I linked above. If you have any suggestions, please feel free to comment. Thank you.
This function checks which words the first two strings have in common and then removes everything from the beginning of the strings up to and including the common element, leaving only the desired part of the string:
library(stringr)
extractBrandName <- function(x) {
x %>%
str_split(" ") %>%
{.[[1]][.[[1]] %in% .[[2]]]} %>%
str_c(collapse = " ") %>%
str_c("^.+", .) %>%
str_remove(x, .) %>%
str_squish() %>%
str_remove("\\?")
}
lapply(brandName, extractBrandName)
#> $Toy
#> [1] "AlphaToy" "BetaToyz" "CharlieDoll" "DeltaToys" "Echoty"
#>
#> $Car
#> [1] "AlphaCars" "BestCar" "CoolCarz"
#>
#> $Trucker
#> [1] "FoodTruckers" "IceCreamTruckers" "JellyTruckers" "SodaTruckers"

Recursive indexing of lists with variable index value per recursion step

Puh... even trying to frame the title properly already gives me a headache.
I have a config.yml with nested values and I would like to define an indexing function get_config() that accepts "path-like" value strings.
The "path entities" of the value string match the nested entity structure of the config file. Based on the path-like value the function should then go and grab the corresponding hierarchy entity (either "branches" or "leaves") from the config file.
Example
Suppose this is the structure of the config.yml:
default:
column_names:
col_id: "id"
col_value: "value"
column_orders:
data_structure_a: [
column_names/col_id,
column_names/col_value
]
data_structure_b: [
column_names/col_value,
column_names/col_id
]
Here's a parsed version for you to play around with:
x <- yaml::yaml.load(
'default:
column_names:
col_id: "id"
col_value: "value"
column_orders:
data_structure_a: [
column_names/col_id,
column_names/col_value
]
data_structure_b: [
column_names/col_value,
column_names/col_id
]'
)
Accessing top-level entities is easy with config::get(value):
config::get("column_names")
# $col_id
# [1] "id"
#
# $col_value
# [1] "value"
config::get("column_orders")
# [1] "hello" "world"
But I would also like to access deeper entities, e.g. column_names: col_id.
In pseudo code:
config::get("column_names:col_id")
or
config::get("column_orders/data_structure_a")
The best I could come up with so far: relying on unlist()
get_config <- function(value, sep = ":") {
if (value %>% stringr::str_detect(sep)) {
value <- value %>% stringr::str_replace(sep, ".")
configs <- config::get() %>% unlist()
configs[value]
} else {
config::get(value)
}
}
get_config("column_names")
# $col_id
# [1] "id"
#
# $col_value
# [1] "value"
get_config("column_names:col_id")
# column_names.col_id
# "id"
Though not elegant, it works for most use cases, but fails for unnamed list entities in the config file
get_config("column_orders:data_structure_a")
# <NA>
# NA
as my indexing approach doesn't play well with the result of unlist() on unnamed lists:
config::get() %>% unlist()
# column_names.col_id column_names.col_value
# "id" "value"
# column_orders.data_structure_a1 column_orders.data_structure_a2
# "column_names/col_id" "column_names/col_value"
# column_orders.data_structure_b1 column_orders.data_structure_b2
# "column_names/col_value" "column_names/col_id"
Thus, I'd like to "go recursive" but my brain says: "no way, dude"
Due diligence
This solution comes close (I guess).
But I keep thinking that I need something like purrr::map2_if() or purrr::pmap_if() (which AFAIK don't exist) instead of purrr::map_if(), as I need to not only traverse the list behind config::get() recursively, but also a listified version of value (e.g. via stringr::str_split(value, sep) %>% unlist() %>% as.list())?
You could also use purrr::pluck to index into a nested list by name if that is what you are after:
x <- yaml::yaml.load('
column_names:
col_id: "id"
col_value: "value"
column_orders:
data_structure_a: [
column_names/col_id,
column_names/col_value
]
data_structure_b: [
column_names/col_value,
column_names/col_id
]
nested_list:
element_1:
element_2:
value: "hello world"
')
purrr::pluck(x, "column_names", "col_id")
#> [1] "id"
purrr::pluck(x, "column_names")
#> $col_id
#> [1] "id"
#>
#> $col_value
#> [1] "value"
purrr::pluck(x, "column_orders", "data_structure_a")
#> [1] "column_names/col_id" "column_names/col_value"
purrr::pluck(x, "column_names", "col_notthere")
#> NULL
I came up with a solution based on Recall().
However, while digging up the internet in trying to get here, I recall having read somewhere that Recall() is generally not a very (memory) efficient way of doing recursion in R? Also would appreciate additional hints on how to do recursion the tidy way with purrr and friends.
Config file content
Being able to call get_config() implies that you have a config.yml file with above content in your project's root directory given by here::here(), but you can test get_list_element_recursively() with this workaround:
x <- yaml::yaml.load('
column_names:
col_id: "id"
col_value: "value"
column_orders:
data_structure_a: [
column_names/col_id,
column_names/col_value
]
data_structure_b: [
column_names/col_value,
column_names/col_id
]
nested_list:
element_1:
element_2:
value: "hello world"
')
Function defs
get_config <- function(value, sep = "/") {
get_list_element_recursively(
config::get(),
stringr::str_split(value, sep, simplify = TRUE)
)
}
get_list_element_recursively <- function(
lst,
el,
.el_trace = el,
.level_trace = 1
) {
# Reached leaf:
if (!is.list(lst)) {
return(lst)
}
# Element not in list:
if (!(el[1] %in% names(lst))) {
message("Current list branch:")
# print(lst)
message(str(lst))
message("Trace of indexing vec (last element is invalid):")
message(stringr::str_c(.el_trace[.level_trace], collapse = "/"))
stop(stringr::str_glue("No such element in list: {el[1]}"))
}
lst <- lst[[ el[1] ]]
if (!is.na(el[2])) {
# Continue if there are additional elements in `el` vec
Recall(lst, el[-1], .el_trace, .level_trace = 1:(.level_trace + 1))
} else {
# Otherwise return last indexing result:
lst
}
}
Testing get_config()
get_config("column_names")
# $col_id
# [1] "id"
#
# $col_value
# [1] "value"
get_config("column_names/col_id")
# [1] "id"
get_config("column_names/col_nonexisting")
# Current list branch:
# List of 6
# $ col_id : chr "id"
# $ col_value : chr "value"
#
# Trace of indexing vec (last element is invalid):
# column_names/col_nonexisting
# Error in get_list_element_recursively(config::get(), stringr::str_split(value, :
# No such element in list: col_nonexisting
get_config("column_orders")
# $data_structure_a
# [1] "column_names/col_id" "column_names/col_value"
#
# $data_structure_b
# [1] "column_names/col_value" "column_names/col_id"
get_config("column_orders/data_structure_a")
# [1] "column_names/col_id" "column_names/col_value"
Testing get_list_element_recursively()
get_list_element_recursively(x, c("column_names"))
# $col_id
# [1] "id"
#
# $col_value
# [1] "value"
get_list_element_recursively(x, c("column_names", "col_id"))
# [1] "id"
get_list_element_recursively(x, c("column_names", "col_notthere"))
# Current list branch:
# List of 2
# $ col_id : chr "id"
# $ col_value: chr "value"
#
# Trace of indexing vec (last element is invalid):
# column_names/col_notthere
# Error in get_list_element_recursively(x$default, c("column_names", "col_notthere")) :
# No such element in list: col_notthere

Create list with specific iteration in R

I have the following dataset containing dates:
> dates
[1] "20180412" "20180424" "20180506" "20180518" "20180530" "20180611" "20180623" "20180705" "20180717" "20180729"
I am trying to create a list where in each position, the name is 'Coherence_' + the first and second dates in dates. So in output1[1] I would have Coherence_20180412_20180424. Then in output1[2] I would have Coherence_20180506_20180518, etc.
I am starting with this code but it is not working they way I need:
output1<-list()
for (i in 1:5){
output1[[i]]<-paste("-Poutput1=", S1_Out_Path,"Coherence_VV_TC", dates[[i]],"_", dates[[i+1]], ".tif", sep="")
}
Do you have any suggestions?
M
Try this:
Without loop
even_indexes<-seq(2,10,2) # List of even indexes
odd_indexes<-seq(1,10,2) # List of odd indexes
print(paste('Coherence',paste(odd_indexes,even_indexes,sep = "_"),sep = "_"))
Link answer from here: Create list in R with specific iteration
Updated (To get data in List)
lst=c(paste('Coherence',paste(odd_indexes,even_indexes,sep = "_"),sep = "_"))
OR
a=c(1:10)
for (i in seq(1, 9, 2)){
print(paste('Coherence',paste(a[i],a[i+1],sep = "_"),sep = "_"))
}
Output:
[1] "Coherence_1_2"
[1] "Coherence_3_4"
[1] "Coherence_5_6"
[1] "Coherence_7_8"
[1] "Coherence_9_10"
You can create these patterns using paste capability to operate on vectors:
dates <- c("20180412", "20180424", "20180506", "20180518", "20180530",
"20180611", "20180623", "20180705", "20180717", "20180729")
paste("Coherence", dates[1:length(dates)-1], dates[2:length(dates)], sep="_")
[1] "Coherence_20180412_20180424" "Coherence_20180424_20180506" "Coherence_20180506_20180518"
[4] "Coherence_20180518_20180530" "Coherence_20180530_20180611" "Coherence_20180611_20180623"
[7] "Coherence_20180623_20180705" "Coherence_20180705_20180717" "Coherence_20180717_20180729"
Or other simple patterns can be generated as:
paste("Coherence", dates[seq(1, length(dates), 2)], dates[seq(2, length(dates), 2)], sep="_")
[1] "Coherence_20180412_20180424" "Coherence_20180506_20180518" "Coherence_20180530_20180611"
[4] "Coherence_20180623_20180705" "Coherence_20180717_20180729"
You can use matrix(..., nrow=2):
dates <- c("20180412", "20180424", "20180506", "20180518", "20180530", "20180611", "20180623", "20180705", "20180717", "20180729")
paste0("Coherence_", apply(matrix(dates, 2), 2, FUN=paste0, collapse="_"))
# > paste0("Coherence_", apply(matrix(dates, 2), 2, FUN=paste0, collapse="_"))
# [1] "Coherence_20180412_20180424" "Coherence_20180506_20180518" "Coherence_20180530_20180611" "Coherence_20180623_20180705"
# [5] "Coherence_20180717_20180729"

Find similar elements between two lists and Replace with a corresponding elements

I have a list of probe ids as below :
> dput(best)
list(c("204639_at", "203440_at", "242136_x_at", "231954_at",
"208388_at", "205942_s_at", "203510_at", "204639_at"), c("204639_at",
"203510_at", "231954_at"))
Then I have used this file:
> head(sym)
x
204639_at ADA
203440_at CDH2
242876_at AKT3
207078_at MED6
208388_at NR2E3
222161_at NAALAD2
> class(sym)
[1] "data.frame"
Then, I want to find alternative names :
("ADA" "CDH2" "AKT3" "MED6" "NR2E3" "NAALAD2")
In sym and replace existing with elements in "best" file. Does anyone have a hack? Thanks
There is no "hack" needed.
#your data:
best <- list(list(c("204639_at", "203440_at", "242136_x_at", "231954_at", "208388_at", "205942_s_at", "203510_at", "204639_at" )),
list(c("204639_at", "203510_at", "231954_at")))
sym <- read.table(text=" x
204639_at ADA
203440_at CDH2
242876_at AKT3
207078_at MED6
208388_at NR2E3
222161_at NAALAD2", header=TRUE)
#iterate through list and match against sym
rapply(best, function(x) {
res <- as.character(sym[x,1])
#omit the following line if you prefer NAs for nomatches
res[is.na(res)] <- x[is.na(res)]
res
}, how="list")
#[[1]]
#[[1]][[1]]
#[1] "ADA" "CDH2" "242136_x_at" "231954_at" "NR2E3" "205942_s_at" "203510_at" "ADA"
#
#
#[[2]]
#[[2]][[1]]
#[1] "ADA" "203510_at" "231954_at"

Resources