Webscrape text using logical grep in R - r

Good afternoon,
Thanks for helping me out with this question.
I have a set of >5000 URLs within a list that I am interested in scraping. I have used lapply and readLines to extract the text for these webpages using the sample code below:
multipleURL <- c("http://dailymed.nlm.nih.gov/dailymed/lookup.cfm?ndc=0002-1200&start=1&labeltype=all", "http://dailymed.nlm.nih.gov/dailymed/lookup.cfm?ndc=0002-1407&start=1&labeltype=all", "http://dailymed.nlm.nih.gov/dailymed/lookup.cfm?ndc=0002-1975&start=1&labeltype=all")
multipleText <- lapply(multipleURL, readLines)
Now I would like to query each of these texts for the word "radioactive". I am simply interested in figuring out if this term is mentioned in the text and have been using the logical grep command:
radioactive <- grepl("radioactive" , multipleText, ignore.case = TRUE)
When I count the number of items in our list that contain the word "radioactive" it returns a count of 0:
count(radioactive)
x freq
1 FALSE 3
However, a cursory review of the webpages for each of these URLs however reveals that the first link (http://dailymed.nlm.nih.gov/dailymed/lookup.cfm?ndc=0002-1200&start=1&labeltype=all) DOES in fact contain the word radioactive. Our "multipleText" list even includes the word radioactive, although our grepl command doesn't seem to pick it up.
Any thoughts on what I am doing wrong would be greatly appreciated.
Many thanks,
Chris

I think you should you parse your document using html parser. Here I am using XML package. I convert your document to an R list and then I can apply grep on it.
library(XML)
multipleText <- lapply(multipleURL,function(x) {
y <- xmlToList(htmlParse(x))
y.flat <- unlist(y,recursive=TRUE)
length(grep('radioactive',c(y.flat,names(y.flat))))
})
multipleText
[[1]]
[1] 8
[[2]]
[1] 0
[[3]]
[1] 0
EDIT to search for multi search :
## define your words here
WORDS <- c('CLINICAL ','solution','Action','radioactive','Effects')
library(XML)
multipleText <- lapply(multipleURL,
function(x) {
y <- xmlToList(htmlParse(x))
y.flat <- unlist(y,recursive=TRUE)
sapply(WORDS,function(y)
length(grep(y,c(y.flat,names(y.flat)))))
})
do.call(rbind,multipleText)
CLINICAL solution Action radioactive Effects
[1,] 6 10 2 8 2
[2,] 1 3 1 0 3
[3,] 6 22 2 0 6
PS: maybe you should use ignore.case = TRUE for the grep command.

Related

Creating functions with eval(parse()) containing numeric vectors

I have several functions as strings which contain a lot of numeric vectors in the form of
c(1,2,3) , with three fixed values each (3D-coordinates). See test_string below as a small example. I can create a working function test_fun using eval and parse, but there is a problem:
I need these vectors to be recognized as one input, i.e. as double[3] and not as language with the parts 'c' (symbol), 1 (double[1]), 2 (double[1]) and 3 (double[1]). Check this code to see what I mean:
test_string <- "function(x) \n c(1,2,3)*x"
test_fun <- eval(parse(text = test_string))
test_fun(2)
#[1] 2 4 6 <- it's working
View(list(test_fun)) # see 'type' column
str(body(test_fun)[[2]])
# language c(1, 2, 3) <- desired output here: num [1:3] 1 2 3
str(body(test_fun)[[2]][[1]])
# symbol c
Is there an easy solution that works on the full string? I would be very happy to learn about this! If necessary I could also change the code in the function which creates these function strings when the substrings are concatenated with paste("function(x) \n ","c(1,2,3)","*x",sep = "").
Edit: I did a mistake in the 'View' and 'desired output' line. It is now correct.
I think I found a solution that works for me. If there is a more elegant solution, please let me know!
I go recursively through the function body and evaluate the parts which are numerical vectors a second time (like #Allan Cameron suggested, thanks!). Here is the function:
evalBodyParts <- function(fun_body){
for (i in 1:length(fun_body)){ #i=2
if (typeof(fun_body[[i]])=="language" &&
typeof(fun_body[[i]][[1]])=="symbol" && fun_body[[i]][[1]]=="c"){
#if first element is symbol 'c' the whole list is only num [1:3] here
fun_body[[i]] <- eval(fun_body[[i]])
} else {
if(typeof(fun_body[[i]])=="language"){
fun_body[[i]] <- evalBodyParts(fun_body=fun_body[[i]])
}
}
}
return(fun_body)
}
To do a quick example which is a bit more complex than the one in the main question above, let me show you the following.
Before:
test_string <- paste("function(x) \n ","c(1,2,3)","*x","+c(7,8,9)",sep = "")
test_fun <- eval(parse(text = test_string))
test_fun(2) # it's working
# [1] 9 12 15
str(body(test_fun)[[2]][[2]])
# language c(1, 2, 3)
str(body(test_fun)[[3]])
# language c(7, 8, 9)
After:
body(test_fun) <- evalBodyParts(fun_body=body(test_fun))
test_fun(2) # it is still working
# [1] 9 12 15
str(body(test_fun)[[2]][[2]])
# num [1:3] 1 2 3
str(body(test_fun)[[3]])
# num [1:3] 7 8 9

R: Extract list of list

I am trying to extract the elements from a nested list.
I have a list as below
> terms[1:3]
$`1`
mathew
1
$`2`
apr expires gmt thu
1 1 1 1
$`3`
distribution world
1 1
When I am using unlist I get the following output, where each term is preceded by the number it is present inside the list
> unlist(terms)[1:6]
1.mathew 2.apr 2.expires 2.gmt 2.thu 3.distribution
1 1 1 1 1 1
>
How can I extract the row name and the value associated with it. Example mathew column has value 1.
I need to create a dataframe in the end for term,count
Reproducible Example
library(tm)
data("crude")
tdm <- TermDocumentMatrix(crude)
findMostFreqTerms(tdm,10)
TermDocumentMatrix will return a named list by default. If you just want to combine those terms into a single list ignoring the document name, use
unlist(unname(terms))
But note that this may duplicate some words multiple times if more than one document shares a most frequent work. If you want to treat the entire corpus as a single document, you can do
findMostFreqTerms(tdm, 10, INDEX=rep(1, ncol(tdm)))[[1]]
Does this help?
data('crude')
library(tm)
tdm <- TermDocumentMatrix(crude)
terms=findMostFreqTerms(tdm,10)
a = unlist(terms)
words = gsub('[0-9.]+', '', attr(a,'names'))
words
df = t(data.frame(a))
colnames(df) = words
# colnames(df)

Locate different patterns in a sequence

If I want to find two different patterns in a single sequence how am I supposed to do
eg:
seq="ATGCAAAGGT"
the patterns are
pattern=c("ATGC","AAGG")
How am I supposed to find these two patterns simultaneously in the sequence?
I also want to find the location of these patterns like for example the patterns locations are 1,4 and 5,8.
Can anyone help me with this ?
Lets say your sequence file is just a vector of sequences:
seq.file <- c('ATGCAAAGGT','ATGCTAAGGT','NOTINTHISONE')
You can search for both motifs, and then return a true / false vector that identifies if both are present using the following one-liner:
grepl('ATGC', seq.file) & grepl('AAGG', seq.file)
[1] TRUE TRUE FALSE
Lets say the vector of sequences is a column within data frame d, which also contains a column of ID values:
id <- c('s1','s2','s3')
d <- data.frame(id,seq.file)
colnames(d) <- c('id','sequence')
You can append a column to this data frame, d, that identifies whether a given sequence matches with this one-liner:
d$match <- grepl('ATGC',d$sequence) & grepl('AAGG', d$sequence)
> print(d)
id sequence match
1 s1 ATGCAAAGGT TRUE
2 s2 ATGCTAAGGT TRUE
3 s3 NOTINTHISONE FALSE
The following for-loop can return a list of the positions of each of the patterns within the sequence:
require(stringr)
for(i in 1: length(d$sequence)){
out <- str_locate_all(d$sequence[i], pattern)
first <- c(out[[1]])
first.o <- paste(first[1],first[2],sep=',')
second <- c(out[[2]])
second.o <- paste(second[1],second[2], sep=',')
print(c(first.o, second.o))
}
[1] "1,4" "6,9"
[1] "1,4" "6,9"
[1] "NA,NA" "NA,NA"
You can try using the stringr library to do something like this:
seq = "ATGCAAAGGT"
library(stringr)
str_extract_all(seq, 'ATGC|AAGG')
[[1]]
[1] "ATGC" "AAGG"
Without knowing more specifically what output you are looking for, this is the best I can provide right now.
How about this using stringr to find start and end positions:
library(stringr)
seq <- "ATGCAAAGGT"
pattern <- c("ATGC","AAGG")
str_locate_all(seq, pattern)
#[[1]]
# start end
#[1,] 1 4
#
#[[2]]
# start end
#[1,] 6 9

How to read in list from file in R?

I have list written in file created by sink() - "file.txt". That file contains one list, which look like this, and it contains only numers:
[[1]]
[1] 1 2
[[2]]
[1] 1 2 3
how to read in data as list from such file ?
EDITION :
I'm going to try read it as a string, then use some regex to remove '[[*]]' and substitute '[*]' with special symbol - let it be '#'. Then take every substring between '#', split it into vector and put into empty list.
Something like this should do the trick. (The exact details may vary, but at least this will give you some ideas to work with.)
l <- readLines("file.txt")
l2 <- gsub("\\[{2}\\d+\\]{2}", "#", l) # Replace [[*]] with '#'
l3 <- gsub("\\[\\d+\\]\\s", "", l2)[-1] # Remove all [*]
l4 <- paste(l3, collapse=" ") # Paste together into one string
l5 <- strsplit(l4, "#")[[1]] # Break into list
lapply(l5, function(X) scan(textConnection(X))) # Use scan to convert 2 numeric
# [[1]]
# [1] 1 2
#
# [[2]]
# [1] 1 2 3

How to read and print first head of a file in R?

I want to print a head of a file in R. I know how to use read.table and other input methods supported by R. I just want to know R alternatives to unix command cat or head that reads in a file and print some of them.
Thank you,
SangChul
read.table() takes an nrows argument for just this purpose:
read.table(header=TRUE, text="
a b
1 2
3 4
", nrows=1)
# a b
# 1 1 2
If you are instead reading in (possibly less structured) files with readLines(), you can use its n argument instead:
readLines(textConnection("a b
1 2 3 4 some other things
last"), n=1)
# [1] "a b"

Resources