I am trying to extract the elements from a nested list.
I have a list as below
> terms[1:3]
$`1`
mathew
1
$`2`
apr expires gmt thu
1 1 1 1
$`3`
distribution world
1 1
When I am using unlist I get the following output, where each term is preceded by the number it is present inside the list
> unlist(terms)[1:6]
1.mathew 2.apr 2.expires 2.gmt 2.thu 3.distribution
1 1 1 1 1 1
>
How can I extract the row name and the value associated with it. Example mathew column has value 1.
I need to create a dataframe in the end for term,count
Reproducible Example
library(tm)
data("crude")
tdm <- TermDocumentMatrix(crude)
findMostFreqTerms(tdm,10)
TermDocumentMatrix will return a named list by default. If you just want to combine those terms into a single list ignoring the document name, use
unlist(unname(terms))
But note that this may duplicate some words multiple times if more than one document shares a most frequent work. If you want to treat the entire corpus as a single document, you can do
findMostFreqTerms(tdm, 10, INDEX=rep(1, ncol(tdm)))[[1]]
Does this help?
data('crude')
library(tm)
tdm <- TermDocumentMatrix(crude)
terms=findMostFreqTerms(tdm,10)
a = unlist(terms)
words = gsub('[0-9.]+', '', attr(a,'names'))
words
df = t(data.frame(a))
colnames(df) = words
# colnames(df)
Related
If I want to find two different patterns in a single sequence how am I supposed to do
eg:
seq="ATGCAAAGGT"
the patterns are
pattern=c("ATGC","AAGG")
How am I supposed to find these two patterns simultaneously in the sequence?
I also want to find the location of these patterns like for example the patterns locations are 1,4 and 5,8.
Can anyone help me with this ?
Lets say your sequence file is just a vector of sequences:
seq.file <- c('ATGCAAAGGT','ATGCTAAGGT','NOTINTHISONE')
You can search for both motifs, and then return a true / false vector that identifies if both are present using the following one-liner:
grepl('ATGC', seq.file) & grepl('AAGG', seq.file)
[1] TRUE TRUE FALSE
Lets say the vector of sequences is a column within data frame d, which also contains a column of ID values:
id <- c('s1','s2','s3')
d <- data.frame(id,seq.file)
colnames(d) <- c('id','sequence')
You can append a column to this data frame, d, that identifies whether a given sequence matches with this one-liner:
d$match <- grepl('ATGC',d$sequence) & grepl('AAGG', d$sequence)
> print(d)
id sequence match
1 s1 ATGCAAAGGT TRUE
2 s2 ATGCTAAGGT TRUE
3 s3 NOTINTHISONE FALSE
The following for-loop can return a list of the positions of each of the patterns within the sequence:
require(stringr)
for(i in 1: length(d$sequence)){
out <- str_locate_all(d$sequence[i], pattern)
first <- c(out[[1]])
first.o <- paste(first[1],first[2],sep=',')
second <- c(out[[2]])
second.o <- paste(second[1],second[2], sep=',')
print(c(first.o, second.o))
}
[1] "1,4" "6,9"
[1] "1,4" "6,9"
[1] "NA,NA" "NA,NA"
You can try using the stringr library to do something like this:
seq = "ATGCAAAGGT"
library(stringr)
str_extract_all(seq, 'ATGC|AAGG')
[[1]]
[1] "ATGC" "AAGG"
Without knowing more specifically what output you are looking for, this is the best I can provide right now.
How about this using stringr to find start and end positions:
library(stringr)
seq <- "ATGCAAAGGT"
pattern <- c("ATGC","AAGG")
str_locate_all(seq, pattern)
#[[1]]
# start end
#[1,] 1 4
#
#[[2]]
# start end
#[1,] 6 9
I am using readLines() to extract an html code from a site. In almost every line of the code there is pattern of the form <td>VALUE1<td>VALUE2<td>. I would like to take the values in between the <td>. I tried some compilations such as:
output <- gsub(pattern='(.*<td>)(.*)(<td>.*)(.*)(.*<td>)',replacement='\\2',x='<td>VALUE1<td>VALUE2<td>')
but the output gives back only the one value. Any idea how to do that?
string <- "<td>VALUE1<td>VALUE2<td>"
regmatches(string , gregexpr("(?<=<td>)\\w+(?=<td>)" , string , perl = T) )
# use gregexpr function to get the match indices and the lengthes
indices <- gregexpr("(?<=<td>)\\w+(?=<td>)" , string , perl = T)
# this should be the result
# [1] 5 15
# attr(,"match.length")
# this means you have two matches the first one starts at index 5 and the
#second match starts at index 15
#[1] 6 6
#attr(,"useBytes")
# this means the first match should be with length 6 , also in this case the
#second match with length of 6
# then get the result of this match and pass it to regmatches function to
# substring your string at these indices
regmatches(string , indices)
Did you take a look at the "XML" package that can extract tables from HTML? You probably need to provide more context of the entire message that you are trying to parse so that we could see if it might be appropriate.
I am trying to do some analysis on twitter data. So I have tweets
head(words) 1 "#fabulous" "rock" "is" "#destined" "to" "be" "star"
> head(hashtags)
hashtags score
1 #fabulous 7.526
2 #excellent 7.247
3 #superb 7.199
4 #perfection 7.099
5 #terrific 6.922
6 #magnificent 6.672
So I want a to check words against hashtags dataframe and words character array and for every match, I want the sum of the value of scores.
So in above case I want the output to be 7.526+6.922=14.448
Any help would be greatly appreciated.
Try this
words_hashtags <- words[grepl('^#', words)]
scores <- hashtags[hashtags$hashtags %in% words_hashtags, 'score']
sum(scores)
grepl returns a logical vector indicating which words has hashtags in the beginning. The rest is just basic R syntax.
More options to get words_hashtags:
words_hashtags <- grep('^#', words, value=T)
words_hashtags <- words[grep('^#', words, value=F)]
I have a list of 120777 records which contains names of people. I want to store an array of name parts for each record in the dataset. I tried this in R.
my_list$name_parts<- strsplit(my_list$name, " ")
I get a my_list$name_parts as a list of 120777 items. When I try querying the number of words in each name using length(my_list$name_parts), I get 120777 for all.
Let's use this simple example:
my_list <- list()
my_list$name <- c("toto t. tutu", "foo bar")
To get the number of words, you can do that:
lapply(strsplit(my_list$name," "), length)
which gives in the simple example above:
[[1]]
[1] 3
[[2]]
[1] 2
To avoid getting a list, you can even do:
unlist(lapply(strsplit(my_list$name," "), length))
[1] 3 2
Good afternoon,
Thanks for helping me out with this question.
I have a set of >5000 URLs within a list that I am interested in scraping. I have used lapply and readLines to extract the text for these webpages using the sample code below:
multipleURL <- c("http://dailymed.nlm.nih.gov/dailymed/lookup.cfm?ndc=0002-1200&start=1&labeltype=all", "http://dailymed.nlm.nih.gov/dailymed/lookup.cfm?ndc=0002-1407&start=1&labeltype=all", "http://dailymed.nlm.nih.gov/dailymed/lookup.cfm?ndc=0002-1975&start=1&labeltype=all")
multipleText <- lapply(multipleURL, readLines)
Now I would like to query each of these texts for the word "radioactive". I am simply interested in figuring out if this term is mentioned in the text and have been using the logical grep command:
radioactive <- grepl("radioactive" , multipleText, ignore.case = TRUE)
When I count the number of items in our list that contain the word "radioactive" it returns a count of 0:
count(radioactive)
x freq
1 FALSE 3
However, a cursory review of the webpages for each of these URLs however reveals that the first link (http://dailymed.nlm.nih.gov/dailymed/lookup.cfm?ndc=0002-1200&start=1&labeltype=all) DOES in fact contain the word radioactive. Our "multipleText" list even includes the word radioactive, although our grepl command doesn't seem to pick it up.
Any thoughts on what I am doing wrong would be greatly appreciated.
Many thanks,
Chris
I think you should you parse your document using html parser. Here I am using XML package. I convert your document to an R list and then I can apply grep on it.
library(XML)
multipleText <- lapply(multipleURL,function(x) {
y <- xmlToList(htmlParse(x))
y.flat <- unlist(y,recursive=TRUE)
length(grep('radioactive',c(y.flat,names(y.flat))))
})
multipleText
[[1]]
[1] 8
[[2]]
[1] 0
[[3]]
[1] 0
EDIT to search for multi search :
## define your words here
WORDS <- c('CLINICAL ','solution','Action','radioactive','Effects')
library(XML)
multipleText <- lapply(multipleURL,
function(x) {
y <- xmlToList(htmlParse(x))
y.flat <- unlist(y,recursive=TRUE)
sapply(WORDS,function(y)
length(grep(y,c(y.flat,names(y.flat)))))
})
do.call(rbind,multipleText)
CLINICAL solution Action radioactive Effects
[1,] 6 10 2 8 2
[2,] 1 3 1 0 3
[3,] 6 22 2 0 6
PS: maybe you should use ignore.case = TRUE for the grep command.