I am trying to do some analysis on twitter data. So I have tweets
head(words) 1 "#fabulous" "rock" "is" "#destined" "to" "be" "star"
> head(hashtags)
hashtags score
1 #fabulous 7.526
2 #excellent 7.247
3 #superb 7.199
4 #perfection 7.099
5 #terrific 6.922
6 #magnificent 6.672
So I want a to check words against hashtags dataframe and words character array and for every match, I want the sum of the value of scores.
So in above case I want the output to be 7.526+6.922=14.448
Any help would be greatly appreciated.
Try this
words_hashtags <- words[grepl('^#', words)]
scores <- hashtags[hashtags$hashtags %in% words_hashtags, 'score']
sum(scores)
grepl returns a logical vector indicating which words has hashtags in the beginning. The rest is just basic R syntax.
More options to get words_hashtags:
words_hashtags <- grep('^#', words, value=T)
words_hashtags <- words[grep('^#', words, value=F)]
Related
I'm relatively new to R and I have a question about data processing. The main issue is that the dataset is too big, and I want to write a vectorized function that's faster than a for loop, but I don't know how. The data is about movies and user ratings, is formatted like this (below).
1:
5,3,2005-09-06
1,5,2005-05-13
3,4,2005-10-19
2:
2,4,2005-12-26
3,3,2004-05-03
5,3,2005-11-17
The 1: and 2: represent movies, while the other lines represent a user id, user rating and dating of rating for that movie (in that order from left to right, separated by commas). I want to format the data as an edge list, like this:
Movie | User
1: | 5
1: | 1
1: | 3
2: | 2
2: | 3
2: | 5
I wrote the code below to perform this function. Basically, for every row, it check if its a movie id (containing ':') or if it's user data. It then combines the movie id and user id as two columns for every movie and user, and then rowbinds it to a new data frame. At the same time, it also only binds those users who rate a movie 5 out of 5.
el <- data.frame(matrix(ncol = 2, nrow = 0))
for (i in 1:nrow(data))
{
if (grepl(':', data[i,]))
{
mid <- data[i,]
} else(grepl(',', data[i,]))
{
if(grepl(',5,', data[i,]))
{
uid <- unlist(strsplit(data[i,], ','))[1]
add <- c(mid, uid)
el <- rbind(el, add)
}
}
}
However, I have about 100 million entries, and the for loop runs throughout the night without being able to complete. Is there a way to speed this up? I read about vectorization, but I can't figure out how to vectorize this function. Any help?
You can do this with a few regular expressions, for which I'll use the stringr package, as well as na.locf from the zoo package. (You'll have to install stringr and zoo first).
First we'll set up your data, which it sounds like is in a one-column data frame:
data <- read.table(textConnection("1:
5,3,2005-09-06
1,5,2005-05-13
3,4,2005-10-19
2:
2,4,2005-12-26
3,3,2004-05-03
5,3,2005-11-17
"))
You can then follow the following steps (explanation in comments).
# Pull out the column as a character vector for simplicity
lines <- data[[1]]
library(stringr)
# Figure out which lines represent movie IDs, and extract IDs
movie_ids <- str_match(lines, "(\\d+):")[, 2]
# Fill the last observation carried forward (locf), to find out
# the most recent non-NA value
library(zoo)
movie_ids_filled <- na.locf(movie_ids)
# Extract the user IDs
user_ids <- str_match(lines, "(\\d+),")[, 2]
# For each line that has a user ID, match it to the movie ID
result <- cbind(movie_ids_filled[!is.na(user_ids)],
user_ids[!is.na(user_ids)])
This gets the result
[,1] [,2]
[1,] "1" "5"
[2,] "1" "1"
[3,] "1" "3"
[4,] "2" "2"
[5,] "2" "3"
[6,] "2" "5"
The most important part of this code is the use of regular expressions, particularly the capturing groups in parentheses of "(\\d+):" and (\\d+),. For more on using str_match with regular expressions, do check out this guide.
This is a complete re-write of my original question in an attempt to clarify it and make it as answerable as possible. My objective is to write a function which takes a string as input and returns the information contained therein in tabular format. Two examples of the kind of character strings the function will face are the following
s1 <- " 9 9875 Γεωργίου Άγγελος Δημήτρης ΑΒ/Γ Π/Π Β 00:54:05 167***\r"
s2 <- " 10 8954F Smith John ΔΕΖ N ΔΕΝ ΕΚΚΙΝΗΣΕ 0\r"
(For those who had read my original question, these are smaller strings for simplicity.)
The required output would be:
Rank Code Name Club Class Time Points
9 9875 Γεωργίου Άγγελος Δημήτρης ΑΒ/Γ Π/Π Β 00:54:05 167
10 8954F Smith John ΔΕΖ N ΔΕΝ ΕΚΚΙΝΗΣΕ 0
I have managed to split the string based on where there's a blank space using:
strsplit(s1, " ")[[1]][strsplit(s1, " ")[[1]] != ""]
although a more elegant solution was given by G. Grothendieck in the comments below using:
unlist(strsplit(trimws(s1), " +"))
This results in
"9" "9875" "Γεωργίου" "Άγγελος" "Δημήτρης" "ΑΒ/Γ" "Π/Π" "Β" "00:54:05" "167***\r"
However, this is still problematic as "Γεωργίου" "Άγγελος" and "Δημήτρης" should be combined into "Γεωργίου Άγγελος Δημήτρης" (note that the number of elements could be two OR three) and the same applies to "Π/Π" "Β" which should be combined into "Π/Π Β".
The question
How can I use the additional information that I have, namely:
The order of the elements will always be the same
The Name data will consist of two or three words
The Club data (i.e. ΑΒ/Γ in s1 and ΔΕΖ in s2) will come from a pre-defined list of clubs (e.g. stored in a character vector named sClub)
The Class data (i.e. Π/Π Β in s1 and N in s2) will come from a pre-defined list of classes (e.g. stored in a character vector named sClass)
The Points data will always contain "\r" and won't contain any spaces.
to produce the required output above?
Defining
sClub <- c("ΑΒ/Γ", "ΔΕΖ")
sClass <- c("Π/Π Β", "N")
we may do
library(stringr)
myfun <- function(s)
gsub("\\*", "", trimws(str_match(s, paste0("^\\s*(\\d+)\\s*?(\\w+)\\s*?([\\w ]+)\\s*(", paste(sClub, collapse = "|"),")\\s*(", paste(sClass, collapse = "|"), ")(.*?)\\s*([^ ]*\r)"))[, -1]))
sapply(list(s1, s2), myfun)
# [,1] [,2]
# [1,] "9" "10"
# [2,] "9875" "8954F"
# [3,] "Γεωργίου Άγγελος Δημήτρης" "Smith John"
# [4,] "ΑΒ/Γ" "ΔΕΖ"
# [5,] "Π/Π Β" "N"
# [6,] "00:54:05" "ΔΕΝ ΕΚΚΙΝΗΣΕ"
# [7,] "167" "0"
The way it works is just taking into account all your additional information and constructing a long regex. It finishes with erasing * and removing leading/trailing whitespace.
I am trying to extract the elements from a nested list.
I have a list as below
> terms[1:3]
$`1`
mathew
1
$`2`
apr expires gmt thu
1 1 1 1
$`3`
distribution world
1 1
When I am using unlist I get the following output, where each term is preceded by the number it is present inside the list
> unlist(terms)[1:6]
1.mathew 2.apr 2.expires 2.gmt 2.thu 3.distribution
1 1 1 1 1 1
>
How can I extract the row name and the value associated with it. Example mathew column has value 1.
I need to create a dataframe in the end for term,count
Reproducible Example
library(tm)
data("crude")
tdm <- TermDocumentMatrix(crude)
findMostFreqTerms(tdm,10)
TermDocumentMatrix will return a named list by default. If you just want to combine those terms into a single list ignoring the document name, use
unlist(unname(terms))
But note that this may duplicate some words multiple times if more than one document shares a most frequent work. If you want to treat the entire corpus as a single document, you can do
findMostFreqTerms(tdm, 10, INDEX=rep(1, ncol(tdm)))[[1]]
Does this help?
data('crude')
library(tm)
tdm <- TermDocumentMatrix(crude)
terms=findMostFreqTerms(tdm,10)
a = unlist(terms)
words = gsub('[0-9.]+', '', attr(a,'names'))
words
df = t(data.frame(a))
colnames(df) = words
# colnames(df)
Good afternoon,
Thanks for helping me out with this question.
I have a set of >5000 URLs within a list that I am interested in scraping. I have used lapply and readLines to extract the text for these webpages using the sample code below:
multipleURL <- c("http://dailymed.nlm.nih.gov/dailymed/lookup.cfm?ndc=0002-1200&start=1&labeltype=all", "http://dailymed.nlm.nih.gov/dailymed/lookup.cfm?ndc=0002-1407&start=1&labeltype=all", "http://dailymed.nlm.nih.gov/dailymed/lookup.cfm?ndc=0002-1975&start=1&labeltype=all")
multipleText <- lapply(multipleURL, readLines)
Now I would like to query each of these texts for the word "radioactive". I am simply interested in figuring out if this term is mentioned in the text and have been using the logical grep command:
radioactive <- grepl("radioactive" , multipleText, ignore.case = TRUE)
When I count the number of items in our list that contain the word "radioactive" it returns a count of 0:
count(radioactive)
x freq
1 FALSE 3
However, a cursory review of the webpages for each of these URLs however reveals that the first link (http://dailymed.nlm.nih.gov/dailymed/lookup.cfm?ndc=0002-1200&start=1&labeltype=all) DOES in fact contain the word radioactive. Our "multipleText" list even includes the word radioactive, although our grepl command doesn't seem to pick it up.
Any thoughts on what I am doing wrong would be greatly appreciated.
Many thanks,
Chris
I think you should you parse your document using html parser. Here I am using XML package. I convert your document to an R list and then I can apply grep on it.
library(XML)
multipleText <- lapply(multipleURL,function(x) {
y <- xmlToList(htmlParse(x))
y.flat <- unlist(y,recursive=TRUE)
length(grep('radioactive',c(y.flat,names(y.flat))))
})
multipleText
[[1]]
[1] 8
[[2]]
[1] 0
[[3]]
[1] 0
EDIT to search for multi search :
## define your words here
WORDS <- c('CLINICAL ','solution','Action','radioactive','Effects')
library(XML)
multipleText <- lapply(multipleURL,
function(x) {
y <- xmlToList(htmlParse(x))
y.flat <- unlist(y,recursive=TRUE)
sapply(WORDS,function(y)
length(grep(y,c(y.flat,names(y.flat)))))
})
do.call(rbind,multipleText)
CLINICAL solution Action radioactive Effects
[1,] 6 10 2 8 2
[2,] 1 3 1 0 3
[3,] 6 22 2 0 6
PS: maybe you should use ignore.case = TRUE for the grep command.
I have a bunch of names, and I want to obtain the unique names. However, due to spelling errors and inconsistencies in the data the names might be written down wrong. I am looking for a way to check in a vector of strings if two of them are similair.
For example:
pres <- c(" Obama, B.","Bush, G.W.","Obama, B.H.","Clinton, W.J.")
I want to find that " Obama, B." and "Obama, B.H." are very similar. Is there a way to do this?
This can be done based on eg the Levenshtein distance. There are multiple implementations of this in different packages. Some solutions and packages can be found in the answers of these questions:
agrep: only return best match(es)
In R, how do I replace a string that contains a certain pattern with another string?
Fast Levenshtein distance in R?
But most often agrep will do what you want :
> sapply(pres,agrep,pres)
$` Obama, B.`
[1] 1 3
$`Bush, G.W.`
[1] 2
$`Obama, B.H.`
[1] 1 3
$`Clinton, W.J.`
[1] 4
Maybe agrep is what you want? It searches for approximate matches using the Levenshtein edit distance.
lapply(pres, agrep, pres, value = TRUE)
[[1]]
[1] " Obama, B." "Obama, B.H."
[[2]]
[1] "Bush, G.W."
[[3]]
[1] " Obama, B." "Obama, B.H."
[[4]]
[1] "Clinton, W.J."
Add another duplicate to show it works with more than one duplicate.
pres <- c(" Obama, B.","Bush, G.W.","Obama, B.H.","Clinton, W.J.", "Bush, G.")
adist shows the string distance between 2 character vectors
adist(" Obama, B.", pres)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0 9 3 10 7
For example, to select the closest string to " Obama, B." you can take the one which has the minimal distance. To avoid the identical string, I took only distances greater than zero:
d <- adist(" Obama, B.", pres)
pres[min(d[d>0])]
# [1] "Obama, B.H."
To obtain unique names, taking into account spelling errors and inconsistencies, you can compare each string to all previous ones. Then if there is a similar one, remove it. I created a keepunique() function that performs this. keepunique() is then applied to all elements of the vector successively with Reduce().
keepunique <- function(previousones, x){
if(any(adist(x, previousones)<5)){
x <- NULL
}
return(c(previousones, x))
}
Reduce(keepunique, pres)
# [1] " Obama, B." "Bush, G.W." "Clinton, W.J."