I having trouble calculating average sentiment of each row in a relatively big dataset (N=36140).
My dataset containts review data from an app on Google Play Store (each row represents one review) and I would like to calculate sentiment of each review using sentiment_by() function.
The problem is that this function takes a lot of time to calculate it.
Here is the link to my dataset in .csv format:
https://drive.google.com/drive/folders/1JdMOGeN3AtfiEgXEu0rAP3XIe3Kc369O?usp=sharing
I have tried using this code:
library(sentimentr)
e_data = read.csv("15_06_2016-15_06_2020__Sygic.csv", stringsAsFactors = FALSE)
sentiment=sentiment_by(e_data$review)
Then I get the following warning message (After I cancel the process when 10+ minutes has passed):
Warning message:
Each time `sentiment_by` is run it has to do sentence boundary disambiguation when a
raw `character` vector is passed to `text.var`. This may be costly of time and
memory. It is highly recommended that the user first runs the raw `character`
vector through the `get_sentences` function.
I have also tried to use the get_sentences() function with the following code, but the sentiment_by() function still needs a lot of time to execute the calculations
e_sentences = e_data$review %>%
get_sentences()
e_sentiment = sentiment_by(e_sentences)
I have datasets regarding the Google Play Store review data and I have used the sentiment_by() function for the past month and it worked very quickly when calculating the sentiment... I only started to run calculations for this long since yesterday.
Is there a way to quickly calculate sentiment for each row on a big dataset.
The algorithm used in sentiment appears to be O(N^2) once you get above 500 or so individual reviews, which is why it's suddenly taking a lot longer when you upped the size of the dataset significantly. Presumably it's comparing every pair of reviews in some way?
I glanced through the help file (?sentiment) and it doesn't seem to do anything which depends on pairs of reviews so that's a bit odd.
library(data.table)
reviews <- iconv(e_data$review, "") # I had a problem with UTF-8, you may not need this
x1 <- rbindlist(lapply(reviews[1:10],sentiment_by))
x1[,element_id:=.I]
x2 <- sentiment_by(reviews[1:10])
produce effectively the same output which means that the sentimentr package has a bug in it causing it to be unnecessarily slow.
One solution is just to batch the reviews. This will break the 'by' functionality in sentiment_by, but I think you should be able to group them yourself before you send them in (or after as it doesnt seem to matter).
batch_sentiment_by <- function(reviews, batch_size = 200, ...) {
review_batches <- split(reviews, ceiling(seq_along(reviews)/batch_size))
x <- rbindlist(lapply(review_batches, sentiment_by, ...))
x[, element_id := .I]
x[]
}
batch_sentiment_by(reviews)
Takes about 45 seconds on my machine (and should be O(N) for bigger datasets.
Related
I have a large dataset, including about 100,000 entries. I am using the tibbletime package to create a rolling version of the DL.test function from the vrtest package.
I am using a rolling window (size=1000), leading to about 99,000 computations. The code looks like this:
#installing packages
install.packages("tibbletime")
install.packages("vrtest")
#importing libraries
library(vrtest)
library(dplyr)
library(tibbletime)
library(tibble)
#generating demo data
data <- data.frame(replicate(1,sample(0:1,1010,rep=TRUE)))
names(data)[names(data) == "replicate.1..sample.0.1..1010..rep...TRUE.."] <- "log_return"
#running DL.test once
DL.test(data, 300, 1)
#creating a rolling window version of DL.test
test <- rollify(DL.test, window=1000, unlist=FALSE)
#applying function and saving results
results <- dplyr::mutate(data, test = test(log_return))
The issue now is that running DL.test even once takes a little less than 5 minutes on my current setup. Having to repeat this step nearly 100,000 times limits the practicality fairly strong.
What options do I have to speed this process up?
My current idea would be to create many smaller versions of my original dataset (e.g., entries 1 - 1500 for the first 500 computations, 501 - 2000 for the second batch...) and somehow employ parallel processing.
Any hints are highly appreciated!
I'm trying to generate a word cloud for a year's worth of complaint narrative data from the CFPB's public complaint database.
There are roughly 100,000 words per year.
I've been able to generate clouds using samples of about 1,000 words per year. I use a tibble with words and frequencies for each year.
I've tried wordcloud and ggwordcloud so far and both packages seem to run forever or freeze when I try using them on a full year's worth of data. My machine has 16GB of RAM. Is it capable of handling this much data?
Does anyone know if there's a package I can use to generate word clouds for datasets this large?
I've seen previous answers that recommend taking samples or otherwise reducing the size of data that I'm working with. I still want to work with the full dataset if possible.
Reading in chunks to accumulate a word list is one option. Here's some code that uses the function read_lines_chunked from the readr package. Each chunk is processed using the tidytext package and the output from a previous chunk is used to create an accumulating word list. From there, the wordcloud package is used.
library(tidytext)
library(dplyr)
library(wordcloud)
library(readr)
process_chunk <- function(x, pos, acc) {
df <- tibble(text=x)
words <-
unnest_tokens(df, 'word', 'text') %>%
anti_join(stop_words, by = 'word') %>%
bind_rows(acc) %>%
count(word, sort = T)
rm(df)
words
}
words <- read_lines_chunked('complaints.csv', AccumulateCallback$new(process_chunk), chunk_size = 100000)
words %>%
head(50) %>%
with(wordcloud(word, n))
The creation of the word list took about 30 minutes on my 16Gb laptop.
Obviously, you'll have to tune your code to add or remove words of interest. I eliminated stop words and you might want to decide what to do with xxxx which I think is a substitute for cross people saying rude words.
I get a random number of errors that come back as NA when I try to geocode a lot of places using rgooglemaps getGeoCode function. Can anyone tell me why? (Reproducible code below)
library(RgoogleMaps)
library(foreach)
###Replicating a large search data###
PlaceVector <- c(rep("Anchorage,Alaska", 20), rep("Baltimore,Maryland", 20),
rep("Birmingham,Alabama", 20))
iters <- length(PlaceVector)
###Looping to get each geocode###
geoadd <- foreach(a=1:iters, .combine=rbind) %do% {
getGeoCode(paste(PlaceVector[a]))
}
geoadd <- as.data.frame(geoadd)
geoadd$Place <- PlaceVector
I get a random number of errors, usually around 15 where the latitude and longitudes in data frame geoadd come back as NA. I could loop it back on the NA's but that seems utterly inefficient. Do others have the same problem with the sample code provided?
I get NA's in the example as well. I once had a problem with looping and geocoding. The problem was that I was hitting googleMaps to fast or with to many request within a minimum time frame. I built in a waiting period with Sys.sleep to solve the issue. The problem is finding the correct amount of microseconds to wait. This depends on your connection and response times of google.
I'm trying to complete the very simple task of reading in an unphased fasta file and phasing it using ape, and then calculating Tajima's D using pegas, but #my data doesn't seem to be reading in correctly. Input and output is as #follows:
library("ape")
library("adegenet")
library("ade4")
library("pegas")
DNAbin8c18 <- read.dna(file="fasta8c18.fa", format="f")
I shouldn't need to attach any data since I've just generated the file, but since the data() command was in the manual, I executeed
data(DNAbin8c18)
and got
Warning message: In data(DNAbin8c18) : data set ‘DNAbin8c18’ not found
I know that data() only works in certain contexts, so maybe this isn't a big deal. I looked at what had been loaded
DNAbin8c18
817452 DNA sequences in binary format stored in a matrix.
All sequences of same length: 96
Labels:
CLocus_12706_Sample_1_Locus_34105_Allele_0 [BayOfIslands_s08...
CLocus_12706_Sample_2_Locus_31118_Allele_0 [BayOfIslands_s08...
CLocus_12706_Sample_3_Locus_30313_Allele_0 [BayOfIslands_s09...
CLocus_12706_Sample_5_Locus_33345_Allele_0 [BayOfIslands_s09...
CLocus_12706_Sample_7_Locus_37388_Allele_0 [BayOfIslands_s09...
CLocus_12706_Sample_8_Locus_29451_Allele_0 [BayOfIslands_s09... ...
More than 10 million nucleotides: not printing base composition
so it looks like the data should be fine. Because of this, I tried what I want to do
tajima.test(DNAbin8c18)
and got
Error: cannot allocate vector of size 2489.3 Gb
Many people have completed this same test using as many or more SNPs that I have, and also using FASTA files, but is it possible that mine is too big, or can you see another issue?
The data file can be downloaded at the following link
https://drive.google.com/open?id=0B6qb8IlaQGFZLVRYeXMwRnpMTUU
I have also sent and earlier version of this question, with the data, to the r-sig-genetics mailing list, but I have not heard back.
Any thoughts would be much appreciated.
Ella
Thank you for the comment. Indeed, you are correct. The developer just emailed me with the following very helpful comments.
The problem is that your data are too big (too many sequences) and tajima.test() needs to compute the matrix of all pairwise distances. You could this check by trying:
dist.dna(DNAbin8c18, "N")
One possibility for you is to sample randomly some observations, and repeat this many times, eg:
tajima.test(DNAbin8c18[sample(n, size = 1000), ])
This could be:
N <- 1000 # number of repeats
RES <- matrix(N, 3)
for (i in 1:N)
RES[, i] <- unlist(tajima.test(DNAbin8c18[sample(n, size = 10000), ]))
You may adjust N and 'size =' to have something not too long to run. Then you may look at the distribution of the columns of RES.
I want to use ChemoSpec with a mass spectra of about 60'000 datapoint.
I have them already in one txt file as a matrix (X + 90 samples = 91 columns; 60'000 rows).
How may I adapt this file as spectra data without exporting again each single file in csv format (which is quite long in R given the size of my data)?
The typical (and only?) way to import data into ChemoSpec is by way of the getManyCsv() function, which as the question indicates requires one CSV file for each sample.
Creating 90 CSV files from the 91 columns - 60,000 rows file described, may be somewhat slow and tedious in R, but could be done with a standalone application, whether existing utility or some ad-hoc script.
An R-only solution would be to create a new method, say getOneBigCsv(), adapted from getManyCsv(). After all, the logic of getManyCsv() is relatively straight forward.
Don't expect such a solution to be sizzling fast, but it should, in any case, compare with the time it takes to run getManyCsv() and avoid having to create and manage the many files, hence overall be faster and certainly less messy.
Sorry I missed your question 2 days ago. I'm the author of ChemoSpec - always feel free to write directly to me in addition to posting somewhere.
The solution is straightforward. You already have your data in a matrix (after you read it in with >read.csv("file.txt"). So you can use it to manually create a Spectra object. In the R console type ?Spectra to see the structure of a Spectra object, which is a list with specific entries. You will need to put your X column (which I assume is mass) into the freq slot. Then the rest of the data matrix will go into the data slot. Then manually create the other needed entries (making sure the data types are correct). Finally, assign the Spectra class to your completed list by doing something like >class(my.spectra) <- "Spectra" and you should be good to go. I can give you more details on or off list if you describe your data a bit more fully. Perhaps you have already solved the problem?
By the way, ChemoSpec is totally untested with MS data, but I'd love to find out how it works for you. There may be some changes that would be helpful so I hope you'll send me feedback.
Good Luck, and let me know how else I can help.
many years passed and I am not sure if anybody is still interested in this topic. But I had the same problem and did a little workaround to convert my data to class 'Spectra' by extracting the information from the data itself:
#Assumption:
# Data is stored as a numeric data.frame with column names presenting samples
# and row names including domain axis
dataframe2Spectra <- function(Spectrum_df,
freq = as.numeric(rownames(Spectrum_df)),
data = as.matrix(t(Spectrum_df)),
names = paste("YourFileDescription", 1:dim(Spectrum_df)[2]),
groups = rep(factor("Factor"), dim(Spectrum_df)[2]),
colors = rainbow(dim(Spectrum_df)[2]),
sym = 1:dim(Spectrum_df)[2],
alt.sym = letters[1:dim(Spectrum_df)[2]],
unit = c("a.u.", "Domain"),
desc = "Some signal. Describe it with 'desc'"){
features <- c("freq", "data", "names", "groups", "colors", "sym", "alt.sym", "unit", "desc")
Spectrum_chem <- vector("list", length(features))
names(Spectrum_chem) <- features
Spectrum_chem$freq <- freq
Spectrum_chem$data <- data
Spectrum_chem$names <- names
Spectrum_chem$groups <- groups
Spectrum_chem$colors <- colors
Spectrum_chem$sym <- sym
Spectrum_chem$alt.sym <- alt.sym
Spectrum_chem$unit <- unit
Spectrum_chem$desc <- desc
# important step
class(Spectrum_chem) <- "Spectra"
# some warnings
if (length(freq)!=dim(data)[2]) print("Dimension of data is NOT #samples X length of freq")
if (length(names)>dim(data)[1]) print("Too many names")
if (length(names)<dim(data)[1]) print("Too less names")
if (length(groups)>dim(data)[1]) print("Too many groups")
if (length(groups)<dim(data)[1]) print("Too less groups")
if (length(colors)>dim(data)[1]) print("Too many colors")
if (length(colors)<dim(data)[1]) print("Too less colors")
if (is.matrix(data)==F) print("'data' is not a matrix or it's not numeric")
return(Spectrum_chem)
}
Spectrum_chem <- dataframe2Spectra(Spectrum)
chkSpectra(Spectrum_chem)