I get a random number of errors that come back as NA when I try to geocode a lot of places using rgooglemaps getGeoCode function. Can anyone tell me why? (Reproducible code below)
library(RgoogleMaps)
library(foreach)
###Replicating a large search data###
PlaceVector <- c(rep("Anchorage,Alaska", 20), rep("Baltimore,Maryland", 20),
rep("Birmingham,Alabama", 20))
iters <- length(PlaceVector)
###Looping to get each geocode###
geoadd <- foreach(a=1:iters, .combine=rbind) %do% {
getGeoCode(paste(PlaceVector[a]))
}
geoadd <- as.data.frame(geoadd)
geoadd$Place <- PlaceVector
I get a random number of errors, usually around 15 where the latitude and longitudes in data frame geoadd come back as NA. I could loop it back on the NA's but that seems utterly inefficient. Do others have the same problem with the sample code provided?
I get NA's in the example as well. I once had a problem with looping and geocoding. The problem was that I was hitting googleMaps to fast or with to many request within a minimum time frame. I built in a waiting period with Sys.sleep to solve the issue. The problem is finding the correct amount of microseconds to wait. This depends on your connection and response times of google.
Related
I have a large dataset, including about 100,000 entries. I am using the tibbletime package to create a rolling version of the DL.test function from the vrtest package.
I am using a rolling window (size=1000), leading to about 99,000 computations. The code looks like this:
#installing packages
install.packages("tibbletime")
install.packages("vrtest")
#importing libraries
library(vrtest)
library(dplyr)
library(tibbletime)
library(tibble)
#generating demo data
data <- data.frame(replicate(1,sample(0:1,1010,rep=TRUE)))
names(data)[names(data) == "replicate.1..sample.0.1..1010..rep...TRUE.."] <- "log_return"
#running DL.test once
DL.test(data, 300, 1)
#creating a rolling window version of DL.test
test <- rollify(DL.test, window=1000, unlist=FALSE)
#applying function and saving results
results <- dplyr::mutate(data, test = test(log_return))
The issue now is that running DL.test even once takes a little less than 5 minutes on my current setup. Having to repeat this step nearly 100,000 times limits the practicality fairly strong.
What options do I have to speed this process up?
My current idea would be to create many smaller versions of my original dataset (e.g., entries 1 - 1500 for the first 500 computations, 501 - 2000 for the second batch...) and somehow employ parallel processing.
Any hints are highly appreciated!
I am using following code to GET data using API from the KOBO server, it downloads only 30000 observations instead of all 85,000 observations.
rawdata<-GET(url,authenticate(u,pw, type = "basic"),progress())
observer <- content(rawdata,"raw",encoding="UTF-8")
observer <- read_csv(observer)
observer <- as.data.frame(observer)
Using same code, I am able to download all observations when smaller no of observations.
Looking for help
I having trouble calculating average sentiment of each row in a relatively big dataset (N=36140).
My dataset containts review data from an app on Google Play Store (each row represents one review) and I would like to calculate sentiment of each review using sentiment_by() function.
The problem is that this function takes a lot of time to calculate it.
Here is the link to my dataset in .csv format:
https://drive.google.com/drive/folders/1JdMOGeN3AtfiEgXEu0rAP3XIe3Kc369O?usp=sharing
I have tried using this code:
library(sentimentr)
e_data = read.csv("15_06_2016-15_06_2020__Sygic.csv", stringsAsFactors = FALSE)
sentiment=sentiment_by(e_data$review)
Then I get the following warning message (After I cancel the process when 10+ minutes has passed):
Warning message:
Each time `sentiment_by` is run it has to do sentence boundary disambiguation when a
raw `character` vector is passed to `text.var`. This may be costly of time and
memory. It is highly recommended that the user first runs the raw `character`
vector through the `get_sentences` function.
I have also tried to use the get_sentences() function with the following code, but the sentiment_by() function still needs a lot of time to execute the calculations
e_sentences = e_data$review %>%
get_sentences()
e_sentiment = sentiment_by(e_sentences)
I have datasets regarding the Google Play Store review data and I have used the sentiment_by() function for the past month and it worked very quickly when calculating the sentiment... I only started to run calculations for this long since yesterday.
Is there a way to quickly calculate sentiment for each row on a big dataset.
The algorithm used in sentiment appears to be O(N^2) once you get above 500 or so individual reviews, which is why it's suddenly taking a lot longer when you upped the size of the dataset significantly. Presumably it's comparing every pair of reviews in some way?
I glanced through the help file (?sentiment) and it doesn't seem to do anything which depends on pairs of reviews so that's a bit odd.
library(data.table)
reviews <- iconv(e_data$review, "") # I had a problem with UTF-8, you may not need this
x1 <- rbindlist(lapply(reviews[1:10],sentiment_by))
x1[,element_id:=.I]
x2 <- sentiment_by(reviews[1:10])
produce effectively the same output which means that the sentimentr package has a bug in it causing it to be unnecessarily slow.
One solution is just to batch the reviews. This will break the 'by' functionality in sentiment_by, but I think you should be able to group them yourself before you send them in (or after as it doesnt seem to matter).
batch_sentiment_by <- function(reviews, batch_size = 200, ...) {
review_batches <- split(reviews, ceiling(seq_along(reviews)/batch_size))
x <- rbindlist(lapply(review_batches, sentiment_by, ...))
x[, element_id := .I]
x[]
}
batch_sentiment_by(reviews)
Takes about 45 seconds on my machine (and should be O(N) for bigger datasets.
I'm trying to complete the very simple task of reading in an unphased fasta file and phasing it using ape, and then calculating Tajima's D using pegas, but #my data doesn't seem to be reading in correctly. Input and output is as #follows:
library("ape")
library("adegenet")
library("ade4")
library("pegas")
DNAbin8c18 <- read.dna(file="fasta8c18.fa", format="f")
I shouldn't need to attach any data since I've just generated the file, but since the data() command was in the manual, I executeed
data(DNAbin8c18)
and got
Warning message: In data(DNAbin8c18) : data set ‘DNAbin8c18’ not found
I know that data() only works in certain contexts, so maybe this isn't a big deal. I looked at what had been loaded
DNAbin8c18
817452 DNA sequences in binary format stored in a matrix.
All sequences of same length: 96
Labels:
CLocus_12706_Sample_1_Locus_34105_Allele_0 [BayOfIslands_s08...
CLocus_12706_Sample_2_Locus_31118_Allele_0 [BayOfIslands_s08...
CLocus_12706_Sample_3_Locus_30313_Allele_0 [BayOfIslands_s09...
CLocus_12706_Sample_5_Locus_33345_Allele_0 [BayOfIslands_s09...
CLocus_12706_Sample_7_Locus_37388_Allele_0 [BayOfIslands_s09...
CLocus_12706_Sample_8_Locus_29451_Allele_0 [BayOfIslands_s09... ...
More than 10 million nucleotides: not printing base composition
so it looks like the data should be fine. Because of this, I tried what I want to do
tajima.test(DNAbin8c18)
and got
Error: cannot allocate vector of size 2489.3 Gb
Many people have completed this same test using as many or more SNPs that I have, and also using FASTA files, but is it possible that mine is too big, or can you see another issue?
The data file can be downloaded at the following link
https://drive.google.com/open?id=0B6qb8IlaQGFZLVRYeXMwRnpMTUU
I have also sent and earlier version of this question, with the data, to the r-sig-genetics mailing list, but I have not heard back.
Any thoughts would be much appreciated.
Ella
Thank you for the comment. Indeed, you are correct. The developer just emailed me with the following very helpful comments.
The problem is that your data are too big (too many sequences) and tajima.test() needs to compute the matrix of all pairwise distances. You could this check by trying:
dist.dna(DNAbin8c18, "N")
One possibility for you is to sample randomly some observations, and repeat this many times, eg:
tajima.test(DNAbin8c18[sample(n, size = 1000), ])
This could be:
N <- 1000 # number of repeats
RES <- matrix(N, 3)
for (i in 1:N)
RES[, i] <- unlist(tajima.test(DNAbin8c18[sample(n, size = 10000), ]))
You may adjust N and 'size =' to have something not too long to run. Then you may look at the distribution of the columns of RES.
I have attempted to email the author of this package without success,
just wondering if anybody else has experienced this.
I am having an using rpart on 4000 rows of data with 13 attributes.
I can run the same test on 300 rows of the same data with no issue.
When I run on 4000 rows, Rgui.exe runs consistently at 50% CPU and the
UI hangs; it will stay like this for at least 4-5hours if I let it
run, and never exit or become responsive.
here is the code I am using both on the 300 and 4000 size subset:
train <- read.csv("input.csv", header=T)
y <- train[, 18]
x <- train[, 3:17]
library(rpart)
fit <- rpart(y ~ ., x)
Is this a known limitation of rpart, am I doing something wrong?
potential workarounds?
Can you reproduce the error message when you feed rpart random data of similar dimensions, rather than your real data (from input.csv)? If not, it's probably a problem with your data (formatting perhaps?). After importing your data using read.csv, check the data for format issues by looking at the output from
str(train).
#How to do an equivalent rpart fit one some random data of equivalent dimension
dats<-data.frame(matrix(rnorm(4000*14), nrow=4000))
y<-dats[,1]
x<-dats[,-1]
library(rpart)
system.time(fit<-rpart(y~.,x))
the problem here was data prep error.
a header was re-written far down in the middle of the data set.