httr package in R - Not Getting all the observations from KOBO Data - r

I am using following code to GET data using API from the KOBO server, it downloads only 30000 observations instead of all 85,000 observations.
rawdata<-GET(url,authenticate(u,pw, type = "basic"),progress())
observer <- content(rawdata,"raw",encoding="UTF-8")
observer <- read_csv(observer)
observer <- as.data.frame(observer)
Using same code, I am able to download all observations when smaller no of observations.
Looking for help

Related

How to use botornot function in R tweetbotornot package?

I am unable to even run the example code given on the botrnot documentation. Unsure what's happening.
# libraries
library(rtweet)
library(tweetbotornot)
# authentication for twitter API
auth <- rtweet_app()
auth_setup_default()
users <- c("kearneymw", "geoffjentry", "p_barbera",
"tidyversetweets", "rstatsbot1234", "RStatsStExBot")
## get most recent 10 tweets from each user
tmls <- get_timeline(users, n = 10)
## pass the returned data to botornot()
data <- botornot(tmls)
Expecting data frame titled data should be created and have an additional column that is the probability of the user being a bot. Instead I have this error.
?Error in botornot.data.frame(tmls) : "user_id" %in% names(x) is not TRUE
The table in the bottom of the documentation is what I'm hoping to achieve.
https://www.rdocumentation.org/packages/botrnot/versions/0.0.2

Calculate sentiment of each row in a big dataset using R

I having trouble calculating average sentiment of each row in a relatively big dataset (N=36140).
My dataset containts review data from an app on Google Play Store (each row represents one review) and I would like to calculate sentiment of each review using sentiment_by() function.
The problem is that this function takes a lot of time to calculate it.
Here is the link to my dataset in .csv format:
https://drive.google.com/drive/folders/1JdMOGeN3AtfiEgXEu0rAP3XIe3Kc369O?usp=sharing
I have tried using this code:
library(sentimentr)
e_data = read.csv("15_06_2016-15_06_2020__Sygic.csv", stringsAsFactors = FALSE)
sentiment=sentiment_by(e_data$review)
Then I get the following warning message (After I cancel the process when 10+ minutes has passed):
Warning message:
Each time `sentiment_by` is run it has to do sentence boundary disambiguation when a
raw `character` vector is passed to `text.var`. This may be costly of time and
memory. It is highly recommended that the user first runs the raw `character`
vector through the `get_sentences` function.
I have also tried to use the get_sentences() function with the following code, but the sentiment_by() function still needs a lot of time to execute the calculations
e_sentences = e_data$review %>%
get_sentences()
e_sentiment = sentiment_by(e_sentences)
I have datasets regarding the Google Play Store review data and I have used the sentiment_by() function for the past month and it worked very quickly when calculating the sentiment... I only started to run calculations for this long since yesterday.
Is there a way to quickly calculate sentiment for each row on a big dataset.
The algorithm used in sentiment appears to be O(N^2) once you get above 500 or so individual reviews, which is why it's suddenly taking a lot longer when you upped the size of the dataset significantly. Presumably it's comparing every pair of reviews in some way?
I glanced through the help file (?sentiment) and it doesn't seem to do anything which depends on pairs of reviews so that's a bit odd.
library(data.table)
reviews <- iconv(e_data$review, "") # I had a problem with UTF-8, you may not need this
x1 <- rbindlist(lapply(reviews[1:10],sentiment_by))
x1[,element_id:=.I]
x2 <- sentiment_by(reviews[1:10])
produce effectively the same output which means that the sentimentr package has a bug in it causing it to be unnecessarily slow.
One solution is just to batch the reviews. This will break the 'by' functionality in sentiment_by, but I think you should be able to group them yourself before you send them in (or after as it doesnt seem to matter).
batch_sentiment_by <- function(reviews, batch_size = 200, ...) {
review_batches <- split(reviews, ceiling(seq_along(reviews)/batch_size))
x <- rbindlist(lapply(review_batches, sentiment_by, ...))
x[, element_id := .I]
x[]
}
batch_sentiment_by(reviews)
Takes about 45 seconds on my machine (and should be O(N) for bigger datasets.

RGooglemaps Random Geocode Error (reproducible code)

I get a random number of errors that come back as NA when I try to geocode a lot of places using rgooglemaps getGeoCode function. Can anyone tell me why? (Reproducible code below)
library(RgoogleMaps)
library(foreach)
###Replicating a large search data###
PlaceVector <- c(rep("Anchorage,Alaska", 20), rep("Baltimore,Maryland", 20),
rep("Birmingham,Alabama", 20))
iters <- length(PlaceVector)
###Looping to get each geocode###
geoadd <- foreach(a=1:iters, .combine=rbind) %do% {
getGeoCode(paste(PlaceVector[a]))
}
geoadd <- as.data.frame(geoadd)
geoadd$Place <- PlaceVector
I get a random number of errors, usually around 15 where the latitude and longitudes in data frame geoadd come back as NA. I could loop it back on the NA's but that seems utterly inefficient. Do others have the same problem with the sample code provided?
I get NA's in the example as well. I once had a problem with looping and geocoding. The problem was that I was hitting googleMaps to fast or with to many request within a minimum time frame. I built in a waiting period with Sys.sleep to solve the issue. The problem is finding the correct amount of microseconds to wait. This depends on your connection and response times of google.

use ape to phase a fasta file and create a DNAbin file as output, then test tajima's D using pegas

I'm trying to complete the very simple task of reading in an unphased fasta file and phasing it using ape, and then calculating Tajima's D using pegas, but #my data doesn't seem to be reading in correctly. Input and output is as #follows:
library("ape")
library("adegenet")
library("ade4")
library("pegas")
DNAbin8c18 <- read.dna(file="fasta8c18.fa", format="f")
I shouldn't need to attach any data since I've just generated the file, but since the data() command was in the manual, I executeed
data(DNAbin8c18)
and got
Warning message: In data(DNAbin8c18) : data set ‘DNAbin8c18’ not found
I know that data() only works in certain contexts, so maybe this isn't a big deal. I looked at what had been loaded
DNAbin8c18
817452 DNA sequences in binary format stored in a matrix.
All sequences of same length: 96
Labels:
CLocus_12706_Sample_1_Locus_34105_Allele_0 [BayOfIslands_s08...
CLocus_12706_Sample_2_Locus_31118_Allele_0 [BayOfIslands_s08...
CLocus_12706_Sample_3_Locus_30313_Allele_0 [BayOfIslands_s09...
CLocus_12706_Sample_5_Locus_33345_Allele_0 [BayOfIslands_s09...
CLocus_12706_Sample_7_Locus_37388_Allele_0 [BayOfIslands_s09...
CLocus_12706_Sample_8_Locus_29451_Allele_0 [BayOfIslands_s09... ...
More than 10 million nucleotides: not printing base composition
so it looks like the data should be fine. Because of this, I tried what I want to do
tajima.test(DNAbin8c18)
and got
Error: cannot allocate vector of size 2489.3 Gb
Many people have completed this same test using as many or more SNPs that I have, and also using FASTA files, but is it possible that mine is too big, or can you see another issue?
The data file can be downloaded at the following link
https://drive.google.com/open?id=0B6qb8IlaQGFZLVRYeXMwRnpMTUU
I have also sent and earlier version of this question, with the data, to the r-sig-genetics mailing list, but I have not heard back.
Any thoughts would be much appreciated.
Ella
Thank you for the comment. Indeed, you are correct. The developer just emailed me with the following very helpful comments.
The problem is that your data are too big (too many sequences) and tajima.test() needs to compute the matrix of all pairwise distances. You could this check by trying:
dist.dna(DNAbin8c18, "N")
One possibility for you is to sample randomly some observations, and repeat this many times, eg:
tajima.test(DNAbin8c18[sample(n, size = 1000), ])
This could be:
N <- 1000 # number of repeats
RES <- matrix(N, 3)
for (i in 1:N)
RES[, i] <- unlist(tajima.test(DNAbin8c18[sample(n, size = 10000), ]))
You may adjust N and 'size =' to have something not too long to run. Then you may look at the distribution of the columns of RES.

R not recognizing Excel cells populated with Bloomberg API code

I have built a dynamically updating spreadsheet in Excel with BBG addin that pulls price data using BBG API. I am trying to pull a table from that sheet into R and create a simple scatterplot using the below code:
wb <- loadWorkbook("Fx Vol Framework.xlsx")
data <- readWorksheet(wb,sheet = "Carry", region = "AL40:AN68",header=TRUE, rownames = 1)
plot(data,ylim = c(-2,12))
with(data,text(data, labels = row.names(data), pos = 1))
reg1 <- lm(data[,2]~data[,1])
abline(reg1)
The region I am calling (AL40:AN68) is populated with results from an HLOOKUP formula that calls from cells with BBG API. When I run the code, I get the below error (repeats the same error text for each cell):
There were 50 or more warnings (use warnings() to see the first 50)
> warnings()
Warning messages:
1: Error when trying to evaluate cell AM41 - Name '_xll.BDP' is completely unknown in the current workbook
If I go back to the excel sheet and populate that same region AL40:AN68 with numeric values (copy -> paste values), save the workbook, and run the same code, I get the scatterplot I was expecting with the original code. Is there any way for me to get the scatterplot using the cells with Bloomberg API or do I need to run it with simple numeric values? Do I need the Bbg package for this to work? Thank you.
A simpler approach for other users now might be to use the Rblpapi package from CRAN to connect to the Bloomberg API directly.
I'm not familiar with BBG add in, but it seems like after calling readWorksheet you would want to call a function that actually opens the workbook. I think that would in a sense "complete the binding". At any rate I sometimes need to pass data between R and excel. Here is how I'd tackle the problem using the package RDCOMClient.
R Code:
library(RDCOMClient)
exB <- COMCreate("Excel.Application")
book <- exB$Workbooks()$Open("C:/'the right directory/exp.xlsx'")
dNames <- book$Worksheets("Sheet1")$Range("AL40:AN40")
dValues <- book$Worksheets("Sheet1")$Range("AL41:AN68")
dNames <- unlist(dNames[["Value"]])
dValues <- unlist(dValues[["Value"]])
data1 <- matrix(dValues,ncol=3)
colnames(data1) <- dNames
data1 <- as.data.frame(data1)
plot(data1$v1,data1$v2)
Obviously you can plot things, model things or whatever in a number of ways, but this gets things into R which is probably the best place for it. At any rate there is also a good introduction to using the RDCOMClient package to connect R and Excel for quick data tasks at http://www.omegahat.org/RDCOMClient/Docs/introduction.html

Resources