Dear All R Developers,
I maintain a package GENEAread and have recently found a bug in the package which comes from within the function header.info. This function is designed to read in the header information stored in a GENEActiv binary file, from the Actigraphy watch GENEActiv. This information is stored in the first 100 lines of the binary file.
The part of this function that is reading in values incorrectly uses the function scan(). Until recently this has worked, however the frequency which is read in by the function header.info now takes a different form because of the varying output of scan() that now occurs.
Below is some sample code which demonstrates the issue:
install.packages(“GENEAread”)
library(GENEAread)
binfile = system.file("binfile/TESTfile.bin", package = "GENEAread")[1]
nobs = 300
info <- vector("list", 15)
# index <- c(2, 20:22, 26:29)
tmpd = readLines(binfile, 300)
#try to find index positions - so will accomodate multiple lines in the
notes sections
#change when new version of binfile is produced.
ind.subinfo = min(which((tmpd == "Subject Info" )& (1:length(tmpd) >= 37)))
ind.memstatus = max(which(tmpd == "Memory Status"))
ind.recdata = (which(tmpd == "Recorded Data"))
ind.recdata = ind.recdata[ind.recdata > ind.memstatus][1:2]
ind.calibdata = max(which(tmpd == "Calibration Data"))
ind.devid = min(which(tmpd == "Device Identity"))
ind.config = min(which(tmpd == "Configuration Info"))
ind.trial = min(which(tmpd == "Trial Info"))
index = c(ind.devid + 1, ind.recdata[1] + 8, ind.config + 2:3, ind.trial +
1:4, ind.subinfo + 1:7, ind.memstatus + 1)
if (max(index) == Inf){
stop("Corrupt headers or not Geneactiv file!", call = FALSE)
}
# Read in header info
nm <- NULL
for (i in 1:length(index)) {
line = strsplit(tmpd[index[i]], split = ":")[[1]]
el = ""
if (length(line) > 1){
el <- paste(line[2:length(line)],collapse=":")
}
info[[i]] <- el
nm[i] <- paste(strsplit(line[1], split = " ")[[1]], collapse = "_")
}
info <- as.data.frame(matrix(info), row.names = nm)
colnames(info) <- "Value"
Decimal_Separator = "."
if (length( grep(",", paste(tmpd[ind.memstatus + 8:9], collapse = "")) ) > 0){
Decimal_Separator = ","
}
info = rbind(info,
Decimal_Separator = Decimal_Separator)
# more here
# if (more){
# grab calibration data etc as well
calibration = list()
fc = file(binfile, "rt")
index = sort(c(ind.config + 4,
ind.calibdata + 1:8,
ind.memstatus + 1,
ind.recdata + 3,
ind.recdata[1] + c(2,8))
)
#### First appearance in the function header.info of the function scan. ####
# tmp <- substring(scan(fc,
# skip = index[1] - 1,
# what = "",
# n = 3,
# sep = " ",
# quiet = TRUE)[3],
# c(1,2,5),
# c(1, 3, 6))
# Isolating scan and running multiple times #
scan(fc,
skip = index[1] - 1,
what = "",
n = 3,
sep = " ",
quiet = TRUE)[3]
scan(fc,
skip = index[1] - 1,
what = "",
n = 3,
sep = " ",
quiet = TRUE)[3]
scan(fc,
skip = (index[1] - 1),
what = "",
n = 3,
sep = " ",
quiet = TRUE)[3]
#### Checking the same thing happens with the substring ####
# Checking by using 3.4.3 possibly
substring(scan(fc,
skip = index[1] - 1,
what = "",
n = 3,
sep = " ",
quiet = TRUE)[3],
c(1,2,5),
c(1, 3, 6))
substring(scan(fc,
skip = index[1] - 1,
what = "",
n = 3,
sep = " ",
quiet = TRUE)[3],
c(1,2,5),
c(1, 3, 6))
substring(scan(fc,
skip = index[1] - 1,
what = "",
n = 3,
sep = " ",
quiet = TRUE)[3],
c(1,2,5),
c(1, 3, 6))
Why does the output of the scan function vary? I have run the examples given on the scan help page and the output is the same if the code is ran more than once. What in the build up to running this function can cause the output to vary?
Any help would be much appreciated.
You opened the fc connection using
fc = file(binfile, "rt")
This means scan() will read from it and leave it open, with the file pointer advanced to the end of the read. Each time you call scan(), you are reading a later part of the file. That's why the results vary.
If you want to always read the same part of the file, you would do it something like this:
seek(fc, 0)
scan(fc, ...)
seek(fc, 0)
scan(fc, ...)
Alternatively, don't open fc when you create it, and scan() will open and close it each time. You do this by writing
fc <- file(binfile) # No open specified
Or even more simply (but a tiny bit less efficiently)
fc <- binfile
which will create a new connection each time.
Related
I am using following script for an eQTL analysis using gene expression data as input.xlsx which has genes as rows and expression values in columns and sample names as column headings. Then I have a vcf file sample.vcf with same samples as of gene expression data.
library(readxl)
library(vcfR)
library(tidyverse)
library(MatrixEQTL)
all = read_xlsx('input.xlsx')
ge[, 2:ncol(ge)] = lapply(ge[, 2:ncol(ge)], as.numeric)
zeroes = apply(ge[, 2:ncol(ge)], 1, function(x) {sum(x, na.rm = T) == 0})
ge = ge[!zeroes, ]
genes = ge$gene
ge = as.matrix(ge[, -1])
rownames(ge) = genes
write.table(ge, 'ge.txt', sep = '\t', row.names = T)
rm(all, ge)
data.lines = length(count.fields('sample.vcf', comment.char = '#'))
chunk.n = 1e5
start.seq = seq(0, data.lines-1, by = chunk.n)
for (n in start.seq){
first = ifelse(n == 0, T, F)
lines = ifelse(n == max(start.seq), (data.lines %% chunk.n) - 1, chunk.n)
tmp = read.vcfR('sample.vcf', nrows = lines, skip = n)
gt = extract.gt(tmp, as.numeric = T)
write.table(gt, 'snp.txt', sep = '\t',
row.names = T, col.names = first,
append = !first)
}
vcf = bedr::read.vcf('sample.vcf')
snp_pos = bedr::vcf2bed(vcf)
snp_pos = snp_pos %>%
separate(col = 'chr', into = c('snpid', 'chr'), sep = 'ch', convert = T) %>%
mutate(snpid = 1:nrow(snp_pos)) %>%
rename(pos = start) %>%
select(!end)
snp_pos = rename(sp)
snp_pos$snpid = 1:nrow(snp_pos)
snp_pos = snp_pos[, c('snpid', 'chr', )]
ge = SlicedData$new()
ge$LoadFile('ge.txt', delimiter = '\t')
snps = SlicedData$new()
snps$LoadFile('newFile', delimiter = '\t', skipColumns = 186)
colnames(snps) = colnames(ge)
Matrix_eQTL_main(snps = snps, gene = ge,
snpspos = snp_pos, genepos = gene_pos,
useModel = modelANOVA,
output_file_name = 'eQTL_results.txt',
output_file_name.cis = 'eQTL_cis_results.txt',
pvOutputThreshold.cis = 1e-3,
pvOutputThreshold = 1e-3)
But it gives an out of memory error as:
CONVERT VCF TO BED
slurmstepd: error: Detected 1 oom-kill event(s) in step. Some of your processes may have been killed by the cgroup out-of-memory handler.
I think this error is from this script line:
vcf = bedr::read.vcf('sample.vcf')
It makes sense as the vcf file is large in size and the script line is trying to read the entire file all together. I was wondering if there is a way to add another loop for this step so that it can read vcf file in small chunks for the conversion step. Thank you for the help!
The following code works on the example VCF file here: https://www.internationalgenome.org/wiki/Analysis/vcf4.0/
The vcfR object is an S4 object, meaning you can access its components using slot and slot<-.
data.lines = length(count.fields('example.vcf', comment.char = '#'))
chunk.n = 2
start.seq = seq(0, data.lines-1, by = chunk.n)
sample_vcf = read.vcfR('example.vcf', nrows = 2)
slotNames(sample_vcf)
# [1] "meta" "fix" "gt"
vcf = new("vcfR")
slot(object = vcf, name = "meta") = slot(object = sample_vcf, name = "meta")
for (n in start.seq) {
first = n == 0
lines = ifelse(n == max(start.seq), (data.lines %% chunk.n), chunk.n)
# each chunk
ind = which(n == start.seq)
sample_vcf = vcfR::read.vcfR('example.vcf',
nrows = lines, skip = n)
# partial matrices of the whole file
if (first) {
slot(object = vcf, name = "fix") = slot(object = sample_vcf, name = "fix")
slot(object = vcf, name = "gt") = slot(object = sample_vcf, name = "gt")
} else {
slot(object = vcf, name = "fix") = rbind(
slot(object = vcf, name = "fix"),
slot(object = sample_vcf, name = "fix"))
slot(object = vcf, name = "gt") = rbind(
slot(object = vcf, name = "gt"),
slot(object = sample_vcf, name = "gt"))
}
}
# check that worked okay
stopifnot(validObject(vcf))
nrow(slot(object = vcf, name = "fix"))
# [1] 5
As you import each chunk of file, bind it into the object. The file I used had elements meta, fix and gt. It would be worth checking that these are the only elements present before running the loop on the whole file.
I think your calculation of lines is incorrect; you are not counting from 0, you are skipping 0, so the last chunk contains data.lines %% chunk.n rows, not data.lines %% chunk.n - 1.
The next step you have uses bedr::read.vcf. Unfortunately this creates a pseudo-class list object with elements header and vcf and attribute vcf. It seems that we can convert the vcfR object into a vcf object which can then be processed. The header contains parsed meta information, but it does not appear to be needed by your workflow.
#' #title Convert VCFR to VCF
#' #param x a VCFR object
#' #return a VCF object
as.vcf.vcfR <- function(x) {
vcf <- list(
# TODO implement parsing if needed downstream
header = list(meta = slot(object = x, name = "meta")),
vcf = cbind(
data.table::data.table(slot(object = x, name = "fix")),
data.table::data.table(slot(object = x, name = "gt"))))
attr(vcf, which = "vcf") <- TRUE
vcf$vcf$POS <- as.numeric(vcf$vcf$POS)
vcf
}
vcf2 = as.vcf.vcfR(vcf)
snp_pos = bedr::vcf2bed(vcf2)
# CONVERT VCF TO BED
# Warning messages:
# 1: In bedr::vcf2bed(vcf2) :
# ALT contains a comma and the variant length was decided based on the first element of ALT.
# 2: In bedr::vcf2bed(vcf2) :
# ALT contains a comma and the variant length was decided based on the first element of ALT.
snp_pos
# chr start end
# 1 20 14369 14370
# 2 20 17329 17330
# 3 20 1110695 1110696
# 4 20 1230236 NA
# 5 20 1234566 1234567
I'm using the raster package on an R server to process a large set (30000 files) of data files (10MB each).
For now, processing consists of parsing the data and subsequently rasterizing it via the rasterize function.
The data is very sparse (only along roads) but has a high resolution and large extent. I've seen temporary files of 30GB for a raster created from one of the input files.
Because of the amount of files I'm using a foreach() %dopar% approach to processing the files, giving one file to each thread. I've set the raster options as follows:
rasterOptions(maxmemory = 15000000000)
rasterOptions(chunksize = 14000000000)
rasterOptions(todisk = TRUE)
This should come out to 15GB/thread * 32 threads = 480GB of RAM used at maximum for the rasters. Add some overhead and I would expect somewhere between 10GB to 20GB of the 512GB RAM to remain. However, that is not the case and I can't seem to figure out why.
R gobbles up RAM until only 100MB to 2GB remain and only then seems to release previously allocated memory, only to be fed straight back into R for the next raster. I checked the RAM usage repeatedly over several hours to observe this.
I'm using SpatialPolygonDataFrames as input for rasterize, and suspected they might take up a lot of RAM as well. But when I checked their size, they were rather small, at about 100MB. Playing around with maxmemory, chunksize and only 16 threads also didn't seem to have any effect.
I also looked at the rasterize source code to see if I find an explanation there, but that didn't get me far:
setMethod('rasterize', signature(x='SpatialPoints', y='Raster'),
function(x, y, field, fun='last', background=NA, mask=FALSE, update=FALSE, updateValue='all', filename="", na.rm=TRUE, ...){
.pointsToRaster(x, y, field=field, fun=fun, background=background, mask=mask, update=update, updateValue=updateValue, filename=filename, na.rm=na.rm, ...)
}
)
I have no clue where to find .pointsToRaster
Does anyone have an explanation for this behaviour or ideas for things to check? Did I simply overlook something? I´d like to not use the entire RAM so that other users can still work on the server. From what I understand my code should regulate how much RAM is used.
Here's the code I use:
library('iterators')
library('parallel')
library('foreach')
library('doParallel')
#init parallelisation
nCores = 32
cCluster = makeCluster(nCores, type = "FORK", outFile = "parseProcess")
registerDoParallel(cCluster)
foreach(j = 1:length(fileList)) %dopar%{
#load all libraries for every thread
library('sp')
library('raster')
library('spatial')
library('gstat')
library('rgdal')
library('dismo')
library('deldir')
library('rgeos')
library('sjmisc')
#set rasteroptions per thread
rasterOptions(maxmemory = 15000000000)
rasterOptions(chunksize = 14000000000)
rasterOptions(todisk = TRUE)
tmpFolder = paste0("[PATH TO STORAGE]/rtmp",j)
dir.create(tmpFolder)
rasterOptions(tmpdir = tmpFolder)
#generate names for raster files
fileName = basename(fileList[j])
print(paste("Processing:", fileName))
rNameMax0 = sub(pattern = ".bin", replacement = "_scan0_max.tif", fileName)
#repeat this for all 11 scans
rasterStorage = "[PATH TO OTHER STORAGE]" #path to raster folder
scanList = parseFile(fileList[j]) #any memory allocated in this functions should be released on function return
#create template raster
bounds = as.vector(t(bbox(scanList$scan0)))
resolution = c(0.0000566, 0.0000359)
tmp = raster(xmn = bounds[1], xmx = bounds[2], ymn = bounds[3], ymx = bounds[4], res = resolution)
#create rasters from data
coordinates(scanList$scan0) = ~Long+Lat
proj4string(scanList$scan0) = WGS84CRS
rScanMax0 = rasterize(scanList$scan0, tmp, fun = 'max', filename = paste0(rasterStorage, rNameMax0))
rm('rScanMax0')
#repeat for scans 1 to 4
removeTmpFiles(h = 0.2)
unlink(tmpFolder, recursive = TRUE, force = TRUE)
dir.create(tmpFolder)
rasterOptions(tmpdir = tmpFolder)
coordinates(scanList$scan5) = ~Long+Lat
proj4string(scanList$scan5) = WGS84CRS
rScanMax5 = rasterize(scanList$scan5, tmp, fun = 'max', filename = paste0(rasterStorage, rNameMax5))
rm('rScanMax5')
#repeat for scans 6 to 10
removeTmpFiles(h = 0.2)
unlink(tmpFolder, recursive = TRUE, force = TRUE)
}
stopCluster(cCluster)
Here's the (gutted) code of the parseFile function:
parseFile = function(fileName){
con = file(fileName, "rb")
intSize = 4
fileEndian = "little"
#create data frames for each scan
scan0 = data.frame(matrix(ncol = n1, nrow = 0))
colnames(scan0) = c("Lat", "Long", ...)
scan1 = data.frame(matrix(ncol = n2, nrow = 0))
colnames(scan1) = c("Lat", "Long", ...)
scan2 = data.frame(matrix(ncol = n3, nrow = 0))
colnames(scan2) = c("Lat", "Long", ...)
scan3 = data.frame(matrix(ncol = n4, nrow = 0))
colnames(scan3) = c("Lat", "Long", ...)
scan4 = data.frame(matrix(ncol = n5, nrow = 0))
colnames(scan4) = c("Lat", "Long", ...)
scan5 = data.frame(matrix(ncol = n6, nrow = 0))
colnames(scan5) = c("Lat", "Long", ...)
scan6 = data.frame(matrix(ncol = n7, nrow = 0))
colnames(scan6) = c("Lat", "Long", ...)
scan7 = data.frame(matrix(ncol = n8, nrow = 0))
colnames(scan7) = c("Lat", "Long", ...)
scan8 = data.frame(matrix(ncol = n9, nrow = 0))
colnames(scan8) = c("Lat", "Long", ...)
scan9 = data.frame(matrix(ncol = n10, nrow = 0))
colnames(scan9) = c("Lat", "Long", ...)
scan10 = data.frame(matrix(ncol = n11, nrow = 0))
colnames(scan10) = c("Lat", "Long", ...)
header = readBin(con, raw(), n = 36)
i = 1
while(i){
blockHeader = readBin(con, integer(), n = 3, size = intSize, endian = fileEndian)
if(...){ #check whether file ended
break
}
i = i + 1
#sort data to correct scan, assign GPS tag
blockTrailer = readBin(con, raw(), n = 8)
}
#clean up
close(con)
#return parsed data
returnList = list("scan0" = scan0, "scan1" = scan1, "scan2" = scan2, "scan3" = scan3, "scan4" = scan4,
"scan5" = scan5, "scan6" = scan6, "scan7" = scan7, "scan8" = scan8, "scan9" = scan9, "scan10" = scan10)
return(returnList)
}
I'm also looking at the solutions posted here as another approach, but I'd still like to know why my code doesn't work as I expect it to.
I'm trying to output some of my code results in knitr. Now the strange thing is, the code generates the error in the title. But running round_any() seperately and outputting it in knitr is fine.
knitr code
```{r, echo = FALSE, message=FALSE, warning=FALSE}
source("BooliQuery.R")
BooliQuery()
```
My code
library(digest)
library(stringi)
library(jsonlite)
library(plyr)
BooliQuery <- function(area = "stockholm", type="lägenhet", sincesold = "", FUN = "", limit = 250, offset = 0, mode = 1) {
#raw data fetch + adjust.
lOriginal <- GETAPI(area, type, sincesold, FUN, limit, offset)
lOriginal$AreaSize <- round_any(lOriginal$livingArea, 10, floor)
lOriginal$PriceDiff <- lOriginal$soldPrice - lOriginal$listPrice
#Create frame overview
Overview.Return <- Frame.Overview(lOriginal)
#Mode - return selector
ifelse( mode == 1, return (Overview.Return), return (lOriginal) )
}
Frame.Overview <- function(lOriginal) {
#Aggregate mean
listPrice <- aggregate(lOriginal, list(lOriginal$AreaSize), FUN = mean, na.rm = TRUE)
colnames(listPrice)[1] <- "SegGroup"
listPrice <- listPrice[, c("SegGroup", "listPrice", "soldPrice", "PriceDiff", "rent", "livingArea", "constructionYear") ]
#Perform Rounding
listPrice[, c(2:5)] <- round(listPrice[,c(2:5)], digits = 0)
listPrice[, 6] <- round(listPrice[, 6], digits = 1)
listPrice[, 7] <- signif(listPrice[,7], digits = 4)
return(listPrice)
}
GETAPI <- function(area = "stockholm", type="lägenhet", sincesold = "", FUN = "", limit = 250, offset = 0) {
#ID Info
key <- "PRIVATE KEY"
caller.ID <- "USERNAME"
#//
unix.timestamp <- as.integer( as.POSIXct(Sys.time()) )
random.string <- stri_rand_strings( n = 1, length = 16)
#Sha1-Hash: CallerID + time + key + unique, 40-char hexadecimal
hash.string <- paste0(caller.ID, unix.timestamp, key, random.string)
hash.sha1 <- digest(hash.string,"sha1",serialize=FALSE)
#Create URL
api.string <- "https://api.booli.se/sold?q="
url.string <- paste0(api.string, area, "&objectType=" , type , "&minSoldDate=", sincesold, FUN, "&limit=", limit, "&offset=", offset,"&callerId=", caller.ID, "&time=" ,
unix.timestamp, "&unique=", random.string, "&hash=", hash.sha1)
#Parse JSON
parsed.JSON <- fromJSON(txt = url.string)
return(parsed.JSON$sold)
}
Running the code seperately in console is fine. So what could be wrong?
I am using the PerformanceAnalytics package to analyze some monthly returns. The charts.RollingRegression should plot the n-month rolling regression against some benchmark.
The data is just 6 returns series from April 08 to December 2014, trying to regress against the SPY.
indexReturns <- read.table("quantIndices.csv", stringsAsFactors = FALSE, sep = ",", fill = TRUE, row.names = 1, header=TRUE)
hfIndexReturns <- read.table("quantHFIndices.csv", stringsAsFactors = FALSE, sep = ",", fill = TRUE, row.names = 1, header=TRUE)
peerReturns <- read.table("quantPeers.csv", stringsAsFactors = FALSE, sep = ",", fill = TRUE, row.names = 1, header=TRUE)
splits <- as.data.frame(strsplit(rownames(indexReturns), "/"))
rownames(indexReturns) <- unname(sapply(splits, function(x) paste0(x[3], "-", x[1], "-", x[2])))
splits <- as.data.frame(strsplit(rownames(peerReturns), "/"))
rownames(peerReturns) <- unname(sapply(splits, function(x) paste0(x[3], "-", x[1], "-", x[2])))
Ret <- xts(peerReturns, order.by = as.Date(row.names(peerReturns)))
Rb <- xts(indexReturns, order.by = as.Date(row.names(indexReturns)))
charts.RollingRegression(Ret, Rb[,2, drop = FALSE], Rf = 0.001, na.pad = TRUE)
This produces the following chart:
I would like it to omit the "meaningless) first 12 months, but there is no documentation on how this is done, and any other depiction of this chart I can find looks like this:
Looking at the source, in the main meat of the function, I see:
for (column.a in 1:columns.a) {
for (column.b in 1:columns.b) {
merged.assets = merge(Ra.excess[, column.a, drop = FALSE],
Rb.excess[, column.b, drop = FALSE])
if (attribute == "Alpha")
column.result = rollapply(na.omit(merged.assets),
width = width, FUN = function(x) lm(x[, 1,
drop = FALSE] ~ x[, 2, drop = FALSE])$coefficients[1],
by = 1, by.column = FALSE, fill = na.pad, align = "right")
if (attribute == "Beta")
column.result = rollapply(na.omit(merged.assets),
width = width, FUN = function(x) lm(x[, 1,
drop = FALSE] ~ x[, 2, drop = FALSE])$coefficients[2],
by = 1, by.column = FALSE, fill = na.pad, align = "right")
if (attribute == "R-Squared")
column.result = rollapply(na.omit(merged.assets),
width = width, FUN = function(x) summary(lm(x[,
1, drop = FALSE] ~ x[, 2, drop = FALSE]))$r.squared,
by = 1, by.column = FALSE, align = "right")
column.result.tmp = xts(column.result)
colnames(column.result.tmp) = paste(columnnames.a[column.a],
columnnames.b[column.b], sep = " to ")
column.result = xts(column.result.tmp, order.by = time(column.result))
if (column.a == 1 & column.b == 1)
Result.calc = column.result
else Result.calc = merge(Result.calc, column.result)
}
}
And we can see there is no na.pad being passed to the final "R-Squared" function, which results in the graph I would expect to see for both the first two charts. I would like to fix this, but I cannot edit the package code. I tried using "assignInNamespace", but it doesn't work. The function seems to work, but the function code does not change in the package. I would also like to remove the leading blank space in the graphs as well, but if you guys could let me know how to edit this, or know any workarounds please let me know. (And thanks! You guys are gods!)
OH! And PS - Why the heck is my version of the package seemingly the only one that has this problem??? Why don't my graphs look right by default?
EDIT: This is not the only piece of code from this package which is suspect. I keep having things break and not work as it seems to be documented (Error in R[, nc] - coredata(Rf) : non-numeric argument to binary operator seems to happen about every other function call.) Anyone have any suggestions for better packages for this type stuff?
The subsetting of the data should be done prior to passing to the charts.RollingRegression function. The mighty xts provides this functionality:
charts.RollingRegression(Ret["2009-04::",], Rb["2009-04::",2, drop = FALSE], Rf = 0.001, na.pad = TRUE)
You can read more about how to subset with xts by looking at the help page in R via ?subset.xts.
Let's break this down a bit:
The charts.RollingRegression is just a wrapper to calculate the rolling beta then plots it.
Here is an example with the rolling alpha and beta:
require(PerformanceAnalytics)
data(managers)
capm_xts = xts(matrix(nrow=nrow(managers),ncol=2),order.by=index(managers))
for(i in 12:nrow(managers)){
capm_xts[i,] = coef(lm(managers[(i-11):i,1]~managers[(i-11):i,4]))
}
colnames(capm_xts) = c('alpha','beta')
chart.TimeSeries(capm_xts[12:nrow(capm_xts),2])
I have an R Script that scrapes tweets using streamR. It's not 100% reproducible because you'll need your own oauth token from Twitter to run it. Every time I run the script in interactive mode it finishes successfully. However, I am running this in a cronjob every hour and I receive different sporadic error messages but it also sometimes finishes successfully. I've listed the error messages at the bottom. It appears that it's having a problem with the long data.frame call. Since the errors look scrambled and random, it almost seems like it's the same issue when you try to paste a lot of R code into a terminal at once and some characters get scrambled. I've tried using both Rscript and R CMD BATCH in my cronjob but the errors still persist.
library(ROAuth)
library(streamR)
## load oauth token
load("/home/ubuntu/Documents/Scripts/TwitterScrapes/mr_oauth.dat")
keywords <- "switch Sprint"
tweets <- filterStream( file="", track=keywords, oauth=my_oauth, timeout=3400)
tweets <- parseTweets(tweets)
tweets$text <- as.character(tweets$text)
tweets$text <- gsub("\n", "", tweets$text)
tweets$text <- gsub("[^[:alnum:] ]", "", tweets$text)
tweets$description <- gsub("[^[:alnum:] ]", "", tweets$description)
tweets$description <- gsub("\n", "", tweets$description)
tweets$name <- gsub("[^[:alnum:] ]", "", tweets$name)
tweets$location <- gsub("[^[:alnum:] ]", "", tweets$location)
tweets$att <- 0
tweets$sprint <- 0
tweets$verizon <- 0
tweets$tmobile <- 0
tweets$aio <- 0
tweets$cricket <- 0
tweets$gosmart <- 0
tweets$metropcs <- 0
tweets$virgin <- 0
tweets$boost <- 0
tweets$usc <- 0
cleaned <- data.frame(To = "", From = "", Phone.Availability = 0, Phone.Price = 0, Family.Plan = 0, Coverage.Availability = 0, Coverage.Quality = 0,
Customer.Service =0,Data.Plan = 0, Upgrade.Plan = 0, Device.Promo = 0, Service.Promo = 0, Outage = 0, Plan.Price = 0, Wireline = 0,
Wireline.Programming =0, Corportate = 0, Na = 0,Switch.Phrase = "", tweet = tweets$text, Date=tweets$created_at, Location = tweets$location,
S.reviewed = 0, att = tweets$att, verizon = tweets$verizon,sprint = tweets$sprint, tmobile = tweets$tmobile, aio = tweets$aio,
cricket = tweets$cricket, gosmart = tweets$gosmart, metro = tweets$metro,boost = tweets$boost, virgin = tweets$virgin, usc = tweets$usc,
Idstr = tweets$id_str, Retweet = tweets$retweeted, Retweet_Count = tweets$retweet_count,In.reply.to.status.id = tweets$in_reply_to_status_id_str,
In.reply.to.id = tweets$in_reply_to_user_id_str, Listed.count = tweets$listed_count,Verified = tweets$verified, Usr.id.str = tweets$user_id_str,
Description = tweets$description, Geo.enabled = tweets$geo_enabled,Usr.created.at = tweets$user_created_at, Statuses.count = tweets$statuses_count,
Followers.count = tweets$followers_count, Favorites = tweets$favourites_count,Name = tweets$name, Lang=tweets$lang, Utc.offset = tweets$utc_offset,
Friends.count = tweets$friends_count, Screen.name = tweets$screen_name, Country.code = tweets$country_code, Country=tweets$country,
Place.type = tweets$place_type, Full.name = tweets$full_name, Place.name = tweets$place_name, Place.id = tweets$place_id, source=tweets$source)
Examples of the errors:
Error in data.frame(To = "", From = "", Phone.Availability = 0, Phone.Price = 0, :
object 'tweetsd_str' not found
Execution halted
Error in `$<-.data.frame`(`*tmp*`, "place_lat", value = c(NaN, NaN)) :
replacement has 2 rows, data has 0
Calls: parseTweets -> $<- -> $<-.data.frame
Execution halted