Store large dataframes in redis through R - r

I have a number of large dataframes in R which I was planning to store using redis. I am totally new to redis but have been reading about it today and have been using the R package rredis.
I have been playing around with small data and saved and retrieved small dataframes using the redisSet() and redisGet() functions. However when it came to saving my larger dataframes (the largest of which is 4.3 million rows and 365MB when saved as .RData file)
using the code redisSet('bigDF', bigDF) I get the following error message:
Error in doTryCatch(return(expr), name, parentenv, handler) :
ERR Protocol error: invalid bulk length
In addition: Warning messages:
1: In writeBin(v, con) : problem writing to connection
2: In writeBin(.raw("\r\n"), con) : problem writing to connection
Presumably because the dataframe is too large to save. I know that redisSet writes the dataframe as a string, which is perhaps not the best way to do it with large dataframes. Does anyone know of the best way to do this?
EDIT: I have recreated the error my creating a very large dummy dataframe:
bigDF <- data.frame(
'lots' = rep('lots',40000000),
'of' = rep('of',40000000),
'data' = rep('data',40000000),
'here'=rep('here',40000000)
)
Running redisSet('bigDF',bigDF) gives me the error:
Error in .redisError("Invalid agrument") : Invalid agrument
the first time, then running it again immediately afterwards I get the error
Error in doTryCatch(return(expr), name, parentenv, handler) :
ERR Protocol error: invalid bulk length
In addition: Warning messages:
1: In writeBin(v, con) : problem writing to connection
2: In writeBin(.raw("\r\n"), con) : problem writing to connection
Thanks

In short: you cannot. Redis can store a maximum of 512 Mb of data in a String value and your serialized demo data frame is bigger than that:
> length(serialize(bigDF, connection = NULL)) / 1024 / 1024
[1] 610.352
Technical background:
serialize is called in the .cerealize function of the package via redisSet and rredis:::.redisCmd:
> rredis:::.cerealize
function (value)
{
if (!is.raw(value))
serialize(value, ascii = FALSE, connection = NULL)
else value
}
<environment: namespace:rredis>
Offtopic: why would you store such a big dataset in redis anyway? Redis is for small key-value pairs. On the other hand I had some success storing big R datasets in CouchDB and MongoDB (with GridFS) by adding the compressed RData there as an attachement.

Related

RODBC connection issue

I am trying to use RODBC to connect to an access database. I have used the same structure several times in this project with success. However, in this instance it is now failing and I cannot figure out why. The code is not really reprex as I can't provide the DB, but...
This works for a single table:
library(magrittr);library(RODBC)
#xWalk_path is simply the path to the accdb
#xtabs generated by querying the available tables
x=1
tab=xtabs$TABLE_NAME[x]
temp<-RODBC::odbcConnectAccess2007(xWalk_path)%>%
RODBC::sqlFetch(., tab, stringsAsFactors = FALSE)
odbcCloseAll()
#that worked perfectly
However, I really want to use this in a a function so I can read several similar tables into a list. As a function it does not work:
xWalk_ls<- lapply(seq_along(xtabs$TABLE_NAME), function(x, xWalk_path=xWalk_path, tab=xtabs$TABLE_NAME[x]){
#print(tab) #debug code
temp<-RODBC::odbcConnectAccess2007(xWalk_path)%>%
RODBC::sqlFetch(., tab, stringsAsFactors = FALSE)
return(temp)
odbcCloseAll()
})
#error every time
The above code will return the error:
Warning in odbcDriverConnect(con, ...) :
[RODBC] ERROR: Could not SQLDriverConnect
Warning in odbcDriverConnect(con, ...) : ODBC connection failed
Error in RODBC::sqlFetch(., tab, stringsAsFactors = FALSE) :
first argument is not an open RODBC channel
I am baffled. I accessed the db to pull table names and generate the xtabs variable using sql Tables. Also, earlier in my code I used a similar code structure (not identical, but same core: sqlFetch to retrieve a table into a list) nd it worked without a problem. Only difference between then and now is that: Then I was opening and closing different .accdb files, but pulling the same table name from each. Now, I am opening and closing the same .accdb file but pulling different sheet names each time.
Am I somehow opening and closing this too fast and it is getting irritated with me? That seems unlikely, because if I force it to print(tab) as the first line of the function it will only print the first table name. If it was getting annoyed about the speed of opening an closing I would expect it to print 2 table names before throwing the error.
return returns its argument and exits, so the remaining code (odbcCloseAll()) won't be executed and the opened file (AccessDB) remains locked as you supposed.

Insert URL path into database using dbplyr

I'm trying to insert a url into a postgresql database using
db_insert_into(con, "url", "http://www.google.com")
Error in file(fn, open = "r") : cannot open the connection
In addition: Warning message:
In file(fn, open = "r") :
cannot open file 'http:/www.google.com': No such file or directory
How can I solve this?
You need to somehow specify both the table name and the field name. I'm going to guess that "url" is the field name, and the table name is as yet undefined here. But it doesn't matter, frankly, take the solution and adapt as needed.
The expectation of db_insert_into is that the values (third argument) is a data.frame or something that can easily be converted to such. So you can probably do:
newdata <- data.frame(url = "http://www.google.com", stringsAsFactors = FALSE)
db_insert_into(con, "tablename", newdata)
If you're lazy or playing code-golf, you might be able to do it with:
db_insert_into(con, "tablename", list(url = "http://google.com"))
since some of the underlying S3 or S4 methods around dbplyr sometimes check if (!is.data.frame(values)) values <- as.data.frame(values). (But I wouldn't necessarily rely on that, it's usually better to be explicit.)

"embedded nul in string" error with SparkR::collect

I'm pulling in data from a api and keep getting the following error.
I put together the sql query and am connecting to the instance for pulling the data. However, when I run collect, it given me an error.
soql_query = paste("SELECT Id, subject FROM Table")
myDF2 <- read.df(sqlContext, source="...", username=sf_username, password=sf_password, version=apiVersion, soql=soql_query)
temp2 <- SparkR::collect(myDF2)
Error in rawToChar(string) :
embedded nul in string: 'VOID: \xe5,nq\b\x92ƹ\xc8Y\x8b\n\nAdd a new comment by Asako:\0\xb3\xe1\xf3Ȓ\xfd\xa0\bE\xe4\t06/29 09:23'
In addition: Warning message:
closing unused connection 6 (col)
I've gone through and identified what column it is. It contains a lot of string data and sentences, so the error partially makes sense.
I was wondering if there was any way to get around this issue.

Error in fstRead R

I am using the new 'fst' package in R for a few weeks to write and read tables in the .fst format. Sometimes I cannot read a table that I've just write having the following message :
> tab=read.fst("Tables R/tab.fst",as.data.table=TRUE)
Error in fstRead(fileName, columns, from, to) :
Unknown type found in column.
Do you know why this happens ? Is there an other way to retrieve the table ?

Unable to pass R variable values to Redis using RREDIS PACKAGE

I am using the RRDEDIS package as a client to REDIS. I am building a neural-redis module.
I am performing all pre-processing of data in R and I am sending the variables from R to the REDIS instance using RedisCmd command. I am using the iris dataset.
When I pass it as values, the modules accepts the input. If I pass the variables, then it says its an invalid neural network input.
NR.OBSERVE works if the values are given individually
library(rredis)
redisConnect("localhost",6379)
create <- redisCmd('NR.CREATE','net', 'REGRESSOR', '2','3', '->', '1','NORMALIZE','DATASET','50','TEST','10')
obs<- redisCmd('NR.OBSERVE', 'net','1','2','->','3')
NR.OBSERVE however does not work if I give a variable containing the values.
library(rredis)
redisConnect("localhost",6379)
create <- redisCmd('NR.CREATE','net', 'REGRESSOR', '2','3', '->', '1','NORMALIZE','DATASET','50','TEST','10')
a<-1
b<-2
c<-3
obs<- redisCmd('NR.OBSERVE', 'net','a','b','->','c')
This throws the following error
Error in doTryCatch(return(expr), name, parentenv, handler) :
ERR invalid neural network input: must be a valid float precision floating point number
What is the correct way of doing it

Resources