How to create data frame for super large vectors?

How to create data frame for super large vectors? - r

I have 7 verylarge vectors, c1 to c7. My task is to simply create a data frame. However when I use data.frame(), error message returns.
> newdaily <- data.frame(c1,c2,c3,c4,c5,c6,c7)
Error in if (mirn && nrows[i] > 0L) { :
missing value where TRUE/FALSE needed
Calls: data.frame
In addition: Warning message:
In attributes(.Data) <- c(attributes(.Data), attrib) :
NAs introduced by coercion to integer range
Execution halted
They all have the same length (2,626,067,374 elements), and I’ve checked there’s no NA.
I tried subsetting 1/5 of each vector and data.frame() function works fine. So I guess it has something to do with the length/size of the data? Any ideas how to fix this problem? Many thanks!!
Update
both data.frame and data.table allow vectors shorter than 2^31-1. Stil can't find the solution to create one super large data.frame, so I subset my data instead... hope larger vectors will be allowed in the future.

R's data.frames don't support such long vectors yet.
Your vectors are longer than 2^31 - 1 = 2147483647, which is the largest integer value that can be represented. Since the data.frame function/class assumes that the number of rows can be represented by an integer, you get an error:
x <- rep(1, 2626067374)
DF <- data.frame(x)
#Error in if (mirn && nrows[i] > 0L) { :
# missing value where TRUE/FALSE needed
#In addition: Warning message:
#In attributes(.Data) <- c(attributes(.Data), attrib) :
# NAs introduced by coercion to integer range
Basically, something like this happens internally:
as.integer(length(x))
#[1] NA
#Warning message:
# NAs introduced by coercion to integer range
As a result the if condition becomes NA and you get the error.
Possibly, you could use the data.table package instead. Unfortunately, I don't have sufficient RAM to test:
library(data.table)
DT <- data.table(x = rep(1, 2626067374))
#Error: cannot allocate vector of size 19.6 Gb

For that kind of data size, you must to optmize your memory, but how?
You need to write these values in a file.
output_name = "output.csv"
lines = paste(c1,c2,c3,c4,c5,c6,c7, collapse = ";")
cat(lines, file = output_name , sep = "\n")
But probably you'll need to analyse them too, and (as it was said before) it requires a lot of memory.
So you have to read the file by their lines (like, 20k lines) by iteration to opmize your RAM memory, analyse these values, save their results and repeat..
con = file(output_name )
while(your_conditional) {
lines_in_this_round = readLines(con, n = 20000)
# create data.frame
# analyse data
# save result
# update your_conditional
}
I hope this helps you.

Related

FindVariableFeatures Function in Seurat Producing "Error in match(x, table, nomatch = 0L) : 'match' requires vector arguments"

I am running Seurat V3 in RStudio and attempting to run PCA on a newly subsetted object. As part of that process, I am using the commands:
tnk.cells <- FindVariableFeatures(tnk.cells, assay = "RNA", selection.method = "vst", nfeatures = 2000)
tnk.cells <- RunPCA(tnk.cells, verbose = TRUE, npcs = 30, features = FindVariableFeatures(tnk.cells))
The first process seems to work, but I am unsure if it actually did, and if so, whether I need to specify that "features" in the second command should refer to those features. Either way, every time I attempt to run the second command, it produces this error, along with three warning messages:
Error in match(x, table, nomatch = 0L) :
'match' requires vector arguments
In addition: Warning messages:
1: In FindVariableFeatures.Assay(object = assay.data, selection.method = selection.method, :
selection.method set to 'vst' but count slot is empty; will use data slot instead
2: In eval(predvars, data, env) : NaNs produced
3: In hvf.info$variance.expected[not.const] <- 10^fit$fitted :
number of items to replace is not a multiple of replacement length
Does anyone have any idea why these errors/warnings are being produced? I have tried coercing the output of FindVariableFeatures as a vector and a dataframe, to no avail. I also want to ask: do I need to rerun FindVariableFeatures after subsetting a new dataset from a larger one?

The variable features are already stored in the Seurat object. You can access them using VariableFeatures() , for example:
library(Seurat)
pbmc_small =SCTransform(pbmc_small)
pbmc_small = FindVariableFeatures(pbmc_small,nfeatures=20)
head(VariableFeatures(pbmc_small))
[1] "GNLY" "PPBP" "PF4" "S100A8" "VDAC3" "CD1C"
Then you can run it is like this, although by default, it will use the variable features stored in the object:
pbmc_small <- RunPCA(pbmc_small,features = VariableFeatures(pbmc_small))

R: read.table with colClasses gives Error in integer(n) : vector size cannot be NA/NaN

I'm trying to read a simple dataframe into R using read.table. While reading the table I want to specify that the first 3 columns are of type character, while the remaining 4 columns are of type numeric.
I'm specifying the column types to prevent R from dropping the leading 0's in columns 2 and 3, as they're required for DB lookups. Here's what I'm using:
df.img <- read.table('https://gist.githubusercontent.com/duhaime/46dde948263136d0b52be1575232a83e/raw/80f14650e4f4b9ef38a5dec3f5bbb8c62954ee59/match-stats.tsv',
sep='\t',
colClasses=c(replicate('character', 3), replicate('numeric', 4)))
This returns:
Error in integer(n) : vector size cannot be NA/NaN
In addition: Warning message:
In integer(n) : NAs introduced by coercion
Does anyone know how I can update my read.table command to correctly read in my columns with the desired types? Any help would be appreciated!

Aha, I should have been using rep():
df.img <- read.table('https://gist.githubusercontent.com/duhaime/46dde948263136d0b52be1575232a83e/raw/80f14650e4f4b9ef38a5dec3f5bbb8c62954ee59/match-stats.tsv',
sep='\t',
colClasses=c(rep('character', 3), rep('numeric', 4)))

Error in 2:n : NA/NaN argument

The code below produces the following error:
Error in 2:n : NA/NaN argument
How can I resolve this error?
library (pdfetch)
library(tidyverse)
library(xts)
tickers<-c("AXP","MMM","BA","CAT","CVX","CSCO","KO","DWDP","AAPL","XOM","GE","GS","HD","IBM","INTC","HPI","AIV","MCD","MRK","MSFT","NKE","PFE","PG","TRV","JPM","UTX","VZ","V","WMT","DIS")
data<-pdfetch_YAHOO(tickers<- c("^DJI","AXP","MMM","BA","CAT","CVX","CSCO","KO","DWDP","AAPL","XOM","GE","GS","HD","IBM","INTC","HPI","AIV","MCD","MRK","MSFT","NKE","PFE","PG","TRV","JPM","UTX","VZ","V","WMT","DIS"),from = as.Date("2015-03-20"),to = as.Date("2018-03-20"),interval='1mo')
# to remove the nas from the entire data
data[complete.cases(data),]
plus<-data[complete.cases(data),]
plus
str(plus)
head(plus)
tail(plus)
class(plus$Date)
(plus[1:10, "^DJI.adjclose",drop=F])
#Create a new data frame that contains the price data with the dates as the row names
prices <- (plus)[, "^DJI.adjclose", drop = FALSE]
rownames(prices) <-plus$Date
head(prices)
tail(prices)
#to find the return from 3/3/2015-3/8/2018
djia_ret1<- ((prices [2:n,1]-prices [1:(n-1),1])/prices [1:(n-1),1])

Error in 2:n : NA/NaN argument.
This means that one (or both) of the two arguments of : are NA or NaN. 2 is not, so n must be.
In your question you don't show how you created the variable n, but if it was the result of some data that was NA, or a division by zero result for example, that would cause these errors.

All possible combinations for large numbers in R

Image i have a sequence like this:
seq <- rep(0:9, 10)
I want to know all possible combinations of this sequence. For sure, command combn isn't working:
> comb <- combn(seq, 10)
Error in matrix(r, nrow = len.r, ncol = count) :
invalid 'ncol' value (too large or NA)
In addition: Warning message:
In combn(seq, 10) : NAs introduced by coercion to integer range
Can you give me a hint how to make my own function for all possible combinations?

Based on your reply to the comment , here is one thing you can do . You need the combinat package installed for this to work.
library(combinat)
seq <- c(1,2,3,4,5,6,7,8,9,0)
permn(seq)

Error with knn function

I try to run this line :
knn(mydades.training[,-7],mydades.test[,-7],mydades.training[,7],k=5)
but i always get this error :
Error in knn(mydades.training[, -7], mydades.test[, -7], mydades.training[, :
NA/NaN/Inf in foreign function call (arg 6)
In addition: Warning messages:
1: In knn(mydades.training[, -7], mydades.test[, -7], mydades.training[, :
NAs introduced by coercion
2: In knn(mydades.training[, -7], mydades.test[, -7], mydades.training[, :
NAs introduced by coercion
Any idea please ?
PS : mydades.training and mydades.test are defined as follow :
N <- nrow(mydades)
permut <- sample(c(1:N),N,replace=FALSE)
ord <- order(permut)
mydades.shuffled <- mydades[ord,]
prop.train <- 1/3
NOMBRE <- round(prop.train*N)
mydades.training <- mydades.shuffled[1:NOMBRE,]
mydades.test <- mydades.shuffled[(NOMBRE+1):N,]

I suspect that your issue lies in having non-numeric data fields in 'mydades'. The error line:
NA/NaN/Inf in foreign function call (arg 6)
makes me suspect that the knn-function call to the C language implementation fails. Many functions in R actually call underlying, more efficient C implementations, instead of having an algorithm implemented in just R. If you type just 'knn' in your R console, you can inspect the R implementation of 'knn'. There exists the following line:
Z <- .C(VR_knn, as.integer(k), as.integer(l), as.integer(ntr),
as.integer(nte), as.integer(p), as.double(train), as.integer(unclass(clf)),
as.double(test), res = integer(nte), pr = double(nte),
integer(nc + 1), as.integer(nc), as.integer(FALSE), as.integer(use.all))
where .C means that we're calling a C function named 'VR_knn' with the provided function arguments. Since you have two of the errors
NAs introduced by coercion
I think two of the as.double/as.integer calls fail, and introduce NA values. If we start counting the parameters, the 6th argument is:
as.double(train)
that may fail in cases such as:
# as.double can not translate text fields to doubles, they are coerced to NA-values:
> as.double("sometext")
[1] NA
Warning message:
NAs introduced by coercion
# while the following text is cast to double without an error:
> as.double("1.23")
[1] 1.23
You get two of the coercion errors, which are probably given by 'as.double(train)' and 'as.double(test)'. Since you did not provide us with exact details of how 'mydades' is, here are some of my best guesses (and an artificial multivariate normal distribution data):
library(MASS)
mydades <- mvrnorm(100, mu=c(1:6), Sigma=matrix(1:36, ncol=6))
mydades <- cbind(mydades, sample(LETTERS[1:5], 100, replace=TRUE))
# This breaks knn
mydades[3,4] <- Inf
# This breaks knn
mydades[4,3] <- -Inf
# These, however, do not introduce the coercion for NA-values error message
# This breaks knn and gives the same error; just some raw text
mydades[1,2] <- mydades[50,1] <- "foo"
mydades[100,3] <- "bar"
# ... or perhaps wrongly formatted exponential numbers?
mydades[1,1] <- "2.34EXP-05"
# ... or wrong decimal symbol?
mydades[3,3] <- "1,23"
# should be 1.23, as R uses '.' as decimal symbol and not ','
# ... or most likely a whole column is non-numeric, since the error is given twice (as.double problem both in training AND test set)
mydades[,1] <- sample(letters[1:5],100,replace=TRUE)
I would not keep both the numeric data and class labels in a single matrix, perhaps you could split the data as:
mydadesnumeric <- mydades[,1:6] # 6 first columns
mydadesclasses <- mydades[,7]
Using calls
str(mydades); summary(mydades)
may also help you/us in locating the problematic data entries and correct them to numeric entries or omitting non-numeric fields.
The rest of the run code (after breaking the data), as provided by you:
N <- nrow(mydades)
permut <- sample(c(1:N),N,replace=FALSE)
ord <- order(permut)
mydades.shuffled <- mydades[ord,]
prop.train <- 1/3
NOMBRE <- round(prop.train*N)
mydades.training <- mydades.shuffled[1:NOMBRE,]
mydades.test <- mydades.shuffled[(NOMBRE+1):N,]
# 7th column seems to be the class labels
knn(train=mydades.training[,-7],test=mydades.test[,-7],mydades.training[,7],k=5)

Great answer by#Teemu.
As this is a well-read question, I will give the same answer from an analytics perspective.
The KNN function classifies data points by calculating the Euclidean distance between the points. That's a mathematical calculation requiring numbers. All variables in KNN must therefore be coerce-able to numerics.
The data preparation for KNN often involves three tasks:
(1) Fix all NA or "" values
(2) Convert all factors into a set of booleans, one for each level in the factor
(3) Normalize the values of each variable to the range 0:1 so that no variable's range has an unduly large impact on the distance measurement.

I would also point out that the function seems to fail when using integers. I needed to convert everything into "num" type prior to calling the knn function. This includes the target feature, which most methods in R use the factor type. Thus, as.numeric(my_frame$target_feature) is required.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to create data frame for super large vectors? - r

Related

FindVariableFeatures Function in Seurat Producing "Error in match(x, table, nomatch = 0L) : 'match' requires vector arguments"

R: read.table with colClasses gives Error in integer(n) : vector size cannot be NA/NaN

Error in 2:n : NA/NaN argument

All possible combinations for large numbers in R

Error with knn function

Categories

Resources

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to create data frame for super large vectors? ​ - r

Related

FindVariableFeatures Function in Seurat Producing "Error in match(x, table, nomatch = 0L) : 'match' requires vector arguments"

R: read.table with colClasses gives Error in integer(n) : vector size cannot be NA/NaN

Error in 2:n : NA/NaN argument

All possible combinations for large numbers in R

Error with knn function

Categories

Resources

How to create data frame for super large vectors? - r