Suppose I am using this data set of movie ratings: http://www.grouplens.org/node/73
It contains ratings in a file formatted as
userID::movieID::rating::timestamp
Given this, I want to construct a feature matrix in R project, where each row corresponds to a user and each column indicates the rating that the user gave to the movie (if any).
Example, if the data file contains
1::1::1::10
2::2::2::11
1::2::3::12
2::1::5::13
3::3::4::14
Then the output matrix would look like:
UserID, Movie1, Movie2, Movie3
1, 1, 3, NA
2, 5, 2, NA
3, NA, NA, 3
So is there some built-in way to achieve this in R project. I wrote a simple python script to do the same thing but I bet there are more efficient ways to accomplish this.
You can use the dcast function, in the reshape2 package, but the resulting data.frame may be huge (and sparse).
d <- read.delim(
"u1.base",
col.names = c("user", "film", "rating", "timestamp")
)
library(reshape2)
d <- dcast( d, user ~ film, value.var = "rating" )
If your fields are separated by double colons, you cannot use the sep argument of read.delim, which has to be only one character.
If you already do some preprocessing outside R, it is easier to do it there (e.g., in Perl, it would just be s/::/\t/g), but you can also do it in R: read the file as a single column, split the strings, and concatenate the result.
d <- read.delim("a")
d <- as.character( d[,1] ) # vector of strings
d <- strsplit( d, "::" ) # List of vectors of strings of characters
d <- lapply( d, as.numeric ) # List of vectors of numbers
d <- do.call( rbind, d ) # Matrix
d <- as.data.frame( d )
colnames( d ) <- c( "user", "movie", "rating", "timestamp" )
From the web site pointed to in a previous question, it appears that you want to represent
> print(object.size(integer(10000 * 72000)), units="Mb")
2746.6 Mb
which should be 'easy' with 8 GB you reference in another question. Also, the total length is less than the maximum vector length in R, so that should be ok too. But see the end of the response for an important caveat!
I created, outside R, a tab-delimited version of the data file. I then read in the information I was interested in
what <- list(User=integer(), Film=integer(), Rating=numeric(), NULL)
x <- scan(fl, what)
the 'NULL' drops the unused timestamp data. The 'User' and 'Film' entries are not sequential, and numeric() on my platform take up twice as much memory as integer(), so I converted User and Film to factor, and Rating to integer() by doubling (original scores are 1 to 5 in increments of 1/2).
x <- list(User=factor(x$User), Film=factor(x$Film),
Rating=as.integer(2 * x$Rating))
I then allocated the matrix
ratings <- matrix(NA_integer_ ,
nrow=length(levels(x$User)),
ncol=length(levels(x$Film)),
dimnames=list(levels(x$User), levels(x$Film)))
and use the fact that a two-column matrix can be used to index another matrix
ratings[cbind(x$User, x$Film)] <- x$Rating
This is the step where memory use is maximum. I'd then remove unneeded variable
rm(x)
The gc() function tells me how much memory I've used...
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 140609 7.6 407500 21.8 350000 18.7
Vcells 373177663 2847.2 450519582 3437.2 408329775 3115.4
... a little over 3 Gb, so that's good.
Having done that, you'll now run in to serious problems. kmeans (from your response to questions on an earlier earlier answer) will not work with missing values
> m = matrix(rnorm(100), 5)
> m[1,1]=NA
> kmeans(m, 2)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
and as a very rough rule of thumb I'd expect ready-made R solutions to requires 3-5 times as much memory as the starting data size. Have you worked through your analysis with a smaller data set?
Quite simply, you can represent it as a sparse matrix, using sparseMatrix from the Matrix package.
Just create a 3 column coordinate object list, i.e. in the form (i, j, value), say in a data.frame named myDF. Then, execute mySparseMat <- sparseMatrix(i = myDF$i, j = myDF$j, x = myDF$x, dims = c(numRows, numCols) - you need to decide the number of rows and columns, else the maximum indices will be used to decide the size of the matrix.
It's just that simple. Storing sparse data in a dense matrix is inappropriate, if not grotesque.
Related
I am trying to run a for function to extract multiple strings in order from a fasta.
Here is an example(of course the real one is more than 10 thousand)
eg <- ATCGACGTCGATGCTGATCGATCGATCGATCGTCAGATCGATCAG
And here is my code
forsubseq <- function(dna){
sta <- for (i in 1:floor(width(dna)/100)) {
seqGC <- Biostrings::subseq(dna, start = 100*i - 99, width = 100) %>%
Biostrings::letterFrequency(letters = "GC", as.prob = TRUE)
}
return(sta)
}
forsubseq(eg)
However, nothing happened after running. It really confused me...What I want to obtain is to analyze GC content for each 100 bp...
Could anyone kindly offer advice? Thanks.
The library Biostrings is not available for the most recent version of R, but one simplified approach would be to split eg at every n th character then use lapply to analyze. In this example I counted the number of "GC" pairs using str_count since I dont have the Biostrings library but you can change to the Biostrings::letterFrequency function:
eg <- "ACGACGTCGATGCTGATCGATCGATCGATCGTCAGATCGATCAG"
n <- 10 # you would change to 100
blocks <- seq(1, nchar(eg), n) # prep to separate every n base pairs
splits <- substring(eg, blocks, blocks + n - 1) # separate every n base pairs
lapply(splits,
function(x) stringr::str_count(x, "GC")) # replace with Biostrings::letterFrequency
The output is a list counting the number of "GC" pairs for each block of n characters (here, 10). If you want a vector of integers representing these data, just simply wrap the lapply function in unlist(lapply(...))
I am trying to create 1000 variables, which I want to name with the index number. I don't know how to create these new variables.
for(i in 1:1000) {
Ui <- rnorm(200,0,1)
}
This is a common sort of thing that people want to do, especially when they are coming from other programming languages. However, there are better ways to accomplish the same thing, and you should not follow recommendations to use assign; that is bad advice that you will likely regret later on.
The way we do this sort of thing in R is to use lists, specifically named lists:
x <- replicate(1000,rnorm(200,0,1),simplify = FALSE)
x <- setNames(x,paste0("A",seq_along(x)))
Now x is a named list of length 1000, each element is a vector of length 200 from a normal(0,1) distribution.
You can refer to each one via x[[1]] or x[["A1"]] as needed. Additionally, since they are all in the same object, you can operate on the easily as a group using tools like lapply.
Pretty much any time you find yourself wanting to create a sequence of objects with similar names, that should be a signal to you that you should be using a list instead.
There is no point in cluttering your environment with so many variables, try to store them in a named list instead
l1 <- setNames(lapply(1:5, function(x) rnorm(5)), paste0("A", 1:5))
l1
#$A1
#[1] 0.4951453 -1.4278665 0.5680115 0.3537730 -0.7757363
#$A2
#[1] -0.11096037 0.05958700 0.02578168 1.00591996 0.54852030
#$A3
#[1] 0.1058318 0.6988443 -0.8213525 -0.1072289 0.8757669
#$A4
#[1] -0.6629634 0.8321713 -0.3073465 -0.2645550 -1.0064132
#$A5
#[1] 2.2191246 0.2054360 -0.1768357 1.6875302 -1.1495807
Now you can access individual list element as
l1[["A1"]]
#[1] 0.4951453 -1.4278665 0.5680115 0.3537730 -0.7757363
Moreover, other method is to generate all the numbers together and then split them into list.
groups = 5
each = 5
setNames(split(rnorm(groups * each), rep(seq_len(groups), each = each)),
paste0("A", seq_len(groups)))
I agree with the others that this is not a good idea. Anyway, to answer your question how you wolud do that is
k <- 1000 # number of variables
n <- 200 # sample size of each variable
for(i in 1:k){
assign(paste0("variable", i), rnorm(n, 0, 1))}
variable1
-0.012947062 0.728284959 -1.627796366 0.003471491 ...
However, personally I would prefer another solution. The both answers so far suggest using lists. I find lsts quite cumbersome, especially if you are new to R. So I would suggest creating a matrix where every column contains one variable.
# creates a matrix
m <- matrix(rep(NA, n*k), ncol= k)
# generates rnorm() in each column
for(i in 1:k){
m[ , i] <- rnorm(n, 0, 1)
}
# now you can name the columns
colnames(m) <- paste0("variable", 1:k)
m
variable1 variable2 ...
[1,] 0.30950749 -2.07388046
[2,] -1.13232330 -0.55511476
...
I have a list of large (35000 x 3) matrices in R and I want to combine them into a single matrix but it would be about 1 billion rows long and would exceed the maximum object size in R.
The bigmemory package allows for larger matrices but doesn't appear to support rbind to put multiple matrices together.
Is there some other package or technique that supports the creation of a very large matrix from smaller matrices?
Also before you ask this is not a RAM issue, simply an R limitation even on 64-bit R.
You could implement it with a loop:
library(bigmemory)
## Reproducible example
mat <- matrix(1, 50e3, 3)
l <- list(mat)
for (i in 2:100) {
l[[i]] <- mat
}
## Solution
m <- ncol(l[[1]]) ## assuming that all have the same number of columns
n <- sum(sapply(l, nrow))
bm <- big.matrix(n, m)
offset <- 0
for (i in seq_along(l)) {
mat_i <- l[[i]]
n_i <- nrow(mat_i)
ind_i <- seq_len(n_i) + offset
bm[ind_i, ] <- mat_i
offset <- offset + n_i
}
## Verif
stopifnot(offset == n, all(bm[, 1] == 1))
Not quite an answer, but a little more than a comment: are you sure that you can't do it by brute force? R now has long vectors (since version 3.0.0; the question you link to refers to R version 2.14.1): from this page,
Arrays (including matrices) can be based on long vectors provided each of their dimensions is at most 2^31 - 1: thus there are no 1-dimensional long arrays.
while the underlying atomic vector can go up to 2^52 -1 elements ("in theory .. address space limits of current CPUs and OSes will be much smaller"). That means you should in principle be able to create a matrix that is as much as ((2^31)-1)/1e9 = 2.1 billion rows long; since the maximum "long" object size is about 10^15 (i.e. literally millions of billions), a matrix of 1 billion rows and 3 columns should (theoretically) not be a problem.
I am a noob R programmer. I have written a code that needs to apply a function to a data frame split by factors. The data frame in itself contains about 1 million 324961 observations with 64376 factors in the variable that we use to slice the dataframe.
The code is as follows:
library("readstata13")
# Reading the Stata Data file into R
bod_fb <- read.dta13("BoD_nonmissing_fb.dta")
gen_fuzzy_blau <- function(bod_sample){
# Here we drop the Variables that are not required in creating the Fuzzy-Blau index
bod_sample <- as.data.frame(bod_sample)
bod_sample$tot_occur <- as.numeric(bod_sample$tot_occur)
bod_sample$caste1_occ <- as.numeric(bod_sample$caste1_occ)
bod_sample$caste2_occ <- as.numeric(bod_sample$caste2_occ)
bod_sample$caste3_occ <- as.numeric(bod_sample$caste3_occ)
bod_sample$caste4_occ <- as.numeric(bod_sample$caste4_occ)
# Calculating the Probabilites of a director belonging to a caste
bod_sample$caste1_occ <- (bod_sample$caste1_occ)/(bod_sample$tot_occur)
bod_sample$caste2_occ <- (bod_sample$caste2_occ)/(bod_sample$tot_occur)
bod_sample$caste3_occ <- (bod_sample$caste3_occ)/(bod_sample$tot_occur)
bod_sample$caste4_occ <- (bod_sample$caste4_occ)/(bod_sample$tot_occur)
#Dropping the Total Occurances column, as we do not need it anymore
bod_sample$tot_occur<- NULL
# Here we replace all the blanks with NA
bod_sample <- apply(bod_sample, 2, function(x) gsub("^$|^ $", NA, x))
bod_sample <- as.data.frame(bod_sample)
# Here we push all the NAs in the caste names and caste probabilities to the end of the row
# So if there are only two castes against a name, then they become caste1 and caste2
caste_list<-data.frame(bod_sample$caste1,bod_sample$caste2,bod_sample$caste3,bod_sample$caste4)
caste_list = as.data.frame(t(apply(caste_list,1, function(x) { return(c(x[!is.na(x)],x[is.na(x)]) )} )))
caste_list_prob<-data.frame(bod_sample$caste1_occ,bod_sample$caste2_occ,bod_sample$caste3_occ,bod_sample$caste4_occ)
caste_list_prob = as.data.frame(t(apply(caste_list_prob,1, function(x) { return(c(x[!is.na(x)],x[is.na(x)]) )} )))
# Here we write two functions: 1. gen_castelist
# 2. gen_caste_prob
# gen_castelist: This function takes the row number (serial number of the direcor)
# and returns the names of all the castes for which he has a non-zero
# probability.
# gen_caste_prob: This function takes the row number (serial number of the director)
# and returns the probability with which he belongs to the caste
#
gen_castelist <- function(x){
y <- caste_list[x,]
y <- as.vector(y[!is.na(y)])
return(y)
}
gen_caste_prob <- function(x){
z <- caste_list_prob[x,]
z <- z[!is.na(z)]
z <- as.numeric(z)
return(z)
}
caste_ls <-list()
caste_prob_ls <- list()
for(i in 1:nrow(bod_sample))
{
caste_ls[[i]]<- gen_castelist(i)
caste_prob_ls[[i]]<- gen_caste_prob(i)
}
gridcaste <- expand.grid(caste_ls)
gridcaste <- data.frame(lapply(gridcaste, as.character), stringsAsFactors=FALSE)
gridcasteprob <- expand.grid(caste_prob_ls)
# Generating the Joint Probability
gridcasteprob$JP <- apply(gridcasteprob,1,prod)
# Generating the Similarity Index
gen_sim_index <- function(x){
x <- t(x)
a <- as.data.frame(table(x))
sim_index <- sum(a$Freq^2)/(sum(a$Freq))^2
return(sim_index)
}
gridcaste$sim_index <- apply(gridcaste,1,gen_sim_index)
# Generating fuzzyblau
gridcaste$fb <- gridcaste$sim_index * gridcasteprob$JP
fuzzy_blau_index <- sum(gridcaste$fb)
remove_list <- c("gridcaste","")
return(fuzzy_blau_index)
}
fuzzy_blau_output <- by(bod_fb,bod_fb$code_year,gen_fuzzy_blau)
# Saving the output as a dataframe with two columns
# Column 1 is the fuzzy blau index
# Column 2 is the code_year
code_year <- names(fuzzy_blau_output)
fuzzy_blau <- as.data.frame(as.vector(unlist(fuzzy_blau_output)))
names(fuzzy_blau) <- c("fuzzy_blau_index")
fuzzy_blau$code_year <- code_year
bod_fb <- merge(bod_fb,fuzzy_blau,by = "code_year")
save.dta13(bod_fb,"bod_fb_example.dta")
If the code is tl;dr, the summary is as follows:
I have a dataframe bod_fb. I need to apply the apply the gen_fuzzy_blau function on this dataframe by slicing the dataframe with factors of bod_fb$code_year.
Since the function is very huge sequential processing is taking more than a day and ends up in running out of memory. The function gen_fuzzy_blau returns a numeric variable fuzzy_blau_index for each code_year of the dataframe. I use by to apply the function on each slice. I wanted to know if there is a way to parallelly implement this code so that multiple instances of the function run at once on different slices of the dataframe. I did not find a by implementation for parallel package and I did not know how to pass the dataframes as iterators while using foreach and doParallel packages.
I have a AMD A8 laptop with 4GB RAM and windows 7 sp1 home basic. I have given 20GB as page file memory (this was after I got the memory error).
Thank you
EDIT 1: #milkmotel I have eliminated the redundancy in the code and removed the for loops, but a huge amount of time is being wasted in gen_sim_index in the function, I am using the proc.time()function to gauge the time that each part of the code is taking.
The function is supposed to the following to a row:
if we have a row (not a vector) say: a a b c the similarity index will be (2/4)^2 + (1/4)^2 + (1/4)^2 ie, summation of (no of occurences of each unique element of each row/total no of elements in the row)^2
I am unable to use the apply function directly on the row because each element in a row because each element in the row has different factors and table() does not output the frequencies properly.
What is an efficient way to code the gen_sim_index function?
You're saving your data 6 times over in 6 different variables. Try not doing that.
and it takes a day because you're running character indexing on a ridiculous amount of data with gsub().
Take your code out of your gen_fuzzy_blau function as it provides no value to wrap it up into one function rather than running it all independently. Then run it all one line at a time. If it takes too long to run, reconsider your method. Your code is incredibly inefficient.
I have been trying to use tapply, ave, ddply to create statistics by group of a variable (age, sex). I haven't been able to use above mentioned R commands successfully.
library("ff")
df <- as.ffdf(data.frame(a=c(1,1,1:3,1:5), b=c(10:1), c=(1:10)))
tapply(df$a, df$b, length)
The error message I get is
Error in as.vmode(value, vmode) :
argument "value" is missing, with no default
or
Error in byMean(df$b, df$a) : object 'index' not found
There is currently no tapply or ave for ff_vectors currently implemented in package ff.
But what you can do is use functionality in ffbase.
Let's elaborate on some bigger dataset
require(ffbase)
a <- ffrep.int(ff(1:100000), times=500) ## 50Mio records on disk - not in RAM
b <- ffrandom(n=length(a), rfun = runif)
c <- ffseq_len(length(a))
df <- ffdf(a = a, b = b, c = c) ## on disk
dim(df)
For your simple aggregation method, you can use binned_sum for which you can extract the length easily as follows. Mark that binned_sum needs an ff factor object in the bin, which can be obtained by doing as.character.ff as shown.
df$groupbyfactor <- as.character(df$a)
agg <- binned_sum(x=df$b, bin=df$groupbyfactor, nbins = length(levels(df$groupbyfactor)))
head(agg)
agg[, "count"]
For more complex aggregations you can use ffdfdply in ffbase. What I frequently do is combine it with some data.table statements like this:
require(data.table)
agg <- ffdfdply(df, split=df$groupbyfactor, FUN=function(x){
x <- as.data.table(x)
result <- x[, list(b.mean = mean(b), b.median = median(b), b.length = length(b), whatever = b[c == max(c)][1]), by = list(a)]
result <- as.data.frame(result)
result
})
class(agg)
aggg <- as.data.frame(agg) ## Puts the data in RAM!
This will put your data in RAM in chunks of groups of split elements based on which you can apply a function, like some data.table statements, which require your data to be in RAM. The result of all chunks based on which you applied the function is next combined in a new ffdf, so that you can further use it, or put it into RAM if your RAM allows that size.
The sizes of the chunks are controlled by getOption("ffbatchbytes"). So if you have more RAM, the better as it will allow you to get more data in each chunk in RAM.