I have a large data frame (50k by 5k). I would like to make a smaller data frame using the following rule. For a given k, 0>k>n, I would like to select the largest set of columns such that k rows have non-na values in all of these columns.
This seems like it might be too hard for a computer to do on a big data frame, but I'm hoping it is possible. I have written code for this operation.
It seems my way of doing this is too complex. It relies on (1) computing a list of all possible subsets of the set of columns, and then (2) checking how many shared rows they have. For even small numbers of columns (1) gets too slow (e.g. 45 seconds for 25 columns).
Question: Is it theoretically possible to get the largest set of columns sharing at least k non-na rows? If so, what is a more realistic approach?
#alexis_laz's elegant answer to a similar question takes an inverse approach to mine, examining all (fixed-size) subsets of the observations/samples/draws/units and checking which variables are present in them.
Taking combinations of n observations is difficult for large n. For example, length(combn(1:500, 3, simplify = FALSE)) yields 20,708,500 combinations for 500 observations and fails to produce the combinations on my computer for sizes greater than 3. This makes me worry that it's not gonna be possible for large n and p in either approach.
I have included an example matrix for reproducibility.
require(dplyr)
# generate example matrix
set.seed(123)
n = 100
p = 25
missing = 25
mat = rnorm(n * p)
mat[sample(1:(n*p), missing)] = NA
mat = matrix(mat, nrow = n, ncol = p)
colnames(mat) = 1:p
# matrix reporting whether a value is na
hasVal = 1-is.na(mat)
system.time(
# collect all possible subsets of the columns' indices
nameSubsets <<- unlist(lapply(1:ncol(mat), combn, x = colnames(mat), simplify = FALSE),
recursive = FALSE,
use.names = FALSE)
)
#how many observations have all of the subset variables
countObsWithVars = function(varsVec){
selectedCols = as.matrix(hasVal[,varsVec])
countInRow = apply(selectedCols, 1, sum) # for each row, number of matching values
numMatching = sum(countInRow == length(varsVec)) #has all selected columns
}
system.time(
numObsWithVars <<- unlist(lapply(nameSubsets, countObsWithVars))
)
# collect results into a data.frame
df = data.frame(subSetNum = 1:length(numObsWithVars),
numObsWithVars = numObsWithVars,
numVarsInSubset = unlist(lapply(nameSubsets, length)),
varsInSubset = I(nameSubsets))
# find the largest set of columns for each number of rows
maxdf = df %>% group_by(numObsWithVars) %>%
filter(numVarsInSubset== max(numVarsInSubset)) %>%
arrange(numObsWithVars)
Related
I have some large data frames that are big enough to push the limits of R on my machine; e.g., the one on which I'm currently working is 2 columns by 70 million rows. The contents aren't important, but just in case, column 1 is a string and column 2 is an integer.
What I would like to do is split that data frame into n parts (say, 20, but preferably something that could change on a case-by-case basis) so that I can work on each of the smaller data frames one at a time. That means that (a) the result has to produce things that are named (e.g., "newdf_1", "newdf_2", ... "newdf_20" or something), and (b) each line in the original data frame needs to be in one (and only one) of the new "sub" data frames. The order does not matter, but doing it sequentially by rows makes sense to me.
Once I do the work, I will start to recombine them (using rbind()) one pair at a time.
I've looked at split(), but from what I can tell, it is designed to work with factors (which I don't have).
Any ideas?
You can create a new column and split the data frame based on that column. The column does not need to be a factor, but need to be a data type that can be converted to a factor by the split function.
# Number of groups
N <- 20
dat$group <- 1:nrow(dat) %% N
# Add 1 to group
dat$group <- dat$group + 1
# Split the dat by group
dat_list <- split(dat, f = ~group)
# Set the name of the list
names(dat_list) <- paste0("newdf_", 1:N)
Data
set.seed(123)
# Create example data frame
dat <- data.frame(
A = sample(letters, size = 70000000, replace = TRUE),
B = rpois(70000000, lambda = 1)
)
Here's a tidyverse based solution. Try using read_csv_chunked().
# practice data
tibble(string = sample(letters, 1e6, replace = TRUE),
value = rnorm(1e6) %>%
write_csv("test.csv")
# here's the solution
partial_data <- read_csv_chunked("test.csv",
DataFrameCallback$new(function(x, pos) filter(x, string == "a")),
chunk_size = 1000)
You can wrap the call to read_csv_chunked in a function where you change the string that you subset on.
This is more or less a repeat of this question:
How to read only lines that fulfil a condition from a csv into R?
I need to use student t Test on the columns two matrices having 21 x 4044 and 36 x 4044 respectively. The columns are identical in both, just the rows vary in length.
Sample code for my example input data
mat1 <- matrix(rnorm(100), ncol = 5)
mat2 <- matrix(rnorm(125), ncol = 5)
f <- function(x,y){
test <- t.test(x,y, paired=TRUE)
out <- data.frame(stat = test$statistic,
df = test$parameter,
pval = test$p.value,
conl = test$conf.int[1],
conh = test$conf.int[2]
)
return(out)
}
sapply(seq(ncol(mat1)), function(x) f(mat1[,x], mat2[,x]))
But is gives the following error
Error in complete.cases(x, y) : not all arguments have the same length
How to deal with this error?
It works fine for the matrices having same number of rows.
A paired t-test assumes that you have two results for each entity, so for example, you might measure heart-rate of the same person before and after a race leaving you with reading 1 and reading 2 that are 'paired'. This is what you're achieving with paired = TRUE.
In your example, you have differently sized vectors, suggesting that you may not be recording two readings for the same entity, so from here:
If you have not been collecting pairs of readings from the same subject, switch to paired = FALSE.
If you have been collecting pairs of readings from the same subject then you are missing some readings (by virtue of one column having more readings than the other) and you should remove the cases where you don't have two readings.
Hopefully that makes sense and helps a little.
EDIT: Having just made that change and run your code, I get:
stat -0.1336019 -0.8981109 -0.1962769 0.9045503 0.3164153
df 42.35801 42.9418 38.21301 40.52551 41.40109
pval 0.8943501 0.3741347 0.8454336 0.3710499 0.7532772
conl -0.7211962 -1.069044 -0.6361448 -0.363129 -0.5404484
conh 0.6316144 0.4102729 0.5236731 0.9519358 0.7413329
I have a matrix named "fida", from which I have randomly sampled certain number of rows. On these rows I am running a set of commands at the end of which I have a condition which if true, i want to sample another row randomly from the same matrix which is not any of the rows sampled earlier.
For doing this I have a condition. But before that itself when i use the same command to sample from the matrix gives me an empty data
reps=5 #number of samples
randreps=sample(nrow(fida), size = reps, replace = F)
for (loop in randreps)
{calculate a}
if(a==0)
{loop=sample(nrow(fida), size = 1, replace = F)
calculate a}
But when I run this, the second sample always gives empty data and a cannot be calculated. When I go back and check my dataframe "fida" for the row that has been selected, there is data in that row. I do not know what is wrong and any help will be much appreciated.
You could approach this problem in the following manner.
set.seed(357)
xy <- matrix(1:30, nrow = 10)
original.rows <- sample(10, size = 3, replace = FALSE)
original <- xy[original.rows, ]
# Your calculations.
# Sample from the original matrix again, but without the already sampled
# samples.
middle <- xy[-original.rows, ]
output.row <- sample(nrow(middle), size = 3, replace = FALSE)
output <- xy[output.row, ]
In other words, you have a matrix that holds only the unsampled rows, which serves as a source of new rows for your calculations.
I'm trying to create a sparse matrix, where for each row has a maximum of n entries that are each integers within a certain range, which I could then use as an adjacency matrix for social network analysis. For example, an 80X80 matrix where each row has 10 or fewer entries that are integers from 1-4. The goal is to represent the sort of data you would get from a social networking survey in which respondents were selecting values between 1 and 4 to indicate their relationship with up to 10 of the possibilities/columns in the survey.
I can create a sparse matrix using the "rsparsematrix" function, and using the density command can approximate the required number of responses, but I can't control the number of responses per row and would have to do additional processing to convert the random values to integers within my desired range.
eg: I could start with something like
M1<-rsparsematrix(80, 80, density = .1, symmetric = FALSE)
A more promising approach (from https://www.r-bloggers.com/casting-a-wide-and-sparse-matrix-in-r/) would be to generate the values and then use "transform" to convert them into a matrix. This allows me to control the integer values, but still doesn't get the limited number of responses per row.
Example code from the blog follows:
set.seed(11)
N = 10
data = data.frame(
row = sample(1:3, N, replace = TRUE),
col = sample(LETTERS, N, replace = TRUE),
value = sample(1:3, N, replace = TRUE))
data = transform(data,
row = factor(row),
col = factor(col)) "
This could be tweaked to give the required 80x80 matrix, but doesn't solve the problem of limiting the responses per row and, in the case of duplicate entries in the same row/column combination will result in out of range values since it resolves duplicate entries by taking the sum.
Any suggestions would be most appreciated.
As a bonus question, how would you then create random rows of null responses? For example within the 80*80 matrix, how might you introduce 40 random rows with no values? As in the description above, this would correspond to missing survey data.
You can try to build the spare matrix up using the row (i), column (j) amd value (x) components. This involves sampling subject to your row and value constraints.
# constraints
values <- 1:4
maxValuesPerRow <- 10
nrow <- 80
ncol <- 80
# sample values : how many values should each row get but <= 10 values
set.seed(1)
nValuesForEachRow <- sample(maxValuesPerRow, nrow, replace=TRUE)
# create matrix
library(Matrix)
i <- rep(seq_len(nrow), nValuesForEachRow) # row
j <- unlist(lapply(nValuesForEachRow, sample, x=seq_len(ncol))) # which columns
x <- sample(values, sum(nValuesForEachRow), replace=TRUE) # values
sm <- sparseMatrix(i=i, j=j, x=x)
check
dim(sm)
table(rowSums(sm>0))
table(as.vector(sm))
note, cant just sample columns like below as this can give duplicate values, hence loop used.
j <- sample(seq_len(ncol), sum(nValuesForEachRow), replace=TRUE)
The code below will do what you want. It generates your random sparse matrix, rounds it to whole numbers, then for every row that has more than 10 entries, randomly makes some entries NA until only 10 remain. It then makes all the non NA entries a random number between 1 and 4.
library(Matrix)
M1<-as.data.frame(as.matrix((rsparsematrix(80, 80, density = .1, symmetric = FALSE))))
M1 <- as.data.frame(apply(M1,1,round))
M1<-as.data.frame(sapply(M1,function(x) ifelse(x==0,NA,x)))
rows<-which(apply(M1,1,function(x) sum(!(is.na(x)))) >10)
for(i in rows)
{
toNA<-setdiff(which(!(is.na(M1[i,]))),sample(which(!(is.na(M1[i,]))),10,replace=F))
M1[i,toNA] <- NA
)
for(i in 1:nrow(M1))
{
M1[i,which(!(is.na(M1[i,])))] <- sample(1:4,length(M1[i,which(!
(is.na(M1[i,])))]),replace=T)
}
I am having a list of identically sorted dataframes. More specific these are the imputed dataframes which I get after doing Multiple imputations with the AmeliaII package. Now I want to create a new dataframe that is identical in structure, but contains the mean values of the cells calculated across the dataframes.
The way I achieve this at the moment is the following:
## do the Amelia run ------------------------------------------------------------
a.out <- amelia(merged, m=5, ts="Year", cs ="GEO",polytime=1)
## Calculate the output statistics ----------------------------------------------
left.side <- a.out$imputations[[1]][,1:2]
a.out.ncol <- ncol(a.out$imputations[[1]])
a <- a.out$imputations[[1]][,3:a.out.ncol]
b <- a.out$imputations[[2]][,3:a.out.ncol]
c <- a.out$imputations[[3]][,3:a.out.ncol]
d <- a.out$imputations[[4]][,3:a.out.ncol]
e <- a.out$imputations[[5]][,3:a.out.ncol]
# Calculate the Mean of the matrices
mean.right <- apply(abind(a,b,c,d,e,f,g,h,i,j,along=3),c(1,2),mean)
# recombine factors with values
mean <- cbind(left.side,mean.right)
I suppose that there is a much better way of doing this by using apply, plyr or the like, but as a R Newbie I am really a bit lost here. Do you have any suggestions how to go about this?
Here's an alternate approach using Reduce and plyr::llply
dfr1 <- data.frame(a = c(1,2.5,3), b = c(9.0,9,9), c = letters[1:3])
dfr2 <- data.frame(a = c(5,2,5), b = c(6,5,4), c = letters[1:3])
tst = list(dfr1, dfr2)
require(plyr)
tst2 = llply(tst, function(df) df[,sapply(df, is.numeric)]) # strip out non-numeric cols
ans = Reduce("+", tst2)/length(tst2)
EDIT. You can simplify your code considerably and accomplish what you want in 5 lines of R code. Here is an example using the Amelia package.
library(Amelia)
data(africa)
# carry out imputations
a.out = amelia(x = africa, cs = "country", ts = "year", logs = "gdp_pc")
# extract numeric columns from each element of a.out$impuations
tst2 = llply(a.out$imputations, function(df) df[,sapply(df, is.numeric)])
# sum them up and divide by length to get mean
mean.right = Reduce("+", tst2)/length(tst2)
# compute fixed columns and cbind with mean.right
left.side = a.out$imputations[[1]][1:2]
mean0 = cbind(left.side,mean.right)
If I understand your question correctly, then this should get you a long way:
#set up some data:
dfr1<-data.frame(a=c(1,2.5,3), b=c(9.0,9,9))
dfr2<-data.frame(a=c(5,2,5), b=c(6,5,4))
tst<-list(dfr1, dfr2)
#since all variables are numerical, use a threedimensional array
tst2<-array(do.call(c, lapply(tst, unlist)), dim=c(nrow(tst[[1]]), ncol(tst[[1]]), length(tst)))
#To see where you're at:
tst2
#rowMeans for a threedimensional array and dims=2 does the mean over the last dimension
result<-data.frame(rowMeans(tst2, dims=2))
rownames(result)<-rownames(tst[[1]])
colnames(result)<-colnames(tst[[1]])
#display the full result
result
HTH.
After many attempts, I've found a reasonably fast way to calculate cells' means across multiple data frames.
# First create an empty data frame for storing the average imputed values. This
# data frame will have the same dimensions of the original one
imp.df <- df
# Then create an array with the first two dimensions of the original data frame and
# the third dimension given by the number of imputations
a <- array(NA, dim=c(nrow(imp.df), ncol(imp.df), length(a.out$imputations)))
# Then copy each imputation in each "slice" of the array
for (z in 1:length(a.out$imputations)) {
a[,,z] <- as.matrix(a.out$imputations[[z]])
}
# Finally, for each cell, replace the actual value with the mean across all
# "slices" in the array
for (i in 1:dim(a)[1]) {
for (j in 1:dim(a)[2]) {
imp.df[i, j] <- mean(as.numeric(a[i, j,]))
}}