Related
I have a problem in which I try to create a matrix with the number of occurrences of specific 'coordinates'. I am working in R.
To illustrate, this is (part of) my data:
pre = c(3,1,3,2,2,4,3,5,3,4,6,5,6,5,4,5,6,6,5,6,5,7,6,7,7,7,4,8,4,8,8,4,4,8,3,9,8,6,9,8)
post = c(4,3,5,3,4,6,5,6,5,4,5,6,6,5,6,5,7,6,7,7,7,4,8,4,8,8,4,4,8,3,9,8,6,9,8,8,9,7,9,9)
df = data.frame(pre,post)
I then define this output matrix with the possible coordinate dimensions(range 1-20 in all data):
matrix = matrix(NA, nrow=20, ncol=20)
colnames(matrix) = seq(1,20,1)
rownames(matrix) = seq(1,20,1)
I then need a loop to run through my data and to store how many of the specific pre-post combinations exist within the data:
for (i in 1:40){matrix[df$post[i], df$pre[i]] = 1}
This works as in that the output now shows which 'coordinates' occurred in the data, but it doesn't say how many times.
For example, I know that pre=4, post=4 occurred twice.
Thus the loop needs to remember the combination already occurred and needs to add one more 1, but I don't know how to program this.
I hope somebody can be of help,
Anne
You could initialize the matrix with zeros instead of NA and then increment the matrix value like this:
pre = c(3,1,3,2,2,4,3,5,3,4,6,5,6,5,4,5,6,6,5,6,5,7,6,7,7,7,4,8,4,8,8,4,4,8,3,9,8,6,9,8)
post = c(4,3,5,3,4,6,5,6,5,4,5,6,6,5,6,5,7,6,7,7,7,4,8,4,8,8,4,4,8,3,9,8,6,9,8,8,9,7,9,9)
df = data.frame(pre,post)
matrix = matrix(0, nrow=20, ncol=20)
colnames(matrix) = seq(1,20,1)
rownames(matrix) = seq(1,20,1)
for (i in 1:40){matrix[df$post[i], df$pre[i]] = matrix[df$post[i], df$pre[i]] + 1}
By the way, the setting of the matrix colnames and rownames is not needed if you don't need it for any other reasons.
We can use table to do this. Convert the 'pre', 'post' columns to factor with levels specified as 1 to 20 and then use table
table(factor(df$pre, levels = 1:20), factor(df$post, levels = 1:20))
If we are using the already created 'matrix' an option is
out <- as.data.frame(table(df))
matrix[as.matrix(out[1:2])] <- out$Freq
I'm trying to create a sparse matrix, where for each row has a maximum of n entries that are each integers within a certain range, which I could then use as an adjacency matrix for social network analysis. For example, an 80X80 matrix where each row has 10 or fewer entries that are integers from 1-4. The goal is to represent the sort of data you would get from a social networking survey in which respondents were selecting values between 1 and 4 to indicate their relationship with up to 10 of the possibilities/columns in the survey.
I can create a sparse matrix using the "rsparsematrix" function, and using the density command can approximate the required number of responses, but I can't control the number of responses per row and would have to do additional processing to convert the random values to integers within my desired range.
eg: I could start with something like
M1<-rsparsematrix(80, 80, density = .1, symmetric = FALSE)
A more promising approach (from https://www.r-bloggers.com/casting-a-wide-and-sparse-matrix-in-r/) would be to generate the values and then use "transform" to convert them into a matrix. This allows me to control the integer values, but still doesn't get the limited number of responses per row.
Example code from the blog follows:
set.seed(11)
N = 10
data = data.frame(
row = sample(1:3, N, replace = TRUE),
col = sample(LETTERS, N, replace = TRUE),
value = sample(1:3, N, replace = TRUE))
data = transform(data,
row = factor(row),
col = factor(col)) "
This could be tweaked to give the required 80x80 matrix, but doesn't solve the problem of limiting the responses per row and, in the case of duplicate entries in the same row/column combination will result in out of range values since it resolves duplicate entries by taking the sum.
Any suggestions would be most appreciated.
As a bonus question, how would you then create random rows of null responses? For example within the 80*80 matrix, how might you introduce 40 random rows with no values? As in the description above, this would correspond to missing survey data.
You can try to build the spare matrix up using the row (i), column (j) amd value (x) components. This involves sampling subject to your row and value constraints.
# constraints
values <- 1:4
maxValuesPerRow <- 10
nrow <- 80
ncol <- 80
# sample values : how many values should each row get but <= 10 values
set.seed(1)
nValuesForEachRow <- sample(maxValuesPerRow, nrow, replace=TRUE)
# create matrix
library(Matrix)
i <- rep(seq_len(nrow), nValuesForEachRow) # row
j <- unlist(lapply(nValuesForEachRow, sample, x=seq_len(ncol))) # which columns
x <- sample(values, sum(nValuesForEachRow), replace=TRUE) # values
sm <- sparseMatrix(i=i, j=j, x=x)
check
dim(sm)
table(rowSums(sm>0))
table(as.vector(sm))
note, cant just sample columns like below as this can give duplicate values, hence loop used.
j <- sample(seq_len(ncol), sum(nValuesForEachRow), replace=TRUE)
The code below will do what you want. It generates your random sparse matrix, rounds it to whole numbers, then for every row that has more than 10 entries, randomly makes some entries NA until only 10 remain. It then makes all the non NA entries a random number between 1 and 4.
library(Matrix)
M1<-as.data.frame(as.matrix((rsparsematrix(80, 80, density = .1, symmetric = FALSE))))
M1 <- as.data.frame(apply(M1,1,round))
M1<-as.data.frame(sapply(M1,function(x) ifelse(x==0,NA,x)))
rows<-which(apply(M1,1,function(x) sum(!(is.na(x)))) >10)
for(i in rows)
{
toNA<-setdiff(which(!(is.na(M1[i,]))),sample(which(!(is.na(M1[i,]))),10,replace=F))
M1[i,toNA] <- NA
)
for(i in 1:nrow(M1))
{
M1[i,which(!(is.na(M1[i,])))] <- sample(1:4,length(M1[i,which(!
(is.na(M1[i,])))]),replace=T)
}
I am trying to write a for loop where if the cell of one matrix matches a letter it then fills a blank matrix with the entire row that matched. Here is my code
mets<-data.frame(read.csv(file="Metabolite_data.csv",header=TRUE))
full<-length(mets[,6])
A=matrix(,nrow=4930,ncol=8, byrow=T)
for (i in 1:full){
if (mets[i,6]=="A") (A[i,]=(mets[i,]))
}
If I replace the i in the if statement with a single number it works to fill that row of matrix A, however it will not fill more then one row. TIA
You might be getting problems going from data frame to matrix. It could be that just using "mets" as a matrix instead of a data frame could solve your problem, or you could use as.matrix within your for loop. An example of the latter with made-up data since I don't have your "metabolite_data.csv":
mets <- matrix(sample(LETTERS[1:4], 80, replace = TRUE), nrow = 10, ncol = 8)
mets <- as.data.frame(mets)
A <- matrix(nrow = nrow(mets), ncol = ncol(mets), byrow = TRUE)
for(i in 1:nrow(mets)){
if(mets[i,6] == "A"){
A[i,] = as.matrix(mets[i,])
}
}
print(A)
You may wanna try to specify ncol=dim(mets)[2] to make sure you are providing same number of inputs to fill the matrix.
I have a large data frame (50k by 5k). I would like to make a smaller data frame using the following rule. For a given k, 0>k>n, I would like to select the largest set of columns such that k rows have non-na values in all of these columns.
This seems like it might be too hard for a computer to do on a big data frame, but I'm hoping it is possible. I have written code for this operation.
It seems my way of doing this is too complex. It relies on (1) computing a list of all possible subsets of the set of columns, and then (2) checking how many shared rows they have. For even small numbers of columns (1) gets too slow (e.g. 45 seconds for 25 columns).
Question: Is it theoretically possible to get the largest set of columns sharing at least k non-na rows? If so, what is a more realistic approach?
#alexis_laz's elegant answer to a similar question takes an inverse approach to mine, examining all (fixed-size) subsets of the observations/samples/draws/units and checking which variables are present in them.
Taking combinations of n observations is difficult for large n. For example, length(combn(1:500, 3, simplify = FALSE)) yields 20,708,500 combinations for 500 observations and fails to produce the combinations on my computer for sizes greater than 3. This makes me worry that it's not gonna be possible for large n and p in either approach.
I have included an example matrix for reproducibility.
require(dplyr)
# generate example matrix
set.seed(123)
n = 100
p = 25
missing = 25
mat = rnorm(n * p)
mat[sample(1:(n*p), missing)] = NA
mat = matrix(mat, nrow = n, ncol = p)
colnames(mat) = 1:p
# matrix reporting whether a value is na
hasVal = 1-is.na(mat)
system.time(
# collect all possible subsets of the columns' indices
nameSubsets <<- unlist(lapply(1:ncol(mat), combn, x = colnames(mat), simplify = FALSE),
recursive = FALSE,
use.names = FALSE)
)
#how many observations have all of the subset variables
countObsWithVars = function(varsVec){
selectedCols = as.matrix(hasVal[,varsVec])
countInRow = apply(selectedCols, 1, sum) # for each row, number of matching values
numMatching = sum(countInRow == length(varsVec)) #has all selected columns
}
system.time(
numObsWithVars <<- unlist(lapply(nameSubsets, countObsWithVars))
)
# collect results into a data.frame
df = data.frame(subSetNum = 1:length(numObsWithVars),
numObsWithVars = numObsWithVars,
numVarsInSubset = unlist(lapply(nameSubsets, length)),
varsInSubset = I(nameSubsets))
# find the largest set of columns for each number of rows
maxdf = df %>% group_by(numObsWithVars) %>%
filter(numVarsInSubset== max(numVarsInSubset)) %>%
arrange(numObsWithVars)
I simulated a data matrix containing 200 rows x 1000 columns. It contains 0's and 1's in a binomial distribution. The probability of a 1 occurring depends on a probability matrix that I've created.
I then transpose this data matrix and convert it to a data frame. I created a function that will introduce missing data to each row of the data frame. The function will also add three columns to the data frame after the missing data is introduced. One column is the computed frequency of 1's across each of the 1000 rows. The 2nd column is the computed frequency of 0's across each row. The 3rd column is the frequency of missing values across each row.
I would like to repeat this function 500 times with the same input data frame (the one with no missing values) and output three data frames: one with 500 columns containing all of the computed frequencies of 0's (one column per simulation), one with 500 columns containing all of the computed frequencies of 1's, and one with 500 columns of the missing data frequencies.
I have seen mapply() used for something similar, but was not sure if it would work in my case. How can I repeatedly apply a function to a data frame and store the output of each computation performed within that function every time that function is repeated?
Thank you!
####Load Functions####
###Compute freq of 0's
compute.al0 = function(GEcols){
(sum(GEcols==0, na.rm=TRUE)/sum(!is.na(GEcols)))
}
###Compute freq of 1's
compute.al1 = function(GEcols){
(sum(GEcols==1, na.rm=TRUE)/sum(!is.na(GEcols)))
}
#Introduce missing data
addmissing = function(GEcols){
newdata = GEcols
num.cols = 200
num.miss = 10
set.to.missing = sample(num.cols, num.miss, replace=FALSE) #select num.miss to be set to missing
newdata[set.to.missing] = NA
return(newdata) #why is the matrix getting transposed during this??
}
#Introduce missing data and re-compute freq of 0's and 1's, and missing data freq
rep.missing = function(GEcols){
indata = GEcols
missdata = apply(indata,1,addmissing)
missdata.out = as.data.frame(missdata) #have to get the df back in the right format
missdata.out.t = t(missdata.out)
missdata.new = as.data.frame(missdata.out.t)
missdata.new$allele.0 = apply(missdata.new[,1:200], 1, compute.al0) #compute freq of 0's
missdata.new$allele.1 = apply(missdata.new[,1:200], 1, compute.al1) #compute freq of 1's
missdata.new$miss = apply(missdata.new[,1:200], 1, function(x) {(sum(is.na(x)))/200}) #compute missing
return(missdata.new)
}
#Generate a data matrix with no missing values
datasim = matrix(0, nrow=200, ncol=1000) #pre-allocated matrix of 0's of desired size
probmatrix = col(datasim)/1000 #probability matrix, each of the 1000 columns will have a different prob
datasim2 = matrix(rbinom(200 * 1000,1,probmatrix),
nrow=200, ncol=1000, byrow=FALSE) #new matrix of 0's and 1's based on probabilities
#Assign column names
cnum = 1:1000
cnum = paste("M",cnum,sep='')
colnames(datasim2) = cnum
#Assign row names
rnum = 1:200
rnum = paste("L",rnum,sep='')
rownames(datasim2) = rnum
datasim2 = t(datasim2) #data will be used in the transposed form
datasim2 = as.data.frame(datasim2)
#add 10 missing values per row and compute new frequencies
datasim.miss = rep.missing(datasim2)
#Now, how can I repeat the rep.missing function
#500 times and store the output of the new frequencies
#generated from each repetition?
Update:
Frank, thank you for the replicate() suggestion. I am able to return the repetitions by changing return(missdata.new) to return(list(missdata.new)) in the rep.missing() function. I then call the function with replicate(500,rep.missing(datasim2), simplify="matrix").
This is almost exactly what I want. I would like to do
return(list(missdata.new$allele.0, missdata.new$allele.1, missdata.new$miss))
in rep.missing() and return each of these 3 vectors as 3 column bound data frames within a list. One data frame holds the 500 repetitions of missdata.new$allele.0, one holds the 500 repetitions of missdata.new$allele.1, etc.
replicate(500, rep.missing(datasim2), simplify="matrix")
I am not sure to understand which part is where you don't know how to do.
If you don't know how repeatedly store your results. one way would be to have a global variable , and inside your function you do <<- assignments instead of <- or =.
x=c()
func = function(i){x <<- c(x,i) }
sapply(1:5,func)
mapply is tfor repeating a function over multiple inputs list or vectors.
you want to repeat your function 500 times. so you can always do
sapply(1:500,fund)