I simulated a data matrix containing 200 rows x 1000 columns. It contains 0's and 1's in a binomial distribution. The probability of a 1 occurring depends on a probability matrix that I've created.
I then transpose this data matrix and convert it to a data frame. I created a function that will introduce missing data to each row of the data frame. The function will also add three columns to the data frame after the missing data is introduced. One column is the computed frequency of 1's across each of the 1000 rows. The 2nd column is the computed frequency of 0's across each row. The 3rd column is the frequency of missing values across each row.
I would like to repeat this function 500 times with the same input data frame (the one with no missing values) and output three data frames: one with 500 columns containing all of the computed frequencies of 0's (one column per simulation), one with 500 columns containing all of the computed frequencies of 1's, and one with 500 columns of the missing data frequencies.
I have seen mapply() used for something similar, but was not sure if it would work in my case. How can I repeatedly apply a function to a data frame and store the output of each computation performed within that function every time that function is repeated?
Thank you!
####Load Functions####
###Compute freq of 0's
compute.al0 = function(GEcols){
(sum(GEcols==0, na.rm=TRUE)/sum(!is.na(GEcols)))
}
###Compute freq of 1's
compute.al1 = function(GEcols){
(sum(GEcols==1, na.rm=TRUE)/sum(!is.na(GEcols)))
}
#Introduce missing data
addmissing = function(GEcols){
newdata = GEcols
num.cols = 200
num.miss = 10
set.to.missing = sample(num.cols, num.miss, replace=FALSE) #select num.miss to be set to missing
newdata[set.to.missing] = NA
return(newdata) #why is the matrix getting transposed during this??
}
#Introduce missing data and re-compute freq of 0's and 1's, and missing data freq
rep.missing = function(GEcols){
indata = GEcols
missdata = apply(indata,1,addmissing)
missdata.out = as.data.frame(missdata) #have to get the df back in the right format
missdata.out.t = t(missdata.out)
missdata.new = as.data.frame(missdata.out.t)
missdata.new$allele.0 = apply(missdata.new[,1:200], 1, compute.al0) #compute freq of 0's
missdata.new$allele.1 = apply(missdata.new[,1:200], 1, compute.al1) #compute freq of 1's
missdata.new$miss = apply(missdata.new[,1:200], 1, function(x) {(sum(is.na(x)))/200}) #compute missing
return(missdata.new)
}
#Generate a data matrix with no missing values
datasim = matrix(0, nrow=200, ncol=1000) #pre-allocated matrix of 0's of desired size
probmatrix = col(datasim)/1000 #probability matrix, each of the 1000 columns will have a different prob
datasim2 = matrix(rbinom(200 * 1000,1,probmatrix),
nrow=200, ncol=1000, byrow=FALSE) #new matrix of 0's and 1's based on probabilities
#Assign column names
cnum = 1:1000
cnum = paste("M",cnum,sep='')
colnames(datasim2) = cnum
#Assign row names
rnum = 1:200
rnum = paste("L",rnum,sep='')
rownames(datasim2) = rnum
datasim2 = t(datasim2) #data will be used in the transposed form
datasim2 = as.data.frame(datasim2)
#add 10 missing values per row and compute new frequencies
datasim.miss = rep.missing(datasim2)
#Now, how can I repeat the rep.missing function
#500 times and store the output of the new frequencies
#generated from each repetition?
Update:
Frank, thank you for the replicate() suggestion. I am able to return the repetitions by changing return(missdata.new) to return(list(missdata.new)) in the rep.missing() function. I then call the function with replicate(500,rep.missing(datasim2), simplify="matrix").
This is almost exactly what I want. I would like to do
return(list(missdata.new$allele.0, missdata.new$allele.1, missdata.new$miss))
in rep.missing() and return each of these 3 vectors as 3 column bound data frames within a list. One data frame holds the 500 repetitions of missdata.new$allele.0, one holds the 500 repetitions of missdata.new$allele.1, etc.
replicate(500, rep.missing(datasim2), simplify="matrix")
I am not sure to understand which part is where you don't know how to do.
If you don't know how repeatedly store your results. one way would be to have a global variable , and inside your function you do <<- assignments instead of <- or =.
x=c()
func = function(i){x <<- c(x,i) }
sapply(1:5,func)
mapply is tfor repeating a function over multiple inputs list or vectors.
you want to repeat your function 500 times. so you can always do
sapply(1:500,fund)
Related
New to R btw so I am sorry if it seems like a stupid question.
So basically I have a dataframe with 100 rows and 3 different columns of data. I also have a vector with 3 thresholds, one for each column. I was wondering how you would filter out the values of each column that are superior to the value of each threshold.
Edit: Sry for the incomplete question.
So essentially what i would like to create is a function (that takes a dataframe and a vector of tresholds as parameters) that applies every treshold to their respective column of the dataframe (so there is one treshhold for every column of the dataframe). The number of elements of each column that “respect” their treshold should later be put in a vector. So for example:
Column 1: values = 1,2,3. Treshold = (only values lower than 3)
Column 2: values = 4,5,6. Treshold = (only values lower than 6)
Output: A vector (2,2) since there are two elements in each column that are under their respective tresholds.
Thank you everyone for the help!!
Your example data:
df <- data.frame(a = 1:3, b = 4:6)
threshold <- c(3, 6)
One option to resolve your question is to use sapply(), which applies a function over a list or vector. In this case, I create a vector for the columns in df with 1:ncol(df). Inside the function, you can count the number of values less than a given threshold by summing the number of TRUE cases:
col_num <- 1:ncol(df)
sapply(col_num, function(x) {sum(df[, x] < threshold[x])})
Or, in a single line:
sapply(1:ncol(df), function(x) {sum(df[, x] < threshold[x])})
I'm trying to create a sparse matrix, where for each row has a maximum of n entries that are each integers within a certain range, which I could then use as an adjacency matrix for social network analysis. For example, an 80X80 matrix where each row has 10 or fewer entries that are integers from 1-4. The goal is to represent the sort of data you would get from a social networking survey in which respondents were selecting values between 1 and 4 to indicate their relationship with up to 10 of the possibilities/columns in the survey.
I can create a sparse matrix using the "rsparsematrix" function, and using the density command can approximate the required number of responses, but I can't control the number of responses per row and would have to do additional processing to convert the random values to integers within my desired range.
eg: I could start with something like
M1<-rsparsematrix(80, 80, density = .1, symmetric = FALSE)
A more promising approach (from https://www.r-bloggers.com/casting-a-wide-and-sparse-matrix-in-r/) would be to generate the values and then use "transform" to convert them into a matrix. This allows me to control the integer values, but still doesn't get the limited number of responses per row.
Example code from the blog follows:
set.seed(11)
N = 10
data = data.frame(
row = sample(1:3, N, replace = TRUE),
col = sample(LETTERS, N, replace = TRUE),
value = sample(1:3, N, replace = TRUE))
data = transform(data,
row = factor(row),
col = factor(col)) "
This could be tweaked to give the required 80x80 matrix, but doesn't solve the problem of limiting the responses per row and, in the case of duplicate entries in the same row/column combination will result in out of range values since it resolves duplicate entries by taking the sum.
Any suggestions would be most appreciated.
As a bonus question, how would you then create random rows of null responses? For example within the 80*80 matrix, how might you introduce 40 random rows with no values? As in the description above, this would correspond to missing survey data.
You can try to build the spare matrix up using the row (i), column (j) amd value (x) components. This involves sampling subject to your row and value constraints.
# constraints
values <- 1:4
maxValuesPerRow <- 10
nrow <- 80
ncol <- 80
# sample values : how many values should each row get but <= 10 values
set.seed(1)
nValuesForEachRow <- sample(maxValuesPerRow, nrow, replace=TRUE)
# create matrix
library(Matrix)
i <- rep(seq_len(nrow), nValuesForEachRow) # row
j <- unlist(lapply(nValuesForEachRow, sample, x=seq_len(ncol))) # which columns
x <- sample(values, sum(nValuesForEachRow), replace=TRUE) # values
sm <- sparseMatrix(i=i, j=j, x=x)
check
dim(sm)
table(rowSums(sm>0))
table(as.vector(sm))
note, cant just sample columns like below as this can give duplicate values, hence loop used.
j <- sample(seq_len(ncol), sum(nValuesForEachRow), replace=TRUE)
The code below will do what you want. It generates your random sparse matrix, rounds it to whole numbers, then for every row that has more than 10 entries, randomly makes some entries NA until only 10 remain. It then makes all the non NA entries a random number between 1 and 4.
library(Matrix)
M1<-as.data.frame(as.matrix((rsparsematrix(80, 80, density = .1, symmetric = FALSE))))
M1 <- as.data.frame(apply(M1,1,round))
M1<-as.data.frame(sapply(M1,function(x) ifelse(x==0,NA,x)))
rows<-which(apply(M1,1,function(x) sum(!(is.na(x)))) >10)
for(i in rows)
{
toNA<-setdiff(which(!(is.na(M1[i,]))),sample(which(!(is.na(M1[i,]))),10,replace=F))
M1[i,toNA] <- NA
)
for(i in 1:nrow(M1))
{
M1[i,which(!(is.na(M1[i,])))] <- sample(1:4,length(M1[i,which(!
(is.na(M1[i,])))]),replace=T)
}
I have two datasets: m and s. The first data set includes variables Frequency, p1, p2 and p3.
The second dataset includes the value for type of regression, mean and sample size. Column names are z, mean, and samplesize, respectively.
I need to add four columns to the first dataset m as follows:
The first column m$reg1 should be m$p1 times the value of s$samplesize corresponding to s$z == 'Regression1'.
The second column m$reg2 should be m$p2 times the value of s$samplesize corresponding to s$z == 'regression2'.
The third column m$reg3 should be m$p3 times the value of s$samplesize corresponding to s$z == 'regression3'.
I was wondering how can I write a loop function for calculating these new four columns in m data set.
See how the datasets are created in the code below:
Frequency<-seq(1,27,1)
p1<-seq(2,28,1)
p2<-seq(10,36,1)
p3<-seq(0,26,1)
m<-data.frame(Frequency,p1,p2,p3)
z<-c('Regression1','Regression2','Regression3','Regression4')
mean<-c(2,28,1,17)
samplesize<-c(10,20,30,40)
s<-data.frame(z,mean,samplesize)
Use the same principle as we applied in this answer. First, define names of columns or row values that would subset tables and then perform the calculation, filling the values into a new, similarly constructed, column.
# custom function that calculates column values
add.col <- function(i){
# name in the s$z that defines the correct row
reg <- paste0("Regression", i)
# name of the m column
p <- paste0("p", i)
# multiply the named column from m with respective samplesize in s
return(m[, p] * s$samplesize[s$z == reg])
}
# loop through all indices
for(i in 1:3){
# create a new column with the compound name and fill it with appropriate values
m[, paste0("reg", i)] <- add.col(i = i)
}
No need for a loop, if I understand your question correctly. Just do:
m$regr1 <- m$p1*s$samplesize[s$z=="Regression1"]
m$regr2 <- m$p2*s$samplesize[s$z=="Regression2"]
m$regr3 <- m$p3*s$samplesize[s$z=="Regression3"]
If you want to do a for loop this might work as well:
desired_col = c(2,3,4) # this can be any selection
for(i in desired_col) { m[[paste0(i,"reg")]] = m[,i]*s[match(i,desired_col),3] }
I'm trying since more than an hour to split randomly my data frame into two frame based on a given percentage, however, I can't make it work i don't know why.
I saw those posts :
How to split data into training/testing sets using sample function in R program
R: How to split a data frame into training, validation, and test sets?
How can divide a dataset based on percentage?
What I want is basically to take as input a data frame df, and a real number α ∈ (0, 1) and returns a list consisting of two data frames df1 and df2. df1 is finally (a * 100)% of df, and df2 the rest of df, the unselected rows.
For example, if df has 100 rows, and α = 0.4, then df1 will consist of 40 randomly selected rows of df, and df2 will consist of the other 60 rows.
I could do it with a big function and loops etc, make my algorithm, but I'm pretty sure, another way to do it should exists and I would like to share this solution with the community !
Thank for your help !
Here is a function that splits the data into two data.frames using sample:
splitTable <- function(df, prob) {
variant <- sample(seq(1, 0), size = nrow(df), replace = TRUE, prob = c(prob, 1 - prob))
res <- split(df, variant)
return(res)
}
res <- splitTable(iris, 0.4)
In R, I'm trying to resample my dataset.
The database A includes some codes in the first column (integer) and characteristics of each row as follows:
A <- as.matrix(cbind(floor(runif(1000, 1,101)), matrix(rexp(20000, rate=.1), ncol=20) ))
Some codes are repeated in the first column.
I want to resample randomly codes from the first column and create a new matrix or dataframe such that for each entry in the resampled code vector it gives me the right hand side. If there are more vectors with the same resampled code it should include both. Also, if I'm resampling the same code twice, all rows in A with the same resample code should appear twice.
---EDIT---
The resampling is done with replacement. So far what I did is:
res <- resample(unique(A[,1]), size = length(unique(A[,1])) , replace = TRUE, prob= NULL)
A.new <- A[which(A[,1] %in% res),]
however, assume that two lines in A have the same code (say 2), and that the vector res selects 2 4 times. In A.new I will only have 2 twice (because there are two lines coded as 2 in A[,1]), instead that having these two lines repeated 4 times
We can do it like this:
A.new = sapply(res, function(x) A[A[,1] == x, ])
A.new = do.call(rbind, A.new)
The first line makes a list of matrices in which each value of res creates a list item that is the subset of A for which the 1st column equals that value of res. If res contains the same number more than once, a matrix will be created for each occurrence of that value.
The 2nd line uses rbind to condense this list into a single matrix