In R, I'm trying to resample my dataset.
The database A includes some codes in the first column (integer) and characteristics of each row as follows:
A <- as.matrix(cbind(floor(runif(1000, 1,101)), matrix(rexp(20000, rate=.1), ncol=20) ))
Some codes are repeated in the first column.
I want to resample randomly codes from the first column and create a new matrix or dataframe such that for each entry in the resampled code vector it gives me the right hand side. If there are more vectors with the same resampled code it should include both. Also, if I'm resampling the same code twice, all rows in A with the same resample code should appear twice.
---EDIT---
The resampling is done with replacement. So far what I did is:
res <- resample(unique(A[,1]), size = length(unique(A[,1])) , replace = TRUE, prob= NULL)
A.new <- A[which(A[,1] %in% res),]
however, assume that two lines in A have the same code (say 2), and that the vector res selects 2 4 times. In A.new I will only have 2 twice (because there are two lines coded as 2 in A[,1]), instead that having these two lines repeated 4 times
We can do it like this:
A.new = sapply(res, function(x) A[A[,1] == x, ])
A.new = do.call(rbind, A.new)
The first line makes a list of matrices in which each value of res creates a list item that is the subset of A for which the 1st column equals that value of res. If res contains the same number more than once, a matrix will be created for each occurrence of that value.
The 2nd line uses rbind to condense this list into a single matrix
Related
New to R btw so I am sorry if it seems like a stupid question.
So basically I have a dataframe with 100 rows and 3 different columns of data. I also have a vector with 3 thresholds, one for each column. I was wondering how you would filter out the values of each column that are superior to the value of each threshold.
Edit: Sry for the incomplete question.
So essentially what i would like to create is a function (that takes a dataframe and a vector of tresholds as parameters) that applies every treshold to their respective column of the dataframe (so there is one treshhold for every column of the dataframe). The number of elements of each column that “respect” their treshold should later be put in a vector. So for example:
Column 1: values = 1,2,3. Treshold = (only values lower than 3)
Column 2: values = 4,5,6. Treshold = (only values lower than 6)
Output: A vector (2,2) since there are two elements in each column that are under their respective tresholds.
Thank you everyone for the help!!
Your example data:
df <- data.frame(a = 1:3, b = 4:6)
threshold <- c(3, 6)
One option to resolve your question is to use sapply(), which applies a function over a list or vector. In this case, I create a vector for the columns in df with 1:ncol(df). Inside the function, you can count the number of values less than a given threshold by summing the number of TRUE cases:
col_num <- 1:ncol(df)
sapply(col_num, function(x) {sum(df[, x] < threshold[x])})
Or, in a single line:
sapply(1:ncol(df), function(x) {sum(df[, x] < threshold[x])})
I am not very experienced with R, and have been struggling for days to repeat a string of code to fill a data matrix. My instinct is to create a for loop.
I am a biology student working on colour differences between sets of images, making use of the R package colordistance. The relevant data has been loaded in R as a list of 8x4 matrices (each matrix describes the colours in one image). Five images make up one set and there are 100 sets in total. Each set is identified by a number (not 1-100, it's an interrupted sequence, but I have stored the sequence of numbers in a vector called 'numberlist'). I have written the code to extract the desired data in the right format for the first set, and it is as follows;
## extract the list of matrices belonging to the first set (A3) from the the full list
A3<-histlist[grep('^3',names(histlist))]
## create a colour distance matrix (cdm), ie a pairwise comparison of "similarity" between the five matrices stored in A3
cdm3<-colordistance::getColorDistanceMatrix(A3, method="emd", plotting=FALSE)
## convert to data frame to fix row names
cdm3df<-as.data.frame(cdm3)
## remove column names
names(cdm3df)<-NULL
## return elements in the first row and column 2-5 only (retains row names).
cdm3filtered<-cdm3df[1,2:5]
Now I want to replace "3" in the code above with each number in 'numberlist' (not sure whether they should be as.factor or as.numeric). I've had many attempts starting with for (i in numberlist) {...} but with no successful output. To me it makes sense to store the output from the loop in a storage matrix; matrix(nrow=100,ncol=4) but I am very much stuck, and unable to populate my storage matrix row by row by iterating the code above...
Any help would be greatly appreciated!
Updates
What I want the outputs of the loop to to look like (+ appended in the storage matrix);
> cdm17filtered
17clr 0.09246918 0.1176651 0.1220622 0.1323586
This is my attempt:
for (i in numberlist$X) {
A[i] <- histlist[grep(paste0('^',i),names(histlist))]
cdm[i] <- colordistance::getColorDistanceMatrix(A[i], method="emd", plotting=FALSE)
cdm[i]df <- as.data.frame(cdm[i])
cdm[i]filtered <- cdm[i]df[1,2:5]
print(A[i]) # *insert in n'th column of storage matrix
}
The above is not working, and I'm missing the last bit needed to store the outputs of the loop in the storage matrix. (I was advised against using rbind to populate the storage matrix because it is slow..)
In your attempt, you use invalid R names with non-alphanumeric characters not escaped, cdm[i]df and cdm[i]filtered. It seems you intend to index from a larger container like a list of objects.
To properly generalize your process for all items in numberlist, adjust your ^3 setup. Specifically, build empty lists and in loop iteratively assign by index [i]:
# INITIALIZE LISTS (SAME LENGTH AS numberlist)
A <- vector(mode="list", length = length(numberlist))
cdm_matrices <- vector(mode="list", length = length(numberlist))
cdm_dfs <- vector(mode="list", length = length(numberlist))
cdm_filtered_dfs <- vector(mode="list", length = length(numberlist))
# POPULATE LISTS
for (i in numberlist$X) {
## extract the list of matrices belonging to the first set
A[i] <- histlist[grep(paste0('^', i), names(histlist))]
## create a colour distance matrix (cdm)
cdm_matrices[i] <- colordistance::getColorDistanceMatrix(A[i], method="emd", plotting=FALSE)
## convert to data frame to fix row names and remove column names
cdm_dfs[i] <- setNames(as.data.frame(cdm_matrices[i]), NULL)
## return elements in the first row and column 2-5 only (retains row names).
cdm_filtered_dfs[i] <- cdm_dfs[i][1,2:5]
}
Alternatively, if you only need the last object, cdm_filtered_df returned, use lapply where you do not need to use or index lists and all objects are local in scope of function (i.e., never saved in global environment):
cdm_build <- function(i) {
A <- histlist[grep(paste0('^', i), names(histlist))]
cdm <- colordistance::getColorDistanceMatrix(A, method="emd", plotting=FALSE)
cdm_df <- setNames(as.data.frame(cdm), NULL)
cdm_filtered_df <- cdm_df[1,2:5]
return(cdm_filtered_df) # REDUNDANT AS LAST LINE IS RETURNED BY DEFAULT
}
# LIST OF FILTERED CDM DATA FRAMES
cdm_filtered_dfs <- lapply(numberlist, cdm_build)
Finally, with either solution above, should you want to build a singular data frame, run rbind in a do.call():
cdm_final_df <- do.call(rbind, cdm_filtered_dfs)
I have two datasets: m and s. The first data set includes variables Frequency, p1, p2 and p3.
The second dataset includes the value for type of regression, mean and sample size. Column names are z, mean, and samplesize, respectively.
I need to add four columns to the first dataset m as follows:
The first column m$reg1 should be m$p1 times the value of s$samplesize corresponding to s$z == 'Regression1'.
The second column m$reg2 should be m$p2 times the value of s$samplesize corresponding to s$z == 'regression2'.
The third column m$reg3 should be m$p3 times the value of s$samplesize corresponding to s$z == 'regression3'.
I was wondering how can I write a loop function for calculating these new four columns in m data set.
See how the datasets are created in the code below:
Frequency<-seq(1,27,1)
p1<-seq(2,28,1)
p2<-seq(10,36,1)
p3<-seq(0,26,1)
m<-data.frame(Frequency,p1,p2,p3)
z<-c('Regression1','Regression2','Regression3','Regression4')
mean<-c(2,28,1,17)
samplesize<-c(10,20,30,40)
s<-data.frame(z,mean,samplesize)
Use the same principle as we applied in this answer. First, define names of columns or row values that would subset tables and then perform the calculation, filling the values into a new, similarly constructed, column.
# custom function that calculates column values
add.col <- function(i){
# name in the s$z that defines the correct row
reg <- paste0("Regression", i)
# name of the m column
p <- paste0("p", i)
# multiply the named column from m with respective samplesize in s
return(m[, p] * s$samplesize[s$z == reg])
}
# loop through all indices
for(i in 1:3){
# create a new column with the compound name and fill it with appropriate values
m[, paste0("reg", i)] <- add.col(i = i)
}
No need for a loop, if I understand your question correctly. Just do:
m$regr1 <- m$p1*s$samplesize[s$z=="Regression1"]
m$regr2 <- m$p2*s$samplesize[s$z=="Regression2"]
m$regr3 <- m$p3*s$samplesize[s$z=="Regression3"]
If you want to do a for loop this might work as well:
desired_col = c(2,3,4) # this can be any selection
for(i in desired_col) { m[[paste0(i,"reg")]] = m[,i]*s[match(i,desired_col),3] }
I have this piece of script for R and I want to adjust it a little bit.
Here's the script I have, mydata is an imported .csv file of n columns:
library(orddom)
R=6
delta = numeric (R)
for (i in 1:R) {`
a <- data.matrix(sample(mydata, 2, replace=FALSE))
drops <- c(colnames(a))
b <- data.matrix(mydata[,!(names(mydata) %in% drops)])
a1 <- na.omit(t(matrix(a,1)))
b1 <- na.omit(t(matrix(b,1)))
colnames(a1) <- c("Group 1")
colnames(b1) <- c("Group 2")
delta [i] <- abs(as.numeric(orddom(a1, b1, alpha = 0.05, paired=FALSE)[13,1]))
The problem is that for vector a, the columns of mydata get resampled randomly, leading to several equal delta values, because every time the iterative process start again there is a possibility that the same set of columns get selected.
Now I want the columns to be not randomly resampled. So I want all the possible column combinations, column 1 and 2 and 3 is the same combination as column 2 and 1 and 3 and so on, avoiding combinations of one column with itself, without repetition.
Is there a way to exclude column combinations that have already been selected before?
Then I would like to calculate delta for every combination and store it in a vector.
orddom: Ordinal Dominance Statistics
You can try the following:
#get the combos outside the loop
combos<-combn(length(mydata),2)
R<-ncol(combos)
delta<-numeric(R)
#in the loop, replace the first line
a <- mydata[,combos[,i]]
#the rest should be ok
There are some improvements you could make in the code but they are not relevant in what you are asking.
I simulated a data matrix containing 200 rows x 1000 columns. It contains 0's and 1's in a binomial distribution. The probability of a 1 occurring depends on a probability matrix that I've created.
I then transpose this data matrix and convert it to a data frame. I created a function that will introduce missing data to each row of the data frame. The function will also add three columns to the data frame after the missing data is introduced. One column is the computed frequency of 1's across each of the 1000 rows. The 2nd column is the computed frequency of 0's across each row. The 3rd column is the frequency of missing values across each row.
I would like to repeat this function 500 times with the same input data frame (the one with no missing values) and output three data frames: one with 500 columns containing all of the computed frequencies of 0's (one column per simulation), one with 500 columns containing all of the computed frequencies of 1's, and one with 500 columns of the missing data frequencies.
I have seen mapply() used for something similar, but was not sure if it would work in my case. How can I repeatedly apply a function to a data frame and store the output of each computation performed within that function every time that function is repeated?
Thank you!
####Load Functions####
###Compute freq of 0's
compute.al0 = function(GEcols){
(sum(GEcols==0, na.rm=TRUE)/sum(!is.na(GEcols)))
}
###Compute freq of 1's
compute.al1 = function(GEcols){
(sum(GEcols==1, na.rm=TRUE)/sum(!is.na(GEcols)))
}
#Introduce missing data
addmissing = function(GEcols){
newdata = GEcols
num.cols = 200
num.miss = 10
set.to.missing = sample(num.cols, num.miss, replace=FALSE) #select num.miss to be set to missing
newdata[set.to.missing] = NA
return(newdata) #why is the matrix getting transposed during this??
}
#Introduce missing data and re-compute freq of 0's and 1's, and missing data freq
rep.missing = function(GEcols){
indata = GEcols
missdata = apply(indata,1,addmissing)
missdata.out = as.data.frame(missdata) #have to get the df back in the right format
missdata.out.t = t(missdata.out)
missdata.new = as.data.frame(missdata.out.t)
missdata.new$allele.0 = apply(missdata.new[,1:200], 1, compute.al0) #compute freq of 0's
missdata.new$allele.1 = apply(missdata.new[,1:200], 1, compute.al1) #compute freq of 1's
missdata.new$miss = apply(missdata.new[,1:200], 1, function(x) {(sum(is.na(x)))/200}) #compute missing
return(missdata.new)
}
#Generate a data matrix with no missing values
datasim = matrix(0, nrow=200, ncol=1000) #pre-allocated matrix of 0's of desired size
probmatrix = col(datasim)/1000 #probability matrix, each of the 1000 columns will have a different prob
datasim2 = matrix(rbinom(200 * 1000,1,probmatrix),
nrow=200, ncol=1000, byrow=FALSE) #new matrix of 0's and 1's based on probabilities
#Assign column names
cnum = 1:1000
cnum = paste("M",cnum,sep='')
colnames(datasim2) = cnum
#Assign row names
rnum = 1:200
rnum = paste("L",rnum,sep='')
rownames(datasim2) = rnum
datasim2 = t(datasim2) #data will be used in the transposed form
datasim2 = as.data.frame(datasim2)
#add 10 missing values per row and compute new frequencies
datasim.miss = rep.missing(datasim2)
#Now, how can I repeat the rep.missing function
#500 times and store the output of the new frequencies
#generated from each repetition?
Update:
Frank, thank you for the replicate() suggestion. I am able to return the repetitions by changing return(missdata.new) to return(list(missdata.new)) in the rep.missing() function. I then call the function with replicate(500,rep.missing(datasim2), simplify="matrix").
This is almost exactly what I want. I would like to do
return(list(missdata.new$allele.0, missdata.new$allele.1, missdata.new$miss))
in rep.missing() and return each of these 3 vectors as 3 column bound data frames within a list. One data frame holds the 500 repetitions of missdata.new$allele.0, one holds the 500 repetitions of missdata.new$allele.1, etc.
replicate(500, rep.missing(datasim2), simplify="matrix")
I am not sure to understand which part is where you don't know how to do.
If you don't know how repeatedly store your results. one way would be to have a global variable , and inside your function you do <<- assignments instead of <- or =.
x=c()
func = function(i){x <<- c(x,i) }
sapply(1:5,func)
mapply is tfor repeating a function over multiple inputs list or vectors.
you want to repeat your function 500 times. so you can always do
sapply(1:500,fund)