Choosing the Best Combination of Values from Data frame R - r

I have a data frame with 20 rows and 10 columns. Each value in the data is a number between 0 and 10.
I want to pick the combination of values with the highest sum, and I have to pick one and only one value from each column.
Is there a ready r function that does this, or a implication of a known algorithm.
Is there an r function that generates all the possible combinations from which I would pick the one with the highest sum?

Is this what you're trying to do? (I'm assuming your data frame is named df.)
maxList <- c(which(df$col1 == max(df[, 1]))) #Initialize list of row numbers with max value
total <- max(df[, 1]) #Initialize sum of allowable maximum values
combination <- c(total) #Initialize list of those maximum values
for(i in 2:ncol(df)) { #For the remaining columns in df
subCol <- df[, i]
for(j in 1:length(maxList)) { #For the number of items in maxList
subCol[maxList[j]] <- 0 #Set row values of previous maxima to zero
maxList <- c(maxList, which(subCol == max(subCol))) #Update maxList
}
combination <- c(combination, max(subCol))
total <- total + max(subCol) #Update total
}

Related

Function to show which quartile number belong data in a large list of elements

I'm experimenting with the quantile function in independent dataframes.
A very easy example to illustrate my case:
get quartiles
quantile(x <- rnorm(1001))
0% 25% 50% 75% 100%
-2.930587810 -0.687108751 0.004405246 0.644589258 2.839597566
#subdivide quantile results in 5 independent results (data frames) For example:
list2env(setNames(as.list(quantile(x <- rnorm(1001))),paste0("Q",1:5)),.GlobalEnv)
So now, in a new column I have next to the quartile data results, grouped into its corresponding quartile number Q0,Q1,Q2,Q3,Q4.
Now I'd like to apply the same to a "Large list" (large_list) with more than 400 elements on it, so I guess I need a different approach on it (function), to apply it globally into the 400 elements of my list.
Here I'd need the help of community, this is my approach:
#Read all elements of the list in the environment,create a new column to be named,
# Elementname.Quartilenumber that contains which
# Q (0,1,2,3,4) number the data belongs to.
Qnumber <- function(x) {
element_name <- stringi::stri_extract(names(x)[1], regex = "^[A-Z]+")
element_name <- paste0(element_name, ".Quartilenumber")
column_names <- c(names(x), stock_name)
x$quartile <- quantile(large_list$.)
x <- setNames(x, column_names)
return(x)
Any help will be very appreciated.
Thank you very much.
For each element in your list, do the following:
calculate the quantiles, as you have done: qx <- quantiles(x)
count how many of these values are >= each datum sum(qx >=
x[i]); this corresponds to the quartile number in all but one
caseā€”the maximum value (you get NA for this one, because the sum
is 0)
set the quartile for the maximum value's quartile to the 4th quartile
('Q4').
Here are some fake data (a list of data frames):
list.1 <- list()
for (i in 1:5) {
list.1[[i]] <- data.frame('elem_data'=rnorm(10))
}
Step through the list of data.frames and add the quartile column.
qnames <- c('Q1','Q2','Q3','Q4')
for (i in 1:5) {
qx <- quantile(list.1[[i]]$elem_data)
list.1[[i]]$qnum <- sapply(list.1[[i]]$elem_data, function(x) qnames[sum(x >= qx)])
list.1[[i]]$qnum[is.na(list.1[[i]]$qnum)] <- qnames[4]
}
I tried this with a list of 1000 data.frames with 1000 data elements each, and it took about 2.5 seconds (on a mid-2013 MacBook Air).

How to Delete Every Row&Columns Which Contains Negative Value

I have dataframe called lexico which has a dimension of 11293x512.
I'd like to purge every row and column if any element in that column or row holds negative value.
How could I do this?
Following is my code that I tried but it takes too long time to run since it has nested loop structure.
(I was about to first get every column number that holds neg value in it)
colneg <- c()
for(i in 1:11293){
for(j in 1:512){
if(as.numeric(as.character(lexico[1283,2]))< 0)
colneg <- c(colneg, j)
}
}
It would be appreciate for your harsh advice for this novice.
A possible solution:
# create an index of columns with negative values
col_index <- !colSums(d < 0)
# create an index of rows with negative values
row_index <- !rowSums(d < 0)
# subset the dataframe with the two indexes
d2 <- d[row_index, col_index]
What this does:
colSums(d < 0) gives a numeric vector of the number of negative values in the columns.
By negating it with ! you create a logical vector where for the columns with no negative values get a TRUE value.
It works the same for rows.
Subsetting the dataframe with the row_index and the col_index gives you a dataframe where the rows as wel as the columns where the negative values appeared are removed.
Reproducible example data:
set.seed(171228)
d <- data.frame(matrix(rnorm(1e4, mean = 3), ncol = 20))

Faster method of counting specified values from rows in large matrix in R

MC is a very large matrix, 1E6 rows (or more) and 500 columns. I am trying to get the number of occurrences of the values 1 through 13 for each of the columns. Sometimes the number of occurrences for one of these values will be zero. I would like my final output to be a 300X13 matrix (or data frame) with these count values. I am wondering if anyone can suggest a more efficient manner then what I currently have, which is the following:
MCct<-matrix(0,500,13)
for (j in 1:500){
for (i in 1:13){
MCct[j,i]<-length(which(MC[,j]==i))}}
I don't that table works, because I need to also know if zero occurrences occurred...I couldn't figure it out how to do that if it is possible. And I am only somewhat familiar with apply, so maybe there is a method to use that...I haven't been successful in figuring that out yet.
Thanks for the help,
Vivien
You could do this with sapply (to iterate from 1 to 13) and colSums (to add up the columns of j):
MCct <- sapply(1:13, function(i) {
colSums(MC == i)
})
Suppose you have a set of values you're interested in
set <- 1:4
n = length(set)
and you have a matrix that includes those values, and others
m <- matrix(sample(10, 120, TRUE), 12, 10)
Create a vector indicating the index in the set of each matching value
idx <- match(m, set)
then make the index unique to each column
idx <- idx + (col(m) - 1) * n
idx ranges from 1 (occurrences of the first set element in the first column) to n * ncol(m) (occurrence of the nth set element in the last column of m). Tabulate the unique values of idx
v <- tabulate(idx, nbin = n * ncol(m))
The first n elements of v summarize the number of times set elements 1..n appear in the first column of m. The second n elements of v summarize the number of times set elements 1..n appear in the second column of m, etc. Reshape as the desired matrix, where each row represents the corresponding member of the set.
matrix(v, ncol=ncol(m))
table can count zero occurrences, you just need to create a factor that has the whole range of levels, e.g.
apply(MC, 2, function(x) table(factor(x, levels=1:13)))
This is not as efficient as #Patronus' solution though.

Combing information of two data sets_Loop function

I have two datasets: m and s. The first data set includes variables Frequency, p1, p2 and p3.
The second dataset includes the value for type of regression, mean and sample size. Column names are z, mean, and samplesize, respectively.
I need to add four columns to the first dataset m as follows:
The first column m$reg1 should be m$p1 times the value of s$samplesize corresponding to s$z == 'Regression1'.
The second column m$reg2 should be m$p2 times the value of s$samplesize corresponding to s$z == 'regression2'.
The third column m$reg3 should be m$p3 times the value of s$samplesize corresponding to s$z == 'regression3'.
I was wondering how can I write a loop function for calculating these new four columns in m data set.
See how the datasets are created in the code below:
Frequency<-seq(1,27,1)
p1<-seq(2,28,1)
p2<-seq(10,36,1)
p3<-seq(0,26,1)
m<-data.frame(Frequency,p1,p2,p3)
z<-c('Regression1','Regression2','Regression3','Regression4')
mean<-c(2,28,1,17)
samplesize<-c(10,20,30,40)
s<-data.frame(z,mean,samplesize)
Use the same principle as we applied in this answer. First, define names of columns or row values that would subset tables and then perform the calculation, filling the values into a new, similarly constructed, column.
# custom function that calculates column values
add.col <- function(i){
# name in the s$z that defines the correct row
reg <- paste0("Regression", i)
# name of the m column
p <- paste0("p", i)
# multiply the named column from m with respective samplesize in s
return(m[, p] * s$samplesize[s$z == reg])
}
# loop through all indices
for(i in 1:3){
# create a new column with the compound name and fill it with appropriate values
m[, paste0("reg", i)] <- add.col(i = i)
}
No need for a loop, if I understand your question correctly. Just do:
m$regr1 <- m$p1*s$samplesize[s$z=="Regression1"]
m$regr2 <- m$p2*s$samplesize[s$z=="Regression2"]
m$regr3 <- m$p3*s$samplesize[s$z=="Regression3"]
If you want to do a for loop this might work as well:
desired_col = c(2,3,4) # this can be any selection
for(i in desired_col) { m[[paste0(i,"reg")]] = m[,i]*s[match(i,desired_col),3] }

Repeat a function on a data frame and store the output

I simulated a data matrix containing 200 rows x 1000 columns. It contains 0's and 1's in a binomial distribution. The probability of a 1 occurring depends on a probability matrix that I've created.
I then transpose this data matrix and convert it to a data frame. I created a function that will introduce missing data to each row of the data frame. The function will also add three columns to the data frame after the missing data is introduced. One column is the computed frequency of 1's across each of the 1000 rows. The 2nd column is the computed frequency of 0's across each row. The 3rd column is the frequency of missing values across each row.
I would like to repeat this function 500 times with the same input data frame (the one with no missing values) and output three data frames: one with 500 columns containing all of the computed frequencies of 0's (one column per simulation), one with 500 columns containing all of the computed frequencies of 1's, and one with 500 columns of the missing data frequencies.
I have seen mapply() used for something similar, but was not sure if it would work in my case. How can I repeatedly apply a function to a data frame and store the output of each computation performed within that function every time that function is repeated?
Thank you!
####Load Functions####
###Compute freq of 0's
compute.al0 = function(GEcols){
(sum(GEcols==0, na.rm=TRUE)/sum(!is.na(GEcols)))
}
###Compute freq of 1's
compute.al1 = function(GEcols){
(sum(GEcols==1, na.rm=TRUE)/sum(!is.na(GEcols)))
}
#Introduce missing data
addmissing = function(GEcols){
newdata = GEcols
num.cols = 200
num.miss = 10
set.to.missing = sample(num.cols, num.miss, replace=FALSE) #select num.miss to be set to missing
newdata[set.to.missing] = NA
return(newdata) #why is the matrix getting transposed during this??
}
#Introduce missing data and re-compute freq of 0's and 1's, and missing data freq
rep.missing = function(GEcols){
indata = GEcols
missdata = apply(indata,1,addmissing)
missdata.out = as.data.frame(missdata) #have to get the df back in the right format
missdata.out.t = t(missdata.out)
missdata.new = as.data.frame(missdata.out.t)
missdata.new$allele.0 = apply(missdata.new[,1:200], 1, compute.al0) #compute freq of 0's
missdata.new$allele.1 = apply(missdata.new[,1:200], 1, compute.al1) #compute freq of 1's
missdata.new$miss = apply(missdata.new[,1:200], 1, function(x) {(sum(is.na(x)))/200}) #compute missing
return(missdata.new)
}
#Generate a data matrix with no missing values
datasim = matrix(0, nrow=200, ncol=1000) #pre-allocated matrix of 0's of desired size
probmatrix = col(datasim)/1000 #probability matrix, each of the 1000 columns will have a different prob
datasim2 = matrix(rbinom(200 * 1000,1,probmatrix),
nrow=200, ncol=1000, byrow=FALSE) #new matrix of 0's and 1's based on probabilities
#Assign column names
cnum = 1:1000
cnum = paste("M",cnum,sep='')
colnames(datasim2) = cnum
#Assign row names
rnum = 1:200
rnum = paste("L",rnum,sep='')
rownames(datasim2) = rnum
datasim2 = t(datasim2) #data will be used in the transposed form
datasim2 = as.data.frame(datasim2)
#add 10 missing values per row and compute new frequencies
datasim.miss = rep.missing(datasim2)
#Now, how can I repeat the rep.missing function
#500 times and store the output of the new frequencies
#generated from each repetition?
Update:
Frank, thank you for the replicate() suggestion. I am able to return the repetitions by changing return(missdata.new) to return(list(missdata.new)) in the rep.missing() function. I then call the function with replicate(500,rep.missing(datasim2), simplify="matrix").
This is almost exactly what I want. I would like to do
return(list(missdata.new$allele.0, missdata.new$allele.1, missdata.new$miss))
in rep.missing() and return each of these 3 vectors as 3 column bound data frames within a list. One data frame holds the 500 repetitions of missdata.new$allele.0, one holds the 500 repetitions of missdata.new$allele.1, etc.
replicate(500, rep.missing(datasim2), simplify="matrix")
I am not sure to understand which part is where you don't know how to do.
If you don't know how repeatedly store your results. one way would be to have a global variable , and inside your function you do <<- assignments instead of <- or =.
x=c()
func = function(i){x <<- c(x,i) }
sapply(1:5,func)
mapply is tfor repeating a function over multiple inputs list or vectors.
you want to repeat your function 500 times. so you can always do
sapply(1:500,fund)

Resources