R: Refer to the chromosome in the evaluation function of rbga.bin - r

I am trying to use the rbga.bin genetic function in R.
I have a dataframe with 40 observations (rows) and 189 metrics (columns). In the evaluation function, I have to run a Principal Component Analysis on both the original dataset and the "chromosome dataset" (i.e., the dataframe with some of the metrics columns - the ones that have 1s in the chromosome) in order to produce the fitness score.
For example, a possible solution (chromosome) is the following:
(1,1,1,0,0,...,0)
The solution dataset that I would want to run a PCA on, would just have only the first 3 columns of the original dataset.
How can I refer to that "reduced" dataset inside the evaluation function?

It seems that the variable you provide to the evaluation function is the chromosome, i.e. the binary vector. You can get the reduced dataset the following way.
Assume chromosome is the binary vector, original is the starting dataframe and reduced is the resulting dataframe with only the columns that are 1 in the chromosome.
reduced = !!chromosome
reduced = original[reduced]

Related

Apply function in R that multiplies Df Columns by matrix rows

Fairly new to R (used for much simpler stuff), coming from a deeper SAS background
I have a Dataframe which contains multiple types of data, amongst which 5 ratios, used as factors in logistic regression.
The factors are then transformed using a logistic transformation, subject to parameters that are given.
I need to apply those parameters to a longer dataset to essentially apply that logistic model to my own dataset (This is for validation purposes, so the parameters have to be exactly applied).
The dataframe would look something like:
obs unique_identifier event Regressor_1 Regressor_2 ... Regressor_5 Obs_date
no no factor no no no date
The dataset also has other columns, but lets keep it short.
Parameters are contained in a separate dataframe that looks like
Regressor Slope Sign Mid-point mean deviation
Regressor1 .. .. .... ... ....
Regressor2
and so forth. What I need is to perform an operation so that I get:
Regressor1_Score = F(Regressor1, parameters in matrix)
What is the best way to get that in R? something like mapply? how can you specify that parameters (rows in 2nd df) have to be applied to relevant columns in first df?

how to see order rank of a correlation matrix using corrgram

I wanted to generate correlation matrices which are made of correlation of row couples. I used the corrgram function to generate them. In my first attempt, the function generated correlation matrix of which diagonals filled with ranks.
corrgram(t(datasetA),order="GW")
a sample of the output
However when I use it for my second dataset, somehow the diagonal of correlation matrix is filled with varxxx strings instead of rank of correlation.
corrgram(t(datasetB),order="GW")
The datasets contain nearly the same type of values (ints) and they are both dataframe. How can I solve this ?
Edit:
Here is the list of commands from which generates the correlation matrix contains varxxx's in diagonal
erase <- matrix(c(1,5,2,6,8,4,1,5,6),nrow=3)
corrgram(t(erase),order="HC")
output:
Because it is a huge dataset and contains sensitive data, I cannot share the dataset and show the series of operations by which I ended up with the first output above.
Renaming column names with numbers fixed the issue
names(dataSetB)<-c(1:totalNumberOfColumn)

Creating a function that has all columns of a data frame as input in r

I have two data frames; "clinical" and "expression":
The "clinical" dataframe contains data about various clinical parameter (columns) in patients with breast cancer (rows). The "expression" contains data about expression gene levels (columns) in patients with breast cancer (rows). The columns name in the "expression" dataframe are various "gene.ID".
Both dataframes have the same patients (rows), and only differ from each other in the columns. However, the rows in each dataframe are not exactly at the same order as the other dataframe.
I want to test and plot the correlation between the expression level of a certain gene and the clinical parameter of the individuals in the cohort.
In order to do so, I am trying to create a function that (1) will receive these dataframes and the gene.ID of a specific gene, (2) extract the expression pattern of this gene, (3) match the patients from both dataframes, (4) go over all the clinical parameters, (5) and do some computations, each time on another clinical parameter.
My main issue is the "go over all the clinical parameters" part, although I'm pretty sure the rest of my code is not much better.
So far, my code looks something like this:
my_function <- function(clinical_data, expression_data, gene.ID){
gene.ID <- (expression_data$gene.ID)
expression.pattern <- as.numeric(expression$gene.ID)
matched.samples <- match(row.names(clinical), row.names(expression))
for(i in names(clinical)){
***here will come an if statement***
I also think I have a serious problem with the "gene.ID".
I would like to know what I should change in my function so that it will do the job once I write the if statement.
I hope my question is clear enough.
Both dataframes have the same patients (rows), and only differ from
each other in the columns. However, the rows in each dataframe are not
exactly at the same order as the other dataframe.
The function cbind lets you join ("bind") two datasets by columns ("c"). Because each dataset has the same patients, but in a different order, you would need to first sort the rows of one dataset to match the other dataset.
cbind(clinical_data, expression_data[rownames(clinical_data), ])
Now you just have one data.frame that contains everything needed for the rest of the analysis.

how to make groups of variables from a data frame in R?

Dear Friends I would appreciate if someone can help me in some question in R.
I have a data frame with 8 variables, lets say (v1,v2,...,v8).I would like to produce groups of datasets based on all possible combinations of these variables. that is, with a set of 8 variables I am able to produce 2^8-1=63 subsets of variables like {v1},{v2},...,{v8}, {v1,v2},....,{v1,v2,v3},....,{v1,v2,...,v8}
my goal is to produce specific statistic based on these groupings and then compare which subset produces a better statistic. my problem is how can I produce these combinations.
thanks in advance
You need the function combn. It creates all the combinations of a vector that you provide it. For instance, in your example:
names(yourdataframe) <- c("V1","V2","V3","V4","V5","V6","V7","V8")
varnames <- names(yourdataframe)
combn(x = varnames,m = 3)
This gives you all permutations of V1-V8 taken 3 at a time.
I'll use data.table instead of data.frame;
I'll include an extraneous variable for robustness.
This will get you your subsetted data frames:
nn<-8L
dt<-setnames(as.data.table(cbind(1:100,matrix(rnorm(100*nn),ncol=nn))),
c("id",paste0("V",1:nn)))
#should be a smarter (read: more easily generalized) way to produce this,
# but it's eluding me for now...
#basically, this generates the indices to include when subsetting
x<-cbind(rep(c(0,1),each=128),
rep(rep(c(0,1),each=64),2),
rep(rep(c(0,1),each=32),4),
rep(rep(c(0,1),each=16),8),
rep(rep(c(0,1),each=8),16),
rep(rep(c(0,1),each=4),32),
rep(rep(c(0,1),each=2),64),
rep(c(0,1),128)) *
t(matrix(rep(1:nn),2^nn,nrow=nn))
#now get the correct column names for each subset
# by subscripting the nonzero elements
incl<-lapply(1:(2^nn),function(y){paste0("V",1:nn)[x[y,][x[y,]!=0]]})
#now subset the data.table for each subset
ans<-lapply(1:(2^nn),function(y){dt[,incl[[y]],with=F]})
You said you wanted some statistics from each subset, in which case it may be more useful to instead specify the last line as:
ans2<-lapply(1:(2^nn),function(y){unlist(dt[,incl[[y]],with=F])})
#exclude the first row, which is null
means<-lapply(2:(2^nn),function(y){mean(ans2[[y]])})

Binning column and getting corresponding values from other column in R

I have two columns of paired values in a data frame, I want to bin the data in one column using the cut2 function from the Hmisc package so that there are at least say 25 data points in each bin. I however need the corresponding values from the other column. Is there a convenient way for that using R? I have to bin the column B.
A B
-10.834510 1.680173
11.012966 1.866603
-16.491415 1.868667
-14.485036 1.900002
2.629104 1.960929
-3.597291 2.005348
.........
It's not clear what you mean by wanting the "corresponding values of the other column". The first part is easy to accomplish using the g (# of groups) argument:
dfrm$Agrp <- cut2(dfrm$A, g=trunc(length(dfrm$A)/25) )
You can aggregate means or medians of B within Agrp's using tapply or ave or one of the Hmisc summary functions. There are several worked examples in one of today's questions: How to get Summary statistics by group as well as many other examples of using those functions or aggregate or the pkg:plyr functions.
Given that the number of B values will not necessarily be constant across groups the only way I can think to deliver the individual values by A-grouped-value would be with split. I added an extra row to illustrate that a non-even split might need to return a list rather than a more "rectangular" object :
dat <- read.table(text="A B
-10.834510 1.680173
11.012966 1.866603
-16.491415 1.868667
-14.485036 1.900002
2.629104 1.960929
-3.597291 2.005348\n 3.5943 3.796", header=TRUE)
dat$Agrp <- cut2(dat$A, g=trunc(length(dat$A)/3) )
split(dat$B, dat$Agrp)
#-----
$`[-16.49, 2.63)`
[1] 1.680173 1.868667 1.900002 2.005348
$`[ 2.63,11.01]`
[1] 1.866603 1.960929 3.796000
If you want the vector of values on which the splits were done then that can be accomplished by using regex on levels(dat$Agrp).

Resources