R - Problems creating dataframe from sapply results that combine lists - r

I'm trying to create a dataframe from the results of sapply.
The variable fits is a collection of nls objects, all with the same names for coefficients. I am able to create a dataframe where the vectors are the coefficients and the rows are the individual fits using this:
Coefficients <- sapply(fits,function(x){coef(x)},simplify = TRUE,USE.NAMES = TRUE)
Coefficients_df <- data.frame(t(Coefficients))
This works fine and when I do summary(Coefficients_df), I get what I expect, the normal summary of a dataframe with 3 columns.
But what I want is to have a dataframe that includes the coefficients and the average error for each fit. I can get a matrix using this:
Coef_and_Error <- sapply(fits,function(x){c(coef(x),list(error = summary(x)$sigma))},simplify = TRUE,USE.NAMES = TRUE)
But then I do not get a proper dataframe using this:
Coef_and_Error_df <- data.frame(t(Coef_and_Error))
RStudio reports this as a dataframe (4105 obs. of 4 variables), but when I try to see a summary of the dataframe, it spews out a long list of "1 -none- numeric", flooding the console.
Also, when I click the triangle next to "Coef_and_Error_df"in Rstudio it does not show the sort of summary I'm used to. Instead of a simple summary with one row per column (e.g. k num 233 189 391 ...) It says "k: list of 4105"
I have been able to get the output I want by creating two separate matrices, converting both to dataframes and adding one as a new vector to the other, but that shouldn't be necessary.
I tried using "append" instead of "c" to combine the values, but no luck.

Related

R: Refer to the chromosome in the evaluation function of rbga.bin

I am trying to use the rbga.bin genetic function in R.
I have a dataframe with 40 observations (rows) and 189 metrics (columns). In the evaluation function, I have to run a Principal Component Analysis on both the original dataset and the "chromosome dataset" (i.e., the dataframe with some of the metrics columns - the ones that have 1s in the chromosome) in order to produce the fitness score.
For example, a possible solution (chromosome) is the following:
(1,1,1,0,0,...,0)
The solution dataset that I would want to run a PCA on, would just have only the first 3 columns of the original dataset.
How can I refer to that "reduced" dataset inside the evaluation function?
It seems that the variable you provide to the evaluation function is the chromosome, i.e. the binary vector. You can get the reduced dataset the following way.
Assume chromosome is the binary vector, original is the starting dataframe and reduced is the resulting dataframe with only the columns that are 1 in the chromosome.
reduced = !!chromosome
reduced = original[reduced]

Using a variable to point to a vector

I am trying to a comparison between many different items using a Spearman test(from the package pspearman). I would like to have a way to automate the switching in of variables so that rather than running it one of at a time and running it would be able to just switch in each one and run all at once.
I tried to pass the list of the vectors that I would like to compare it to.
spearman.test(access_sam2$Area,access_sam2$B)
All the columns are in the dataframe access_sam2. In the y, position there is a list of columns that I need to run:
"CD8_PD1, CD8_PDL1, CD8_GBNEG_FOXP3, CD8_GBNEG_FOXP3_CD45RO, CD8_GBNEG_FOXP3NEG_CD45RO, CD8NEG_PD1, CD8NEG_PDL1, CD8NEG_FOXP3, CD8NEG_FOXP3_CD45RO,CD68_PDL1, CK_PDL1."
The problem is that it is not possible to use indexes because they are not sequential columns, and has 660+ columns.
I could write 7 spearman tests but changing all 7 for each Area variable seems inefficent
First set yvars to be a character variable which names the columns you want or is a numeric variable that gives their column numbers. We have shown the first few elements of it below. Then we define a function which takes a variable name and outputs the spearman test. Finally use Map to apply that function to each component of yvars.
yvars <- c("CD8_PD1", "CD8_PDL1", "CD8_GBNEG_FOXP3")
sptest <- function(yvar) spearman.test(access_sam2$Area, access_sam2[[yvar]])
Map(sptest, yvars)
Reproducible example
Below is a reproducible example using the mtcars data frame that comnes with R.
library(pspearman)
yvars <- c("cyl", "disp", "hp")
sptest <- function(yvar) spearman.test(mtcars$mpg, mtcars[[yvar]])
Map(sptest, yvars)

How to fill dataframe rows for progressive files in a for loop in R

I'm trying to analyze some data acquired from experimental tests with several variables being recorded. I've imported a dataframe into R and I want to obtain some statistical information by processing these data.
In particular, I want to fill in an empty dataframe with the same variable names of the imported dataframe but with statistical features like mean, median, mode, max, min and quantiles as rows for each variable.
The input dataframes are something like 60 columns x 250k rows each.
I've already managed to do this using apply as in the following lines of code for a single input file.
df[1,] <- apply(mydata,2,mean,na.rm=T)
df[2,] <- apply(mydata,2,sd,na.rm=T)
...
Now I need to do this in a for loop for a number of input files mydata_1, mydata_2, mydata_3, ... in order to build several summary statistics dataframes, one for each input file.
I tried in several different ways, trying with apply and assign but I can't really manage to access each row of interest in the output dataframes cycling over the several input files.
I wuold like to do something like the code below (I know that this code does not work, it's just to give an idea of what I want to do).
The output df dataframes are already defined and empty.
for (xx in 1:number_of_mydata_files) {
df_xx[1,]<-apply(mydata_xx,2,mean,na.rm=T)
df_xx[2,]<-apply(mydata_xx,2,sd,na.rm=T)
...
}
Actually I can't remember the error message given by this code, but the problem is that I can't even run this because it does not work.
I'm quite a beginner of R, so I don't have so much experience in using this language. Is there a way to do this? Are there other functions that could be used instead of apply and assign)?
EDIT:
I add here a simple table description that represents the input dataframes I’m using. Sorry for the poor data visualization right here. Basically the input dataframes I’m using are .csv imported files, looking like tables with the first row being the column description, aka the name of the measured variable, and the following rows being the acquired data. I have 250 000 acquisitions for each variable in each file, and I have something like 5-8 files like this being my input.
Current [A] | Force [N] | Elongation [%] | ...
—————————————————————————————————————
Value_a_1 | Value_b_1 | Value_c_1 | ...
I just want to obtain a data frame like this as an output, with the same variables name, but instead with statistical values as rows. For example, the first row, instead of being the first values acquired for each variable, would be the mean of the 250k acquisitions for each variable. The second row would be the standard deviation, the third the variance and so on.
I’ve managed to build empty dataframes for the output summary statistics, with just the columns and no rows yet. I just want to fill them and do this iteratively in a for loop.
Not sure what your data looks like but you can do the following where lst represents your list of data frames.
lst <- list(iris[,-5],mtcars,airquality)
lapply(seq_along(lst),
function(x) sapply(lst[[x]],function(x)
data.frame(Mean=mean(x,na.rm=TRUE),
sd=sd(x,na.rm=TRUE))))
Or as suggested by #G. Grothendieck simply:
lapply(lst, sapply, function(x)
data.frame(Mean = mean(x, na.rm = TRUE), sd = sd(x, na.rm = TRUE)))
If all your files are in the same directory, set working directory as that and use either list.files() or ls() to walk along your input files.
If they share the same column names, you can rbind the result into a single data set.

R chisq.test test on data frame

I am attempting to run a chi sqare analysis on the data frame (called "habitat.re") below however im having difficulty as I've gotten it to read the data but its giving the wrong results, when i prompt it with $expected it returns 18 different colums when there should be 3 (one for each site).
All the tourorials ive been able to find have the data as a table, however i've not been able to convert it correctly myself.
The chisq.test function is intended to work with two variables, or columns in this case. If you want to compare all three of your columns, then I suspect you would want to compare 1-2, 2-3, and 3-3, e.g.
chisq.test(x=habitat.re$Gidgee, y=habitat.re$`Ian's Place`)
chisq.test(x=habitat.re$`Ian's Place`, y=habitat.re$`Saw Mulga`)
chisq.test(x=habitat.re$Gidgee, y=habitat.re$`Saw Mulga`)
Actually, just typing in the above should reveal much useful information directly to the R console, something like this:
data: habitat.re$Gidgee and y=habitat.re$`Ian's Place`
X-squared = 5.5569, df = 1, p-value = 0.01841
A sufficiently low p-value might indicate that the two columns are in fact dependent.
Pearson's Chi-Squared Test requires a data frame to be made into a matrix table containing only the variables you need as numerical values. N.B. my data frame is called "habitat.re"
habitat.df<-data.matrix(habitat.re, rownames.force = NA)# convert to matrix table
habitat.df<- habitat.df[,-c(1,2,3)] # delete first 3 columns
rownames(habitat.df) <- habitat.re$COMMON.NAME #pull names from original
chisq.test(habitat.df) #do chisquare test
chisq.test(habitat.df)$expected #return predicted values
The following are images of my data frames
habitat.re
habitat.df

correlation of several columns need to be calculated

I'm trying to get the correlation coefficient for corresponding columns of two csv files. I simply use the followings but get errors. consider each csv file has 50 columns
first values <- read.csv("")
second values <- read.csv("")
correlation.csv <- cor(x= first values , y=second values, method="spearman)
But i get x' must be numeric error!
subset of one csv file
Thanks for your help
The read.table function and all of it's derivatives return a data.frame which is an R list object. The mapply function processes lists in "parallel". If the matching columns are in the same order in the two datasets and have the same number of rows and do not have spaces in their names, it would be as simple as:
mapply(cor, first_values , second_values)
If it's more complicated tahn that, then you need to fill in the missing details with example data by editing the question (not by responding in comments.)
There must be some categorical variable in X.So you can first separate that categorical variable from X and then use X in cor() function.

Resources