Creating Data Table of Regression Coefficients - r

I have a model with the following regression coefficient values:
(Intercept) radius perimeter compactness concavepoints
-2.3003926746 0.0743984303 -0.0111031732 -2.5826629017 5.3127565914
radius.stderr smoothness.stderr compactness.stderr concavity.stderr radius.worst
0.4256225882 16.9805981122 -3.8819567231 0.9488969352 0.1408605366
texture.worst area.worst concavity.worst symmetry.worst fractaldimension.worst
0.0105317616 -0.0009867991 0.3504860653 0.8536208289 4.7503948408
I want to make a data table with the variable names in one column, and the corresponding regression coefficient values in the other column.
This is what I have tried so far:
var_names = coef(summary(model_B))[, 0]
coef_vals = coef(summary(model_B))[, 1]
data.table(Variables=c(var_names), RegressionCoefficients = c(coef_values))
But I get the following output with the 'Variables' column all NA:
Variables RegressionCoefficients
<dbl> <dbl>
NA -2.3003926746
NA 0.0743984303
NA -0.0111031732
NA -2.5826629017
NA 5.3127565914
NA 0.4256225882
NA 16.9805981122
NA -3.8819567231
NA 0.9488969352
NA 0.1408605366

Use names to access the names of the coefficients.
var_names=names(coef(model_B))
coef_vals=coef(model_B)
data.table(Variables=var_names, RegressionCoefficients=coef_vals)
Variables RegressionCoefficients
1: (Intercept) 2.984208e-16
2: radius 1.000000e+00
3: perimeter 1.000000e+00

Related

ERROR with Aggregate differentially expressed genes across all contrast results using DEVis/DESeq

I'm using DEVis for differential expression analysis. When I get to running DESeq I get an error about "object of class “NULL” is not valid" when I'm creating the aggregated data.
running BiocManager::valid() returns [1] TRUE and restarting RStudio didn't solve it either.
#Run DESeq on my previously prepared DESeq2 object.
dds <- DESeq(dds)
#determine the contrasts we are interested in examining by using DESeq2's results() function
res.SAMPLE3.vs.C2 <- results(dds, contrast=c("condition_ppGpp", "untreated_0mM", "treated_0.5mM"))
res.SAMPLE4.vs.C2 <- results(dds, contrast=c("condition_ppGpp", "untreated_0mM", "treated_1mM"))
print(res.SAMPLE3.vs.C2)
log2 fold change (MLE): condition_ppGpp untreated_0mM vs treated_0.5mM
Wald test p-value: condition ppGpp untreated 0mM vs treated 0.5mM
DataFrame with 3996 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
1 352.326 0.2830664 0.391703 0.7226560 0.4698913 0.6771991
2 335.373 0.5624211 0.270855 2.0764617 0.0378513 0.1542207
3 315.891 0.6081361 0.237291 2.5628297 0.0103823 0.0653239
4 326.854 -0.0200133 0.275640 -0.0726069 0.9421189 0.9702167
5 360.061 -0.6693134 0.317713 -2.1066623 0.0351469 0.1471803
... ... ... ... ... ... ...
3992 0.1194548 -3.02243 4.02648 -0.750639 0.45287 NA
3993 0.0481303 0.00000 4.06264 0.000000 1.00000 NA
3994 0.0481303 0.00000 4.06264 0.000000 1.00000 NA
3995 0.0481303 0.00000 4.06264 0.000000 1.00000 NA
3996 0.1218624 0.00000 4.06264 0.000000 1.00000 NA
#Make a list of all of our contrasts.
result_list <- list(res.SAMPLE3.vs.C2, res.SAMPLE4.vs.C2)
print(result_list)
#Aggregate differentially expressed genes across all contrast results.
master_dataframe <- create_master_res(result_list, filename="master_DE_list.txt", method="union", lfc_filter=TRUE)
Error in (function (cl, name, valueClass) :
assignment of an object of class “NULL” is not valid for #‘allNames’ in an object of class “DESeqResMeta”; is(value, "character") is not TRUE
I have tried to replace the NA data points in my data to zeros using the is.na() <- 0 function, but then I get an error
Error in create_master_res(result_list, filename = "master_DE_list.txt", :
create_master_res() requires list type object containing DESeq result sets.
Changing data.frame to list didn't hep either.
What I am missing?
Thanks!

compute diff of rows with NAs values in data frame using R

I have data frame (9000 x 304) but it looks like to this :
date
a
b
1997-01-01
8.720551
10.61597
1997-01-02
na
na
1997-01-03
8.774251
na
1997-01-04
8.808079
11.09641
I want to calculate the values data such as :
first <- data[i-1,] - data[i-2,]
second <- data[i,] - data[i-1,]
third <- data[i,] - data[i-2,]
I want to ignore the NA values and if there is na I want to get the last value that is not na in the column.
For example in the second diff i = 4 from column b :
11.09641 - 10.61597 is the value of b_diff on 1997-01-04
This is what I did but it keeps generating data with NA :
first <- NULL
for (i in 3:nrow(data)){
first <-rbind(first, data[i-1,] - data[i-2,])
}
second <- NULL
for (i in 3:nrow(data)){
second <- rbind(second, data[i,] - data[i-1,])
}
third <- NULL
for (i in 3:nrow(data)){
third <- rbind(third, data[i,] - data[i-2,])
}
It can be a way to solve it with aggregate function but I need a solution that can be applied on big data and I can't specify each colnames separately. Moreover my colnames are in foreign language.
Thank you very much ! I hope I gave you all the information you need to help me, otherwise, let me know please.
You can use fill to replace NAs with the closest value, and then use across and lag to compute the new variables. It is unclear as to what exactly is your expected output, but you can also replace the default value of lag when it does not exist (e.g. for the first value), using lag(.x, default = ...).
library(dplyr)
library(tidyr)
data %>%
fill(a, b) %>%
mutate(across(a:b, ~ lag(.x) - lag(.x, n = 2), .names = "first_{.col}"),
across(a:b, ~ .x - lag(.x), .names = "second_{.col}"),
across(a:b, ~ .x - lag(.x, n = 2), .names = "third_{.col}"))
date a b first_a first_b second_a second_b third_a third_b
1 1997-01-01 8.720551 10.61597 NA NA NA NA NA NA
2 1997-01-02 8.720551 10.61597 NA NA 0.000000 0.00000 NA NA
3 1997-01-03 8.774251 10.61597 0.0000 0 0.053700 0.00000 0.053700 0.00000
4 1997-01-04 8.808079 11.09641 0.0537 0 0.033828 0.48044 0.087528 0.48044

Correct variable values in a dataframe applying a function using variable-specific values in another dataframe in R

I have a df called 'covs' with sites on rows and in columns, 9 different environmental variables for each of these sites. I need to recalculate the value of each cell using the function x - center_values(x)) / scale_values(x). However, 'center_values' and 'scale_values' are different for each environmental covariate, and they are located in another df called 'correction'.
I have found many solutions for applying a function for a whole df, but not for applying specific values according to the id of the value to transform.
covs <- read.table(text = "X elev builtup river grip pa npp treecov
384879-2009 1 24.379101 25188.572 1241.8348 1431.1082 5.705152e+03 16536.664 60.23175
385822-2009 2 29.533478 32821.770 2748.9053 1361.7772 2.358533e+03 15773.115 62.38455
385823-2009 3 30.097059 28358.244 2525.7627 1073.8772 4.340906e+03 14899.451 46.03269
386765-2009 4 33.877861 40557.891 927.4295 1049.4838 4.580944e+03 15362.518 53.08151
386766-2009 5 38.605156 36182.801 1479.6178 1056.2130 2.517869e+03 13389.958 35.71379",
header= TRUE)
correction <- read.table(text = "var_name center_values scale_values
1 X 196.5 113.304898393671
2 elev 200.217889868483 307.718211316278
3 builtup 31624.4888660664 23553.2438790344
4 river 1390.41023742909 1549.88661649406
5 grip 5972.67361738244 6996.57793554527
6 pa 2731.33431010861 4504.71055521749
7 npp 10205.2997576655 2913.19658598938
8 treecov 47.9080656134352 17.7101565911347
9 nonveg 7.96755640452006 4.56625351682905", header= TRUE)
Could someone help me write a code to recalculate the environmental covariate values in 'covs' using the specific covariate values reported in 'correction'? E.g. For each value in the column 'elev' of the df 'covs', I need to substract the 'center_value' reported for 'elev' in the 'corrected' df, and then divided by the 'scale_value' of 'elev' reported in 'corrected' df. Thank you for your kind help.
You may assign var_name to row names, then loop over the names of covs to do the calculations in an sapply.
rownames(correction) <- correction$var_name
res <- as.data.frame(sapply(names(covs), function(x, y)
(covs[, x] - correction[x, "center_values"])/correction[x, "scale_values"]))
res
# X elev builtup river grip pa npp treecov
# 1 -1.725433 -0.5714280 -0.27324970 -0.09586213 -0.6491124 0.66015733 2.173339 0.6958541
# 2 -1.716607 -0.5546776 0.05083296 0.87651254 -0.6590217 -0.08275811 1.911239 0.8174114
# 3 -1.707781 -0.5528462 -0.13867495 0.73253905 -0.7001703 0.35730857 1.611340 -0.1058927
# 4 -1.698956 -0.5405596 0.37928543 -0.29871910 -0.7036568 0.41059457 1.770295 0.2921174
# 5 -1.690130 -0.5251972 0.19353224 0.05755748 -0.7026950 -0.04738713 1.093183 -0.6885470
Check e.g. "elev":
(covs[,"elev"] - correction["elev", "center_values"]) / correction["elev", "scale_values"]
# [1] -0.5714280 -0.5546776 -0.5528462 -0.5405596 -0.5251972

R - Find percentiles of all the features for 1 of the observations from a dataset (Boston Housing Dataset)

I'm working on the Boston Housing dataset. I filtered the observations (towns) having the lowest 'medv' and saved them after transposing to a new dataframe. I want to insert column in this new dataframe that contains the percentiles based on the original data for the feature values of these filtered observations.
Here's the R code:
# load the library containing the dataset
library(MASS)
# save the data with custom name
boston = Boston
# suburb with lowest medv
low.medv = data.frame(t(boston[boston$medv == min(boston$medv),]))
low.medv
# The values I want populated in new columns:
# Finding percentile rank for crim
ecdf(boston$crim)(38.3518)
# >>> 0.9881423
ecdf(boston$crim)(67.9208)
# >>> 0.9960474
# percentile rank for lstat
ecdf(boston$lstat)(30.59)
# >>> 0.9782609
ecdf(boston$lstat)(22.98)
# >>> 0.8992095
Desired output :
Is there a way to use the ecdf function with sapply?
I think it would be easier if you don't transpose the data beforehand :
low.medv <- boston[boston$medv == min(boston$medv),]
res <- mapply(function(x, y) ecdf(x)(y), boston, low.medv)
res
# crim zn indus chas nox rm age dis rad
#[1,] 0.9881 0.7352 0.8874 0.9308 0.8577 0.07708 1 0.05731 1
#[2,] 0.9960 0.7352 0.8874 0.9308 0.8577 0.13636 1 0.04150 1
# tax ptratio black lstat medv
#[1,] 0.9901 0.8893 1.0000 0.9783 0.003953
#[2,] 0.9901 0.8893 0.3498 0.8992 0.003953
Now, if you want the result as shown in 4-columns you can do :
cbind(t(low.medv), t(res))

calculating a parameters in equation

SMDIt=p*SMDIt-1+q*SMDt
SMDIt=SMDt/50
I want to do the above equation to my dataset (SMD). I first need to divide the first column of my dataset with 50 (eqn 2)and call it SMDI, then go for first equation where i add SMDIt-1 with the original SMD.I have two values of p and q (p_dry and p_wet, q_dry and q_wet). I want to use p_dry and q_dry if my cell value is positive otherwise p_wet and q_wet in equation one. I wrote a following code but it gives me error. NA/NAN argument. Please help.
3.343327144 0.076583722 -4.316073117 -6.064319011 -1.034313982 1.711678831 2.062381759 5.632386548 6.017760438
4.467709087 1.632745678 -2.045736377 -3.601413064 1.695347213 3.295933998 4.070685302 7.743864617 8.348716373
8.256385028 5.635534811 2.707796712 1.572985845 6.066710978 7.095101029 7.941167874 11.37490758 12.15712496
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
-47.4749727 -62.45954133 -69.42311677 -68.04854477 -69.86363461 -56.6566393 -44.02624374 -34.68257496 -5.528397863
-57.44464723 -74.11667952 -83.07777747 -81.88546602 -84.32488173 -72.37428075 -61.04778523 -51.84892678 -20.81696219
-12.6032741 -26.27089119 -36.55478576 -30.40468773 -36.15889518 -33.71339142 -16.63378788 -4.849972012 -1.667644897
-28.28948158 -38.05693676 -43.2879285 -35.34546364 -40.09848824 -34.40754496 -18.41988896 -9.867125675 -7.493617422
NA NA NA NA NA NA NA NA NA
-35.04117468 -38.74252722 -42.69080876 -43.06064215 -40.85844545 -36.79603495 -37.92408262 -34.51428202 -32.54118632
-29.35688054 -33.7004665 -37.88555224 -39.06340145 -37.19884049 -29.8488303 -32.48244008 -28.52426895 -28.39245064
-1.422800439 -6.972537109 -11.86824507 -13.14543917 -9.893061342 1.11258721 -0.415834635 2.424939039 2.65615071
Codes:
data=read.table('SMD.csv', header=TRUE, sep=',')
SMD=data.matrix(data)
p_dry<-0.1542
q_dry<-0.0338
p_wet<-0.1660
q_wet<-0.0333
SMDI<- matrix(0,nrow=nrow(SMD),ncol=ncol(SMD))
for (i in 2:nrow(SMD)) {
for(j in 1:ncol){
if(is.na(SMD[i,j])){
SMD[i,j]<-NaN
SMDI[1,j] <-SMD[1,j]/50
if(SMD[i,j]<0)
SMDI[i,j]<- p_dry[j]*SMDI[i-1,j]+SMD[i,j]*q_dry[j] else
SMDI[i,j]<- p_wet[j]*SMDI[i-1,j]+SMD[i,j]*q_wet[j]
}
}
}
write.table(SMDI,(file='SMDI.csv')
You don't need loops. In R we works with vectors.
SMDIt <- SMD/50 # second equation
# defining vectors of p and q values corresponding to SMDIt
p <- ifelse(SMDIt>0, p_dry, p_wet)
q <- ifelse(SMDIt>0, q_dry, q_wet)
SMDIt <- p*SMDIt - 1 + q*SMD # first equation
Edit: replaced SMD[, 1] with SMD to calculate values for whole matrix.

Resources