I want to make a SummarizedExperiment,
I have the count table in this format in FeatureCount.txt
SRR1554537 SRR1554538 SRR1554541 SRR1554535 SRR1554536 SRR1554539
1/2-SBSRNA4 39 66 72 23 16 7
A1BG 221 113 226 146 36 126
A1BG-AS1 393 296 527 276 39 258
A1CF 8 7 5 1 0 4
A2LD1 97 208 171 181 72 110
I have the phenotype Data in this format:
SampleName RUN Age sex tissue disease
SRR1554537 R3452_DLPFC_polyA_RNAseq_total SRR1554537 -0.384 female DLPFC control
SRR1554538 R3462_DLPFC_polyA_RNAseq_total SRR1554538 -0.4027 female DLPFC control
SRR1554541 R3485_DLPFC_polyA_RNAseq_total SRR1554541 -0.3836 male DLPFC control
SRR1554535 R2869_DLPFC_polyA_RNAseq_total SRR1554535 41.58 male DLPFC control
SRR1554536 R3098_DLPFC_polyA_RNAseq_total SRR1554536 44.17 female DLPFC control
SRR1554539 R3467_DLPFC_polyA_RNAseq_total SRR1554539 36.5 female DLPFC control
Here is my code:
count_feature <- as.matrix(read.table("featureCount.txt", header = TRUE, stringsAsFactors = FALSE))
phenoData <- read.csv("Pheno_Data.csv", header = TRUE)
col_data <- DataFrame(phenoData)
row_data <- relist(GRanges(), vector("list", length= nrow(count_feature)))
mcols(row_data) <- rownames(count_feature)
Brain_Es <- SummarizedExperiment( assays = list(feature_Count= feature_Count), rowRanges = row_data, colData = col_data)
Error in rownames<-(*tmp*, value = c("X", "SRR1554537", "SRR1554538", :
invalid rownames length
Can you explain the error?
I don't understand what you're trying to do with row_data, but it's clearly not working. You already have gene names from the count table. Why not do
Brain_Es <- SummarizedExperiment(assays = list(counts = count_feature), colData = col_data, rowData = rownames(count_feature));
Have a look at ?SummarizedExperiment, and at the examples given here in section "Constructing a SummarizedExperiment".
Related
I attempted to utilize the principles of linear regression and feature selection to predict a target variable (Y) based on a set of predictor variables (X1, X2, X3, X4, X5, X6, X7, and X8). I started by implementing a full model, which included all predictor variables, and then used stepwise regression to select the most relevant variables for my model through the use of backward, forward, and both selection methods. I then compared the performance of my model using AIC, BIC, and root mean squared error (RMSE) to determine the best model for my data. Finally, I used this best model to predict the value of Y for a specific set of predictor variable values and compared it to the actual value to assess the accuracy of my model. However, I encountered a problem in my data where the value of Y in the 39th semester was missing, so I couldn't evaluate the prediction results.
#Jeu de donnée : Classeur2.xlsx
setwd("D:/third year/alaoui")
# load
library(readxl)
data2 <- read_excel("D:/third year/alaoui/tpnote/Classeur2.xlsx")
data <- data2[-c(39),]
View(data)
# Analyse descriptive
summary(data)
str(data)
# analyse de correlation
#install.packages("psych")
library(psych)
# check if all values r numeric if not convert em
num_cols <- sapply(data, is.numeric)
data[, !num_cols] <- lapply(data[, !num_cols], as.numeric)
#
matrice_correlation <- cor(data[,c("Y","X1","X2","X4","X5...5","X5...6","X6","X7","X8")])
KMO(matrice_correlation)
cortest.bartlett(matrice_correlation, n = nrow(data))
# Analyse en composantes principales
library("FactoMineR")
library("factoextra")
library("corrplot")
p=PCA(data,graph=FALSE)
p
pca=PCA(data,ncp=2)
print(pca)
eig.val <- get_eigenvalue(pca)
eig.val
fviz_eig(pca)
fviz_pca_var(pca, col.var = "contrib", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"))
# Régression linéaire
model <- lm(Y ~ X1 + X2 + X4 + X5...5 + X5...6 + X6 + X7 + X8, data = data)
summary(model)
#Vérification des hypothèses de la régression linéaire
#1. Linearité
par(mfrow = c(2, 2))
plot(model)
#2. Homoscédasticité
library(car)
ncvTest(model)
#3. Normalité des résidus
library(lmtest)
library(tseries)
residuals <- resid(model)
qqnorm(residuals)
qqline(residuals)
shapiro.test(residuals)
#4. Indépendance des résidus
plot(residuals ~ fitted(model))
durbinWatsonTest(model)
#Sélection de variables
# Fit the full model
full_model <- lm(Y ~ X1 + X2 + X4 + X5...5 + X5...6 + X6 + X7 + X8, data = data)
# Fit the null model (constant only)
null_model <- lm(Y ~ 1, data = data)
# Perform backward stepwise selection
backward_model <- step(full_model, direction = "backward")
# Perform forward stepwise selection
forward_model <- step(null_model, scope = list(lower = null_model, upper = full_model), direction = "forward")
# Perform both stepwise selection
both_model <- step(null_model, scope = list(upper = full_model), direction = "both")
# Compare AIC, BIC and RMSE for each model
AIC_full <- AIC(full_model)
AIC_backward <- AIC(backward_model)
AIC_forward <- AIC(forward_model)
AIC_both <- AIC(both_model)
BIC_full <- BIC(full_model)
BIC_backward <- BIC(backward_model)
BIC_forward <- BIC(forward_model)
BIC_both <- BIC(both_model)
RMSE_full <- sqrt(mean((resid(full_model))^2))
RMSE_backward <- sqrt(mean((resid(backward_model))^2))
RMSE_forward <- sqrt(mean((resid(forward_model))^2))
RMSE_both <- sqrt(mean((resid(both_model))^2))
#Print the model selection criteria for each model
cat("Full model:")
cat("\tAIC:", AIC_full, "\tBIC:", BIC_full, "\tRMSE:", RMSE_full, "\n")
cat("Backward model:")
cat("\tAIC:", AIC_backward, "\tBIC:", BIC_backward, "\tRMSE:", RMSE_backward, "\n")
cat("Forward model:")
cat("\tAIC:", AIC_forward, "\tBIC:", BIC_forward, "\tRMSE:", RMSE_forward, "\n")
cat("Both model:")
cat("\tAIC:", AIC_both, "\tBIC:", BIC_both, "\tRMSE:", RMSE_both, "\n")
#Select the model with the lowest AIC, BIC, and RMSE
model_names <- c("Full Model", "Backward Model", "Forward Model", "Both Model")
best_model <- model_names[which.min(c(AIC_full, AIC_backward, AIC_forward, AIC_both))]
print(best_model)
# predict the value of Y in the 39th semester
predicted_Y <- predict(backward_model, newdata = data.frame(X1 = 500, X2 = 100, X4 = 83, X5...5 = 30, X5...6= 50, X6 = 90, X7 = 300, X8 = 200))
print(predicted_Y)
# to make sure that its correct
#Calculate mean squared error
MSE <- mean((predicted_Y - data$Y[39])^2)
#Calculate root mean squared error
RMSE <- sqrt(MSE)
#Calculate R-squared value
R_squared <- summary(backward_model)$r.squared
#Print the results
print(paste("Predicted value of Y:", predicted_Y))
print(paste("Mean Squared Error:", MSE))
print(paste("Root Mean Squared Error:", RMSE))
print(paste("R-Squared value:", R_squared))
#Compare the predicted value with the actual value
print(paste("Actual value of Y:", data$Y[39]))
print(paste("Difference:", abs(predicted_Y - data$Y[39])))
#Plot the model
par(xpd = TRUE)
plot(backward_model,which=1)
abline(backward_model,col="blue")
#Plot the residuals
plot(backward_model, which=2)
#Normality test on residuals
shapiro.test(residuals(backward_model))
#Homoscedasticity test on residuals
ncvTest(backward_model)
#Linearity test on residuals
dwtest(backward_model)
this is my file.csv
X4 X5 X5 X6 X7 X8 Y
56 12 50 77 229 98 5540
59 9 17 89 177 225 5439
57 29 89 51 166 263 4290
58 13 107 40 258 321 5502
59 13 143 52 209 407 4872
60 11 61 21 180 247 4708
60 25 -30 40 213 328 4627
60 21 -45 32 201 298 4110
63 8 -28 12 176 218 4123
62 11 76 68 175 410 4842
65 22 144 52 253 93 5741
65 24 113 77 208 307 5094
64 14 128 96 195 107 5383
66 15 10 48 154 305 4888
67 22 -25 27 181 60 4033
67 23 117 73 220 239 4942
66 13 120 62 235 141 5313
68 8 122 25 258 291 5140
69 27 71 74 196 414 5397
71 18 4 63 279 206 5149
69 8 47 29 207 80 5151
70 10 8 91 213 429 4989
73 27 128 74 296 273 5927
73 16 -50 16 245 309 4704
73 32 100 43 276 280 5366
75 20 -40 41 211 315 4630
73 15 68 93 283 212 5712
74 11 88 83 218 118 5095
74 27 27 75 307 345 6124
77 20 59 88 211 141 4787
79 35 142 74 270 83 5036
77 23 126 21 328 398 5288
78 36 30 26 258 124 4647
78 22 18 95 233 118 5316
81 20 42 93 324 161 6180
80 16 -22 50 267 405 4801
81 35 148 83 257 111 5512
82 27 -18 91 267 170 5272
83 30 50 90 300 200 .
I need to create a kinship matrix. For this purpose, I wanted to use the kinship2 library in R, but the sex variable is required, which I don't have. From the documentation I read that you can use the value "3" for unknown gender, but it doesn't work. My code where I'm trying to get the effect but it comes out wrong 1x1 matrix.
My data:
nr.os nr.oj nr.ma ferma
1 169 152 84 3
2 170 152 84 3
3 171 152 84 3
4 172 152 84 3
5 173 152 84 3
6 174 152 84 3
My code:
library(kinship2)
my_data <- read.table("Zeszyt_s1.csv",header = TRUE, sep = ";")
my_data$sex <- as.integer(3)
df_fixed <- fixParents(id = my_data$nr.os, dadid=my_data$nr.oj, momid=my_data$nr.ma, sex=my_data$sex)
pedAll <- with(df_fixed,pedigree(
id = id,
dadid = dadid,
momid = momid,
sex = sex))
kinship(pedAll["1"])
Output:
1
1 0.5
I have a dataset called PimaDiabetes. And the dataset can be pulled from here.
PimaDiabetes <- read.csv("PimaDiabetes.csv")
From which I derived a logistic model:
chosen_glm = glm(PimaDiabetes$Outcome ~ PimaDiabetes$Pregnancies+PimaDiabetes$Glucose
+PimaDiabetes$SkinThickness+PimaDiabetes$BMI
+PimaDiabetes$DiabetesPedigree, data = PimaDiabetes)
However, when ever I try to run it against a new dataset called ToPredict:
Pregnancies
Glucose
BloodPressure
SkinThickness
Insulin
BMI
DiabetesPedigree
Age
4
136
70
0
0
20
31.2
22
1
121
78
39
74
20
39
28
3
108
62
24
0
20
26
25
0
181
88
44
510
20
43.3
26
8
154
78
32
0
20
32.4
45
I get the following error:
>predict(chosen_glm,ToPredict,type="response")
Warning message:
'newdata' had 5 rows but variables found have 750 rows
And I'm not sure what's wrong.
The colnames
colnames(PimaDiabetes)
"Pregnancies" "Glucose" "BloodPressure" "SkinThickness" "Insulin" "BMI" "DiabetesPedigree" "Age"
[9] "Outcome"
Are the same
colnames(ToPredict)
[1] "Pregnancies" "Glucose" "BloodPressure" "SkinThickness" "Insulin" "BMI" "DiabetesPedigree" "Age"
Try this:
PimaDiabetes = read.csv("diabetes.csv")
chosen_glm = glm(
Outcome ~ Pregnancies + Glucose + SkinThickness + BMI + DiabetesPedigreeFunction,
data = PimaDiabetes
)
ToPredict = PimaDiabetes[sample(nrow(PimaDiabetes),5),]
predict(chosen_glm,ToPredict,type="response")
I want to work with a filtered subset of my dataset.
Example: healthstats.csv
age weight height gender
A 25 150 65 female
B 24 175 78 male
C 26 130 72 male
D 32 200 69 female
E 28 156 66 male
F 40 112 78 female
I would start with
patients = read.csv("healthstats.csv")
but how to I only import a subset of
patients$gender == "female"
when I run
patients = read.csv("healthstats.csv")
If you want to import only a subset of rows without reading them you can use sqldf which accepts a query to filter data.
library(sqldf)
read.csv.sql("healthstats.csv", sql = "select * from file where gender == 'female'")
We can also use read_csv_chunked from readr
readr::read_csv_chunked('healthstats.csv',
callback = DataFrameCallback$new(function(x, pos) subset(x, gender == "female")))
I have following dataframe:
olddf <- structure(list(test = structure(1:6, .Label = c("test1", "test2",
"test3", "test4", "test5", "test6"), class = "factor"), month0_gp1 = c("163±28",
"133±20", "177±29", "153±30", "161±31", "159±23"), month0_gp2 = c("122±17",
"167±20", "146±26", "150±27", "148±33", "161±37"), month1_gp1 = c("157±32",
"152±37", "151±24", "143±25", "144±29", "126±30"), month1_gp2 = c("181±14",
"133±34", "152±38", "144±30", "148±20", "137±19"), month3_gp1 = c("139±38",
"161±39", "166±38", "162±39", "151±38", "155±38"), month3_gp2 = c("151±40",
"161±33", "137±25", "161±31", "168±30", "147±34")), .Names = c("test",
"month0_gp1", "month0_gp2", "month1_gp1", "month1_gp2", "month3_gp1",
"month3_gp2"), row.names = c(NA, 6L), class = "data.frame")
test month0_gp1 month0_gp2 month1_gp1 month1_gp2 month3_gp1 month3_gp2
1 test1 163±28 122±17 157±32 181±14 139±38 151±40
2 test2 133±20 167±20 152±37 133±34 161±39 161±33
3 test3 177±29 146±26 151±24 152±38 166±38 137±25
4 test4 153±30 150±27 143±25 144±30 162±39 161±31
5 test5 161±31 148±33 144±29 148±20 151±38 168±30
6 test6 159±23 161±37 126±30 137±19 155±38 147±34
I have to split columns 2:7 into 2 each (one for mean and other for sd):
test month0_gp1_mean month0_gp1_sd month0_gp2_mean month0_gp2_sd month1_gp1_mean month1_gp1_sd ....
I checked earlier posts and used do.call(rbind... method:
mydf <- data.frame(do.call(rbind, strsplit(olddf$month0_gp1,'±')))
mydf
X1 X2
1 163 28
2 133 20
3 177 29
4 153 30
5 161 31
6 159 23
But this works for one column at a time. How can I modify this to loop for 2:7 columns, and combine them to form one new dataframe? Thanks for your help.
First, get my cSplit function from this GitHub Gist.
Second, split it up:
cSplit(olddf, 2:ncol(olddf), sep = "±")
# test 2_1 2_2 3_1 3_2 4_1 4_2 5_1 5_2 6_1 6_2 7_1 7_2
# 1: test1 163 28 122 17 157 32 181 14 139 38 151 40
# 2: test2 133 20 167 20 152 37 133 34 161 39 161 33
# 3: test3 177 29 146 26 151 24 152 38 166 38 137 25
# 4: test4 153 30 150 27 143 25 144 30 162 39 161 31
# 5: test5 161 31 148 33 144 29 148 20 151 38 168 30
# 6: test6 159 23 161 37 126 30 137 19 155 38 147 34
If you want to do the column renaming in the same step, try:
Nam <- names(olddf)[2:ncol(olddf)]
setnames(
cSplit(olddf, 2:ncol(olddf), sep = "±"),
c("test", paste(rep(Nam, each = 2), c("mean", "sd"), sep = "_")))[]
Another option would be to look at dplyr + tidyr.
Here's the best I could come up with, but I'm not sure if this is the correct way to do this with these tools....
olddf %>%
gather(GM, value, -test) %>% # Makes the data somewhat long
separate(value, c("MEAN", "SD")) %>% # Splits "value" column. We're wide again
gather(MSD, value, -test, -GM) %>% # Makes the data long again
unite(var, GM, MSD) %>% # Combines GM and MSD columns
spread(var, value) # Goes from wide to long
This is sort of the equivalent of melting the data once, using colsplit on the resulting "value" column, melting the data again, and using dcast to get the wide format.
Here's a qdap approach:
library(qdap)
for(i in seq(2, 13, by = 2)){
olddf <- colsplit2df(olddf, i,
paste0(names(olddf)[i], "_", c("mean", "sd")), sep = "±")
}
olddf[,-1] <- lapply(olddf[,-1], as.numeric)
olddf
I looked at Ananda's splitstackshape package first as I figured there was an easy way to do this but I couldn't figure out a way.
Not sure if you need the last line converting the columns to numeric but assumed you would.