I am looking to achieve something along the lines of what is done here, with the intention of creating a drug-target interaction network.
I have downloaded data from here and I would like to reproduce that network.
My data has the below form:
#Drug Gene
DB00357 P05108
DB02721 P00325
DB00773 P23219
DB07138 Q16539
DB08136 P24941
DB01242 P23975
DB01238 P08173
DB00186 P48169
DB00338 P10635
DB01151 P08913
DB01244 P05023
DB01745 P07477
DB01996 P08254
I consulted this previous post as a first step in order to create the similary matrix. The resulting matrix on the entire data set is large, so I tried recreating the procedure on a smaller data frame as per below.
# packages used
library("qgraph")
library("dplyr")
drugs <- c("DB00357","DB02721","DB00773",
"DB07138","DB08136",
"DB01242","DB01238",
"DB00186","DB00338",
"DB01151","DB01244",
"DB01745","DB01996")
genes <- c("P05108", "P00325","P23219",
"Q16539","P24941",
"P23975","P08173",
"P48169","P10635",
"P08913","P05023",
"P07477","P08254")
# Dataframe with a small subset of observations
df <- data.frame(drugs, genes)
# Consulting the other post
b <- df %>% full_join(df, by = "genes")
tb <- table(b$drugs.x, b$genes)
My next step I believe is to create the correlation matrix and the network as per the guide I'm trying to replicate. Here I face issues, below are my attempts documented:
# Follow guide trying to replicate correlation matrix
cormatrix <- cor_auto(tb)
### Error ###
"Removing factor variables: Var1; Var2
Error in data[, sapply(data, function(x) mean(is.na(x))) != 1] :
incorrect number of dimensions"
So I instead tried using cor(), and this works. However when I try to apply it on the entire dataframe it just keeps running/never produce output.
# Second way, using cor() instead to replicate correlation matrix
cormatrix <- cor(tb)
graph1 <- qgraph(tb, verbose = FALSE)
Therefor I wonder if anyone has any ideas for it to run properly and produce the network as intended?
Related
I worked a lot with MaxEnt in R recently (dismo-package), but only using a crossvalidation to validate my model of bird-habitats (only a single species). Now I want to use a self-created test sample file. I had to pick this points for validation by hand and can't use random test point.
So my R-script looks like this:
library(raster)
library(dismo)
setwd("H:/MaxEnt")
memory.limit(size = 400000)
punkteVG <- read.csv("Validierung_FL_XY_2016.csv", header=T, sep=";", dec=",")
punkteTG <- read.csv("Training_FL_XY_2016.csv", header=T, sep=";", dec=",")
punkteVG$X <- as.numeric(punkteVG$X)
punkteVG$Y <- as.numeric(punkteVG$Y)
punkteTG$X <- as.numeric(punkteTG$X)
punkteTG$Y <- as.numeric(punkteTG$Y)
##### mask NA ######
mask <- raster("final_merge_8class+le_bb_mask.img")
dataframe_VG <- extract(mask, punkteVG)
dataframe_VG[dataframe_VG == 0] <- NA
dataframe_TG <- extract(mask, punkteTG)
dataframe_TG[dataframe_TG == 0] <- NA
punkteVG <- punkteVG*dataframe_VG
punkteTG <- punkteTG*dataframe_TG
#### add the raster dataset ####
habitat_all <- stack("blockstats_stack_8class+le+area_8bit.img")
#### MODEL FITTING #####
library(rJava)
system.file(package = "dismo")
options(java.parameters = "-Xmx1g" )
setwd("H:/MaxEnt/results_8class_LE_AREA")
### backgroundpoints ###
set.seed(0)
backgrVMmax <- randomPoints(habitat_all, 100000, tryf=30)
backgrVM <- randomPoints(habitat_all, 1000, tryf=30)
### Renner (2015) PPM modelfitting Maxent ###
maxentVMmax_Renner<-maxent(habitat_all,punkteTG,backgrVMmax, path=paste('H:/MaxEnt/Ergebnisse_8class_LE_AREA/maxVMmax_Renner',sep=""),
args=c("-P",
"noautofeature",
"nothreshold",
"noproduct",
"maximumbackground=400000",
"noaddsamplestobackground",
"noremoveduplicates",
"replicates=10",
"replicatetype=subsample",
"randomtestpoints=20",
"randomseed=true",
"testsamplesfile=H:/MaxEnt/Validierung_FL_XY_2016_swd_NA"))
After the "maxent()"-command I ran into multiple errors. First I got an error stating that he needs more than 0 (which is the default) "randomtestpoints". So I added "randomtestpoints = 20" (which hopefully doesn't stop the program from using the file). Then I got:
Error: Test samples need to be in SWD format when background data is in SWD format
Error in file(file, "rt") : cannot open the connection
The thing is, when I ran the script with the default crossvalidation like this:
maxentVMmax_Renner<-maxent(habitat_all,punkteTG,backgrVMmax, path=paste('H:/MaxEnt/Ergebnisse_8class_LE_AREA/maxVMmax_Renner',sep=""),
args=c("-P",
"noautofeature",
"nothreshold",
"noproduct",
"maximumbackground=400000",
"noaddsamplestobackground",
"noremoveduplicates",
"replicates=10"))
...all works fine.
Also I tried multiple things to get my csv-validation-data in the correct format. Two rows (labled X and Y), Three rows (labled species, X and Y) and other stuff. I would rather use the "punkteVG"-vector (which is the validation data) I created with read.csv...but it seems MaxEnt wants his file.
I can't imagine my problem is so uncommon. Someone must have used the argument "testsamplesfile" before.
I found out, what the problem was. So here it is, for others to enjoy:
The correct maxent-command for a Subsample-file looks like this:
maxentVMmax_Renner<-maxent(habitat_all, punkteTG, backgrVMmax, path=paste('H:/MaxEnt',sep=""),
args=c("-P",
"noautofeature",
"nothreshold",
"noproduct",
"maximumbackground=400000",
"noaddsamplestobackground",
"noremoveduplicates",
"replicates=1",
"replicatetype=Subsample",
"testsamplesfile=H:/MaxEnt/swd.csv"))
Of course, there can not be multiple replicates, since you got only one subsample.
Most importantly the "swd.csv" Subsample-file has to include:
the X and Y coordinates
the Values at the respective points (e.g.: with "extract(habitat_all, PunkteVG)"
the first colum needs to consist of the word "species" with the header "Species" (since MaxEnt uses the default "species" if you don't define one in the Occurrence data)
So the last point was the issue here. Basically, if you don't define the species-colum in the Subsample-file, MaxEnt will not know how to assign the data.
I am experimenting with different regression models. My end goal is to have a nice easy to read dataframe with 3 columns:
model_results <- data.frame(name = character(),
rmse = numeric(),
r2 = numeric())
Then after running each model, add the corrosponding output to the dataframe and then, at the end, review and make some decisions on which model to use.
I tried this:
mod.spend_transactions.results <- list("mod.spend_transactions",
rsme(residuals(mod.spend_transactions)),
summary(mod.spend_transactions)$r.squared)
I tried using a list because I know vectors can only store one datatype (right?).
Output:
rbind(model_results, mod.spend_transactions.results)
X.mod.spend_transactions. X12.6029444519635 X0.912505643567096
1 mod.spend_transactions 12.60294 0.9125056
Close but not what I wnated since the df names have been changed and I did not expect that.
So I tried vectors, which works but seems "clunky" in that I'm sure I could do this with writing less code:
vect_modname <- vector()
vect_rsme <- vector()
vect_r2 <- vector()
Then after running a model
vect_modname <- c(vect_modname, "mod.spend_transactions")
vect_rsme <- c(vect_rsme, rsme(residuals(mod.spend_transactions)))
vect_r2 <- c(vect_r2, summary(mod.spend_transactions)$r.squared)
Then at the end of running all the models I'm testing out
data.frame(vect_modname, vect_rsme, vect_r2)
Again, the vector method does work. But is there a "better", more elegant way of doing this?
I am using the fpp package to forecast multiple time series of different customers at the same time. I am already able to extract the point forecasts of different easy forecast methods (snaive, meanf, etc.) into a csv document. However, I am still trying to figure out how to extract the measures of the accuracy() command of every time series into a csv file at the same time.
I constructed an example:
# loading of the "fpp"-package into R
install.packages("fpp")
require("fpp")
# Example customers
customer1 <- c(0,3,1,3,0,5,1,4,8,9,1,0,1,2,6,0)
customer2 <- c(1,3,0,1,7,8,2,0,1,3,6,8,2,5,0,0)
customer3 <- c(1,6,9,9,3,1,5,0,5,2,0,3,2,6,4,2)
customer4 <- c(1,4,8,0,3,5,2,3,0,0,0,0,3,2,4,5)
customer5 <- c(0,0,0,0,4,9,0,1,3,0,0,2,0,0,1,3)
#constructing the timeseries
all <- ts(data.frame(customer1,customer2,customer3,customer4,customer5),
f=12, start=2015)
train <- window(all, start=2015, end=2016-0.01)
test <- window(all, start=2016)
CustomerQuantity <- ncol(train)
# Example of extracting easy forecast method into csv-document
horizon <- 4
fc_snaive <- matrix(NA, nrow=horizon, ncol=CustomerQuantity)
for(i in 1:CustomerQuantity){
fc_snaive [,i] <- snaive (train[,i], h=horizon)$mean
}
write.csv2(fc_snaive, file ="fc_snaive.csv")
The following part is exactly the part, where I would needed some help - I would like to extract the accuracy-measures into a csv file all at the same time. In my real dataset, I have 4000 customers, and not only 5! I tried to use loops and lapply(), but unfortunately my code didn't work.
accuracy(fc_snaive[,1], test[,1])
accuracy(fc_snaive[,2], test[,2])
accuracy(fc_snaive[,3], test[,3])
accuracy(fc_snaive[,4], test[,4])
accuracy(fc_snaive[,5], test[,5])
The following uses lapply to run accuracy for each element from 1 to the number of columns in fc_snaive with the corresponding element in test.
Then, with do.call, we bind the results by row (rbind), so we end up with a matrix that we can, in turn, export using write.csv.
new_matrix <- do.call(what = rbind,
args = lapply(1:ncol(fc_snaive), function(x){
accuracy(fc_snaive[, x], test[, x])
}))
write.csv(x = new_matrix,
file = "a_filename.csv")
I am a beginner in R and R is for me actually only the means to analyse my statistical data, so I am far from being a programmer. I need some help with Building percentages of my variables from an Excel sheet. I Need R.total with R.Max as 100% base. this is what I did:
DB <- read_excel("WechslerData.xlsx", sheet=1, col_names=TRUE,
col_types=NULL, na="", skip=0)
I wanted to to use prop.table
but this dose not work with me. than I tried to make data frame
R.total <- DB$R.total
R.max <- DB$R.max
DB.rus <- data.frame(R.total, R.max)
but prop.table still dose not work. Can somebody give me a hint?
Not really sure what you want, but for this mock data.
r.total <- runif(100,min=0, max=.6) # generate random variable
r.max <- runif(100,min=0.7, max=1) # generate random variable
df <- data.frame(r.total, r.max) # create mock data frame
You could try
# create a new column which is the r.total percentage of r.max
df$percentage <- df$r.total / df$r.max
Hope it helps.
New user to R (like 2 days of use new) and coming from MATLAB, syntax nuances are driving me a little crazy. If anyone can point me in a direction on this topic I would really appreciate it. I have this dataset (fl1.back), that has 32 variables (columns) and 513 measurements (rows), and I want to create a table with basic stat tables of 9 of the 32 columns of data. There's a separate datset(fl2.back) that I would also like to pull 1 column of data from for the final table.
Here's the code I used to do the above tasks for 1 of the columns of data (sodium measurements) from fl1.back and fl2.back:
fl1.back <- read.delim("web.flat",comment.char="#",colClasses="character")
fl1.back <- fl1.back[-1,]
fl2.back <- read.delim("web.flat2",comment.char="#",colClasses="character")
fl2.back <- fl2.back[-1,]
head(fl1.back)
head(fl2.back)
#for rep criteria for sodium
back.sod.rep <- fl2.back[fl2.back$P00930!="",]
back.sod.rep$P00930 <- as.numeric(back.sod.rep$P00930)
back.sod.rep$P00930
#for samples...sodium
back.sod <- fl1.back[fl1.back$P00930!="",]
back.sod$P00930 <- as.numeric(back.sod$P00930)
back.sod$P00930
head(back.sod)
back.sod.summ <- data.frame("Sodium")
back.sod.summ
colnames(back.sod.summ) <- "Compound"
back.sod.summ$WQ_crit <- "20 mg/L"
back.sod.summ$n <- nrow(back.sod)
back.sod.summ$n_det <- nrow(back.sod[back.sod$R00930!="<",])
back.sod.summ$min <- min(back.sod[back.sod$R00930!="<","P00930"])
back.sod.summ$max <- max(back.sod[back.sod$R00930!="<","P00930"])
back.sod.summ$mean <- mean(back.sod[back.sod$R00930!="<","P00930"])
back.sod.summ$median <- median(back.sod[back.sod$R00930!="<","P00930"])
back.sod.summ$percent_samp_det <- 100*(back.sod.summ$n_det/back.sod.summ$n)
back.sod.summ$percent_samp_above_crit <- 100*(length(back.sod[back.sod$P00930>20,"P00930"])/back.sod.summ$n)
back.sod.summ$percent_rep_above_crit <- (sum(back.sod.rep$P00930>=20)/(nrow(back.sod.rep)))
back.sod$P00930
length(back.sod[back.sod$P00930>back.sod.summ$WQ_crit,"P00930"])
back.sod.summ
final <- data.frame(back.sod.summ)
Instead of rewriting/copying and pasting the above code to create the data frame final, I would like to loop over the two datasets since I'm looking to repeat the same task, just on different columns of data. I really don't know where to start, and there doesn't seem to be much literature on for loops in R.
Any insight is appreciated!
Here is an example of what I think you want with the iris dataset:
library(plyr)
dlply(iris, .(Species), summary)
This can be extended if you need additional stats. Anyway, you probably should use (as I show above) the "split-apply-combine" approach as implemented in various functions and packages.