Using a test sample file with MaxEnt in R - r

I worked a lot with MaxEnt in R recently (dismo-package), but only using a crossvalidation to validate my model of bird-habitats (only a single species). Now I want to use a self-created test sample file. I had to pick this points for validation by hand and can't use random test point.
So my R-script looks like this:
library(raster)
library(dismo)
setwd("H:/MaxEnt")
memory.limit(size = 400000)
punkteVG <- read.csv("Validierung_FL_XY_2016.csv", header=T, sep=";", dec=",")
punkteTG <- read.csv("Training_FL_XY_2016.csv", header=T, sep=";", dec=",")
punkteVG$X <- as.numeric(punkteVG$X)
punkteVG$Y <- as.numeric(punkteVG$Y)
punkteTG$X <- as.numeric(punkteTG$X)
punkteTG$Y <- as.numeric(punkteTG$Y)
##### mask NA ######
mask <- raster("final_merge_8class+le_bb_mask.img")
dataframe_VG <- extract(mask, punkteVG)
dataframe_VG[dataframe_VG == 0] <- NA
dataframe_TG <- extract(mask, punkteTG)
dataframe_TG[dataframe_TG == 0] <- NA
punkteVG <- punkteVG*dataframe_VG
punkteTG <- punkteTG*dataframe_TG
#### add the raster dataset ####
habitat_all <- stack("blockstats_stack_8class+le+area_8bit.img")
#### MODEL FITTING #####
library(rJava)
system.file(package = "dismo")
options(java.parameters = "-Xmx1g" )
setwd("H:/MaxEnt/results_8class_LE_AREA")
### backgroundpoints ###
set.seed(0)
backgrVMmax <- randomPoints(habitat_all, 100000, tryf=30)
backgrVM <- randomPoints(habitat_all, 1000, tryf=30)
### Renner (2015) PPM modelfitting Maxent ###
maxentVMmax_Renner<-maxent(habitat_all,punkteTG,backgrVMmax, path=paste('H:/MaxEnt/Ergebnisse_8class_LE_AREA/maxVMmax_Renner',sep=""),
args=c("-P",
"noautofeature",
"nothreshold",
"noproduct",
"maximumbackground=400000",
"noaddsamplestobackground",
"noremoveduplicates",
"replicates=10",
"replicatetype=subsample",
"randomtestpoints=20",
"randomseed=true",
"testsamplesfile=H:/MaxEnt/Validierung_FL_XY_2016_swd_NA"))
After the "maxent()"-command I ran into multiple errors. First I got an error stating that he needs more than 0 (which is the default) "randomtestpoints". So I added "randomtestpoints = 20" (which hopefully doesn't stop the program from using the file). Then I got:
Error: Test samples need to be in SWD format when background data is in SWD format
Error in file(file, "rt") : cannot open the connection
The thing is, when I ran the script with the default crossvalidation like this:
maxentVMmax_Renner<-maxent(habitat_all,punkteTG,backgrVMmax, path=paste('H:/MaxEnt/Ergebnisse_8class_LE_AREA/maxVMmax_Renner',sep=""),
args=c("-P",
"noautofeature",
"nothreshold",
"noproduct",
"maximumbackground=400000",
"noaddsamplestobackground",
"noremoveduplicates",
"replicates=10"))
...all works fine.
Also I tried multiple things to get my csv-validation-data in the correct format. Two rows (labled X and Y), Three rows (labled species, X and Y) and other stuff. I would rather use the "punkteVG"-vector (which is the validation data) I created with read.csv...but it seems MaxEnt wants his file.
I can't imagine my problem is so uncommon. Someone must have used the argument "testsamplesfile" before.

I found out, what the problem was. So here it is, for others to enjoy:
The correct maxent-command for a Subsample-file looks like this:
maxentVMmax_Renner<-maxent(habitat_all, punkteTG, backgrVMmax, path=paste('H:/MaxEnt',sep=""),
args=c("-P",
"noautofeature",
"nothreshold",
"noproduct",
"maximumbackground=400000",
"noaddsamplestobackground",
"noremoveduplicates",
"replicates=1",
"replicatetype=Subsample",
"testsamplesfile=H:/MaxEnt/swd.csv"))
Of course, there can not be multiple replicates, since you got only one subsample.
Most importantly the "swd.csv" Subsample-file has to include:
the X and Y coordinates
the Values at the respective points (e.g.: with "extract(habitat_all, PunkteVG)"
the first colum needs to consist of the word "species" with the header "Species" (since MaxEnt uses the default "species" if you don't define one in the Occurrence data)
So the last point was the issue here. Basically, if you don't define the species-colum in the Subsample-file, MaxEnt will not know how to assign the data.

Related

Creating Drug-target interaction network r

I am looking to achieve something along the lines of what is done here, with the intention of creating a drug-target interaction network.
I have downloaded data from here and I would like to reproduce that network.
My data has the below form:
#Drug Gene
DB00357 P05108
DB02721 P00325
DB00773 P23219
DB07138 Q16539
DB08136 P24941
DB01242 P23975
DB01238 P08173
DB00186 P48169
DB00338 P10635
DB01151 P08913
DB01244 P05023
DB01745 P07477
DB01996 P08254
I consulted this previous post as a first step in order to create the similary matrix. The resulting matrix on the entire data set is large, so I tried recreating the procedure on a smaller data frame as per below.
# packages used
library("qgraph")
library("dplyr")
drugs <- c("DB00357","DB02721","DB00773",
"DB07138","DB08136",
"DB01242","DB01238",
"DB00186","DB00338",
"DB01151","DB01244",
"DB01745","DB01996")
genes <- c("P05108", "P00325","P23219",
"Q16539","P24941",
"P23975","P08173",
"P48169","P10635",
"P08913","P05023",
"P07477","P08254")
# Dataframe with a small subset of observations
df <- data.frame(drugs, genes)
# Consulting the other post
b <- df %>% full_join(df, by = "genes")
tb <- table(b$drugs.x, b$genes)
My next step I believe is to create the correlation matrix and the network as per the guide I'm trying to replicate. Here I face issues, below are my attempts documented:
# Follow guide trying to replicate correlation matrix
cormatrix <- cor_auto(tb)
### Error ###
"Removing factor variables: Var1; Var2
Error in data[, sapply(data, function(x) mean(is.na(x))) != 1] :
incorrect number of dimensions"
So I instead tried using cor(), and this works. However when I try to apply it on the entire dataframe it just keeps running/never produce output.
# Second way, using cor() instead to replicate correlation matrix
cormatrix <- cor(tb)
graph1 <- qgraph(tb, verbose = FALSE)
Therefor I wonder if anyone has any ideas for it to run properly and produce the network as intended?

Why can I not see my variants with chromoqc?

I have a vcf file with 20 variants in chromosome 1 that I would like to visualise using vcfR.
What I am doing is the following:
#Read in my mouse genome and filter and rename chromosome 1
ref_genome <- ape::read.FASTA("mouse_genome/Mus_musculus.GRCm38.dna.primary_assembly.fa", type = "DNA")
ref_genome_chr1 <- ref_genome[ grep("GRCm38:1:", names(ref_genome))]
names(ref_genome_chr1) <- "1"
ref_genome_chr1 <- as.matrix(ref_genome_chr1)
#Read in my vcf file and also a mouse gff annotation file
vis_test_vcf <- read.vcfR("test_data/filter_chr1_test.recode.vcf", verbose = TRUE)
mouse_gff <- read.table("mouse_genome/Mus_musculus.GRCm38.102.gff3", sep="\t", quote="")
#Generate chromR object
chrom_test <- create.chromR(name="chr1_test", vcf=vis_test_vcf, seq=ref_genome_chr1, ann=mouse_gff, verbose=TRUE)
#Now try and plot this
chromoqc(chrom_test)
When I head() etc the various objects they look ok and I don't get any warnings about chromosome names not matching or anything. However, the plot is missing the "Variants per site" track, which is all I care about...I get this plot, whereby it's not showing the Variants per site. It's also not showing the DP and MQ but I'm not so worried about that at this stage...
Has anyone had a similar issue? I would be grateful for any pointers!
Kind regards
Cora
Ok so I just found the answer!
I needed to use
proc.chromR()
to process my chromR object, now it has plotted the variants.
Still need to work out the DP and MQ stuff..

Reading ZIP file of machine-written data won't "plot" in RStudio

Summary: Despite a complicated lead-up, the solution was very simple: In order to plot a row of a dataframe as a line instead of a lattice, I needed to transpose the data in order to invert from x obs of y variables to y obs of x variables.
I am using RStudio on a Windows 10 computer.
I am using scientific equipment to write measurements to a csv file. Then I ZIP several files and read to R using read.csv. However, the data frame behaves strangely. Commands "length" and "dim" disagree and the "plot" function throws errors. Because I can create simulated data that doesn't throw the errors, I think the problem is either in how the machine wrote the data or in my loading and processing of the data.
Two ZIP files are located in my stackoverflow repository (with "Monterey Jack" in the name):
https://github.com/baprisbrey/stackoverflow
Here is my code for reading and processing them:
# Unzip the folders
unZIP <- function(folder){
orig.directory <- getwd()
setwd(folder)
zipped.folders <- list.files(pattern = ".*zip")
for (i in zipped.folders){
unzip(i)}
setwd(orig.directory)
}
folder <- "C:/Users/user/Documents/StackOverflow"
unZIP(folder)
# Load the data into a list of lists
pullData <- function(folder){
orig.directory <- getwd()
setwd(folder)
#zipped.folders <- list.files(pattern = ".*zip")
#unzipped.folders <- list.files(folder)[!(list.files(folder) %in% zipped.folders)]
unzipped.folders <- list.dirs(folder)[-1] # Removing itself as the first directory.
oData <- vector(mode = "list", length = length(unzipped.folders))
names(oData) <- str_remove(unzipped.folders, paste(folder,"/",sep=""))
for (i in unzipped.folders) {
filenames <- list.files(i, pattern = "*.csv")
#setwd(paste(folder, i, sep="/"))
setwd(i)
files <- lapply(filenames, read.csv, skip = 5, header = TRUE, fileEncoding = "UTF-16LE") #Note unusual encoding
oData[[str_remove(i, paste(folder,"/",sep=""))]] <- vector(mode="list", length = length(files))
oData[[str_remove(i, paste(folder,"/",sep=""))]] <- files
}
setwd(orig.directory)
return(oData)
}
theData <- pullData(folder) #Load the data into a list of lists
# Process the data into frames
bigFrame <- function(bigList) {
#where bigList is theData is the result of pullData
#initialize the holding list of frames per set
preList <- vector(mode="list", length = length(bigList))
names(preList) <- names(bigList)
# process the data
for (i in 1:length(bigList)){
step1 <- lapply(bigList[[i]], t) # transpose each data
step2 <- do.call(rbind, step1) # roll it up into it's own matrix #original error that wasn't reproduced: It showed length(step2) = 24048 when i = 1 and dim(step2) = 48 501. Any comments on why?
firstRow <- step2[1,] #holding onto the first row to become the names
step3 <- as.data.frame(step2) # turn it into a frame
step4 <- step3[grepl("µA", rownames(step3)),] # Get rid of all those excess name rows
rownames(step4) <- 1:(nrow(step4)) # change the row names to rowID's
colnames(step4) <- firstRow # change the column names to the first row steps
step4$ID <- rep(names(bigList[i]),nrow(step4)) # Add an I.D. column
step4$Class[grepl("pos",tolower(step4$ID))] <- "Yes" # Add "Yes" class
step4$Class[grepl("neg",tolower(step4$ID))] <- "No" # Add "No" class
preList[[i]] <- step4
}
# bigFrame <- do.call(rbind, preList) #Failed due to different number of measurements (rows that become columns) across all the data sets
# return(bigFrame)
return(preList) # Works!
}
frameList <- bigFrame(theData)
monterey <- rbind(frameList[[1]],frameList[[2]])
# Odd behaviors
dim(monterey) #48 503
length(monterey) #503 #This is not reproducing my original error of length = 24048
rowOne <- monterey[1,1:(ncol(monterey)-2)]
plot(rowOne) #Error in plot.new() : figure margins too large
#describe the data
quantile(rowOne, seq(0, 1, length.out = 11) )
quantile(rowOne, seq(0, 1, length.out = 11) ) %>% plot #produces undesired lattice plot
# simulate the data
doppelganger <- sample(1:20461,501,replace = TRUE)
names(doppelganger) <- names(rowOne)
# describe the data
plot(doppelganger) #Successful scatterplot. (With my non-random data, I want a line where the numbers in colnames are along the x-axis)
quantile(doppelganger, seq(0, 1, length.out = 11) ) #the random distribution is mildly different
quantile(doppelganger, seq(0, 1, length.out = 11) ) %>% plot # a simple line of dots as desired
# investigating structure
str(rowOne) # results in a dataframe of 1 observation of 501 variables. This is a correct interpretation.
str(as.data.frame(doppelganger)) # results in 501 observations of 1 variable. This is not a correct interpretation but creates the plot that I want.
How do I convert the rowOne to plot like doppelganger?
It looks like one of my errors is not reproducing, where calls to "dim" and "length" apparently disagree.
However, I'm confused as to why the "plot" function is producing a lattice plot on my processed data and a line of dots on my simulated data.
What I would like is to plot each row of data as a line. (Next, and out of the scope of this question, is I would like to classify the data with adaboost. My concern is that if "plot" behaves strangely then the classifier won't work.)
Any tips or suggestions or explanations or advice would be greatly appreciated.
Edit: Investigating the structure with ("str") of the two examples explains the difference between plots. I guess my modified question is, how do I switch between the two structures to enable plotting a line (like doppelganger) instead of a lattice (like rowOne)?
I am answering my own question.
I am leaving behind the part about the discrepancy between "length" and "dim" since I can't provide a reproducible example. However, I'm happy to leave up for comment.
The answer is that in order to produce my plot, I simply have to transpose the row as follows:
rowOne %>% t() %>% as.data.frame() %>% plot
This inverts the structure from one observation of 501 variables to 501 obs of one variable as follows:
rowOne %>% t() %>% as.data.frame() %>% str()
#'data.frame': 501 obs. of 1 variable:
# $ 1: num 8712 8712 8712 8712 8712 ...
Because of the unusual encoding I used, and the strange "length" result, I failed to see a simple solution to my "plot" problem.

Creating a compartive object in R from two dataframes for comparitive phylogenetics

I'm trying to read in two dataframes into a comparitive object so I can plot them using pgls.
I'm not sure what the error being returned means, and how to go about getting rid of it.
My code:
library(ape)
library(geiger)
library(caper)
taxatree <- read.nexus("taxonomyforzeldospecies.nex")
LWEVIYRcombodata <- read.csv("LWEVIYR.csv")
LWEVIYRcombodataPGLS <-data.frame(LWEVIYRcombodata$Sum.of.percentage,OGT=LWEVIYRcombodata$OGT, Species=LWEVIYRcombodata$Species)
comp.dat <- comparative.data(taxatree, LWEVIYRcombodataPGLS, "Species")
Returns error:
> comp.dat <- comparative.data(taxatree, LWEVIYRcombodataPGLS, 'Species')
Error in if (tabulate(phy$edge[, 1])[ntips + 1] > 2) FALSE else TRUE :
missing value where TRUE/FALSE needed
This might come from your data set and your phylogeny having some discrepancies that comparative.data struggles to handle (by the look of the error message).
You can try cleaning both the data set and the tree using dispRity::clean.data:
library(dispRity)
## Reading the data
taxatree <- read.nexus("taxonomyforzeldospecies.nex")
LWEVIYRcombodata <- read.csv("LWEVIYR.csv")
LWEVIYRcombodataPGLS <- data.frame(LWEVIYRcombodata$Sum.of.percentage,OGT=LWEVIYRcombodata$OGT, Species=LWEVIYRcombodata$Species)
## Cleaning the data
cleaned_data <- clean.data(LWEVIYRcombodataPGLS, taxatree)
## Preparing the comparative data object
comp.dat <- comparative.data(cleaned_data$tree, cleaned_data$data, "Species")
However, as #MrFlick suggests, it's hard to know if that solves the problem without a reproducible example.
The error here is that I was using a nexus file, although ?comparitive.data does not specify which phylo objects it should use, newick trees seem to work fine, whereas nexus files do not.

R code automation

I am doing PCA. Here is the code for the same-
### Read .csv file #####
data<-read.csv(file.choose(),header=T,sep=",")
names(data)
data$qcountry
#### for the country-ARGENTINA#######
ar_data<-data[which(data$qcountry=="ar"),]
ar_data$qcountry<-NULL
names(ar_data)
names(ar_data)<-c("01_insufficient_efficacy","02_safety_issues","03_inconvenient_dosage_regimen","04_price_issues"
,"05_not_reimbursed","06_not_inculed_govt","07_insuficient_clinicaldata","08_previously_used","09_prescription_opted_for_some_patients","10_scientific_info_NA","12_involved_in_diff_clinical_trial"
,"13_patient_inappropriate_for_TT","14_patient_inappropriate_Erb","16_patient_over_65","17_Erbitux_alternative","95_Others")
names(ar_data)
ar_data_wdt_zero_columns<-ar_data[, colSums(ar_data != 0) > 0]
####Testing multicollinearity####
vif(ar_data_wdt_zero_columns)
#### Testing appropriatness of PCA ####
KMO(ar_data_wdt_zero_columns)
cortest.bartlett(ar_data_wdt_zero_columns)
#### Run PCA ####
pca<-prcomp(ar_data_wdt_zero_columns,center=F,scale=F)
summary(pca)
#### Compute the loadings for deciding the top4 most correlated variables###
load<-pca$rotation
write.csv(load,"loadings_argentina_2015_Q4.csv")
I have shown here for the one country, I have done this for 9countries. For each country I have to run this code. I am sure there must be easier way to automate this code. Please suggest !!
Thanks!!
Yes, this is doable for every country. You can make your custom function which takes appropriate parameters, e.g. country name and data. You do the magic inside and return an appropriate object (or not). Pass this magic to a processed data which you import and make pretty only once. The below code is not tested but should get you started.
A few comments.
Don't use file.choose() as it breaks your code 3 days down the line. How do you know what file to choose? Why click every time you run the script when you can make the script work for you? Be lazy in that sense.
You have a lot of clutter in your script. Adhere to some style and don't leave in random lines you try out for "shits and giggles". Use spaces in your code, at least.
Be more imaginative in choose object names. Try the name out first if perhaps the object already exists in a form of a base function, e.g. load.
myPCA <- function(my.country, my.data) {
ar_data <- data[data$qcountry %in% "ar", ]
ar_data$qcountry <- NULL
ar_data_wdt_zero_columns <- ar_data[, colSums(ar_data != 0) > 0]
#### Run PCA ####
pca <- prcomp(ar_data_wdt_zero_columns, center = FALSE, scale = FALSE)
#### Compute the loadings for deciding the top4 most correlated variables###
write.csv(pca$rotation, paste("loadings_", my.country, ".csv", sep = "")) # may need tweaking
return(list(pca = pca, vif = vif(ar_data_wdt_zero_columns),
kmo = KMO(ar_data_wdt_zero_columns), correlation = cortest.bartlett(ar_data_wdt_zero_columns))
}
data <- read.csv("relative_link_to_file", header = TRUE, sep = ",")
names(data) <- c("01_insufficient_efficacy","02_safety_issues","03_inconvenient_dosage_regimen","04_price_issues"
,"05_not_reimbursed","06_not_inculed_govt","07_insuficient_clinicaldata","08_previously_used","09_prescription_opted_for_some_patients","10_scientific_info_NA","12_involved_in_diff_clinical_trial"
,"13_patient_inappropriate_for_TT","14_patient_inappropriate_Erb","16_patient_over_65","17_Erbitux_alternative","95_Others")
sapply(data$qcountry, FUN = myPCA)

Resources