I´m currently working with r. My database has a lot of zeros and N/A, so when I run the lm() command with a log-log model it doesn´t works. Any advice?
This is not the best answer but it got me working: I store the results of my database in a new variable, and i prune all missing values out (in my case, cells that literally have nothing in them, or ""), like this:
library(RODBC) # database on a SQL box
library(tidyverse)
myConnection <- odbcConnect("name_of_your_DNS") # I create a DSN first
Data <- sqlFetch(channel = myConnection, sqtable = "name_of_your_table")
empty_as_na <- function(x) {
if("factor" %in% class(x)) x <- as.character(x) ## since ifelse wont work with factors
ifelse(as.character(x)!="", x, NA)}
Data <- Data %>% mutate_all(funs(empty_as_na))
you might consider experimenting with adding this after loading the RODBC library:
options(stringsAsFactors = FALSE)
Related
I am writing to you because I have been for several hours now stuck on a code in R. Initially, I thought it would be something very simple, but nothing I have tried has worked. I am building a code that imports a number of databases and for each of these databases calculates the average ratio of NA, zero values and empty values. The code is built so that it creates an auxiliar database with the variable names from every database and stores the ratio of missing values for each variable. However, the problem is in trying to store that auxiliar database. The idea is that the auxiliar database is stored with the name of the original database, that is to say that it depends on the factor k that iterates according to all the databases. The problem is that I have not been able to do this, all the alternatives to make it something like: base_[k] where k varies according to the name of the database fail.
Have any of you experienced something like this, I don't know what to do anymore. Thanks a lot. I leave the code so you can understand it a little better.
rm(list = ls())
setwd("C:/Users/Kevin/Escritorio/UK 2022.05.24")
listcsv <- dir(pattern = "*.csv") # creates the list of all the csv files in the directory
results <- as.data.frame(listcsv)
results$mean_na_ratio <- -777
results$mean_zero_ratio <- -777
results$mean_no_value_ratio <- -777
for (k in 1:length(listcsv)){
df <- read.csv(listcsv[k],stringsAsFactors=FALSE)
c1 <- colMeans(is.na(df))
results[k, "mean_na_ratio"] <- mean(c1)
vars_vector <- colnames(df)
vars_dataframe <- as.data.frame(vars_vector)
rownames(vars_dataframe) <- vars_dataframe$vars_vector
for (i in vars_vector){
df[,i] <- as.character(df[,i])
df$temp <- df[,i]
vars_dataframe[i, "mean_zero_ratio"] <- nrow(subset(df, temp=="0"))/nrow(df)
vars_dataframe[i, "mean_no_value_ratio"] <- nrow(subset(df, temp==""))/nrow(df)
}
vars_dataframe[is.na(vars_dataframe)] <- 0
results[k, "mean_zero_ratio"] <- mean(vars_dataframe$mean_zero_ratio)
results[k, "mean_no_value_ratio"] <- mean(vars_dataframe$mean_no_value_ratio)
**data_k <- vars_dataframe**
}
The problem is marked in bold
Thank you so much.
I am trying to optimize a simple R code I wrote on two aspects:
1) For loops
2) Writing data into my PostgreSQL database
For 1) I know for loops should be avoided at all cost and it's recommended to use lapply but I am not clear on how to translate my code below using lapply.
For 2) what I do below is working but I am not sure this is the most efficient way (for example doing this way versus rbinding all data into an R dataframe and then load the whole dataframe into my PostgreSQL database.)
EDIT: I updated my code with a reproducible example below.
for (i in 1:100){
search <- paste0("https://github.com/search?o=desc&p=", i, &q=R&type=Repositories)
download.file(search, destfile ='scrape.html',quiet = TRUE)
url <- read_html('scrape.html')
github_title <- url%>%html_nodes(xpath="//div[#class=mt-n1]")%>%html_text()
github_link <- url%>%html_nodes(xpath="//div[#class=mt-n1]//#href")%>%html_text()
df <- data.frame(github_title, github_link )
colnames(df) <- c("title", "link")
dbWriteTable(con, "my_database", df, append = TRUE, row.names = FALSE)
cat(i)
}
Thanks a lot for all your inputs!
First of all, it is a myth that should be completely thrashed that lapply is in any way faster than equivalent code using a for loop. For years this has been fixed, and for loops should in every case be faster than the equivalent lapply.
I will visualize using a for loop as you seem to find this more intuitive. Do however note that i work mostly in T-sql and there might be some conversion necessary.
n <- 1e5
outputDat <- vector('list', n)
for (i in 1:10000){
id <- element_a[i]
location <- element_b[i]
language <- element_c[i]
date_creation <- element_d[i]
df <- data.frame(id, location, language, date_creation)
colnames(df) <- c("id", "location", "language", "date_creation")
outputDat[[i]] <- df
}
## Combine data.frames
outputDat <- do.call('rbind', outputDat)
#Write the combined data.frame into the database.
##dbBegin(con) #<= might speed up might not.
dbWriteTable(con, "my_database", df, append = TRUE, row.names = FALSE)
##dbCommit(con) #<= might speed up might not.
Using Transact-SQL you could alternatively combine the entire string into a single insert into statement. Here I'll deviate and use apply to iterate over the rows, as it is much more readable in this case. A for loop is once again just as fast if done properly.
#Create the statements. here
statement <- paste0("('", apply(outputDat, 1, paste0, collapse = "','"), "')", collapse = ",\n") #\n can be removed, but makes printing nicer.
##Optional: Print a bit of the statement
# cat(substr(statement, 1, 2000))
##dbBegin(con) #<= might speed up might not.
dbExecute(con, statement <- paste0(
'
/*
SET NOCOCUNT ON seems to be necessary in the DBI API.
It seems to react to 'n rows affected' messages.
Note only affects this method, not the one using dbWriteTable
*/
--SET NOCOUNT ON
INSERT INTO [my table] values ', statement))
##dbCommit(con) #<= might speed up might not.
Note as i comment, this might simply fail to properly upload the table, as the DBI package seems to sometimes fail this kind of transaction, if it results in one or more messages about n rows affected.
Last but not least once the statements are made, this could be copied and pasted from R into any GUI that directly access the database, using for example writeLines(statement, 'clipboard') or writing into a text file (a file is more stable if your data contains a lot of rows). In rare outlier cases this last resort can be faster, if for whatever reason DBI or alternative R packages seem to run overly slow without reason. As this seems to be somewhat of a personal project, this might be sufficient for your use.
I worked a lot with MaxEnt in R recently (dismo-package), but only using a crossvalidation to validate my model of bird-habitats (only a single species). Now I want to use a self-created test sample file. I had to pick this points for validation by hand and can't use random test point.
So my R-script looks like this:
library(raster)
library(dismo)
setwd("H:/MaxEnt")
memory.limit(size = 400000)
punkteVG <- read.csv("Validierung_FL_XY_2016.csv", header=T, sep=";", dec=",")
punkteTG <- read.csv("Training_FL_XY_2016.csv", header=T, sep=";", dec=",")
punkteVG$X <- as.numeric(punkteVG$X)
punkteVG$Y <- as.numeric(punkteVG$Y)
punkteTG$X <- as.numeric(punkteTG$X)
punkteTG$Y <- as.numeric(punkteTG$Y)
##### mask NA ######
mask <- raster("final_merge_8class+le_bb_mask.img")
dataframe_VG <- extract(mask, punkteVG)
dataframe_VG[dataframe_VG == 0] <- NA
dataframe_TG <- extract(mask, punkteTG)
dataframe_TG[dataframe_TG == 0] <- NA
punkteVG <- punkteVG*dataframe_VG
punkteTG <- punkteTG*dataframe_TG
#### add the raster dataset ####
habitat_all <- stack("blockstats_stack_8class+le+area_8bit.img")
#### MODEL FITTING #####
library(rJava)
system.file(package = "dismo")
options(java.parameters = "-Xmx1g" )
setwd("H:/MaxEnt/results_8class_LE_AREA")
### backgroundpoints ###
set.seed(0)
backgrVMmax <- randomPoints(habitat_all, 100000, tryf=30)
backgrVM <- randomPoints(habitat_all, 1000, tryf=30)
### Renner (2015) PPM modelfitting Maxent ###
maxentVMmax_Renner<-maxent(habitat_all,punkteTG,backgrVMmax, path=paste('H:/MaxEnt/Ergebnisse_8class_LE_AREA/maxVMmax_Renner',sep=""),
args=c("-P",
"noautofeature",
"nothreshold",
"noproduct",
"maximumbackground=400000",
"noaddsamplestobackground",
"noremoveduplicates",
"replicates=10",
"replicatetype=subsample",
"randomtestpoints=20",
"randomseed=true",
"testsamplesfile=H:/MaxEnt/Validierung_FL_XY_2016_swd_NA"))
After the "maxent()"-command I ran into multiple errors. First I got an error stating that he needs more than 0 (which is the default) "randomtestpoints". So I added "randomtestpoints = 20" (which hopefully doesn't stop the program from using the file). Then I got:
Error: Test samples need to be in SWD format when background data is in SWD format
Error in file(file, "rt") : cannot open the connection
The thing is, when I ran the script with the default crossvalidation like this:
maxentVMmax_Renner<-maxent(habitat_all,punkteTG,backgrVMmax, path=paste('H:/MaxEnt/Ergebnisse_8class_LE_AREA/maxVMmax_Renner',sep=""),
args=c("-P",
"noautofeature",
"nothreshold",
"noproduct",
"maximumbackground=400000",
"noaddsamplestobackground",
"noremoveduplicates",
"replicates=10"))
...all works fine.
Also I tried multiple things to get my csv-validation-data in the correct format. Two rows (labled X and Y), Three rows (labled species, X and Y) and other stuff. I would rather use the "punkteVG"-vector (which is the validation data) I created with read.csv...but it seems MaxEnt wants his file.
I can't imagine my problem is so uncommon. Someone must have used the argument "testsamplesfile" before.
I found out, what the problem was. So here it is, for others to enjoy:
The correct maxent-command for a Subsample-file looks like this:
maxentVMmax_Renner<-maxent(habitat_all, punkteTG, backgrVMmax, path=paste('H:/MaxEnt',sep=""),
args=c("-P",
"noautofeature",
"nothreshold",
"noproduct",
"maximumbackground=400000",
"noaddsamplestobackground",
"noremoveduplicates",
"replicates=1",
"replicatetype=Subsample",
"testsamplesfile=H:/MaxEnt/swd.csv"))
Of course, there can not be multiple replicates, since you got only one subsample.
Most importantly the "swd.csv" Subsample-file has to include:
the X and Y coordinates
the Values at the respective points (e.g.: with "extract(habitat_all, PunkteVG)"
the first colum needs to consist of the word "species" with the header "Species" (since MaxEnt uses the default "species" if you don't define one in the Occurrence data)
So the last point was the issue here. Basically, if you don't define the species-colum in the Subsample-file, MaxEnt will not know how to assign the data.
I am experimenting with different regression models. My end goal is to have a nice easy to read dataframe with 3 columns:
model_results <- data.frame(name = character(),
rmse = numeric(),
r2 = numeric())
Then after running each model, add the corrosponding output to the dataframe and then, at the end, review and make some decisions on which model to use.
I tried this:
mod.spend_transactions.results <- list("mod.spend_transactions",
rsme(residuals(mod.spend_transactions)),
summary(mod.spend_transactions)$r.squared)
I tried using a list because I know vectors can only store one datatype (right?).
Output:
rbind(model_results, mod.spend_transactions.results)
X.mod.spend_transactions. X12.6029444519635 X0.912505643567096
1 mod.spend_transactions 12.60294 0.9125056
Close but not what I wnated since the df names have been changed and I did not expect that.
So I tried vectors, which works but seems "clunky" in that I'm sure I could do this with writing less code:
vect_modname <- vector()
vect_rsme <- vector()
vect_r2 <- vector()
Then after running a model
vect_modname <- c(vect_modname, "mod.spend_transactions")
vect_rsme <- c(vect_rsme, rsme(residuals(mod.spend_transactions)))
vect_r2 <- c(vect_r2, summary(mod.spend_transactions)$r.squared)
Then at the end of running all the models I'm testing out
data.frame(vect_modname, vect_rsme, vect_r2)
Again, the vector method does work. But is there a "better", more elegant way of doing this?
I have a fitted model that I'd like to apply to score a new dataset stored as a CSV. Unfortunately, the new data set is kind of large, and the predict procedure runs out of memory on it if I do it all at once. So, I'd like to convert the procedure that worked fine for small sets below, into a batch mode that processes 500 lines at a time, then outputs a file for each scored 500.
I understand from this answer (What is a good way to read line-by-line in R?) that I can use readLines for this. So, I'd be converting from:
trainingdata <- as.data.frame(read.csv('in.csv'), stringsAsFactors=F)
fit <- mymodel(Y~., data=trainingdata)
newdata <- as.data.frame(read.csv('newstuff.csv'), stringsAsFactors=F)
preds <- predict(fit,newdata)
write.csv(preds, file=filename)
to something like:
trainingdata <- as.data.frame(read.csv('in.csv'), stringsAsFactors=F)
fit <- mymodel(Y~., data=trainingdata)
con <- file("newstuff.csv", open = "r")
i = 0
while (length(mylines <- readLines(con, n = 500, warn = FALSE)) > 0) {
i = i+1
newdata <- as.data.frame(mylines, stringsAsFactors=F)
preds <- predict(fit,newdata)
write.csv(preds, file=paste(filename,i,'.csv',sep=''))
}
close(con)
However, when I print the mylines object inside the loop, it doesn't get auto-columned correctly the same way read.csv produces something that is---headers are still a mess, and whatever modulo column-width happens under the hood that wraps the vector into an ncol object isn't happening.
Whenever I find myself writing barbaric things like cutting the first row, wrapping the columns, I generally suspect R has a better way to do things. Any suggestions for how I can get a read.csv-like output form a readLines csv connection?
If you want to read your data into memory in chunks using read.csv by using the skip and nrows arguments. In pseudo-code:
read_chunk = function(start, n) {
read.csv(file, skip = start, nrows = n)
}
start_indices = (0:no_chunks) * chunk_size + 1
lapply(start_indices, function(x) {
dat = read_chunk(x, chunk_size)
pred = predict(fit, dat)
write.csv(pred)
}
Alternatively, you could put the data into an sqlite database, and use the sqlite package to query the data in chunks. See also this answer, or do some digging with [r] large csv on SO.