Writing to database in parallel in R

Writing to database in parallel in R - r

I try to write a table which is a processed subset of a global data variable, in a normal for loop this piece of code works fine but when I try to do it in parallel it raises an error.
Here is my piece of code;
library(doParallel)
library(foreach)
library(odbc)
library(data.table)
nc <- detectCores() - 1
cs <- makeCluster(nc)
registerDoParallel(cs)
con <- dbConnect(odbc(),driver = 'SQL Server',server = 'localserver',database = 'mydb', encoding = 'utf-8',timeout = 20)
range_to <- 1e6
set.seed(1)
random_df <- data.table(a = rnorm(n = range_to,mean = 2,sd = 1),
b = runif(n = range_to,min = 1,max = 300))
foreach(i=1:1000,.packages = c('odbc','data.table')) %dopar% {
subk <- random_df[i,]
subk <- subk**2
odbc::dbWriteTable(conn = con,name = 'parallel_test',value = subk,row.names = FALSE,append = TRUE)
}
This code raises this error;
Error in {: task 1 failed - "unable to find an inherited method for function 'dbWriteTable' for signature '"Microsoft SQL Server", "character", "data.table"'"
Like I said before in a normal for loop it works fine.
Thanks in advance.

I solved that issue by changing only creating connection object method by;
parallel::clusterEvalQ(cs, {library(odbc);con <- dbConnect(odbc(),driver = 'SQL Server',server = 'localserver',database = 'mydb', encoding = 'utf-8',timeout = 20)})

Related

How do I resolve an integration error in Seurat?

I am new to Seurat, and am trying to run an integrated analysis of two different single-nuclei RNAseq datasets. I have been following the Seurat tutorial on integrated analysis (https://satijalab.org/seurat/articles/integration_introduction.html) to guide me, but when I ran the last line of code, I got an error.
# Loading required libraries
library(Seurat)
library(cowplot)
library(patchwork)
# Set up the Seurat Object
vgat.data <- Read10X(data.dir = "~/Desktop/VGAT Viral Data 1/")
vglut.data <- Read10X(data.dir = "~/Desktop/VGLUT3 Viral/")
# Initialize the Seurat object with the raw (non-normalized data)
vgat <- CreateSeuratObject(counts = vgat.data, project = "VGAT/VGLUT Integration", min.cells = 3, min.features = 200)
vglut <- CreateSeuratObject(counts = vglut.data, project = "VGAT/VGLUT Integration", min.cells = 3, min.features = 200)
# Merging the datasets
vgat <- AddMetaData(vgat, metadata = "VGAT", col.name = "Cell")
vglut <- AddMetaData(vglut, metadata = "VGLUT", col.name = "Cell")
merged <- merge(vgat, y = vglut, add.cell.ids = c("VGAT", "VGLUT"), project = "VGAT/VGLUT Integration")
# Split the dataset into a list of two seurat objects (vgat and vglut)
merged.list <- SplitObject(merged, split.by = "Cell")
# Normalize and Identify variable features for each dataset independently
merged.list <lapply(X = merged.list, FUN = function(x) {
x <- NormalizeData(x)
x <- FindVariableFeatures(x, selection.method = "vst", nFeatures = 2000)
})
After running the last line of code, I get the following error: Error in merged.list < lapply(X = merged.list, FUN = function(x) { :
comparison of these types is not implemented
I was wondering if anyone is familiar with Seurat and knows how I can troubleshoot this error. Any help would be greatly appreciated.

Conceptual Stucture doesn't accept clust

I use the latest version of package and I try to run this:
I try to run this example:
library(bibliometrix)
download.file("https://www.bibliometrix.org/datasets/joi.zip", destfile = temp<-tempfile())
M <- convert2df(readLines(unz(temp, "joi.txt")), dbsource="isi",format="plaintext")
CS <- conceptualStructure(M, method="MCA", field="ID", minDegree=10, clust=5, stemming=FALSE, labelsize=8,documents=20)
but I receive this error:
Error in conceptualStructure(M, method = "MCA", field = "ID", minDegree = 10, :
unused argument (clust = 5)
What should change?

R Parallel Programming: Error in { : task 1 failed - "could not find function "%>%""

I tried to do Parallel Programming in R by modified my script. On my script I did two parallel programming. First one was done but the second was error whereas the script structure were same. Below is my code:
library(rvest)
library(RMySQL)
library(curl)
library(gdata)
library(doMC)
library(foreach)
library(doParallel)
library(raster)
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
setwd('/home/chandra/R/IlmuOne/MisterAladin')
no_cores <- detectCores()
cl<-makeCluster(no_cores)
registerDoParallel(cl)
MasterData = read.xls("Master Hotels - FINAL.xlsx", sheet = 1, header = TRUE)
MasterData$url_agoda = as.character(MasterData$url_agoda)
today = as.Date(format(Sys.time(), "%Y-%m-%d"))+2
ntasks <- nrow(MasterData)
#This section perfomed well
foreach(i=1:ntasks) %dopar% {
url = MasterData$url_agoda[i]
if (trim(url)!='-' & trim(url)!='')
{
from = gregexpr(pattern ='=',url)[[1]][1]
piece1 = substr(url,1,from)
from = gregexpr(pattern ='&los=',url)[[1]][1]
piece2 = substr(url,from,nchar(url))
MasterData$url_agoda[i] = paste0(piece1,today,piece2)
}
}
con <- dbConnect(RMySQL::MySQL(), username = "root", password = "master",host = "localhost", dbname = "mister_aladin")
#Tried first 10 data
#Below section was error and always return error: Error in { : task 1 failed - "could not find function "%>%""
foreach(a=1:10, .packages='foreach') %dopar% {
hotel_id = MasterData$id[a]
vendor = 'Agoda'
url = MasterData$url_agoda[a]
if (url!='-')
{
tryCatch({
hotel <- curl(url) %>%
read_html() %>%
html_nodes(xpath='//*[#id="room-grouping"]') %>%
html_table(fill = TRUE)
hotel <- hotel[[1]]
hotel$hotel_id= hotel_id
hotel$vendor= vendor
colnames(hotel)[1] = 'TheSpace'
colnames(hotel)[4] = 'PricePerNight'
room = '-'
hotel$NormalPrice = 0
hotel$FinalPrice = 0
for(i in 1:nrow(hotel))
{
if (i==1 | (!grepl('See photos',hotel$TheSpace[i]) & hotel$TheSpace[i]!='') )
{
room = hotel$TheSpace[i]
}
hotel$TheSpace[i] = room
#Normal Price
if (gregexpr(pattern ='IDR',hotel$PricePerNight[i])[[1]][1][1]==1)
{
split = strsplit(hotel$PricePerNight[i],'\n')[[1]]
NormalPrice = trim(split[2])
hotel$NormalPrice[i] = NormalPrice
NormalPrice = as.integer(gsub(",","",NormalPrice))
hotel$NormalPrice[i] = NormalPrice
}
#Final Price
if (gregexpr(pattern ='IDR',hotel$PricePerNight[i])[[1]][1][1]==1)
{
split = strsplit(hotel$PricePerNight[i],'\n')[[1]]
FinalPrice = trim(split[6])
hotel$FinalPrice[i] = FinalPrice
FinalPrice = as.integer(gsub(",","",FinalPrice))
hotel$FinalPrice[i] = FinalPrice
}
hotel$NormalPrice[is.na(hotel$NormalPrice)] <- 0
hotel$FinalPrice[is.na(hotel$FinalPrice)] <- 0
}
hotel = hotel[which(hotel$FinalPrice!=0),c("TheSpace","NormalPrice","FinalPrice")]
colnames(hotel) = c('room','normal_price','final_price')
hotel$log = format(Sys.time(), "%Y-%m-%d %H:%M:%S")
hotel$hotel_id = hotel_id
hotel$vendor = vendor
Push = hotel[,c('hotel_id','room','normal_price','final_price','vendor','log')]
#print(paste0('Agoda: push one record, hotel id ',hotel_id,'!'))
#cat(paste(paste0('Agoda: push one record, hotel id ',hotel_id,'!'),'\n'))
dbWriteTable(conn=con,name='prices_',value=as.data.frame(Push), append = TRUE, row.names = F)
},
error = function(e) {
Sys.sleep(2)
e
})
}
}
dbDisconnect(con)
stopImplicitCluster()
Every time I run the script it always gives me error: Error in { : task 1 failed - "could not find function "%>%""
I already check every post on this forum and tried to apply it but no one works.
Please advise any solution

you have to use .packages = c("magrittr", ...) and include all the packages, which are necessary to run the code within the foreach loop. However, .packages = "foreach" is not helping.
See, you can imagine that all the packages you define in .packages are forwareded / loaded in each parallel worker.

The %>% operator requires the package magrittr. In this case however it does not suffice to load it at the beginning of your script - it needs to be loaded for each of the nodes. You could add this line to the creation of your cluster to accomplish this:
cl<-makeCluster(no_cores)
registerDoParallel(cl)
clusterCall(cl, function() library(magrittr))

unable to save the r output in hdfs

I am running sparkR program. I want to save the output in hdfs.The output saves in local perfectly,but if i mention hdfs path means it throws error.
I am executing from shell script.this is my shell script:
/SparkR-pkg/lib/SparkR/sparkR-submit --master yarn-client examples/pi.R yarn-client 4
this is my r code.
library(SparkR)
getwd()
setwd('hdfs://ip-172-31-41-199.us-wes t2.compute.internal:8020/user/karun/output/')
args <- commandArgs(trailing = TRUE)
if (length(args) < 1) {
print("Usage: pi <master> [<slices>]")
q("no")
}
sc <- sparkR.init(args[[1]], "PiR")
slices <- ifelse(length(args) > 1, as.integer(args[[2]]), 2)
n <- 100000 * slices
piFunc <- function(elem) {
rands <- runif(n = 2, min = -1, max = 1)
val <- ifelse((rands[1]^2 + rands[2]^2) < 1, 1.0, 0.0)
val
}
piFuncVec <- function(elems) {
message(length(elems))
rands1 <- runif(n = length(elems), min = -1, max = 1)
rands2 <- runif(n = length(elems), min = -1, max = 1)
val <- ifelse((rands1^2 + rands2^2) < 1, 1.0, 0.0)
sum(val)
}
rdd <- parallelize(sc, 1:n, slices)
count <- reduce(lapplyPartition(rdd, piFuncVec), sum)
output <- paste("Pi is roughly", 4.0 * count / n, "\n")
output <- paste(output, "Num elements in RDD ", count(rdd), "\n")
writeLines(output, con = "file.txt", sep = "\n", useBytes = FALSE)
cat("Num elements in RDD ", count(rdd), "\n")
i tried many method to save the output in hdfs link sink,write.data,writetype etc.. i am trying the change the working directory by mentioning setwd().This query is also not working .it throws error
Error in setwd("hdfs://ip-172-31-41-199.us-west- 2.compute.internal:8020/user/karun/output/") :
cannot change working directory
Execution halted
I have been troubleshooting for 2 days .any help will be appreciated

Unused arguments in R error

I am new to R , I am trying to run example which is given in "rebmix-help pdf". It use galaxy dataset and here is the code
library(rebmix)
devAskNewPage(ask = TRUE)
data("galaxy")
write.table(galaxy, file = "galaxy.txt", sep = "\t",eol = "\n", row.names = FALSE, col.names = FALSE)
REBMIX <- array(list(NULL), c(3, 3, 3))
Table <- NULL
Preprocessing <- c("histogram", "Parzen window", "k-nearest neighbour")
InformationCriterion <- c("AIC", "BIC", "CLC")
pdf <- c("normal", "lognormal", "Weibull")
K <- list(7:20, 7:20, 2:10)
for (i in 1:3) {
for (j in 1:3) {
for (k in 1:3) {
REBMIX[[i, j, k]] <- REBMIX(Dataset = "galaxy.txt",
Preprocessing = Preprocessing[k], D = 0.0025,
cmax = 12, InformationCriterion = InformationCriterion[j],
pdf = pdf[i], K = K[[k]])
if (is.null(Table))
Table <- REBMIX[[i, j, k]]$summary
else Table <- merge(Table, REBMIX[[i, j,k]]$summary, all = TRUE, sort = FALSE)
}
}
}
It is giving me error ERROR:
unused argument (InformationCriterion = InformationCriterion[j])
Plz help

I'm running R 3.0.2 (Windows) and the library rebmix defines a function REBMIX where InformationCriterion is not listed as a named argument, but Criterion.
Brief invoke REBMIX as :
REBMIX[[i, j, k]] <- REBMIX(Dataset = "galaxy.txt",
Preprocessing = Preprocessing[k], D = 0.0025,
cmax = 12, Criterion = InformationCriterion[j],
pdf = pdf[i], K = K[[k]])

It looks as though there have been substantial changes to the rebmix package since the example mentioned in the OP was created. Among the most noticable changes is the use of S4 classes.
There's also an updated demo in the rebmix package using the galaxy data (see demo("rebmix.galaxy"))
To get the above example to produce results (Note: I am not familiar with this package or the rebmix algorithm!!!):
Change the argument to Criterion as mentioned by #Giupo
Use the S4 slot access operator # instead of $
Don't name the results object REDMIX because that's already the function name
library(rebmix)
data("galaxy")
## Don't re-name the REBMIX object!
myREBMIX <- array(list(NULL), c(3, 3, 3))
Table <- NULL
Preprocessing <- c("histogram", "Parzen window", "k-nearest neighbour")
InformationCriterion <- c("AIC", "BIC", "CLC")
pdf <- c("normal", "lognormal", "Weibull")
K <- list(7:20, 7:20, 2:10)
for (i in 1:3) {
for (j in 1:3) {
for (k in 1:3) {
myREBMIX[[i, j, k]] <- REBMIX(Dataset = list(galaxy),
Preprocessing = Preprocessing[k], D = 0.0025,
cmax = 12, Criterion = InformationCriterion[j],
pdf = pdf[i], K = K[[k]])
if (is.null(Table)) {
Table <- myREBMIX[[i, j, k]]#summary
} else {
Table <- merge(Table, myREBMIX[[i, j,k]]#summary, all = TRUE, sort = FALSE)
}
}
}
}

I guess this is late. But I encountered a similar problem just a few minutes ago. And I realized the real scenario that you may face when you got this kind of error msg... It's just the version conflict.
You may use a different version of the R package from the tutorial, thus the argument names could be different between what you are running and what the real code use.
So please check the version first before you try to manually edit the file. Also, it happens that your old version package is still in the path and it overrides the new one. This was exactly what I had... since I manually installed the old and new version separately...

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Writing to database in parallel in R - r

I solved that issue by changing only creating connection object method by; parallel::clusterEvalQ(cs, {library(odbc);con <- dbConnect(odbc(),driver = 'SQL Server',server = 'localserver',database = 'mydb', encoding = 'utf-8',timeout = 20)})

Related

How do I resolve an integration error in Seurat?

Conceptual Stucture doesn't accept clust

R Parallel Programming: Error in { : task 1 failed - "could not find function "%>%""

unable to save the r output in hdfs

Unused arguments in R error

Categories

Resources