U_SQL script with R - r

I found out that you can execute R within U-SQL. So i took a R-script from one of our data-scientists and build a U-SQL script based on this sample script.
The adapted script:
DECLARE #INPUT_DAT string =
#"/Samples/Data/dat2json/validationData.dat.201805271617";
DECLARE #OUTPUT string = #"/Samples/Output/validationdata.out";
REFERENCE ASSEMBLY [ExtR];
DECLARE #myRScript = #"
datavector <- as.vector(readBin(#INPUT_DAT, "double", size = 4, n = 99000))
Size <- length(datavector)
numberOfPixels <- Size / 84
MaterialBase <- factor(rep(c("Plastic", "Aluminum"), each = (Size / 2)))
ThicknessBase <- factor(rep(c(rep(c(0, 10, 20, 30, 40, 50), times = 7),
rep(c(0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0), each = 6)), each = numberOfPixels))
ThicknessIterated <- factor(rep(c(rep(c(0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0),
each = 6), rep(c(0, 10, 20, 30, 40, 50), times = 7)), each = numberOfPixels))
Pixel <- rep(1:numberOfPixels, times = 84)
dflabel <- data.frame(MaterialBase, ThicknessBase, ThicknessIterated, Pixel,
Value = datavector)
";
#RScriptOutput = REDUCE #myRScript USING new
Extension.R.Reducer(command:#myRScript, rReturnType:"dataframe");
OUTPUT #ScriptOutput
TO #OUTPUT
USING Outputters.Tsv();
The problem is that when I build the code, Visual Studio stops on line 6, after #". Intellisense also show a red ~ sign indicating that something is wrong. The error it generates is: Expected one od: OPTION ';'
The R-script works perfectly in R-studio.
Update 2018-07-19:
I have narrowed it a bit down. The problem is the double quotes in the #myRScript variable. So I changed the code to the following:
DECLARE #INPUT_DAT string =
#"/dat2json/data/validationData.dat.201805271617";
DECLARE #OUTPUT string = #"/dat2json/data/validationdata.out";
DECLARE #vartype string = "double";
DECLARE #var1 string = "Plastic";
DECLARE #var2 string = "Aluminum";
REFERENCE ASSEMBLY [ExtR];
DECLARE #myRScript string = #"
datavector <- as.vector(readBin(#INPUT_DAT, #vartype, size = 4, n = 99000))
Size <- length(datavector)
numberOfPixels <- Size / 84
MaterialBase <- factor(rep(c(#var1, #var2), each = (Size / 2)))
ThicknessBase <- factor(rep(c(rep(c(0, 10, 20, 30, 40, 50), times = 7),
rep(c(0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0), each = 6)), each = numberOfPixels))
ThicknessIterated <- factor(rep(c(rep(c(0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0),
each = 6), rep(c(0, 10, 20, 30, 40, 50), times = 7)), each = numberOfPixels))
Pixel <- rep(1:numberOfPixels, times = 84)
dflabel <- data.frame(MaterialBase, ThicknessBase, ThicknessIterated, Pixel,
Value = datavector)
";
#RScriptOutput = REDUCE #myRScript ON MaterialBase USING new
Extension.R.Reducer(command:#myRScript, rReturnType:"dataframe");
OUTPUT #ScriptOutput
TO #OUTPUT
USING Outputters.Tsv();
But now I get an other error:
E_CSC_USER_ROWSETVARIABLENOTFOUND: Rowset variable #myRScript was not found.
Description:
Rowset variables must be assigned to before they can be referenced.
Resolution:
Assign a rowset to the rowset variable or remove the reference.
Looks like I have to put the rsult of the R-script into a variable an use that one in the REDUCE statement. But how to do that?

Deploy R script as resource. The script file is deployed into the vertex workspace and is accessible from any custom code.
DECLARE #rScriptFile string = #"MyR2.R";
DECLARE #rScriptDeploy string = #"/rscripts/" + #rScriptFile;
DEPLOY RESOURCE #rScriptDeploy;
#inputQuery2 =
REDUCE #inputQuery1
ON Par
PRODUCE Par,
...
READONLY Par
USING new Extension.R.Reducer(scriptFile : #rScriptFile, rReturnType : "dataframe");

Related

r circlize: missing value where TRUE/FALSE needed

I am trying to plot (for the first time) a chord diagram in the package circlize in R Studio. I am going through the manual chapters (Circular Visualization in R). The first step is to allocate the sectors on a circle by using the circos.initialize command. However, when I get to this step, I get an error stating missing values where TRUE/FALSE needed.
A reproducible example
library(circlize)
Types <- data.frame(Types = c("OOP", "UVA", "MAT", "OIC", "FIN", "WSE"))
stack.df <- data.frame(Year = c(rep(2019, 1), rep(2020, 4), rep(2021, 7), rep(2022, 11), rep(2023, 11)), Invoice = c(paste0("2019.", "10", ".INV"),
paste0("2020.", seq(from = 20, to = 23, by = 1), ".INV"),
paste0("2021.", seq(from = 30, to = 36, by = 1), ".INV"),
paste0("2022.", seq(from = 40, to = 50, by = 1), ".INV"),
paste0("2023.", seq(from = 50, to = 60, by = 1), ".INV")))
stack.df <- cbind(stack.df, Org_1 = Types[sample(nrow(Types), nrow(stack.df), replace = TRUE), ], Org_2 = Types[sample(nrow(Types), nrow(stack.df), replace = TRUE), ])
Making Chord Diagram
My overall objective: Make a chord diagram where the sectors are the stack.df$Year and track 1 is the stack.df$Invoice, with the circos.links from stack.df$Org_1 to stack.df$Org_2.
Initialize
circos.initialize(sectors = stack.df$Year, x = stack.df$Invoice)
Error in if (sector.range[i] == 0) { :
missing value where TRUE/FALSE needed
In addition: Warning message:
In circos.initialize(sectors = stack.df$Year, x = stack.df$Invoice) :
NAs introduced by coercion
What am I am missing? My sector.range !== 0 as stack.df$Year is from 2019-2023. Any help in overcoming this error is greatly appreciated.

How to define printing whole graph in ggplot2?

I am using a package called microbiomeMarker to plot a cladogram (ggtree). It was taking too long to print so I implemented the Cairo package so that the printing is faster (there's a considerable change thankfully.
cladogram<-plot_cladogram(
OTU_lefse_res,
color = c("#7570B3","#D95F02"),
only_marker = TRUE,
branch_size = 0.2, #beyond this point everything is default values
alpha = 0.2,
node_size_scale = 1,
node_size_offset = 1,
clade_label_level = 4,
clade_label_font_size = 4,
annotation_shape = 22,
annotation_shape_size = 5,
group_legend_param = list(),
marker_legend_param = list()
)
ggsave(
filename = "16S_Marker_Cladogram.png",
units = "px",
dpi = 300,
type = "cairo-png"
)
My issue is that when it gets saved, the graph is zoomed in (as far as I can figure to the bottom right corner of the legend) so I can't see the cladogram. I've also tried saving it as a pdf, changing the dpi to 900, defining the size by pixels, setting in the ggplot "coord_fixed()"... I'm at a loss, I googled the issue but I can't find anyone having a similar issue. The example data of the package works fine. I've got about 10x the amount of data to be plotted, so I expect some delay, but I don't understand why I can't get it right.
If I don't set a printing device and just do:
plot_cladogram(
OTU_lefse_res,
color = c("#7570B3","#D95F02"),
only_marker = TRUE,
branch_size = 0.2,
alpha = 0.2,
node_size_scale = 1,
node_size_offset = 1,
clade_label_level = 4,
clade_label_font_size = 4,
annotation_shape = 22,
annotation_shape_size = 5,
group_legend_param = list(),
marker_legend_param = list()
)
I don't get any output in the plot window. Any ideas of something else I can try?

Error while running WTC (Wavelet Coherence) Codes in R

I am doing Wavelet Analysis in R using Biwavelet. However, I receive the error message:
Error in check.datum(y) :
The step size must be constant (see approx function to interpolate)
When I run the following code:
wtc.AB = wtc(t1, t2, nrands = nrands)
Please share your help here. Complete Code is:
# Import your data
Data <- read.csv("https://dl.dropboxusercontent.com/u/18255955/Tutorials/Commodities.csv")
# Attach your data so that you can access variables directly using their
# names
attach(Data)
# Define two sets of variables with time stamps
t1 = cbind(DATE, ISLX)
t2 = cbind(DATE, GOLD)
# Specify the number of iterations. The more, the better (>1000). For the
# purpose of this tutorial, we just set it = 10
nrands = 10
wtc.AB = wtc(t1, t2, nrands = nrands)
# Plotting a graph
par(oma = c(0, 0, 0, 1), mar = c(5, 4, 5, 5) + 0.1)
plot(wtc.AB, plot.phase = TRUE, lty.coi = 1, col.coi = "grey", lwd.coi = 2,
lwd.sig = 2, arrow.lwd = 0.03, arrow.len = 0.12, ylab = "Scale", xlab = "Period",
plot.cb = TRUE, main = "Wavelet Coherence: A vs B")```

How do I specify numerical and categorical variables in catboost with R?

The tutorial for catboost with R says this:
library(catboost)
countries = c('RUS','USA','SUI')
years = c(1900,1896,1896)
phone_codes = c(7,1,41)
domains = c('ru','us','ch')
dataset = data.frame(countries, years, phone_codes, domains)
label_values = c(0,1,1)
fit_params <- list(iterations = 100,
loss_function = 'Logloss',
ignored_features = c(4,9),
border_count = 32,
depth = 5,
learning_rate = 0.03,
l2_leaf_reg = 3.5)
pool = catboost.load_pool(dataset, label = label_values, cat_features = c(0,3))
model <- catboost.train(pool, params = fit_params)
However, this results in:
Error in catboost.from_data_frame(data, label, pairs, weight, group_id, :
Unsupported column type: character
Many thanks,

Why write.fwf() did not follow the fixed width set

I want to write a set of randomly generated numbers to a text file with fixed format. But for some reasons, write.fwf only wrote the 1st column right, all other columns got one extra digit. How can I fix it? Thanks!
set.seed(1899)
library(sensitivity)
library(randtoolbox)
par_lower <- c( 0.12, 0.13, 0.038, 0.017)
par_upper <- c(12.00, 13.00, 3.800, 1.700)
sample_size <- 5
lim_para8 <- c(par_lower[1], par_upper[1])
lim_para9 <- c(par_lower[2], par_upper[2])
lim_parb8 <- c(par_lower[3], par_upper[3])
lim_parb9 <- c(par_lower[4], par_upper[4])
par_rand <- parameterSets(par.ranges = list(lim_para8, lim_para9,
lim_parb8, lim_parb9),
samples = sample_size, method = "sobol")
par_rand
# write to file
library(gdata)
file2write <- paste("par.txt", sep = "")
write.fwf(par_rand, file = file2write, width = c(10, 10, 10, 10), colnames = FALSE)
The results:
6.060 6.56500 1.91900 0.858500
9.030 3.34750 2.85950 0.437750
3.090 9.78250 0.97850 1.279250
4.575 4.95625 2.38925 0.227375
10.515 11.39125 0.50825 1.068875
If I changed to
write.fwf(par_rand, file = file2write, width = c(10, 9, 9, 9),
colnames = FALSE, quote = FALSE, rownames = FALSE)
I got this error
Error in write.fwf(par_rand, file = file2write, width = c(10, 9, 9, 9), :
'width' (9) was too small for columns: V4
'width' should be at least (10)
Please try the code below, it works for me. I tested with several formats and all worked. Both code segments return a fixed format file with width 4 x 10.
This of course implies that setting sep in the definition of file2write does not work for getting the desired output with write.fwf
write.fwf(par_rand, file = "par2.txt", width = c(10, 10, 10, 10), colnames = FALSE, sep = "")
write.fwf(par_rand, file = file2write, width = c(10, 10, 10, 10), colnames = FALSE, sep = "")
The following generates the same but with 1x10 and 3x9, as I think you wanted
write.fwf(par_rand, file = "par3.txt", width = c(10, 9, 9, 9), colnames = FALSE, sep = "")
Please let me know whether this is what you wanted.

Resources