Polars Rust melt() significantly slower than R stack() - r

I have some R code that takes a wide data.frame and stacks it into a narrow one. I rewrote this in Rust, but am finding it to be painfully slow. I am wondering if I am using bad practice or something here that is killing speed.
Original R version:
df = cbind(df[ncol(df)], df[ncol(df)-3], df[ncol(df)-2], df[ncol(df)-1], stack(df[1:(ncol(df)-4)]))
The stack(df[1:(ncol(df)-4)]) part takes all but the last 4 columns (usually 1,000) and stacks them. It also creates a second column which indicates which column a row came from. Then I cbind the other 4 columns back to it. R automatically repeats them to match the new length of the narrow df.
Here is my Polars eager version:
let n = 1000;
let sample_cols = (0..n).collect::<Vec<i32>>()
.par_iter()
.map(|l| format!("{}", l))
.collect::<Vec<String>>();
let mut df = df.melt(&["A", "B", "C", "D"], sample_cols).unwrap();
sample_cols is a Vec containing the column names to be stacked, which are strings from 0 to 999, for the 1000 samples.
Here is the lazy version:
let n = 1000;
let sample_cols = (0..n).collect::<Vec<i32>>()
.par_iter()
.map(|l| format!("{}", l))
.collect::<Vec<String>>();
let melt_args = MeltArgs {
id_vars: vec!["A".into(), "B".into(), "C".into(), "D".into()],
value_vars: sample_cols,
variable_name: None,
value_name: None,
};
let mut df = df.lazy().melt(melt_args).collect()?;
Both Rust versions are similar speed, but much slower than R. With n = 100,000 the R code takes 0.45s on average, but sometimes as little as .23s, while both Rust versions take 13.5s to 14.5s.
If you would like to run it yourself this should generate dummy data and run it,just make sure to use only the eager or lazy version at a time:
use rand_distr::{Normal, Distribution};
use rayon::prelude::*;
use ndarray::Array2;
#[macro_use]
extern crate fstrings;
use polars::prelude::*;
use std::time::Instant;
fn multi_rnorm(n: usize, means: Vec<f64>, sds: Vec<f64>) -> Array2<f64> {
let mut preds: Array2<f64> = Array2::zeros((means.len(), n));
preds.axis_iter_mut(ndarray::Axis(0)).into_par_iter().enumerate().for_each(|(i, mut row)| {
let mut rng = rand::thread_rng();
(0..n).into_iter().for_each(|j| {
let normal = Normal::new(means[i], sds[i]).unwrap();
row[j as usize] = normal.sample(&mut rng);
})
});
preds
}
let n = 100000;
let means: Vec<f64> = vec![0.0; 15];
let sds: Vec<f64> = vec![1.0; 15];
let preds = rprednorm(n as usize, means, sds);
let mut df: DataFrame = DataFrame::new(
preds.axis_iter(ndarray::Axis(1))
.into_par_iter()
.enumerate()
.map(|(i, col)| {
Series::new(
&f!("{i}"),
col.to_vec()
)
})
.collect::<Vec<Series>>()
)?;
let start = Instant::now();
let sample_cols= (0..n).collect::<Vec<i32>>()
.par_iter()
.map(|l| format!("{}", l))
.collect::<Vec<String>>();
df.with_column(Series::new("A", &["1", "2", "3", "1", "2", "3'", "1", "2", "3", "1", "2", "3", "1", "2", "3"]));
df.with_column(Series::new("B", &["1", "1", "1", "2", "2", "2", "3", "3", "3", "4", "4", "4", "5", "5", "5"]));
df.with_column(Series::new("C", &["1", "2", "3", "1", "2", "3'", "1", "2", "2", "1", "2", "3'", "1", "2", "3"]));
df.with_column(Series::new("D", (0..df.shape().0 as i32).collect::<Vec<i32>>()));
let melt_args = MeltArgs {
id_vars: vec!["A".into(), "B".into(), "C".into(), "D".into()],
value_vars: sample_cols,
variable_name: None,
value_name: None,
};
let start = Instant::now();
let mut df = df.lazy().melt(melt_args).collect()?;
let duration = start.elapsed();
println!("{:?}", duration);
let start = Instant::now();
let mut df = df.melt(&["A", "B", "C", "D"], &sample_cols).unwrap();
let duration = start.elapsed();
println!("{:?}", duration);

I submitted an issue on Github, and the existing implementation was improved from O(n^2) to O(n), it is now faster than R. It is not part of the latest update so you will need to install from github instead of crates.io

Related

Octave: concise way of creating and initializing a struct

I have a cell array of strings of length 3
headers_ca =
{
[1,1] = time
[1,2] = x
[1,3] = y
}
I want to create a struct that mimics a python dict, with the values in headers_ca as keys (fieldnames in Octave) and an initializer value ival for all entries.
It would be a struct, since even dict exists in octave, it has been deprecated.
I could do (brute force) s = struct("time", ival, "x", ival, "y", ival);
What is the most concise way to do this?
I know I can do a for loop.
Can it be avoided?
I would be working with much longer cell arrays.
You can use struct or cell2struct to create the structure.
headers_ca = {'time','x','y'};
headers_ca(2, :) = {ival};
s = struct(headers_ca{:});
headers_ca = {'time','x','y'};
ivals = repmat({ival}, numel(headers_ca), 1);
s = cell2struct(ivals, headers_ca);

How to iterate over all ways of choosing n b-bit arrays from all possible b-bit arrays?

There are 2^b b-bit arrays. There are "2^b choose n" different ways of choosing n b-bit arrays. I would like to iterate over all "2^b choose n" different ways of choosing n b-bit arrays. Clearly this is only possible in a realistic time frame if b and n are both small.
How could I do that in Julia?
You can use combinations from Combinatorics.jl to generate the various combinations. And, depending on what you're looking for, you can use either string or bitstring to convert integers into their binary representation:
julia> string(123, base=2)
"1111011"
julia> bitstring(123)
"0000000000000000000000000000000000000000000000000000000001111011"
For brevity, I will stick with string. Here's an example of the full calculation for the case of b = 3 and n = 2:
julia> using Combinatorics
julia> r = 0:2^3-1
0:7
julia> b = string.(r, base=2)
8-element Array{String,1}:
"0"
"1"
"10"
"11"
"100"
"101"
"110"
"111"
julia> combs = combinations(b, 2);
julia> foreach(println, combs)
["0", "1"]
["0", "10"]
["0", "11"]
["0", "100"]
["0", "101"]
["0", "110"]
["0", "111"]
["1", "10"]
["1", "11"]
["1", "100"]
["1", "101"]
["1", "110"]
["1", "111"]
["10", "11"]
["10", "100"]
["10", "101"]
["10", "110"]
["10", "111"]
["11", "100"]
["11", "101"]
["11", "110"]
["11", "111"]
["100", "101"]
["100", "110"]
["100", "111"]
["101", "110"]
["101", "111"]
["110", "111"]

How to deal with factors in Rcpp

I'm attempting to learn how to use Rcpp in R. Can someone please point
out what the problem/s are with this code. There's probably more than one
issue.
When the c object is entered into fun() at the bottom of the code I want it to output a vector/array with the values "Home", "Elsewhere" or "Number".
I'm finding the data type slightly confusing here. My original data set is a factor. If I put this into storage.mode() it returns integer. I assume then that I have to assign the x argument as IntegerVector. This confuses me because the data contains letters, i.e "H" and "E", so how can the data be integer?
When I'm saying == "H" in the if statement i don't know if it understands what I'm saying.
library(Rcpp)
c <- factor(c("E", "H", "E", "12", "10", "60", "80", "11", "H", "H"))
class(c)
storage.mode(c)
cppFunction(' IntegerVector fun(IntegerVector x){
// creates an empty character vector the size/length of x.
CharacterVector y = x.size() ;
int n = x.size() - 1 ;
//loop
for(int i = 0; i <= n; i = i + 1){
if(x[i] == "H"){
y[i] = "Home" ;
}else if(x[i] == "E"){
y[i] = "Elsewhere" ;
}else{
y[i] = "Number" ;
} ;
}
return y ;
}')
fun(c)
Note: Throughout, I will refer to f, not c. It is bad practice to name variables the same name as a builtin function or constant, such as c, T, or F. Therefore I change the beginning of your code as follows:
library(Rcpp)
f <- factor(c("E", "H", "E", "12", "10", "60", "80", "11", "H", "H"))
In addition to looking at class(f) and storage.mode(f), it's useful to look at str(f):
str(f)
# Factor w/ 7 levels "10","11","12",..: 6 7 6 3 1 4 5 2 7 7
In truth, a factor is an integer vector with "levels": a character vector corresponding to each unique integer value. Luckily, you can get this from C++ using the .attr() member function of Rcpp::IntegerVector:
cppFunction('CharacterVector fun(IntegerVector x){
// creates an empty character vector the size/length of x.
CharacterVector y = x.size() ;
// Get the levels of x
CharacterVector levs = x.attr("levels");
int n = x.size() - 1 ;
//loop
for(int i = 0; i <= n; i = i + 1){
if(levs[x[i]-1] == "H"){
y[i] = "Home" ;
}else if(levs[x[i]-1] == "E"){
y[i] = "Elsewhere" ;
}else{
y[i] = "Number" ;
} ;
}
return y ;
}')
fun(f)
# [1] "Elsewhere" "Home" "Elsewhere" "Number" "Number" "Number"
# [7] "Number" "Number" "Home" "Home"
So, to get what you want, you had to do three things:
Change the return type from IntegerVector to CharacterVector (though you were completely right that the input should be IntegerVector)
Get the levels of the factor using CharacterVector levs = x.attr("levels");
Compare levs[x[i]-1] to "H", etc., rather than x[i] -- x[i] will always be an integer, giving the element of the vector of levels it corresponds to. We do -1 since C++ is 0-indexed and R is 1-indexed.
Other notes:
It is clear, as you say, that "[you're] attempting to learn how to use Rcpp() in R." You'll definitely want to spend some time with resources such as Rcpp for Everyone (that's the chapter on factors), the Rcpp Gallery (this specific link is an article on factors), Hadley's chapter on Rcpp, and definitely the Rcpp vignettes available here.

Qt qsTr() handling plurals

Im developing and applicacion and want to translate some text, the problem is when i want to handle in QML the plurals.
In c++, the way to handle the plurals is as simple as:
int n = messages.count();
showMessage(tr("%n message(s) saved", 0, n));
and will translate without problems
source:
https://doc.qt.io/qt-5/i18n-source-translation.html#handling-plurals
When I try to do the same with QML doesnt work. After carefull review of some literature, and some comments I found out a "solution", that it's actually people reporting a bug.
var second = qsTr("%b radios", "0", map.radio).arg(map.radio)
source:https://bugreports.qt.io/browse/QTBUG-11579
When I lupdate, in the QtLinguistic it appears the two fields for plural and singular form, but in the application does not work.
I tried several modifications such as:
var a = map.totalSongs;
var first = qsTr("%a songs", "0", parseInt(a))
var second = qsTr("%b radios", "0", map.radio)
var first = qsTr("%a songs", "0", parseInt(a)).arg(map.totalSongs)
var second = qsTr("%b radios", "0", map.radio).arg(map.radio)
var first = qsTr("%a songs", "0", a)
var second = qsTr("%b radios", "0", b)
In QtLinguistic im writting the translation:
%b radio - Singular
%b radios - Plural
Any modification fails to work.
Can some one tell me how to use qstr() to handle the plurals?
Other question related:
Lets say I want to have a text "%1 songs - %2 radios", where in spanish should result in
//As example
if(%1 = 10 && %2 = 10) => "10 canciones - radios"
else if(%1 = 1 && %2 = 10) => "1 cancion - 10 radios"
else if(%1 = 10 && %2 = 1) => "10 canciones - 1 radio"
How to do it? I think either qstr() or tr() can not handle this situation., but just want to verify it with you guys :D
Thanks in advance
I couldnt accept that was not working so I went a little further and find a solution, maybe seems obvious, but I dont think its so.
Doesnt work
var a = map.totalSongs;
var first = qsTr("%a songs", "0", a)
Works because we use the variable N
var n = map.radio;
var first = qsTr("%n songs", "0", n)

lua:90: attempt to index field '?' (a nil value)

I cannot get my 2 dimensional variable to be recognized correctly when I call it. When I print it, it seems to work fine, but when I attempt to call it from within a function it goes bananas on me.
Here is my code:
math.randomseed(os.time())
math.random(); math.random(); math.random()
--init
local t = ""
--t == type
local year = 2014
--year is placeholder with no real value.
local i = 1
local x = 0
local y = 0
local z = 0
local o = 0
--
local l = 0
local l1 = 0
local l2 = 0
--
local h = 1
--Junk Variables
local month = {"01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"}
local days = {0, 0, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31}
--Days linked to months, weeks come as calculated
local m = 1
--"m" is the real month value to be used within the code as the literal value.
fd = {} -- create the matrix
for y=1,5 do
fd[y] = {} -- create a new row
for z=1,5 do
fd[y][z] = 0
end
end
--fd = Family/Day
na = {1, 2, 3, 4 ,5}
--numbers not allocated for the day
fv = {}
--[12][days[m]][5]
--final value to be printed literally into string of txt file.
local s = ""
--string variable
io.write("Please enter a month (ONLY USE NUMBERS)")
io.flush()
m=io.read()
io.write("Please enter a chore creation type (daily, weekly, monthly [Case sensitive])")
t=io.read()
--
m = tonumber(m)
--
for y=1,12 do
fv[y] = {}
for z=1,days[m] do
fv[y][z] = {}
for o=1,5 do
fv[y][z][o] = 0
end
end
end
--
if t == "daily" then
local f,err = io.open("ChoreDaily.txt","w")
if not f then return print(err) end
f:write("")
--
repeat
i = 0
y = 0
print(">>")
repeat
if h <= days[m] then
--
repeat
if h <= days[m] then
--
os.execute("cls")
l1 = math.random(1,2)
l2 = math.random(3,4)
l = math.random(l1,l2)
repeat
o = math.random(1,5)
l = l-1
until l == 0
--
if y == 0 then
--
if na[o] > 0 then
if x < 4 then
s = s .. tostring(na[o]) .. ", "
elseif x >= 4 then
s = s .. tostring(na[o])
end
fd[x][y] = na[o] -- this is the problem.
na[o] = 0
x = x+1
print("!")
end
--
I think it's pretty obvious what I am attempting to make overall, but it's a chore list creator. pretty primitive and I was hoping I could do it all myself, but unfortunately if I can't utilize 2 dimensional variables I'm not going to be able to go much further.
There are some unused variables and whatnot hanging around. I plan to get rid of those later.
x is initialized as 0, and is not changed before you try to access fd[x][y]. But the valid index of the table fd is from 1 to 5, which means fd[x] is nil here, you can't access fd[x][y].

Resources