Stratified Sampling of a DataFrame - julia

Given a dataframe with columns "a", "b", and "value", I'd like to sample N rows from each pair of ("a", "b"). In python pandas, this is easy to do with the following syntax:
import pandas as pd
df.groupby(["a", "b"]).sample(n=10)
In Julia, I found a way to achieve something similar using:
using DataFrames, StatsBase
combine(groupby(df, [:a, :b]),
names(df) .=> sample .=> names(df)
)
However, I don't know how to extend this to n>1. I tried
combine(groupby(df, [:a, :b]),
names(df) .=> x -> sample(x, n) .=> names(df)
)
but this returned the error (for n=3):
DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 3 and 7")
One method (with slightly different syntax) that I found was:
combine(groupby(df, [:a, :b]), x -> x[sample(1:nrow(x), n), :])
but I'm interested in knowing if there are better alternatives

Maybe as an additional comment. If you have an id column in your data frame (holding a row number) then:
df[combine(groupby(df, [:a, :b]), :id => (x -> rand(x, n)) => :id).id, :]
will be a bit faster (but not by much).
Here is an example:
using DataFrames
n = 10
df = DataFrame(a=rand(1:1000, 10^8), b=rand(1:1000, 10^8), id=1:10^8)
combine(groupby(df, [:a, :b]), x -> x[rand(1:nrow(x), n), :]); # around 16.5 seconds on my laptop
df[combine(groupby(df, [:a, :b]), :id => (x -> rand(x, n)) => :id).id, :]; # around 14 seconds on my laptop

Related

Apply function to cartesian product of numeric and function type

I have a function
eval_ = function(f, i) f(i)
for a list of functions, say
fns = list(function(x) x**2, function(y) -y)
and a vector of integers, say
is = 1:2
I would like to get eval_ evaluated at all combinations of fns and is.
I tried the following:
cross = expand.grid(fns, is)
names(cross) = c("f", "i")
results = sapply(1:nrow(cross), function(i) do.call(eval_, cross[i,]))
This throws an error:
Error in f(i) : could not find function "f"
I think that the underlying problem is, that cross is a data.frame and can not carry functions. Hence, it puts the function into a list and then carries a list (indeed, class(cross[1,][[1]]) yields "list". My ugly hack is to change the third line to:
results = sapply(
1:nrow(cross),
function(i) do.call(eval_, list(f = cross[i,1][[1]], i = cross[i,2]))
)
results
#[1] 1 -1 4 -2
This works, but it defeats the purpose of do.call and is very cumbersome.
Is there a nice solution for this kind of problem?
Note: I would like a solution that generalizes well to cases where the cross product is not only over two, but possibly an arbitrary amount of lists, e.g. functions that map R^n into R.
Edit:
For a more involved example, I think of the following:
fns = list(mean, sum, median)
is1 = c(1, 2, 4, 9), ..., isn = c(3,6,1,2) and my goal is to evaluate the functions on the cartesian product spanned by is1, ..., isn, e.g. on the n-dimensional vector c(4, ..., 6).
You can use mapply() for this:
eval_ <- function(f, i) f(i)
fns <- list(function(x) x**2, function(y) -y)
is <- 1:2
cross <- expand.grid(fns = fns, is = is)
cross$result <- mapply(eval_, cross$fn, cross$is)
print(cross)
#> fns is result
#> 1 function (x) , x^2 1 1
#> 2 function (y) , -y 1 -1
#> 3 function (x) , x^2 2 4
#> 4 function (y) , -y 2 -2
An attempt for my "more involved example" with n = 2.
Let X = expand.grid(c(1, 2, 4, 9), c(3,6,1,2)).
The following pattern generalizes to higher dimensions:
nfns = length(fns)
nn = nrow(X)
res = array(0, c(nfns, nn))
for(i in 1:nfns){
res[i,] = apply(X, MARGIN = 1, FUN = fns[[i]])
}
The shape of the margin of X (i.e. nrow(X)) must correspond to the shape of the slice res[i,] (i.e. nn). The function must map the complement of the margin of X (i.e. slices of the form X[i,]) to a scalar. Note that a function that is not scalar has components that are scalar, i.e. in a non-scalar case, we would loop over all components of the function.

Heatmap in Julia

I am trying to create a heat map. I have the following code. Is there a simple way to create a heatmap, s.t. lst1 is the x-axis, lst2 is the y-axis and lst ist the intensity in the graph?
lst1 = []
lst2 = []
lst3 = []
for i in range(0,3.5,step = 0.5)
for j in range(0,4,step = 0.5)
println(i,j)
a = f(parameter,i,j)
push!(lst1,i)
push!(lst2,j)
push!(lst3,a)
print("($i , $j): $a %")
end
end
A somewhat shorter way of doing this, utilizing broadcasts, might be:
lst1 = 0:0.5:3.5
lst2 = 0:0.5:4
lst3 = f.(Ref(parameter), lst1, lst2')
lst1 and lst2 are constructed using the colon operator but are equivalent to the range call you showed.
lst3 is constructed using the Julia broadcast operator. Here, we wrap parameter in a Ref (think of it as a zero dimensional array or a pointer) to indicate that it should not be expanded during the broadcast. We pass lst1 as is, and its form mimics a column vector. We then pass the transpose of lst2 (obtained by the ' operator) which makes it have the form of a row vector.
These two differing dimensions cause the broadcast to create a Matrix with the first axis being lst1 and the second axis being lst2. For clarity you can examine the output of tuple.(lst1, lst2'), which will show you essentially the values passed into the function f.
In the end, lst3 will be a Matrix of elements of the type which f returns.
To actually plot this you should consider using the Plots.jl or Makie.jl packages.
Sorry I can't provide code examples right now, posting this on mobile. Will reformat later.
You may do it like this with Plots.jl
using Plots
f = (x, y) -> x^2 + y^2
x = range(-1, 1; length=10)
y = range(-2, 2; length=40)
z = fill(NaN, size(y, 1), size(x, 1))
for i in eachindex(x), j in eachindex(y)
z[j, i] = f(x[i], y[j])
end
heatmap(x, y, z)
Plots.heatmap waits for rectangular matrix of z-values, which I preallocated before the cycle. In this matrix, x-direction corresponds to the columns and y-direction to the rows.
z as matrix z when plotted
z11 z12 z13 z31 z32 z33
z21 z22 z23 z21 z22 z23
z31 z32 z33 z11 z12 z13
Take a look at
z = [1 2 3; 4 5 6; 7 8 9]
heatmap(z) # heatmap(1:size(z, 2), 1:size(z, 1), z)
Things to improve
replace fill with uninitialised matrix constructor
construct x-y pairs, e.g. by Iterators.product instead of nested loops
use broadcasting

"Sapply" function in R counterpart in MATLAB to convert a code from R to MATLAB

I want to convert the code in R to MATLAB (not to executing the R code in MATLAB).
The code in R is as follows:
data_set <- read.csv("lab01_data_set.csv")
# get x and y values
x <- data_set$x
y <- data_set$y
# get number of classes and number of samples
K <- max(y)
N <- length(y)
# calculate sample means
sample_means <- sapply(X = 1:K, FUN = function(c) {mean(x[y == c])})
# calculate sample deviations
sample_deviations <- sapply(X = 1:K, FUN = function(c) {sqrt(mean((x[y == c] - sample_means[c])^2))})
To implement it in MATLAB I write the following:
%% Reading Data
% read data into memory
X=readmatrix("lab01_data_set(ViaMatlab).csv");
% get x and y values
x_read=X(1,:);
y_read=X(2,:);
% get number of classes and number of samples
K = max(y_read);
N = length(y_read);
% Calculate sample mean - 1st method
% funct1 = #(c) mean(c);
% G1=findgroups(y_read);
% sample_mean=splitapply(funct1,x_read,G1)
% Calculate sample mean - 2nd method
for m=1:3
sample_mean(1,m)=mean(x(y_read == m));
end
sample_mean;
% Calculate sample deviation - 2nd method
for m=1:3
sample_mean=mean(x(y_read == m));
sample_deviation(1,m)=sqrt(mean((x(y_read == m)-sample_mean).^2));
sample_mean1(1,m)=sample_mean;
end
sample_deviation;
sample_mean1;
As you see I get how to use a for loop in MATLAB instead of sapply in R (as 2nd method in code), but do not know how to use a function (Possibly splitaplly or any other).
PS: Do not know how to upload the data, so sorry for that part.
The MATLAB equivalent to R sapply is arrayfun - and its relatives cellfun, structfun and varfun depending on what data type your input is.
For example, in R:
> sapply(1:3, function(x) x^2)
[1] 1 4 9
is equivalent to MATLAB:
>>> arrayfun(#(x) x^2, 1:3)
ans =
1 4 9
Note that if the result of the function you pass to arrayfun, cellfun etc. doesn't have identical type or size for every input, you'll need to specify 'UniformOutput', 'false' .

Extract rows / columns of a matrix into separate variables

The following question came up in my course yesterday:
Suppose I have a matrix M = rand(3, 10) that comes out of a calculation, e.g. an ODE solver.
In Python, you can do
x, y, z = M
to extract the rows of M into the three variables, e.g. for plotting with matplotlib.
In Julia we could do
M = M' # transpose
x = M[:, 1]
y = M[:, 2]
z = M[:, 3]
Is there a nicer way to do this extraction?
It would be nice to be able to write at least (approaching Python)
x, y, z = columns(M)
or
x, y, z = rows(M)
One way would be
columns(M) = [ M[:,i] for i in 1:size(M, 2) ]
but this will make an expensive copy of all the data.
To avoid this would we need a new iterator type, ColumnIterator, that returns slices? Would this be useful for anything other than using this nice syntax?
columns(M) = [ slice(M,:,i) for i in 1:size(M, 2) ]
and
columns(M) = [ sub(M,:,i) for i in 1:size(M, 2) ]
They both return a view, but slice drops all dimensions indexed with
scalars.
A nice alternative that I have just found if M is a Vector of Vectors (instead of a matrix) is using zip:
julia> M = Vector{Int}[[1,2,3],[4,5,6]]
2-element Array{Array{Int64,1},1}:
[1,2,3]
[4,5,6]
julia> a, b, c = zip(M...)
Base.Zip2{Array{Int64,1},Array{Int64,1}}([1,2,3],[4,5,6])
julia> a, b, c
((1,4),(2,5),(3,6))

Create many new matrices from a larger matrix

I am relatively new to R, though I have done a good amount of simple R programming. I think this should be an easy question but I can't seem to figure it out.
Updated:
The situation is that I need to fragment my data for regression analysis due to memory constraints on my computer. I essentially have three matrices that are pertinent call them X (n x k), y (n x 1), and Om ( n x n) I need to break these three matrices up by rows multiply them together in various ways and then add up the results. because of the error structure of Om some groups have to be rows of 3, others rows of 2 and others rows of 1. For a group of 3 we would have:
Xi (3 x k), yi (3 x 1) and Om3 (3 x 3)
I have Om1, Om2, and Om3 already build in R
Om1<-matrix(2)
vec1<-c(2,-1)
vec2<-c(-1,2)
Om2<-rbind(vec1,vec2)
vec3<-c(2,-1,0)
vec4<-c(-1,2,-1)
vec5<-c(0,-1,2)
Om3<-rbind(vec3,vec4, vec5)
Now the Question is how can I break up X, and y so that I can match the rows of X, y and Om. I was thinking I would need a loop but cannot get it to work something like:
for(i in 1:280){
assign(paste("xh", i, sep = ""), i)
}
for(i in 1:145){
xhi<- X[i,]
}
for(i in 146:195){
xhi<-X[ seq( from=i , length=2) , ]
}
for(i in 196:280){
xhi<-X[ seq( from=(i+49) , length=3) , ]
}
Where the first 145 xi 's correspond to Om1, the next 50 xi's correspond to Om2...I was thinking I would need a way to index them that was the same as eventually i need to sum up a product of xi, yi and Omi across i's
Sorry for the long post trying to be thorough, any advice would be greatly appreciated
Assume this matrix is named mtx, you want 10 such matrices of increasing row count, and that it has at least 55 rows, since the sum of the lengths that grow by one with each iteration is n(n+1)/2:
mlist <- list()
for (x in 1:10) mlist[[x]] <-mtx[ seq( from=x*(x-1)/2+1 , length=x) , ]
If the desire is as Carl suggests it would be:
mlist <- list()
for (x in 1:10) mlist[[x]] <-mtx[ seq( from=1 , length=x) , ]
Or:
mlist <- list()
for (x in 1:10) mlist[[x]] <-mtx[ seq( from=x , length=2) , ]

Resources