How can I run a weighted correlation in Julia?
In Stata, you can run corr x y [aw=weight] to find correlations between columns x and y using weight as the weights. I can't find the same functionality in Julia.
This is the way to do it:
julia> using Statistics, StatsBase
julia> x = [1 2
3 4
1 -2
3 -4
5 -6]
5×2 Matrix{Int64}:
1 2
3 4
1 -2
3 -4
5 -6
julia> cor(x, Weights([1,1,0,0,0]))
2×2 Matrix{Float64}:
1.0 1.0
1.0 1.0
julia> cor(x, Weights([0,0,1,1,1]))
2×2 Matrix{Float64}:
1.0 -1.0
-1.0 1.0
Related
I have a sample dataframe from which I want to get a value and then subtract by value in a static column (Mean) over multiple columns.
For example:
I have a dataframe df:
LK Loc1 Loc2 Loc3 Mean
1 1 2 0 3
2 2 8 4 4.6
3 3 1 2 2
4 2 0 1 1.5
5 1 2 0 1.5
I want to get in a new dataframe:
LK Loc1 Loc2 Loc3
1 -2 -1 -3
2 -2.6 3.4 -0.6
3 1 -1 0
4 0.5 -1.5 -0.5
5 -0.5 0.5 -1.5
I tried something with:
df2 <- df %>%
mutate(across(-LK, ~ accumulate(., `-`)))
But I don't know how to continue..
Any help is appreciated.
Thank you in advance
I think you can use the following solution:
library(dply)
df %>%
mutate(across(starts_with("Loc"), ~ .x - Mean))
LK Loc1 Loc2 Loc3 Mean
1 1 -2.0 -1.0 -3.0 3.0
2 2 -2.6 3.4 -0.6 4.6
3 3 1.0 -1.0 0.0 2.0
4 4 0.5 -1.5 -0.5 1.5
5 5 -0.5 0.5 -1.5 1.5
We can also use pmap from purrr package function. This is a bit complicated but it would nice to know. We use pmap function to iterate over every row of a data frame:
Here we use c(...) to capture all values in each row but I selected only those whose names start with Loc as a vector of 3 elements
Then we subtract each element of the resulting vector from the corresponding value of Mean variable which is represented by ..5 in this case as the fifth variable in this data set.
The rest is just renaming and resetting the configuration of variables.
df %>%
pmap_df(~ {x <- c(...)[startsWith(names(df), "Loc")];
x - ..5}) %>%
bind_cols(df$LK) %>%
rename(LK = ...4) %>%
relocate(LK)
# A tibble: 5 x 4
LK Loc1 Loc2 Loc3
<int> <dbl> <dbl> <dbl>
1 1 -2 -1 -3
2 2 -2.6 3.4 -0.600
3 3 1 -1 0
4 4 0.5 -1.5 -0.5
5 5 -0.5 0.5 -1.5
Another way to do it:
library(tidyverse)
df <-
read_table('LK Loc1 Loc2 Loc3 Mean
1 1 2 0 3
2 2 8 4 4.6
3 3 1 2 2
4 2 0 1 1.5
5 1 2 0 1.5')
cbind( df[1],
map_dfc(select(df,starts_with('Loc')), ~ .x - df$Mean) )
#> LK Loc1 Loc2 Loc3
#> 1 1 -2.0 -1.0 -3.0
#> 2 2 -2.6 3.4 -0.6
#> 3 3 1.0 -1.0 0.0
#> 4 4 0.5 -1.5 -0.5
#> 5 5 -0.5 0.5 -1.5
Created on 2021-06-21 by the reprex package (v2.0.0)
I was able to get what you needed using mutate_at:
df %>%
mutate_at(vars(starts_with("Loc")), ~ .-Mean) %>%
select(-c(Mean))
Here, I leverage vars(starts_with("Loc")) to tell R that any column starting with "Loc" should be included in the aggregation, which is referenced as . after the tilde. Then I specifically refer to the column Mean. I noticed that the first value in the Mean column is not a mean across the rows, but the rest look like they are row-wise means. I wasn't sure if that was on purpose or not, but here is one code option that will get you row-wise means in dplyr: mutate(Mean = mean(c(Loc1, Loc2, Loc3)))
I'm using Julia comprehension to achieve the following:
Given a matrix
A = [1 2; 3 4],
I want to expand it into
B =
[1, 1, 1, 2, 2;
1, 1, 1, 2, 2;
1, 1, 1, 2, 2;
3, 3, 3, 4, 4;
3, 3, 3, 4, 4].
Right now I'm doing this with
ns = [3, 2]
B = [fill(B[i, j], ns[i], ns[j]) for i = 1:2, j = 1:2]
However, instead of getting a 5x5 matrix, it gives me:
2×2 Array{Array{Int64,2},2}:
[0 0 0; 0 0 0; 0 0 0] [0 0; 0 0; 0 0]
[0 0 0; 0 0 0] [0 0; 0 0]
So how should I convert this 2d array of matrices to a 2d matrix? Or are there other ways to do the expansion I need?
Here are two example ways how you could do it (the first one uses your approach, the second one does not generate intermediate matrices):
julia> A = [1 2; 3 4]
2×2 Array{Int64,2}:
1 2
3 4
julia> ns = [3, 2]
2-element Array{Int64,1}:
3
2
julia> hvcat(2, [fill(A[j, i], ns[j], ns[i]) for i = 1:2, j = 1:2]...)
5×5 Array{Int64,2}:
1 1 1 2 2
1 1 1 2 2
1 1 1 2 2
3 3 3 4 4
3 3 3 4 4
julia> nsexpand = reduce(vcat, (fill(k, ns[k]) for k in axes(ns, 1)))
5-element Array{Int64,1}:
1
1
1
2
2
julia> [A[i, j] for i in nsexpand, j in nsexpand]
5×5 Array{Int64,2}:
1 1 1 2 2
1 1 1 2 2
1 1 1 2 2
3 3 3 4 4
3 3 3 4 4
EDIT
Here is an additional example:
julia> A = [1 4 7 10
2 5 8 11
3 6 9 12]
3×4 Array{Int64,2}:
1 4 7 10
2 5 8 11
3 6 9 12
julia> hvcat(3, A...)
4×3 Array{Int64,2}:
1 2 3
4 5 6
7 8 9
10 11 12
julia> vec(A)
12-element Array{Int64,1}:
1
2
3
4
5
6
7
8
9
10
11
12
So:
the first argument tells you how how many columns you want to produce
hvcat has h before v so it takes elements row-wise
however arrays store columns col-wise
so in effect you have to create the temporary array as a transpose of your target (because hvcat will take its columns to create rows of a target arrays). Actually this is only a coincidence - hvcat does not know that your original elements were storing in a matrix (it takes them as positional arguments to the call and at that time the fact that they were stored in a matrix is lost due to ... operation).
I have a vector and a matrix (Array{T,1} and Array{T,2}) in my Julia code and I would like to append them such that the vector becomes a new row in the matrix (should be first row). I've tried several methods (cat, etc.) but keep getting errors which I believe are related to the different shape of the data. See the example below.
julia> v = Vector([1, 2, 3])
3-element Array{Int64,1}:
1
2
3
julia> m = Matrix([4 5 6; 7 8 9])
2×3 Array{Int64,2}:
4 5 6
7 8 9
julia> cat(v,m,dims=(1,2))
5×4 Array{Int64,2}:
1 0 0 0
2 0 0 0
3 0 0 0
0 4 5 6
0 7 8 9
What I actually want is
1 2 3
4 5 6
7 8 9
I realize that I can get this to work with transpose(v) but I was hoping to avoid extra calls.
Thanks!
As long as you can change the construction of v to a 1 x 3 array, you can avoid the transpose:
julia> v = [1 2 3]
1×3 Array{Int64,2}:
1 2 3
julia> m = [4 5 6; 7 8 9]
2×3 Array{Int64,2}:
4 5 6
7 8 9
julia> vcat(v, m)
3×3 Array{Int64,2}:
1 2 3
4 5 6
7 8 9
I think that just doing the transpose
julia> v2 = [1, 2, 3]
3-element Array{Int64,1}:
1
2
3
julia> vcat(v2', m)
3×3 Array{Int64,2}:
1 2 3
4 5 6
7 8 9
is almost as efficient though.
I can sample from a 1-D array just fine. E.g.
julia> a = [1; 2; 3]
3-element Array{Int64,1}:
1
2
3
julia> sample(a, myweights, 5)
5-element Array{Int64,1}:
1
2
1
3
3
I can also take weighted samples:
julia> myweights = weights([0.8, 0.1, 0.1])
StatsBase.WeightVec{Float64,Array{Float64,1}}([0.8,0.1,0.1],1.0)
julia> sample(a, myweights, 5)
5-element Array{Int64,1}:
2
1
1
1
1
I'd like to do the same thing for a 2D array, but sampling by row and not by element. E.g. if I have the array
julia> b = [1 1 1; 2 2 2; 3 3 3]
3×3 Array{Int64,2}:
1 1 1
2 2 2
3 3 3
I'd like to be able to take unweighted and weighted samples that give me outputs like
1 1 1
2 2 2
1 1 1
1 1 1
3 3 3
How can I do this?
The simplest solution here is to sample from the indices of the rows, and then use that to index into your matrix:
julia> idxs = sample(axes(b, 1), myweights, 10)
10-element Array{Int64,1}:
1
1
1
2
1
1
3
1
1
1
julia> b[idxs, :]
10×3 Array{Int64,2}:
1 1 1
1 1 1
1 1 1
2 2 2
1 1 1
1 1 1
3 3 3
1 1 1
1 1 1
1 1 1
Suppose I have a data.frame where if I take multiple columns together (say a, b, and c), then I have an identifier that is unique to two different rows (that differ on column name, and a bunch of value columns x, y, and z).
I'd like to take the difference on the value columns, preserve the key columns, and give the name column a new value like diff.
So for example, suppose I have the following data:
a b c x y z name
1 1 M J 0.0 1.0 2.0 alpha
2 1 M K 0.1 0.9 2.0 alpha
3 1 O J 0.2 0.8 2.0 alpha
4 1 O K 0.3 0.7 2.0 alpha
5 2 M J 0.4 0.6 2.0 alpha
6 2 M K 0.5 0.5 2.0 alpha
7 2 O J 0.6 0.4 2.0 alpha
8 2 O K 0.7 0.3 2.0 alpha
9 1 M J 0.0 2.0 1.0 beta
10 1 M K 0.1 1.9 3.0 beta
11 1 O J 0.2 1.8 1.0 beta
12 1 O K 0.3 1.7 3.0 beta
13 2 M J 0.4 1.6 1.0 beta
14 2 M K 0.5 1.5 3.0 beta
15 2 O J 0.6 1.4 1.0 beta
16 2 O K 0.7 1.3 3.0 beta
Then I want the new data frame to be:
a b c x y z name
1 1 M J 0.0 1.0 2.0 alpha
2 1 M K 0.1 0.9 2.0 alpha
3 1 O J 0.2 0.8 2.0 alpha
4 1 O K 0.3 0.7 2.0 alpha
5 2 M J 0.4 0.6 2.0 alpha
6 2 M K 0.5 0.5 2.0 alpha
7 2 O J 0.6 0.4 2.0 alpha
8 2 O K 0.7 0.3 2.0 alpha
9 1 M J 0.0 2.0 1.0 beta
10 1 M K 0.1 1.9 3.0 beta
11 1 O J 0.2 1.8 1.0 beta
12 1 O K 0.3 1.7 3.0 beta
13 2 M J 0.4 1.6 1.0 beta
14 2 M K 0.5 1.5 3.0 beta
15 2 O J 0.6 1.4 1.0 beta
16 2 O K 0.7 1.3 3.0 beta
17 1 M J 0.0 -1.0 1.0 diff
18 1 M K 0.0 -1.0 -1.0 diff
19 1 O J 0.0 -1.0 1.0 diff
20 1 O K 0.0 -1.0 -1.0 diff
21 2 M J 0.0 -1.0 1.0 diff
22 2 M K 0.0 -1.0 -1.0 diff
23 2 O J 0.0 -1.0 1.0 diff
24 2 O K 0.0 -1.0 -1.0 diff
What's the easiest way to accomplish this?
You could make each column individually:
colx = ave(df$x, paste(df$a, df$b, df$c), FUN=function(x) x[1]-x[2])
coly = ave(df$y, paste(df$a, df$b, df$c), FUN=function(x) x[1]-x[2])
colz = ave(df$z, paste(df$a, df$b, df$c), FUN=function(x) x[1]-x[2])
And then put them together:
df2 = subset(df, name=="alpha")
df2$name = "diff"
df2$x = colx[1:(length(colx)/2)]
df2$y = coly[1:(length(coly)/2)]
df2$z = colz[1:(length(colz)/2)]
Now join to original
df = rbind(df, df2)
That gives:
a b c x y z name
1 1 m j 0.0 1.0 2 a
2 1 m k 0.1 0.9 2 a
3 1 o j 0.2 0.8 2 a
4 1 o k 0.3 0.7 2 a
5 2 m j 0.4 0.6 2 a
6 2 m k 0.5 0.5 2 a
7 2 o j 0.6 0.4 2 a
8 2 o k 0.7 0.3 2 a
9 1 m j 0.0 2.0 1 b
10 1 m k 0.1 1.9 3 b
11 1 o j 0.2 1.8 1 b
12 1 o k 0.3 1.7 3 b
13 2 m j 0.4 1.6 1 b
14 2 m k 0.5 1.5 3 b
15 2 o j 0.6 1.4 1 b
16 2 o k 0.7 1.3 3 b
17 1 m j 0.0 -1.0 1 diff
18 1 m k 0.0 -1.0 -1 diff
19 1 o j 0.0 -1.0 1 diff
20 1 o k 0.0 -1.0 -1 diff
21 2 m j 0.0 -1.0 1 diff
22 2 m k 0.0 -1.0 -1 diff
23 2 o j 0.0 -1.0 1 diff
24 2 o k 0.0 -1.0 -1 diff
If your matrix is always sorted and ballanced. Then this should work
half<-1:(nrow(df)/2)
rbind(
df,
cbind(
df[half, 1:3],
df[half, 4:6] - df[half+half[length(half)], 4:6],
name="diff"
)
)