Masking in Dask - mask

I was just wondering if someone could help show me how to apply functions such as "sum" or "mean" on masks arrays using dask. I wish to calculate the sum / mean of the array on only values where there is no mask.
Code:
import dask.array as da
import numpy as np
import numpy.ma as ma
dset = [1, 2, 3, 4]
masked = ma.masked_equal(dset, (4)) # lets say 4 should be masked
print(np.sum(masked)) # output: 6
print(np.mean(masked)) # output: 2
print(masked) # output: [1, 2, 3, -]
masked_array = da.from_array(masked, chunks=(4))
print(masked_array.sum().compute(): # output: 10
print(masked_array.mean().compute()) # output: 2.5
Is there a way I can have my masked sum equal to np.sum(masked) and masked mean equal to 2 by ignoring the "4" value? It seems that numpy is able to ignore the "4" in its calculations but dask is not in this case.

Dask supports several operations on masked array, full list is available in the Dask' docs.
Example of computing mean and sum of masked array:
import dask
import dask.array as da
dset = da.array([1, 2, 3, 4])
mdata = da.ma.masked_equal(dset, 4)
print(da.sum(mdata).compute()) # output: 6
print(da.ma.average(mdata).compute()) # output: 2

Related

R assign to an evaluated object

I want to be able to change the cell of a dataframe by referring to the object's name, rather than to the object itself, but when I attempt do do so it results in the warning could not find function "eval<-".
I can change a cell of a standard dataframe using the code below:
my_object = tibble(x = c("Hello", "Goodbye"),
y = c(1,2))
object[2,1] <- "Bye"
But I am having trouble doing the same when using the object's name. I can evaluate the object using its name and extract the relevant cell:
object_name = "my_object"
eval(sym(object_name))[2, 1]
But I can't assign a new variable to the object (error: could not find function "eval<-"):
eval(sym(object_name))[2, 1] <- "Bye"
You can use get() instead of eval(sym())to obtain an object by name. You can also use the [<- function to write a value to it without requiring an intermediate copy:
my_object = dplyr::tibble(x = c("Hello", "Goodbye"),
y = c(1,2))
object_name = "my_object"
`[<-`(get(object_name), 2, 1, value ="Bye")
#> # A tibble: 2 x 2
#> x y
#> <chr> <dbl>
#> 1 Hello 1
#> 2 Bye 2
Created on 2022-06-02 by the reprex package (v2.0.1)
1) environments
1a) Subscript the current environment with object_name.
e <- environment()
e[[object_name]][2, 1] <- "Bye"
1b) or as one line:
with(list(e = environment()), e[[object_name]][2, 1] <- "Bye")
1c) If my_object is in the global environment, as in the question, it could optionally be written as:
.GlobalEnv[[object_name]][2, 1] <- "Bye"
2) assign We could use assign like this:
assign(object_name, within(get(object_name), x[2] <- "Bye"))
3) without clobbering
3a) If what you really want is to create a new data frame without clobbering the input:
library(dplyr)
mutate(get(object_name), across(1, ~ replace(., 2, "Bye")))
3b) or if we knew that the column name was x then:
library(dplyr)
mutate(get(object_name), x = replace(x, 2, "Bye"))
3c) or without dplyr
within(get(object_name), x[2] <- "Bye")
If you want to define your command as a string, parse it as an expression, and then use eval:
eval(parse(text=paste0(object_name,"[2,1]<-'Bye'")))
> object
x y
1 Hello 1
2 Bye 2

How to sequentially concatenate every nth:(nth+j) object in a list of objects

I wish to concatenate every nth:nth(+jth) object in a list of objects I have. More specifically, I would like every two objects to be concatenated.
A small sample of the list in question is below.
list("SRR1772151_1.fastq", "SRR1772151_2.fastq", "SRR1772152_1.fastq",
"SRR1772152_2.fastq", "SRR1772153_1.fastq", "SRR1772153_2.fastq")
I would like to make a new list from this which looks closer to this.
list(c("SRR1772151_1.fastq", "SRR1772151_2.fastq"), c("SRR1772152_1.fastq",
"SRR1772152_2.fastq"), c("SRR1772153_1.fastq", "SRR1772153_2.fastq"
))
I have made the following attempt at doing this but my for loop has been unsuccessful.
for (i in seq(1,36, 2)) {
for (j in 1:18) {
unlist(List1[i:i+1]) -> List2[[j]]
}
}
Any help or advice would be very appreciated.
You could divide this into two problems -- split the list, e.g.,
elts = split(lst, 1:2)
and concatenate the elements
Map(c, elts[[1]], elts[[2]])
But I think it's better to follow 'tidy' data practices and to create a single vector with a grouping factor
df = data.frame(fastq = unlist(x), grp = 1:2, stringsAsFactors = FALSE)
or more discriptively
df = data.frame(
fastq = unlist(lst),
sample = factor(sub("_[12].fastq", "", unlist(lst))),
stringsAsFactors = FALSE
)
It's better to work with tidy data because one can accomplish more knowing less, for instance notice that when working with lists you have to learn about split() and Map() and c(), whereas working with vectors and data.frames you don't!
Here is one other attempt using dataframes. The output is a list.
library(tidyverse)
data.frame(X1 = unlist(my_list), stringsAsFactors = F) %>%
group_by(str_sub(X1,1,10)) %>% # assuming first 10 characters forms the string
summarise(list_value=list(X1)) %>%
pull(list_value)
For the general case, you can create a vector of consecutive groups of size j with:
ceiling(seq_along(x) / j)
… and then use tapply() to concatenate all elements in those groups. Unlike using Map(), this will also work if the chunk size does not equally divide the length of the list.
x <- list("SRR1772151_1.fastq", "SRR1772151_2.fastq", "SRR1772152_1.fastq",
"SRR1772152_2.fastq", "SRR1772153_1.fastq", "SRR1772153_2.fastq")
tapply(x, ceiling(seq_along(x) / 2), unlist)
#> $`1`
#> [1] "SRR1772151_1.fastq" "SRR1772151_2.fastq"
#>
#> $`2`
#> [1] "SRR1772152_1.fastq" "SRR1772152_2.fastq"
#>
#> $`3`
#> [1] "SRR1772153_1.fastq" "SRR1772153_2.fastq"
tapply(x, ceiling(seq_along(x) / 4), unlist)
#> $`1`
#> [1] "SRR1772151_1.fastq" "SRR1772151_2.fastq" "SRR1772152_1.fastq"
#> [4] "SRR1772152_2.fastq"
#>
#> $`2`
#> [1] "SRR1772153_1.fastq" "SRR1772153_2.fastq"
Created on 2019-06-12 by the reprex package (v0.2.1)

Count Occurrences of All Integers in Matrix

I have an array containing 20,000 rows and 300 columns. Each element is an integer. I would like to count the occurrences of each integer in this matrix.
I have tried the following:
>frequency_Table=read.csv('huge_file.csv',header=FALSE,check.names=FALSE)
>table(frequency_Table)
I get the error "attempt to make a table with >= 2^31 elements", which makes sense after reading it.
I want something like this:
1 2000
2 2023
3 5683
Basically, a frequency table of sorts, for all the numbers. Any advice would be appreciated!
The 'frequency_table' object is a data.frame. We unlist (assuming that the OP wants an R solution as the dataset got read with R syntax) it to make a vector and then get the frequency with table
as.data.frame(table(unlist(frequency_table)))
data
set.seed(24)
frequency_table <- as.data.frame(matrix(sample(22:29, 20*4,
replace=TRUE), ncol=4))
from collections import Counter
from numpy import np
Counter(np.array(frequency_Table).flatten())
numpy.unique can do this:
>>> import numpy as np
>>> table = np.array([[1, 2, 3], [2, 2, 3], [3, 2, 3]])
>>> values, counts = np.unique(table, return_counts = True)
>>> for value, count in zip(values, counts):
... print("{}\t{}".format(value, count))
...
1 1
2 4
3 4
Can you find out a way to get all the unique integers in the data.fame quickly?
My thought is that after you find out the unique integers in the data.frame. You can use the code sapply(unique_int, function(x) sum(m == x)) to find out the corresponding occurrence of each integer.
This is the code I tried:
m <- matrix(sample(1:10, size=20000*300, replace=TRUE), ncol=300)
#A way to get the unique integers
unique_int <- unique(c(m))
#Count
count <- sapply(unique_int, function(x) sum(m == x))
names(count) <- unique_int
count
## 10 8 3 9 6 5 4 1 2 7
## 598551 600413 599396 599517 600114 600503 601311 601205 599268 599722
Here a 1 line solution in R:
You can use stack() or unlist to arrange all columns of the dataset in one. Based on this you can define the first column of the the stacked dataset as factor and use tapply with length as a function, which gives you the frequency of each element:
Using stack():
tapply(stack(frequency_Table)[,1],factor(stack(frequency_Table)[,1]),length)
Using unlist:
tapply(unlist(frequency_Table),factor(unlist(frequency_Table)),length)

Converting a text import of a range in format 1:5 from character with : operator to a numeric string/value for use in a function

I am importing a key in which each row is an argument setting for a function I have programmed. The goal is to batch test my function by producing outputs for all sets of arguments. That's not terribly important. What is important is that I import a column that contains in each row a value for a range. For instance, "1:5" is meant to be entered into an argument as the value 1:5. I try to coerce using as.numeric("1:5"), but R is not happy with this. Is there a way to coerce this to the string c(1,2,3,4,5) from the character value "1:5"
Your text is valid code, so you can eval(parse it
dat$parsed <- lapply(dat$key, function(x) eval(parse(text=x)))
# key parsed
# 1 1:5 1, 2, 3, 4, 5
# 2 1:6 1, 2, 3, 4, 5, 6
# 3 1:4 1, 2, 3, 4
Data
dat <- read.table(text="key
1:5
1:6
1:4", strings=F, header=T)
Reduce(':', strsplit(x,":")[[1]])
[1] 1 2 3 4 5
If x = "1:5", we can use strsplit to separate the two numbers. We can then use Reduce to execute the operator : on the split.

Julia DataFrames.jl - filter data with NA's (NAException)

I am not sure how to handle NA within Julia DataFrames.
For example with the following DataFrame:
> import DataFrames
> a = DataFrames.#data([1, 2, 3, 4, 5]);
> b = DataFrames.#data([3, 4, 5, 6, NA]);
> ndf = DataFrames.DataFrame(a=a, b=b)
I can successfully execute the following operation on column :a
> ndf[ndf[:a] .== 4, :]
but if I try the same operation on :b I get an error NAException("cannot index an array with a DataArray containing NA values").
> ndf[ndf[:b] .== 4, :]
NAException("cannot index an array with a DataArray containing NA values")
while loading In[108], in expression starting on line 1
in to_index at /Users/abisen/.julia/v0.3/DataArrays/src/indexing.jl:85
in getindex at /Users/abisen/.julia/v0.3/DataArrays/src/indexing.jl:210
in getindex at /Users/abisen/.julia/v0.3/DataFrames/src/dataframe/dataframe.jl:268
Which is because of the presence of NA value.
My question is how should DataFrames with NA should typically be handled? I can understand that > or < operation against NA would be undefined but == should work (no?).
What's your desired behavior here? If you want to do selections like this you can make the condition (not a NAN) AND (equal to 4). If the first test fails then the second one never happens.
using DataFrames
a = #data([1, 2, 3, 4, 5]);
b = #data([3, 4, 5, 6, NA]);
ndf = DataFrame(a=a, b=b)
ndf[(!isna(ndf[:b]))&(ndf[:b].==4),:]
In some cases you might just want to drop all rows with NAs in certain columns
ndf = ndf[!isna(ndf[:b]),:]
Regarding to this question I asked before, you can change this NA behavior directly in the modules sourcecode if you want. In the file indexing.jl there is a function named Base.to_index(A::DataArray) beginning at line 75, where you can alter the code to set NA's in the boolean array to false. For example you can do the following:
# Indexing with NA throws an error
function Base.to_index(A::DataArray)
A[A.na] = false
any(A.na) && throw(NAException("cannot index an array with a DataArray containing NA values"))
Base.to_index(A.data)
end
Ignoring NA's with isna() will cause a less readable sourcecode and in big formulas, a performance loss:
#timeit ndf[(!isna(ndf[:b])) & (ndf[:b] .== 4),:] #3.68 µs per loop
#timeit ndf[ndf[:b] .== 4, :] #2.32 µs per loop
## 71x179 2D Array
#timeit dm[(!isna(dm)) & (dm .< 3)] = 1 #14.55 µs per loop
#timeit dm[dm .< 3] = 1 #754.79 ns per loop
In many cases you want to treat NA as separate instances, i.e. assume that that everything that is NA is "equal" and everything else is different.
If this is the behaviour you want, current DataFrames API doesn't help you much, as both (NA == NA) and (NA == 1) returns NA instead of their expected boolean results.
This makes extremely tedious DataFrame filters using loops:
function filter(df,c)
for r in eachrow(df)
if (isna(c) && isna(r:[c])) || ( !isna(r[:c]) && r[:c] == c )
...
and breaks select-like functionalities in DataFramesMeta.jl and Query.jl when NA values are present or requested for..
One workaround is to use isequal(a,b) in place of a==b
test = #where(df, isequal.(:a,"cc"), isequal.(:b,NA) ) #from DataFramesMeta.jl
I think the new syntax in Julia is to use ismissing:
# drop NAs
df = DataFrame(col=[0,1,1,missing,0,1])
df = df[.!ismissing.(df[:col]),:]

Resources