DolphinDB: Obtain all chunkId of a partitioned table - database-partitioning

When using DolphinDB, I found that chunkId is frequently used as the parameter for many functions. How can I get all chunkId of a partitioned table?

I think you need getChunksMeta or getTabletsMeta

The getTabletsMeta function returns metadata of specified tablet chunks on the local node. The first column of the result is chunkId. You can also use function getTabletsMeta to specify multiple database chunks (with parameter chunkPath) via wildcards.
Create a database “testDB“ partitioned by VALUE, and use getTabletsMeta to obtain chunk information of “pt1“.
if(existsDatabase("dfs://testDB")){
dropDatabase("dfs://testDB")
}
db=database("dfs://testDB", VALUE, 1..10)
n=1000000
t=table(rand(1..10, n) as id, rand(100.0, n) as x)
db.createPartitionedTable(t, `pt1, `id).append!(t)
n=2000000
t=table(rand(1..10, n) as id, rand(100.0, n) as x, rand(100, n) as y)
db.createPartitionedTable(t, `pt2, `id).append!(t)
getTabletsMeta("/testDB/%", `pt1, true);
Output:

Related

R: Best way to store packets of data (after a loop)

I have a testing function compareMethods which has as a goal to compare different methods on different number of dimentions. I use a nested loop to create all method-number_of_dimensions combinations. After, for each combination I create a model mod.
What I would like to do is to store for each combination four pieces of information: method, number of dimensions, model, parameter number 10. In the next step of the analysis, I would like to sort the quartuples based on the value of parameter number 10.
How can store each quatruple every time I iterate?
compareMethods <- function(data, dimensions=c(2,5,10,50)){
for (method in c("pca", "tsne", "umap")){
for (n in dimensions){
new.data <- dimReduction(data, method, n)
mod <- Mclust(new.data)
# keep (method, n, mod, mod[10])
}
}
return (lst)
}

Delete all duplicated elements in a vector in Julia 1.1

I am trying to write a code which deletes all repeated elements in a Vector. How do I do this?
I already tried using unique and union but they both delete all the repeated items but 1. I want all to be deleted.
For example: let x = [1,2,3,4,1,6,2]. Using union or unique returns [1,2,3,4,6]. What I want as my result is [3,4,6].
There are lots of ways to go about this. One approach that is fairly straightforward and probably reasonably fast is to use countmap from StatsBase:
using StatsBase
function f1(x)
d = countmap(x)
return [ key for (key, val) in d if val == 1 ]
end
or as a one-liner:
[ key for (key, val) in countmap(x) if val == 1 ]
countmap creates a dictionary mapping each unique value from x to the number of times it occurs in x. The solution can then be easily found by extracting every key from the dictionary that maps to val of 1, ie all elements of x that occur precisely once.
It might be faster in some situations to use sort!(x) and then construct an index for the elements of the sorted x that only occur once, but this will be messier to code, and also the output will be in sorted order, which you may not want. The countmap method preserves the original ordering.

How can I map a vector of strings as argument of a function?

I want to extract data from an Impala connection, using the tbl() and in_schema() functions from dplyr and implyr. I need to do this for each table separately, and need to specify the table using the in_schema() function and a string to define the table. However, only one single string (ie one table) can be given as an argument, and not a vector of strings. Instead of copy pasting the same code x times, I was wondering if there was a more elegant way of mapping this. See example code for details.
Take this vector of strings for example:
tables <- c("table_a", "table_b", "table_c")
To extract one table, code works like this:
table_a <- tbl(impala, in_schema("schema", "table_a"))
This doesn't work, which makes sense since only a single string value is expected:
tables <- tbl(impala, in_schema("schema", tables))
How can I extract all tables without having to repeat this process for all tables separately?
You can make a loop:
result_tables <- list()
for(t in tables){
result_tables[t] <- tbl(impala, in_schema("schema", t))
}

SparkR gapply - function returns a multi-row R dataframe

Let's say I want to execute something as follows:
library(SparkR)
...
df = spark.read.parquet(<some_address>)
df.gapply(
df,
df$column1,
function(key, x) {
return(data.frame(x, newcol1=f1(x), newcol2=f2(x))
}
)
where the return of the function has multiple rows. To be clear, the examples in the documentation (which sadly echoes much of the Spark documentation where the examples are trivially simple) don't help me identify whether this will be handled as I expect.
I would expect that the outcome of this would be, for k groups created in the DataFrame with n_k output rows per group, that the result of the gapply() call would have sum(1..k, n_k) rows, where the key value is replicated for each of n_k rows for each group in key k ... However, the schema-field suggests to me that this is not how this will be handled - in fact it suggests that it will either want the result pushed into a single row.
Hopefully this is clear, albeit theoretical (I'm sorry I can't share my actual code example). Can someone verify or explain how such a function will actually be treated?
Exact expectations regarding input and output are clearly stated in the official documentation:
Apply a function to each group of a SparkDataFrame. The function is to be applied to each group of the SparkDataFrame and should have only two parameters: grouping key and R data.frame corresponding to that key. The groups are chosen from SparkDataFrames column(s). The output of function should be a data.frame.
Schema specifies the row format of the resulting SparkDataFrame. It must represent R function’s output schema on the basis of Spark data types. The column names of the returned data.frame are set by user. Below is the data type mapping between R and Spark.
In other words your function should take a key and data.frame of rows corresponding to that key and return data.frame that can be represented using Spark SQL types with schema provided as schema argument. There are no restriction regarding number of rows. You could for example apply identity transformation as follows:
df <- as.DataFrame(iris)
gapply(df, "Species", function(k, x) x, schema(df))
the same way as aggregations:
gapply(df, "Species",
function(k, x) {
dplyr::summarize(dplyr::group_by(x, Species), max(Sepal_Width))
},
structType(
structField("species", "string"),
structField("max_s_width", "double"))
)
although in practice you should prefer aggregations directly on DataFrame (groupBy %>% agg).

Comparing two vectors one value at a time without using WHILE

I have two tables: df.author and df.post, which are related by a one-to-many relation. Now I changed the primary key of df.author and I want df.post to mirror the change. In the following R script I use match() in a while loop to compare the foreign key of each row of df.post with the old primary key of df.author and-when they match-replace the foreign key with the new one (form a different column of df.author). Please consider the following:
foreignkey <- c("old_pk1","old_pk2","old_pk3","old_pk4","old_pk5","old_pk1","old_pk7")
df.post <- data.frame(foreignkey,stringsAsFactors=FALSE)
rm(foreignkey)
primarykey_old <- c("old_pk1","old_pk2","old_pk3","old_pk4","old_pk5")
primarykey_new <- c("new_pk1","new_pk2","new_pk3","new_pk4","new_pk5")
df.author <- data.frame(primarykey_old, primarykey_new, stringsAsFactors=FALSE);
rm(primarykey_old); rm(primarykey_new)
i <- 1; N <- length(df.post$foreignkey)
while (i <= N) {
match <- match(df.post$foreignkey[i], df.author$primarykey_old)
if (!is.na(match)) {
df.post$foreignkey[i] <- df.author$primarykey_new[match]
}
i <- i + 1
}
rm(N); rm(i); rm(match)
The script works but because of while doesn't scale efficiently for a large dataset. I have read that using apply() (in my case by converting to a matrix) is usually better than using while. I wonder if it also applies to my case. Because if you look at the loop you see I need to go through every single row of the dataframe to get the foreign key and then through out df.author for a match().
Can I compress the computational time by not using while?
I think this might do everything in a loopless fashion:
df.post$foreignkey[
!length(match(df.post$foreignkey, df.author$primarykey_old))==0] <- # the test
df.author$primarykey_new[match(df.post$foreignkey, df.author$primarykey_old)]
Logic : Only if there is a match then replace the existing value with the matching value.

Resources