How to groupby 200 features at once using Julia - julia

I am looking for a way to perform these operations in julia:
From a daframe I have 200 columns that I want to groupby one-by-one doing this:
column1_grouped = combine(df,column_1),[:fixed_parameter] .=> mean)
column2_grouped = combine(df,column_2),[:fixed_parameter] .=> mean)
untill column200_grouped.
Is there a way to iterate over a list of these 200 columns to output these 200 grouped dataframes? (I want to avoid type 200 lines like the above)
I got the list of 200 columns doing: list = names(df[!,r"factor_"])

Here is an example where the result will be a Dict mapping grouping column name to resulting data frame:
list = names(df, r"factor_")
Dict([n => combine(groupby(df, n), :fixed_parameter => mean) for n in list])

Related

In R, how do I concatenate multiple columns within a list of lists

I have a function in R which returns a list with N columns each of M rows of multiple types - date, numeric and char. I am using sapply to create multiple copies of these lists, which then end up in a top level list. I would like to concatenate the underlying lists together to produce a single list of N columns and M * number of list rows.
I've been trying different combinations of do.call, sapply, rbind, c, etc, but I think I'm missing something pretty fundamental. Below is a simple script that mimics the problem and shows the desired outcome. I've used 3 variables here, but the number of variables is arbitrary.
# Set up test function
testfun <- function(varName)
{
currDate = seq(as.Date('2018-12-31'), as.Date('2019-01-10'), "days")
t1 = runif(11)
t2 = runif(11)
groupNum = c(rep(1,5), rep(2,6))
varName = rep(varName, 11)
dataout= data.frame(currDate, t1, t2, groupNum, varName)
}
# create 3 test variables and run the data
varNames = c('test1', 'test2', 'test3')
tmp = sapply(varNames, testfun)
# I would like it to look like the following, but for any given number of variables
desiredAnswer <- rbind(as.data.frame(tmp[,1]),as.data.frame(tmp[,2]),as.data.frame(tmp[,3]))
The final answer will later be used to create a data table and feed ggplot with the varName as facets.
I'm happy to use any method to get the desired results, there's no reason the function needs to produce a list instead of say a data.frame. I'm certain I'm doing something dumb, any help appreciated.

Use R to compute ratio when columns holding JSON lists?

I have a data frame with 2 columns that holds JSON like:
I have 2 columns where each row in each column is a JSON.
df$col1[1] <- "[14,7,5,3,4,0,1,7,2,3,1,18,13,4,23,7,8,8,11,18,15,6,2,10,2,4,8,5,11,5,1,5,2,4,3,1,6,8,5,5,3,1,1,4,5,2,9,3,4,11,11,14,3,12,2,6,0,0,15,1,18,5,3,6,6,6]"
and a scalar column:
df$scalar <- 10, .... , 10
I want to apply the following formula:
((fromJSON(df$col1) / scalar1) / (fromJSON(df$col2) / scalar2))
I have done something like this:
lapply(df$col1, function(i) {fromJSON(i)/scalar1}) /
lapply(df$col2, function(i) {fromJSON(i)/scalar2}
Is there any other way to do this?
We can loop through the columns that in JSON format apply fromJSON, then divide both with scalar using Map and Reduce it to a single vector
Reduce(`/`, Map(`/`, lapply(df[c('col1', 'col2')], fromJSON), df[c('scalar1', 'scalar2')]))
A similar approach using map from purrr would be
library(purrr)
map2(df[c('col1', 'col2')], df[c('scalar1', 'scalar2')], ~ fromJSON(.x)/.y) %>%
reduce(`/`)

How to get sum of observations of a variable using sapply/lapply for a list of data frames?

I want to get the sum of values for a variable in each data frame in a list. I have something similar to this:
l <- list(a = mtcars, b = mtcars, c = mtcars)
v <- sapply(l, function(x)sum(l$x$disp))
I expect to get a named vector containing sums. Instead I get only zeros. My guess would be that only name of the data frame is passed to the function.
I tried other functions and lapply but every time I get a list/vector with Nulls or zeros. It is possible to use for statement for this task, but in my case I have list containing lists of data frames and nested loops in R seem not to be a good choice.
Do you have any ideas what I am missing?
Thanks in advance.
I think you need,
sapply(l, function(x) sum(x["disp"]))
# a b c
#7383.1 7383.1 7383.1
Or if you need the complete sum of disp variable across all the list, you can wrap it completely in sum
sum(sapply(l, function(x) sum(x["disp"])))
#[1] 22149.3

Select a column from multiple dataframes in a list

My list has multiple data frames with only two columns
DateTime Value
30-06-2016 100
31-07-2016 200
.
.
.
I just want to extract the column Value from the list. The fillowing code proved unsuccesful for me. What am I doing wrong here ?
actual_data <- lapply(test_data, function(df) df[,is.numeric(df)])
> actual_data[[1]]
data frame with 0 columns and 12 rows
Thank you
purrr::map (an enhanced version of lapply) provides a shortcut for this type of operation:
# Generate test data
set.seed(35156)
test_df <- data.frame('DateTime' = rnorm(100), 'Value' = rnorm(100))
test_data <- rep(list(test_df), 100)
# Use `map` from the purrr package to subset the data.frames
purrr::map(test_data, 'Value')
purrr::map(test_data, 2)
As you can see in the example above, you can select columns in a data.frame either by name, by passing a character string as the second argument to purrr::map, or by position, by passing a number.

Using R to create a table from a list while preserving attributes

I am trying to use R to create a table that links all KEGG orthology IDs to all related Entrez genes. In theory this can be done using the KEGGREST package from bioconductor.
I have a list of all the KEGG orthology IDs, ko_nums, which I want to convert to Entrez IDs using the function keggConv. First I try lapply, but this is a problem because the url query is too long:
library(KEGGREST)
lapply(ko_nums,keggLink("genes",ko_nums))
Error in .get Url: (414) Request-URI Too Long
So that won't work with a query as big as mine. I tried to expand the list and query one at a time using:
output = apply(expand.grid(ko_nums),1,
function(x,y) keggLink("genes",x[1]))
But if you do this with a toy where
ko_nums = c("ko:K00001","ko:K00002","ko:K00003")
output = apply(expand.grid(ko_nums),1,
function(x,y) keggLink("genes",x[1]))
output
you see that my output is a list of three, with many genes per orthology ID in a list. I want to keep each gene paired with its respective orthology number in a data table, BUT
a) wrapping this in an "unlist" function removes all the ko identifiers, and
b) I can't make a dataframe with the list as it is because each row would have a different number of elements.
Is there a way to make a two-column table from this list in which the ko numbers are split into individual orthology/gene pairs? Like this:
ko:K00001 gene_1
ko:k00001 gene_2
ko:K00001 gene_3
ko:K00002 gene_4
ko:K00002 gene_5
ko:K00002 gene_6
etc.
Split your long list of ko_nums into groups of, say, n=1000 identifiers (choosing n so that the URL is not too long)
n = 1000
k = length(ko_nums)
grp = floor((1:k - 1) / n)
ko_groups = split(ko_nums, grp)
Apply keggLink() to each group
res = lapply(ko_groups, keggLink, target="genes")
Combine the results into the desired from
df = data.frame(ko_num=unlist(sapply(res, names)),
value = unname(unlist(res)))

Resources