LiteTable 0.6.7
Julia 0.3
Jewel0.6.4
June 0.2
Mac OSX 10.9
Hi,I have a problem at using Datavector by DataFrame package.
using DataFrames
df = DataFrame()
df["Name"] = DataVector["JohnSmith", "JaneDoe"]
df["Height"] = DataVector[73.0,68.0]
df["Weight"] = DataVector[NA,130]
df["Gender"] = DataVector["Male","Female"]
after that, julia says
no method convert(Type{DataArray{T,1}}, ASCIIString)
in getindex at array.jl:121
I could do this same script at julia 0.2
Does LightTable's plugin such as Jewel,June not accept this DataFrames function?
I tried dataeye() and other function ,but this doesn't work..
Brw, I found a similar post at google group.
https://groups.google.com/forum/#!topic/julia-users/VmgmRnBCo9I
Thanks for reading.
It looks like you're reading very old documentation for DataArrays and DataFrames. You might want to look at more recent docs: http://juliastats.github.io/DataFrames.jl/
Here's how'd you do what you're trying to do:
using DataFrames
df = DataFrame()
df[:Name] = #data(["JohnSmith", "JaneDoe"])
df[:Height] = #data([73.0,68.0])
df[:Weight] = #data([NA,130])
df[:Gender] = #data(["Male","Female"])
Related
I am not very familiar with loops in R, and am having a hard time stating a variable such that it is recognized by a function, DESeqDataSetFromMatrix.
pls is a table of integers. metaData is a data frame containing sample IDs and conditions corresponding to pls. I verified that the below steps run error-free with the individual elements of cond run successfully .
I reviewed relevant posts on referencing variables in R:
How to reference variable names in a for loop in R?
How to reference a variable in a for loop?
Based on these posts, I modified i in line 3 with single brackets, double brackets and "as.name". No luck. DESeqDataSetFromMatrix is reading the literal text after ~ and spits out an error.
cond=c("wt","dhx","mpp","taz")
for(i in cond){
dds <- DESeqDataSetFromMatrix(countData=pls,colData=metaData,design=~i, tidy = TRUE)
"sizeFactors"(dds) <- 1
paste0("PLS",i)<-DESeq(dds)
pdf <- paste(i,"-PLS_MA.pdf",sep="")
tsv <- paste(i,"-PLS.tsv",sep="")
pdf(file=pdf,paper = "a4r", width = 0, height = 0)
plotMA(paste0("PLS",i),ylim=c(-10,10))
dev.off()
write.table(results(paste0("PLS",i)),file = tsv,quote=FALSE, sep='\t', col.names = NA)
}
With brackets, an unexpected symbol error populates.
With i alone, DESEqDataSetFromMatrix tries to read "i" from my metaData column.
Is R just not capable of reading variables in some situations? Generally speaking, is it better to write loops outside of R in a more straightforward language, then push as standalone commands? Thanks for the help—I hope there is an easy fix.
For anyone else who may be having trouble looping with DESeq2 functions, comments above addressed my issue.
Correct input:
dds <- DESeqDataSetFromMatrix(countData=pls,colData=metaData,design=as.formula(paste0("~", i)), tidy = TRUE)
as.formula worked well with all DESeq functions that I tested.
reformulate(i) also worked well in most situations.
Thanks, everyone for the help!
Currently, I am using foreach loop from doparallel library to run function calls in parallel across multiple cores of the same machine, which looks something like this:
out_results=foreach(i =1:length(some_list))%dopar%
{
out=functions_call(some_list[[i]])
return(out)
}
This some_list is a list of data frames and each data frame would have different number of columns, the function_call() is a function that does multiple things to the data such as data manipulations,then uses random forest for variable selection and then finally performs a least squares fit. The variable out is again a list of 3 data frames, and out_results will be a list of lists.
I am using CRAN libraries and some custom libraries created by me inside the function call, I want to avoid using spark ML libraries due to their limited functionality and re-writing of the entire code.
I want to leverage spark for running these function calls in parallel. Is it possible to do so? If yes in which direction should I be thinking. I have read a lot of documentation from sparklyr, but it doesn't seem to help much since the examples provided there are very straightforward.
SparklyR's homepage gives examples of arbitrary R code distributed on the Spark cluster. In particular, see their example with grouped operations.
Your main structure should be a data frame, which you will process rowwise. Probably something like the following (not tested):
some_list = list(tibble(a=1[0]), tibble(b=1), tibble(c=1:2))
all_data = tibble(i = seq_along(some_list), df = some_list)
# Replace this with your actual code.
# Should get one dataframe and produce one dataframe.
# Embedded dataframe columns are OK
transform_one = function(df_wrapped) {
# in your example, you expect only one record per group
stopifnot(nrow(df_wrapped)==1)
df = df_wrapped$df
res0 = df
res1 = tibble(x=10)
res2 = tibble(y=10:11)
return(tibble(res0 = list(res0), res1 = list(res1), res2 = list(res2)))
}
all_data %>% spark_apply(
transform_one,
group_by = c("i"),
columns = c("res0"="list", "res1"="list", "res2"="list"),
packages = c("randomForest", "etc")
)
All in all, this approach seems unnatural, as if we were forcing the use of Spark on a task which does not really fit. Maybe you should check for another parallelization framework?
I work with multiple dimensional arrays and when I need to plot I usually convert my data to a tibble through tbl_cube to then plot it with ggplot2. Today the new dplyr 1.0.0 was updated to CRAN, and I found that now tbl_cube is no more available. And I could not found a replacement to tbl_cube. I did something like this toy example before today to get a plot:
test_data1 <- array(1:50, c(5,5,2))
test_data2 <- array(51:100, c(5,5,2))
# list of my arrays
test_data <- list(exp1 = test_data1, exp2= test_data2)
# list of the dimentions
dims_list <- list(lat = 1:5, lon = 1:5, var = c('u','v'))
new_data <- as_tibble(tbl_cube(dimensions = dims_list, measures = test_data))
# Make some random plot
ggplot(new_data, aes(x=lon,y=lat)) +
geom_tile(aes(fill=exp2))+
geom_contour(aes(z=exp1),col='black')
This example runs and works with previous dplyr release, but not now since tbl_cube does not exist anymore. I know that in this example the third dimension is not used to do the plot but I wanted to show that I need something to used over at least a 3D array or even a 4D.
Any suggestion of how to solve this in an easy way such as tbl_cube ?
I'm in the swirl() course, Getting and Cleaning Data: swirl Lesson 2: Grouping and Chaining with dplyr, I receive an error when submitting the summarize1.R script. The script I'm submitting is identical to here:
pack_sum <- summarize(by_package,
count = n(),
unique = n_distinct(ip_id),
countries = n_distinct(country),
avg_bytes = mean(size))
The resulting error is:
"Error : object '' not found"
I'm using R version 3.6.1, on Windows 10, with {dplyr} version 0.8.4, and {swirl} version 2.4.5.
Thank you!
Natya
faced the same problem a while ago, the problem is in count=n(), try the following:
pack_sum <- summarize(by_package,
count = sum(n()),
unique = n_distinct(ip_id),
countries = n_distinct(country),
avg_bytes = mean(size))
I have some R code with readr package that works well on a local computer - I use list.files to find files with a specific extension and then use readr to operate on those files found.
My question: I want to do something similar with files in AWS S3 and I am looking for some pointers on how to use my current R code to do the same.
Thanks in advance.
What I want:
Given AWS folder/file structure like this
- /folder1/subfolder1/quant.sf
- /folder1/subfolder2/quant.sf
- /folder1/subfolder3/quant.sf
and so on where every subfolder has the same file 'quant.sf', I would like to get a data frame which has the S3 paths and I want to use the R code shown below to operate on all the quant.sf files.
Below, I am showing R code that works currently with data on a Linux machine.
get_quants <- function(path1, ...) {
additionalPath = list(...)
suppressMessages(library(tximport))
suppressMessages(library(readr))
salmon_filepaths=file.path(path=path1,list.files(path1,recursive=TRUE, pattern="quant.sf"))
samples = data.frame(samples = gsub(".*?quant/salmon_(.*?)/quant.sf", "\\1", salmon_filepaths) )
row.names(samples)=samples[,1]
names(salmon_filepaths)=samples$samples
# IF no tx2Gene available, we will only get isoform level counts
salmon_tx_data = tximport(salmon_filepaths, type="salmon", txOut = TRUE)
## Get transcript count summarization
write.csv(as.data.frame(salmon_tx_data$counts), file = "tx_NumReads.csv")
## Get TPM
write.csv(as.data.frame(salmon_tx_data$abundance), file = "tx_TPM_Abundance.csv")
if(length(additionalPath > 0)) {
tx2geneFile = additionalPath[[1]]
my_tx2gene=read.csv(tx2geneFile,sep = "\t",stringsAsFactors = F, header=F)
salmon_tx2gene_data = tximport(salmon_filepaths, type="salmon", txOut = FALSE, tx2gene=my_tx2gene)
## Get Gene count summarization
write.csv(as.data.frame(salmon_tx2gene_data$counts), file = "tx2gene_NumReads.csv")
## Get TPM
write.csv(as.data.frame(salmon_tx2gene_data$abundance), file = "tx2gene_TPM_Abundance.csv")
}
}
I find it easiest to use the aws.s3 R package for this. In this case what you would do is use the s3read_using() and s3write_using() functions to save to and from S3. Like this:
library(aws.s3)
my_tx2gene=s3read_using(FUN=read.csv, object="[path_in_s3_to_file]",sep = "\t",stringsAsFactors = F, header=F)
It basically is a wrapper around whatever function you want to use for file input/output. Works great with read_json, saveRDS, or anything else!