R :Read csv numeric with comma in decimal, package sparklyr - r

I need to read a file of type ".csv" using the library "sparklyr", in which the numeric values appear with commas. The idea is to be able to read using "spark_read_csv()" directly.
I am using:
library(sparklyr)
library(dplyr)
f<-data.frame(DNI=c("22-e","EE-4","55-W"),
DD=c("33,2","33.2","14,55"),CC=c("2","44,4","44,9"))
write.csv(f,"aff.csv")
sc <- spark_connect(master = "local", spark_home = "/home/tomas/spark-2.1.0-bin-hadoop2.7/", version = "2.1.0")
df <- spark_read_csv(sc, name = "data", path = "/home/tomas/Documentos/Clusterapp/aff.csv", header = TRUE, delimiter = ",")
tbl <- sdf_copy_to(sc = sc, x =df , overwrite = T)
The problem, read the numbers as factor

To manipulate string inside a spark df you can use regexp_replace function as mentioned here:
https://spark.rstudio.com/guides/textmining/
For you problem it would work out like this:
tbl <- sdf_copy_to(sc = sc, x =df, overwrite = T)
tbl0<-tbl%>%
mutate(DD=regexp_replace(DD,",","."),CC=regexp_replace(CC,",","."))%>%
mutate_at(vars(c("DD","CC")),as.numeric)
to check your result:
> glimpse(tbl0)
Observations: ??
Variables: 3
$ DNI <chr> "22-e", "EE-4", "55-W"
$ DD <dbl> 33.20, 33.20, 14.55
$ CC <dbl> 2.0, 44.4, 44.9

If u dont want to replace it with '.' maybe you can try this.
spark_read_csv
Check the documentation. Use escape parameter to specify which character you are trying to ignore.
In this case try using:
df <- spark_read_csv(sc, name = "data", path = "/home/tomas/Documentos/Clusterapp/aff.csv", header = TRUE, delimiter = ",", escape = "\,").

You could replace the "," in the numbers with "." and convert them to numeric. For instance
df$DD<-as.numeric(gsub(pattern = ",",replacement = ".",x = df$DD))
Does that help?

Related

R: Importing file using rio and here packages in a nested function

I'm working on functions that can take the chracter string argument GSE_expt. I have written 4 separate functions which take the argument GSE_expt and produce the output that I am able to save as a variable in the R environment.
The code block below has 2 of those functions. I use paste0 function with the variable GSE_expt to create a file name that the here and rio packages can use to import the file.
# Extracting metadata from 2 different sources and combining them into a single file
extract_metadata <- function(GSE_expt){
GSE_expt <- deparse(substitute(GSE_expt)) # make sure it is a character string
metadata_1 <- rnaseq_metadata_allsamples %>% # subset a larger metadata file
as_tibble %>%
dplyr::filter(GSE == GSE_expt)
# metadata from ENA imported using rio and here packages
metadata_2 <- import(here("metadata", "rnaseq", paste0(GSE_expt, ".txt"))) %>%
as_tibble %>%
select("run_accession","library_layout", "library_strategy","library_source","read_count", "base_count", "sample_alias", "fastq_md5")
metadata <- full_join(metadata_1, metadata_2, by = c("Run"="run_accession"))
return(metadata)
}
# Extracting coverage stats obtained from samtools
clean_extract_coverage <- function(GSE_expt){
coverage <- read_tsv(file = here("results","rnaseq","2022-01-11", "coverage", paste0("coverage_stats_", deparse(substitute(GSE_expt)), "_percent.txt")), col_names = FALSE)
coverage <- data.frame("Run" = coverage$X1[c(TRUE, FALSE)],
"stats" = coverage$X1[c(FALSE, TRUE)])
coverage <- separate(coverage, stats, into = c("num_reads", "covered_bases", "coverage_percent"), convert = TRUE)
return(coverage)
}
The functions work fine on their own individually when I use GSE118008 as the variable for the argument GSE_expt.
I am trying to create a nested/combined function so that I can run GSE118008 on both (or more) functions at the same time and save the output as a list.
When I ran a nested/combined function,
extract_coverage_metadata <- function(GSE_expt){
coverage <- clean_extract_coverage(GSE_expt)
metadata <- extract_metadata(GSE_expt)
return(metadata)
}
extract_coverage_metadata(GSE118008)
This is the error message I got.
Error: 'results/rnaseq/2022-01-11/coverage/coverage_stats_GSE_expt_percent.txt' does not exist.
Rather than creating a filename
coverage_stats_GSE118008_percent.txt
(which it does fine with the individual function), it is unable to do so in this combined function, and instead returns the filename coverage_stats_GSE_expt_percent.txt
Traceback
8. stop("'", path, "' does not exist", if (!is_absolute_path(path)) { paste0(" in current working directory ('", getwd(), "')") }, ".", call. = FALSE)
7. check_path(path)
6. (function (path, write = FALSE) { if (is.raw(path)) { return(rawConnection(path, "rb")) ...
5. vroom_(file, delim = delim %||% col_types$delim, col_names = col_names, col_types = col_types, id = id, skip = skip, col_select = col_select, name_repair = .name_repair, na = na, quote = quote, trim_ws = trim_ws, escape_double = escape_double, escape_backslash = escape_backslash, ...
4. vroom::vroom(file, delim = "\t", col_names = col_names, col_types = col_types, col_select = { { col_select ...
3. read_tsv(file = here("results", "rnaseq", "2022-01-11", "coverage", paste0("coverage_stats_", deparse(substitute(GSE_expt)), "_percent.txt")), col_names = FALSE) at rnaseq_functions.R#30
2. clean_extract_coverage(GSE_expt)
1. extract_coverage_metadata(GSE118008)
I would appreciate any recommendations on how to solve this.
Thanks in advance!
Husain

Using tidyverse to read data from s3 bucket

I'm trying to read a .csv file stored in an s3 bucket, and I'm getting errors. I'm following the instructions here, but either it does not work or I am making a mistake and I'm not getting what I'm doing wrong.
Here's what I'm trying to do:
# I'm working on a SageMaker notebook instance
library(reticulate)
library(tidyverse)
sagemaker <- import('sagemaker')
sagemaker.session <- sagemaker$Session()
region <- sagemaker.session$boto_region_name
bucket <- "my-bucket"
prefix <- "data/staging"
bucket.path <- sprintf("https://s3-%s.amazonaws.com/%s", region, bucket)
role <- sagemaker$get_execution_role()
client <- sagemaker.session$boto_session$client('s3')
key <- sprintf("%s/%s", prefix, 'my_file.csv')
my.obj <- client$get_object(Bucket=bucket, Key=key)
my.df <- read_csv(my.obj$Body) # This is where it all breaks down:
##
## Error: `file` must be a string, raw vector or a connection.
## Traceback:
##
## 1. read_csv(my.obj$Body)
## 2. read_delimited(file, tokenizer, col_names = col_names, col_types = col_types,
## . locale = locale, skip = skip, skip_empty_rows = skip_empty_rows,
## . comment = comment, n_max = n_max, guess_max = guess_max,
## . progress = progress)
## 3. col_spec_standardise(data, skip = skip, skip_empty_rows = skip_empty_rows,
## . comment = comment, guess_max = guess_max, col_names = col_names,
## . col_types = col_types, tokenizer = tokenizer, locale = locale)
## 4. datasource(file, skip = skip, skip_empty_rows = skip_empty_rows,
## . comment = comment)
## 5. stop("`file` must be a string, raw vector or a connection.",
## . call. = FALSE)
When working with Python, I can read a CSV file using someting like this:
import pandas as pd
# ... Lots of boilerplate code
my_data = pd.read_csv(client.get_object(Bucket=bucket, Key=key)['Body'])
This is very similar to what I'm trying to do in R, and it works with Python... so why does it not work on R?
Can you point me in the right path?
Note: Although I could use a Python kernel for this, I'd like to stick to R, because I'm more fluent with it than with Python, at least when it comes to dataframe crunching.
I'd recommend trying the aws.s3 package instead:
https://github.com/cloudyr/aws.s3
Pretty simple - set your env variables:
Sys.setenv("AWS_ACCESS_KEY_ID" = "mykey",
"AWS_SECRET_ACCESS_KEY" = "mysecretkey",
"AWS_DEFAULT_REGION" = "us-east-1",
"AWS_SESSION_TOKEN" = "mytoken")
and then once that is out of the way:
aws.s3::s3read_using(read.csv, object = "s3://bucket/folder/data.csv")
Update: I see you're also already familiar with boto and trying to use reticulate so leaving this easy wrapper for that here:
https://github.com/cloudyr/roto.s3
Looks like it has a great api for example the variable layout you're aiming to use:
download_file(
bucket = "is.rud.test",
key = "mtcars.csv",
filename = "/tmp/mtcars-again.csv",
profile_name = "personal"
)
read_csv("/tmp/mtcars-again.csv")

Why in R string entries do not get read into the data.frame?

I have a data.tsv file (tabs separate entries). The full file can be found here.
The entries in the file look like this:
">173D:C" "TVPGVXTVPGV" "CCSCCCCCCCC"
">173D:D" "TVPGVXTVPGV" "CCCCCCCCSCC"
">185D:A" "SAXVSAXV" "CCBCCCBC"
">1A0M:B" "GCCSDPRCNMNNPDYCX" "CCTTSHHHHHTCTTTCC"
">1A0M:A" "GCCSDPRCNMNNPDYCX" "CGGGSHHHHHHCTTTCC"
">1A0N:A" "PPRPLPVAPGSSKT" "CCCCCCCCSTTCCC"
I am trying to read string entries into the data frame (into a matrix
containing 3 columns):
data = data.frame(read.csv(file = './data.tsv', header = FALSE, sep = '\t'))
but only the first column is read. All other columns are empty.
I also tried different commands, such as
data = read.csv(file = './data.tsv', header = FALSE, sep = '\t')
data = read.csv(file = './data.tsv', sep = '\t')
data = data.frame(read.csv(file = './data.tsv'))
but without success. Can someone foresee why the input does not get read
successfully?
Using the file defined reproducibly in the Note at the end this works:
DF <- read.table("myfile.dat", as.is = TRUE)
gives:
> DF
V1 V2 V3
1 >173D:C TVPGVXTVPGV CCSCCCCCCCC
2 >173D:D TVPGVXTVPGV CCCCCCCCSCC
3 >185D:A SAXVSAXV CCBCCCBC
4 >1A0M:B GCCSDPRCNMNNPDYCX CCTTSHHHHHTCTTTCC
5 >1A0M:A GCCSDPRCNMNNPDYCX CGGGSHHHHHHCTTTCC
6 >1A0N:A PPRPLPVAPGSSKT CCCCCCCCSTTCCC
Note
Lines <- '">173D:C" "TVPGVXTVPGV" "CCSCCCCCCCC"
">173D:D" "TVPGVXTVPGV" "CCCCCCCCSCC"
">185D:A" "SAXVSAXV" "CCBCCCBC"
">1A0M:B" "GCCSDPRCNMNNPDYCX" "CCTTSHHHHHTCTTTCC"
">1A0M:A" "GCCSDPRCNMNNPDYCX" "CGGGSHHHHHHCTTTCC"
">1A0N:A" "PPRPLPVAPGSSKT" "CCCCCCCCSTTCCC"'
writeLines(Lines, "myfile.dat")
Use sep=''
data = read.csv(file = './data.tsv', header = FALSE, sep = '')
See this answer.

R Data file not converting to Stata file

I am getting this error. Cannot figure out why? Any advise?
library(foreign)
x <- data.frame(a = "", b = 1, stringsAsFactors = FALSE)
write.dta(x, 'x.dta')
Error in write.dta(x, "x.dta") :
4 arguments passed to .Internal(nchar) which requires 3
The haven package works much better than foreign in this case as it will read strings (including empty strings) as string values.
library( haven )
x <- data.frame( a = "", b = 1, stringsAsFactors = FALSE )
write_dta( x, 'x.dta' )
Alternatively, if you pass parameter a a value when creating the data frame, instead of an empty string, foreign will be fine.
x <- data.frame( a = "a", b = 1, stringsAsFactors = FALSE )
write.dta( x,"y.dta" )
As you're using an older version of Stata, haven is the way to go, as you can specify the version of Stata you wish the dta file to be compatible with.
write_dta( x, 'x.dta', version = 13 )

Specifying col type in Sparklyr (spark_read_csv)

I am reading in a csv into spark using SpraklyR
schema <- structType(structField("TransTime", "array<timestamp>", TRUE),
structField("TransDay", "Date", TRUE))
spark_read_csv(sc, filename, "path", infer_schema = FALSE, schema = schema)
But get:
Error: could not find function "structType"
How do I specify colunm types using spark_read_csv?
Thanks in advance.
The structType function is from Scala's SparkAPI, in Sparklyr to specify the datatype you must pass it in the "column" argument as a list, suppose that we have the following CSV(data.csv):
name,birthdate,age,height
jader,1994-10-31,22,1.79
maria,1900-03-12,117,1.32
The function to read the corresponding data is:
mycsv <- spark_read_csv(sc, "mydate",
path = "data.csv",
memory = TRUE,
infer_schema = FALSE, #attention to this
columns = list(
name = "character",
birthdate = "date", #or character because needs date functions
age = "integer",
height = "double"))
# integer = "INTEGER"
# double = "REAL"
# character = "STRING"
# logical = "INTEGER"
# list = "BLOB"
# date = character = "STRING" # not sure
For manipulating datetype you must use the hive date functions, not R functions.
mycsv %>% mutate(birthyear = year(birthdate))
Reference: https://spark.rstudio.com/articles/guides-dplyr.html#hive-functions
we have an example of how to do that in one of our articles in the official sparklyr site, here is the link: http://spark.rstudio.com/example-s3.html#data_import

Resources