I want to fetch parquet file from my s3 bucket using R. In my server Spark in not installed.
How to read and write parquet file in R without spark? I am able to read and write data from s3 using different format but not parquet format.
My code is given below -
Read csv file from s3
library(aws.s3)
obj <-get_object("s3://mn-dl.sandbox/Internal Data/test.csv")
csvcharobj <- rawToChar(obj)
con <- textConnection(csvcharobj)
data <- read.csv(file = con)
data1 <-data
#Write csv data directly to s3
s3write_using(data1, FUN = write.csv,
bucket = "mn-dl.sandbox",
object = "Internal Data/abc.csv")
Thanks in advance
Definitely a rookie in using R and AWS, so hopefully this is a universal solution and not just one that worked for me, but here's what I did.
install.packages("paws")
install.packages("arrow")
library(paws)
library(arrow)
s3 <- paws::s3(config=list(<your configurations here to give access to s3>))
object <- s3$get_object(Bucket = "path_to_bucket", Key = "file_name.parquet")
data <- object$Body
read_parquet(data)
I have being using the read function, but this write function should work.
Even though you're using the arrow package, you shouldn't need a spark server installed.
install.packages("aws.s3")
install.packages("arrow")
library(aws.s3)
library(arrow)
# Read data
obj <-get_object("s3://mn-dl.sandbox/Internal Data/test.csv")
csvcharobj <- rawToChar(obj)
con <- textConnection(csvcharobj)
data <- read.csv(file = con)
data1 <- data
# Write csv data directly to s3
aws.s3::s3write_using(x = data1,
FUN = arrow::write_parquet,
bucket = "mn-dl.sandbox",
object = "Internal Data/abc.csv")
Related
I am tring to conduct a basic bibliometrix analysis using biblioshiny. However, since I have both Scopus and WoS databases, I am finding it difficult to combine them. So far, I have been able to import both the data using codes in R, and I have also already combined them. But I can't figure out how to use this combined data as input into the biblioshiny() app.
#Importing WoS and Scopus data individually
m1 = convert2df("WOS.txt", "wos", "plaintext")
m2 = convert2df("scopus.csv", "scopus", "csv")
#Merging them
M = mergeDbSources(m1, m2, remove.duplicated = TRUE)
#Creating the results
results = biblioAnalysis(M, sep = ";")
I just need to know how to export the results in a relevant format for data input in biblioshiny. Please help!
Put all of the WOS data files (in txt format) into a zip file and upload that zip file into biblioshiny. That's all you have to do.
use this command
library(openxlsx)
write.xlsx(results, file="mergedfile.xlsx")
it will save results with a name of mergedfile
I'm working on a bibliometric analysis in r that requires me to work with data obtained from the Web of Science database. I import my data into r using the following code:
file1 <- "data1.txt"
data1 <- convert2df(file = file1, dbsource = "isi", format = "plaintext")
I have about 35 text files that I need to repeat this code for. I would like to do so using loops, something I don't have much experience with. I tried something like this but it did not work:
list_of_items <- c("file1", "file2")
dataset <- vector(mode = "numeric", length = length(list_of_items))
for (i in list_of_items){
dataset[i] <- convert2df(file = list_of_items[i], dbsource = "isi", format = "plaintext")
print(dataset)}
I get the following error:
Error in file(con, "r") : invalid 'description' argument
I'm not very familiar with using loops but I need to finish this work. Any help would be appreciated!
R wants to open file1, but you only have file1.txt. The filenames in the list are incorrect.
I once had that problem as well, maybe the solution works for you too.
maybe put all text files in a folder and read the folder, this might be easier.
FOLDER_PATH <- #your path here just paste it from the explorer bar (Windows). Beware to replace`\` with `\\`
file_paths <- list.files(path = FOLDER_PATH,
full.names = TRUE # so you do not change working dir
) # please put only txt in the folder or read the doc on list.files
# using lapply you do not need a for loop, but this is optional
dataset <- lapply(file_paths, convert2df, dbsource = "isi", format = "plaintext")
I am trying to read in a data set from SAS using the unz() function in R. I do not want to unzip the file. I have successfully used the following to read one of them in:
dir <- "C:/Users/michael/data/"
setwd(dir)
dir_files <- as.character(unzip("example_data.zip", list = TRUE)$Name)
ds <- read_sas(unz("example_data.zip", dir_files))
That works great. I'm able to read the data set in and conduct the analysis. When I try to read in another data set, though, I encounter an error:
dir2_files <- as.character(unzip("data.zip", list = TRUE)$Name)
ds2 <- read_sas(unz("data.zip", dir2_files))
Error in read_connection_(con, tempfile()) :
Evaluation error: error reading from the connection.
I have read other questions on here saying that the file path may be incorrectly specified. Some answers mentioned submitting list.files() to the console to see what is listed.
list.files()
[1] "example_data.zip" "data.zip"
As you can see, I can see the folders, and I was successfully able to read the data set in from "example_data.zip", but I cannot access the data.zip folder.
What am I missing? Thanks in advance.
Your "dir2_files" is String vector of the names of different files in "data.zip". So for example if the files that you want to read have them names at the positions "k" in "dir_files" and "j" in "dir2_files" then let update your script like that:
dir <- "C:/Users/michael/data/"
setwd(dir)
dir_files <- as.character(unzip("example_data.zip", list = TRUE)$Name)
ds <- read_sas(unz("example_data.zip", dir_files[k]))
dir2_files <- as.character(unzip("data.zip", list = TRUE)$Name)
ds2 <- read_sas(unz("data.zip", dir2_files[j]))
I'm trying to convert a large JSON file (6GB) into a CSV to more easily load it into R. I happened upon this solution (from https://community.rstudio.com/t/how-to-read-large-json-file-in-r/13486/33):
library(sparklyr)
library(dplyr)
library(jsonlite)
Sys.setenv(SPARK_HOME="/usr/lib/spark")
# Configure cluster (c3.4xlarge 30G 16core 320disk)
conf <- spark_config()
conf$'sparklyr.shell.executor-memory' <- "7g"
conf$'sparklyr.shell.driver-memory' <- "7g"
conf$spark.executor.cores <- 20
conf$spark.executor.memory <- "7G"
conf$spark.yarn.am.cores <- 20
conf$spark.yarn.am.memory <- "7G"
conf$spark.executor.instances <- 20
conf$spark.dynamicAllocation.enabled <- "false"
conf$maximizeResourceAllocation <- "true"
conf$spark.default.parallelism <- 32
sc <- spark_connect(master = "local", config = conf, version = '2.2.0')
sample_tbl <- spark_read_json(sc,name="example",path="example.json", header = TRUE, memory = FALSE,
overwrite = TRUE)
sdf_schema_viewer(sample_tbl)
I've never used Spark before, and I'm trying to understand where the data I loaded is located in Rstudio, and how can I write the data to a CSV?
Not sure about sparklyr, but if you are trying to read large json file and trying to write into CSV file using Spark R, below is sample code for same.
This code will run only on spark environment and not in Rstudio
# A JSON dataset is pointed to by path.
# The path can be either a single text file or a directory storing text files.
path <- "examples/src/main/resources/people.json"
# Create a DataFrame from the file(s) pointed to by path
people <- read.json(path)
# Write date frame to CSV
write.df(people, "people.csv", "csv")
This seems to work for csv but I need to upload a Parquet file
library(AzureStor)
bl_endp_key <- storage_endpoint("url", key="key")
cont <- storage_container(bl_endp_key, "containername")
csv <- serialize(dataframe, connection = NULL, ascii = TRUE)
con <- rawConnection(csv)
upload_blob(cont, src=con, dest="output.csv")
You can use the arrow package to write R data frames to Parquet files. See https://arrow.apache.org/docs/r/