I am trying to get a .csv file from an S3 bucket.
I am using
x = getFile()
from the RAmazonS3 and converting it to a character using
y=rawToChar(x).
I wish to convert y to a data table with the same structure as the csv file (2 columns) the lines sep is \n and the inner line sep is a ",".
Code:
table = rawToChar(getFile("bucket/file.csv"))
Output:
"\"id\",\"name\"\n\"12\",\"Member 12\"\n\"123\",\"Member 123\"\n\"1234\",\"Member 1234\"\n\"12345\....
And i wish it to be a 2 columns data.table/frame of the form:
id name
12 Member 12
123 Member 123
1234 Member 1234
12345 Member 12345
Any suggestions on an efficient way to perform this conversion?
Any other (more useful) suggestions on how to retrieve csv files from an amazon S3 bucket to a table in R are more then welcome of course.
*I am using:
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
Any help in this issue would be appreciated,
Thank you
Related
Trying to read a SPSS file (.sav format) in R raises:
Error: file is not in any supported SPSS format.
This happens when trying to read the .sav file with foreign and read.spss.
Trying the memsicpackage and its as.data.set(spss.system.file("my_file")) raises:
Error in spss.readheader(file) : not a sysfile
The file is a very long SPSS file containing over 2 million entries and hundreds of factors. The factors vary: Many are categorical "Yes" / "No" / "Missing" / "None", some are numerical (IDS etc), some are labelled with texts ("State One" / "State 2" / "State 3") and some are mixed ("1" / "20" / "3732" / "Technical Problem"). Sadly, I can't give you a subset of my data (severe restrictions on privacy and I don't have a SPSS license).
Reading this file in and storing it as a feather file (.fea format) already has worked on another computer - that might have had another version of R installed. I have no way of checking what version that was though...
Currently, I'm working in R version 3.4.4 (2018-03-2015) on windows 10, and use packages memisc_0.99.17.2 and foreign_0.8-71. The file is stored on a server, my R is installed in a user on the local drive.
This is the code I've tried:
require(foreign)
ws <- "my_workspace_in_local_user"
setwd(ws)
dataDir <- "my_directory_on_the_server_containing_the_file"
fn <- paste0(dataDir, "my_file.sav")
dat <- read.spss(fn, to.data.frame = TRUE)
and
require(foreign)
ws <- "my_workspace_in_local_user"
setwd(ws)
dataDir <- "my_directory_on_the_server_containing_the_file"
fn <- paste0(dataDir, "my_file.sav")
install.packages("memisc")
require("memisc")
dat <- as.data.set(fn, to.data.frame = TRUE)
Does anybody have an idea why this wouldn't work? I'm suspecting it's a problem of which version of R and the packages to use...?
Your first set of code worked for me on macOS 10.15.1 (Catalina) and R 3.6.1 with memisc_0.99.17.2 and foreign_0.8-71.
R version 3.6.1 (2019-07-05) -- "Action of the Toes"
Copyright (C) 2019 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin15.6.0 (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
[R.app GUI 1.70 (7684) x86_64-apple-darwin15.6.0]
> require(foreign)
Loading required package: foreign
> dataDir <- "~/Samples/English/"
> fn <- paste0(dataDir, "accidents.sav")
> dat <- read.spss(fn, to.data.frame = TRUE)
> print(dat)
agecat gender accid pop
1 Under 21 Female 57997 198522
2 21-25 Female 57113 203200
3 26-30 Female 54123 200744
4 Under 21 Male 63936 187791
5 21-25 Male 64835 195714
6 26-30 Male 66804 208239
The "accidents.sav" is an example data file that ships with IBM SPSS Statistics versions 19.0 thru 26.0.
If this code works for you against known data from IBM SPSS, then you can probably rule out your R version and configuration as a cause. Unfortunately that probably means your *.sav file is corrupted in some way.
I am working with the following data set: https://www.kaggle.com/crowdflower/twitter-user-gender-classification
My goal is developing a R script to detect gender through emojis but they are in a weird code that I can´t convert in UNICODE that allows linking in one of the emoji dictionaries. I tried iconv but it converted to hilo format and I don´t know how to convert it to unicode.
I write an example with one of the data set tweets.
new <- iconv("Its a double capsule day _Ù÷ã_Ù÷ã 27 varieties of fruit and veg...in a capsule, simples _Ù÷ã_Ù÷ã #fruitandveg #juiceplus #health", from="utf-8", to="UNICODE", "byte")
[1] "Its a double capsule day _<d9><f7><e3>_<d9><f7><e3> 27 varieties of fruit and veg...in a capsule, simples _<d9><f7><e3>_<d9><f7><e3> #fruitandveg #juiceplus #health"
Any help?
Thanks in advance
I am trying to merge two large data sets as i need to create a final trainset for my models to run
head(TrainWithAppevents_rel4)
event_id |device_id |gender |age |group| phone_brand |device_model| numbrand nummodel | app_id
6 6 1476664663289716480 M 19 M22- åŽä¸º Mate 7 29 919 4348659952760821248
and
head(app_labels)
app_id |label_id
1 7324884708820028416 251
The first dataset has unique rows now as i have worked on it to remove all duplicates
i want my final set to be having the below columns
event_id device_id gender age group phone_brand device_model numbrand nummodel app_id label_id
However when i try to merge using the below in R (R studio session)
TrainWithLabels=merge(x=TrainWithAppevents_rel4,y=app_labels,by="app_id",all.x = TRUE)
i get following error
**Error: cannot allocate vector of size 512.0 Mb**
Error varies if i run again but only in terms of size of vector
The sizes of my datasets are as below :
> dim(TrainWithAppevents_rel4)
[1] 4787796 10
> dim(app_labels)
[1] 459943 2
More information about the machine/R i use :
> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
i use intel 2.6GHz/16GB RAM /64 Bit OS/Windows10/x64 -based processesor
i have tried the following :
-Reducing the dataset by removing duplicates and unwanted columns ,
all rows in the first dataset are unique now
-closing all other application on my laptop and then running the merge-Still fails
-executing gc() and then running merge
I have gone through similar questions on SO for R, however none of them offered a solution to move forward and not specific to merges failing on a 64 bit machine
Can anyone please help in either suggesting a solution or a workaround to move forward.
Please assume that this is the only machine where i can execute the code and running this R script on AWS via zepplin is not possible at the moment.
I apologize in advance for the somewhat lack of reproducibility here. I am doing an analysis on a very large (for me) dataset. It is from the CMS Open Payments database.
There are four files I downloaded from that website, read into R using readr, then manipulated a bit to make them smaller (column removal), and then stuck them all together using rbind. I would like to write my pared down file out to an external hard drive so I don't have to read in all the data each time I want to work on it and doing the paring then. (Obviously, its all scripted but, it takes about 45 minutes to do this so I'd like to avoid it if possible.)
So I wrote out the data and read it in, but now I am getting different results. Below is about as close as I can get to a good example. The data is named sa_all. There is a column in the table for the source. It can only take on two values: gen or res. It is a column that is actually added as part of the analysis, not one that comes in the data.
table(sa_all$src)
gen res
14837291 822559
So I save the sa_all dataframe into a CSV file.
write.csv(sa_all, 'D:\\Open_Payments\\data\\written_files\\sa_all.csv',
row.names = FALSE)
Then I open it:
sa_all2 <- read_csv('D:\\Open_Payments\\data\\written_files\\sa_all.csv')
table(sa_all2$src)
g gen res
1 14837289 822559
I did receive the following parsing warnings.
Warning: 4 parsing failures.
row col expected actual
5454739 pmt_nature embedded null
7849361 src delimiter or quote 2
7849361 src embedded null
7849361 NA 28 columns 54 columns
Since I manually add the src column and it can only take on two values, I don't see how this could cause any parsing errors.
Has anyone had any similar problems using readr? Thank you.
Just to follow up on the comment:
write_csv(sa_all, 'D:\\Open_Payments\\data\\written_files\\sa_all.csv')
sa_all2a <- read_csv('D:\\Open_Payments\\data\\written_files\\sa_all.csv')
Warning: 83 parsing failures.
row col expected actual
1535657 drug2 embedded null
1535657 NA 28 columns 25 columns
1535748 drug1 embedded null
1535748 year an integer No
1535748 NA 28 columns 27 columns
Even more parsing errors and it looks like some columns are getting shuffled entirely:
table(sa_all2a$src)
100000000278 Allergan Inc. gen GlaxoSmithKline, LLC.
1 1 14837267 1
No res
1 822559
There are columns for manufacturer names and it looks like those are leaking into the src column when I use the write_csv function.
This question already has answers here:
Save a plot in an object
(4 answers)
Closed 7 years ago.
Two methods of storing plot objects in list or a name string are mentioned on this page Generating names iteratively in R for storing plots . But both do not seem to work on my system.
> plist = list()
> plist[[1]] = plot(1:30)
>
> plist
list()
>
> plist[[1]]
Error in plist[[1]] : subscript out of bounds
Second method:
> assign('pp', plot(1:25))
>
> pp
NULL
I am using:
> R.version
_
platform i486-pc-linux-gnu
arch i486
os linux-gnu
system i486, linux-gnu
status
major 3
minor 2.0
year 2015
month 04
day 16
svn rev 68180
language R
version.string R version 3.2.0 (2015-04-16)
nickname Full of Ingredients
Where is the problem?
Use recordPlot and replayPlot:
plot(BOD)
plt <- recordPlot()
plot(0)
replayPlot(plt)