I have some csv datasets (so-called DATRAS "exchange data" from ICES) which are in an unusual format making it a challenge to efficiently import them to my R workspace.
The files are laid out like so:
V1 V2 V3 ... V60
................
................
................
V2.1 V2.2 V2.3 ... V2.27
........................
........................
V3.1 V3.2 V3.3 ... V3.27
........................
........................
So, the issue is two-fold: there are three sets of (related) data stacked on top of one another in the csv file which I wish to be able to import as three separate objects in R; and these datasets have different dimensions and colnames.
So far, I have been using count.fields() on the files in the directory to identify the boundaries between each dataset, the readLines() to read the entire file as a character string for each row of data, then finally subsetting these chaacter strings acording to the values derived with count.fields().
Are there any more direct methods? I feel like the approach I am currently using is quite inelegant. It should be noted that these are large csv files.
Related
I am currently working with a raw data set that, when downloaded from our device, outputs in a log file with values delimited with a semi-colon.
I am simply trying to load this data into r so I can put it onto a dataframe and analyze from there. However, as it is a log file I can't use read_csv or read_delim. When I use read_log, there is no input where I can define the delimiter, and as such my columns are being misread and I am receiving error messages since r is not recognizing ; as a delimiter in the file.
I have been unable to find any other instances of people using delimited log files with r, but I am trying to make the code work before I resign to uploading it into excel (I don't want to do this, both because the files have a lot of associated data and my computer runs excel very slowly). Does anyone have any suggestions of functions I could use to load the semi-colon delimited log file?
Thank you!
You could use data.table::fread(). freadautomatically recognizes most delimiters very reliable and reads most file types like *.csv, *.txt etc.
If your facing a situation where it doesn't guess the right delimiter, you can define it by setting the optionfread(your_file, sep=";"). But it won't be necessary in your case.
I've creates a file named your_file without any extension and the following content:
Text1;Text2;Text3;Text4
And now imported it to R:
library(data.table)
df = fread("your_file", header=FALSE)
Output:
> df
V1 V2 V3 V4
1: Text1 Text2 Text3 Text4
I'm new to R and correctly working a project refactoring code reading from csv files to from a database.
The work includes dumping the csv files to a Postgres database, and modify existing R scripts to ingest input data from the db tables instead of csv files for subsequent transformation
Right now I ran into an issue that the dataframe columns returned from dbGetQuery() have different modes and classes than the original dataframe from read_csv()
Since the data I'm reading in has hundreds of columns, it is not that convenient to explicitly specify the mode and class for each column.
Is there an easy way to make the dataframe with same schema as the old one, so I can apply existing code for data transformation on the dataframe
i.e
when I run a comparison between the old dataframe and the new one from db, this is what I see
==================================
VARIABLE CLASS.(from csv) CLASS.(from db)
----------------------------------
col1 numeric integer64
col2 numeric integer
col3 numeric integer
----------------------------------
This won't be possible in general, because some SQL datatypes (e.g. DATE, TIMESTAMP, INTERVAL) have no equivalent in R, and the R data type factor has no equivalent in SQL. Depending on your R version, strings are automatically converted to factors, so it will at least be useful to import the data with stringsAsFactors=FALSE.
I have a sas7bdat file of size around 80 GB. Since my pc has a memory of 4 GB the only way I can see is reading some of its rows. I tried using the sas7bdat package in R which gives the error "big endian files are not supported"
The read_sas() function in haven seems to work but the function supports reading specific columns only while I need to read any subset of rows with all columns. For example, it will be fine if I can read 1% of the data to understand it.
Is there any way to do this? Any package which can work?
Later on I plan to read parts of the file and divide it into 100 or so sections
If you have Windows you can use the SAS Universal Viewer, which is free, and export the dataset to CSV. Then you can import the CSV into R in more readable chunks using this method.
In these days I have run into a problem of data export from R to a more “common” format as .csv or .txt.
My dataset is in data.table format and has 149000 rows * 124 columns. I adopt the following lines of code to try to export it:
write.table(data_reduced,"directory/data_reduced.txt",sep="\t",row.names=FALSE)
write.csv2(data_reduced,"directory/data_reduced.csv")
The result, in both cases, is that the .txt or .csv files have a lower number of rows than they are supposed to do and this changes with the different trials I did (it ranges from 900 to 1800, more or less). Usually what I get are the first rows and then the very last one.
I have tried to transform the data.table in a matrix or data.frame but the result I get is more or less the same. I have also tried to adopt the write.xlsx function but I have some problems with Java (which is something common as I have noticed reading the SO forum and other web sources).
I have also read about a function called fwrite to export very large datasets but it looks like that my RStudio cannot find it, despite I installed the data.table package.
Can anyone give me an explanation/solution for this problem? I've been reading different sources to sort it out but with no success until now.
I use RStudio Version 0.99.473.
I downloaded the X12 seasonal adjustment program located here: http://www.census.gov/srd/www/x12a/x12downv03_pc.html
I followed the setup and got the setting correct. When I go to select a file to input I have four options for file extensions to import which are ".spc" ".mta" ".dta" and "."
The problem is that I have data in excel and I have searched extensively through search engines and I do cannot figure out a way to get data from excel into one of these formats so I can do a seasonal adjustment on my data. Thanks
ADDED: After converting to a dta file (using R thanks to the comments left below) it looks like the program makes you convert it also to a .spc file as well. Anyone have a lead on how to do this? thanks
My first reaction is to:
(1) export the data from excel in something simple like csv.
(2) import that data into R
(3) use the R library "foreign" to export the data in .dta format.
So with the file "test.csv" containing:
V1,V2
1,2
3,4
5,6
you could do the following to produce "test.dta":
library(foreign)
testdata <- read.csv("test.csv")
write.dta(testdata,"test.dta")
Voila, data in .dta format. Would this work for what you have?
I've only ever used the command-line version of X12, but it sounds like you may be using the windows interface instead? If so the following might not be entirely accurate, but it should be close enough (I hope!).
The .dta and .mta files you refer to are just metafiles containing text lists of either spec files or data files to be processed; in particular the .dta files X12 uses are NOT Stata data format files like those produced by Nathan's R-based answer. It's probably best to ignore using metafiles until you are comfortable enough using the software to adjust a single time series.
You can export your data in tab separated variable format (year month/quarter value) without headings and use that as your data file. You can also use a simple list of data values separated by spaces, tabs, or newlines and then tell X12ARIMA what the start and end dates of the series are in the .spc file.
The .spc file doesn't contain the input data, it's a specification file telling X12 where to find the data file and how you want those data to be processed -- you'll have to write them yourself or create them in Win X-12.
Ideally you should write a separate .spc file for each time series to be adjusted; while you can write a .spc file which invokes many of X12's autoselection and identification procedures, it's usually not a good idea to treat the process as a black box, and a bit of manual intervention in the .spc is often necessary to get a good quality adjustment (and essential if there's a seasonal break involved). I find it helpful to start with a fairly generic skeleton .spc file suitable for your computing environment to begin with and then tweak it from there as appropriate for each series.
If you really want to use a single .spc file to adjust multiple series, then you can provide a list of data files in a .dta file and a single .spc file instructing X12ARIMA how to adjust them, but take care to ensure this is appropriate for your data!
The "Getting started with X-12-ARIMA input files on your PC" document on that site is probably a good place to start reading, but you'll probably end up having to consult the complete reference documentation (in particular Chapters 3 and 7) as well.
Edit postscript:
The UK Office for National Statistics have a draft of their guide to seasonal adjustment with X12ARIMA available online here here (archive.org), and is worth a look. It's a good bit easier to work through than the Census Bureau documentation.
Ryan,
This is not elegant, but it might work for you. In this example I'm trying to replicate the spec file from the Example 3.2 in the Census documentation.
Concatentate the data into one text string, then save this single text string using the MS-DOS (TXT) format under the SAVE AS command. To make the text string, first insert two cells above your column header and in the second one type the following text into it.
series{title=
Next, insert double quotation marks before and after the text in your column header, like this:
"Monthly Retail Sales of Household Appliance Stores"
Directly below the last data row, insert rows of texts that list the model specifications, like the following:
)
start= 1972.jul}
transform{function = log}
regression{variables=td}
indentify[diff=(0,1) sdiff=(0,1)}
So you should have something like the following:
<blank row>
series{title=
"Monthly Retail Sales of Household Appliance Stores"
530
529
...
592
590
start= 1972.jul}
transform{function = log}
regression{variables=td}
indentify{diff=(0,1) sdiff=(0,1)}
For the next instructions I am assuming that the text *series{title=
* appears in cell A2 and that cell B1 is empty. In cell B2, insert the following:
=CONCATENATE(B1,A2," ")
Then copy this formula into every cell down the column to concatentate all of the text in column A into a single cell at the end of column B. Finally, copy the final cell to a new spreadsheet's cell A1 using PASTE SPECIAL/VALUE, and save this spreadsheet using SAVE AS: *TXT(MS-DOS), but change the extension to ".spc".
Good luck (and from the little I read of the Census documentation - you'll need it).