I can only give you picture of data I'm working with or the character that creates my problems in .csv file. I don't know how to get that character.
This pillar character is stopping fread working. Is there away to escape it? readr read_csv works through them with no problem. I have tried to drop, make it character column, use comment.char = "", but nothing seems to work.
Here what I'm hoping to get out (what I get out with read_csv)
# A tibble: 5 x 4
X1 trade date trade_condition
<dbl> <dbl> <date> <chr>
1 2902 28.3 2019-01-14 -12------P----
2 2903 28.0 2019-01-14 P
3 2904 28.0 2019-01-14 P
4 2905 28.0 2019-01-14 P
5 2906 28.1 2019-01-14 P
I'm using data.table_1.12.0
Here is Verbose = T
omp_get_max_threads() = 8
omp_get_thread_limit() = 2147483647
DTthreads = 0
RestoreAfterFork = true
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 8 threads (omp_get_max_threads()=8, nth=8)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as integer
[02] Opening the file
Opening file C:/Users/Markku/Desktop/KONECRANES_2019.01.14/trades.csv
File opened, size = 592KB (606768 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<,trade,date,trade_condition,sy>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep automatically ...
sep=',' with 100 lines of 9 fields using quote rule 0
Detected 9 columns on line 1. This line is either column names or first data row. Line starts as: <<,trade,date,trade_condition,sy>>
Quote rule picked = 0
fill=false and the most number of columns found is 9
[07] Detect column types, good nrow estimate and whether first row is column names
Number of sampling jump points = 10 because (606767 bytes from row 1 to eof) / (2 * 27623 jump0size) == 10
Type codes (jump 000) : 57AAAA5AA Quote rule 0
A line with too-few fields (4/9) was found on line 4 of sample jump 7. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-few fields (4/9) was found on line 13 of sample jump 9. Most likely this jump landed awkwardly so type bumps here will be skipped.
Type codes (jump 010) : 57AAAA5AA Quote rule 0
'header' determined to be true due to column 2 containing a string on row 1 and a lower type (float64) in the rest of the 858 sample rows
=====
Sampled 858 rows (handled \n inside quoted fields) at 11 jump points
Bytes from first data row on line 2 to the end of last row: 606683
Line length: mean=213.01 sd=86.78 min=59 max=372
Estimated number of rows: 606683 / 213.01 = 2849
Initial alloc = 5698 rows (2849 + 100%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 57AAAA5AA
[10] Allocate memory for the datatable
Allocating 9 column slots (9 - 0 dropped) with 5698 rows
[11] Read the data
jumps=[0..1), chunk_size=606683, total_size=606683
Restarting team from jump 0. nSwept==0 quoteRule==1
jumps=[0..1), chunk_size=606683, total_size=606683
Restarting team from jump 0. nSwept==0 quoteRule==2
jumps=[0..1), chunk_size=606683, total_size=606683
Restarting team from jump 0. nSwept==0 quoteRule==3
jumps=[0..1), chunk_size=606683, total_size=606683
Read 2903 rows x 9 columns from 592KB (606768 bytes) file in 00:00.014 wall clock time
[12] Finalizing the datatable
Type counts:
2 : int32 '5'
1 : float64 '7'
6 : string 'A'
=============================
0.003s ( 21%) Memory map 0.001GB file
0.007s ( 50%) sep=',' ncol=9 and header detection
0.000s ( 0%) Column type detection using 858 sample rows
0.000s ( 0%) Allocation of 5698 rows x 9 cols (0.000GB) of which 2903 ( 51%) rows used
0.004s ( 29%) Reading 1 chunks (0 swept) of 0.579MB (each chunk 2903 rows) using 1 threads
+ 0.000s ( 0%) Parse to row-major thread buffers (grown 0 times)
+ 0.002s ( 14%) Transpose
+ 0.002s ( 14%) Waiting
0.000s ( 0%) Rereading 0 columns due to out-of-sample type exceptions
0.014s Total
Warning message:
In fread(trades_file, verbose = T) :
Stopped early on line 2905. Expected 9 fields but found 4. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<2903,28.04,2019-01-14,"P>>
I'm trying to import a database from sqlite3 to weka, but the problem is that even after the database is loaded and displayed, when I click ok so I can start working with the database, the message "couldn't read from database: unknown data type: text " appears. I've tried modifying the DatabaseUtil.props file but nothing seems to work, so I really apreacite if someone could tell me how to solve this issue. Thanks
I have read these instructions:
https://waikato.github.io/weka-wiki/databases/#configuration-files
Now this is my DatabaseUtils.props file, please change the jdbcURL entry
# Database settings for sqlite 3.x
#
# General information on database access can be found here:
# https://waikato.github.io/weka-wiki/databases
#
# url: http://www.sqlite.org/
# jdbc: http://www.zentus.com/sqlitejdbc/
# author: Fracpete (fracpete at waikato dot ac dot nz)
# version: $Revision: 5836 $
# JDBC driver (comma-separated list)
jdbcDriver=org.sqlite.JDBC,
# database URL
jdbcURL=jdbc:sqlite:/some/path/to/mydb.sqlite
# specific data types
# string, getString() = 0; --> nominal
# boolean, getBoolean() = 1; --> nominal
# double, getDouble() = 2; --> numeric
# byte, getByte() = 3; --> numeric
# short, getByte()= 4; --> numeric
# int, getInteger() = 5; --> numeric
# long, getLong() = 6; --> numeric
# float, getFloat() = 7; --> numeric
# date, getDate() = 8; --> date
# text, getString() = 9; --> string
# time, getTime() = 10; --> date
#SQLITE DATATYPES
#NULL. The value is a NULL value.
null=9
#INTEGER. The value is a signed integer, stored in 1, 2, 3, 4, 6, or 8 bytes depending on the magnitude of the value.
integer=5
#REAL. The value is a floating point value, stored as an 8-byte IEEE floating point number.
float=6
#TEXT. The value is a text string, stored using the database encoding (UTF-8, UTF-16BE or UTF-16LE).
TEXT=9
text=9
#BLOB. The value is a blob of data, stored exactly as it was input.
# other options
CREATE_DOUBLE=DOUBLE
CREATE_STRING=varchar(2000)
CREATE_STRING=TEXT
CREATE_INT=INT
CREATE_DATE=DATETIME
DateFormat=yyyy-MM-dd HH:mm:ss
checkUpperCaseNames=false
checkLowerCaseNames=false
checkForTable=true
# All the reserved keywords for this database
# Based on the keywords listed at the following URL (2009-04-13):
# http://www.sqlite.org/lang_keywords.html
Keywords=\
ABORT,\
ADD,\
AFTER,\
ALL,\
ALTER,\
ANALYZE,\
AND,\
AS,\
ASC,\
ATTACH,\
AUTOINCREMENT,\
BEFORE,\
BEGIN,\
BETWEEN,\
BY,\
CASCADE,\
CASE,\
CAST,\
CHECK,\
COLLATE,\
COLUMN,\
COMMIT,\
CONFLICT,\
CONSTRAINT,\
CREATE,\
CROSS,\
CURRENT_DATE,\
CURRENT_TIME,\
CURRENT_TIMESTAMP,\
DATABASE,\
DEFAULT,\
DEFERRABLE,\
DEFERRED,\
DELETE,\
DESC,\
DETACH,\
DISTINCT,\
DROP,\
EACH,\
ELSE,\
END,\
ESCAPE,\
EXCEPT,\
EXCLUSIVE,\
EXISTS,\
EXPLAIN,\
FAIL,\
FOR,\
FOREIGN,\
FROM,\
FULL,\
GLOB,\
GROUP,\
HAVING,\
IF,\
IGNORE,\
IMMEDIATE,\
IN,\
INDEX,\
INDEXED,\
INITIALLY,\
INNER,\
INSERT,\
INSTEAD,\
INTERSECT,\
INTO,\
IS,\
ISNULL,\
JOIN,\
KEY,\
LEFT,\
LIKE,\
LIMIT,\
MATCH,\
NATURAL,\
NOT,\
NOTNULL,\
NULL,\
OF,\
OFFSET,\
ON,\
OR,\
ORDER,\
OUTER,\
PLAN,\
PRAGMA,\
PRIMARY,\
QUERY,\
RAISE,\
REFERENCES,\
REGEXP,\
REINDEX,\
RELEASE,\
RENAME,\
REPLACE,\
RESTRICT,\
RIGHT,\
ROLLBACK,\
ROW,\
SAVEPOINT,\
SELECT,\
SET,\
TABLE,\
TEMP,\
TEMPORARY,\
THEN,\
TO,\
TRANSACTION,\
TRIGGER,\
UNION,\
UNIQUE,\
UPDATE,\
USING,\
VACUUM,\
VALUES,\
VIEW,\
VIRTUAL,\
WHEN,\
WHERE
# The character to append to attribute names to avoid exceptions due to
# clashes between keywords and attribute names
KeywordsMaskChar=_
#flags for loading and saving instances using DatabaseLoader/Saver
nominalToStringLimit=50
idColumn=auto_generated_id
Try putting the DatabaseUtils.prop file in the Weka home directory. Also, in the file you should add sth like TEXT=0 or TEXT=9 in the corresponding sector.
I would like to remove the duplicate records from my large .xdf file trans.xdf.
Here is the file details:
File name: /poc/revor/data/trans.xdf
Number of observations: 1000000000
Number of variables: 5
Number of blocks: 40
Compression type: zlib
Variable information:
Var 1: CARD_ID, Type: character
Var 2: SE_NO, Type: character
Var 3: r12m_cv, Type: numeric, Low/High: (-2348.7600, 40587.3900)
Var 4: r12m_roc, Type: numeric, Low/High: (0.0000, 231.0000)
Var 5: PROD_GRP_CD, Type: character
Also below is the sample data of the file:
CARD_ID SE_NO r12m_cv r12m_roc PROD_GRP_CD
900000999000000000 1045815024 110 1 1
900000999000000000 1052487253 247.52 2 1
900000999000000000 9999999999 38.72 1 1
900000999000000000 1090389768 1679.96 16 1
900000999000000000 1091226035 0 1 1
900000999000000000 1091241208 538.68 4 1
900000999000000000 9999999999 83 1 1
900000999000000000 1091468041 148.4 3 1
900000999000000000 1092640358 3.13 1 1
900000999000000000 1093468692 546.29 1 1
I have tried using rxDataStep function to use its transform parameter to call to unique() function over the .xdf file. Below is the code for the same:
uniq_dat <- function( dataList )
{
datalist <- unique(datalist)
return(datalist)
}
rxDataStepXdf(inFile = "/poc/revor/data/trans.xdf",outFile = "/poc/revor/data/trans.xdf",transformFunc = uniq_dat,overwrite = TRUE)
But was getting below error:
Error in unique(datalist) : object 'datalist' not found
Error in transformation function: Error in unique(datalist) : object 'datalist' not found
Error in rxCall("RxDataStep", params) :
So anybody could point out the mistake that I am doing here or if there is a better way to remove the duplicate records from the .Xdf file. I am avoiding loading the data into inmemory dataframe as the data is pretty huge.
I am running the above code in Revolution R Environment over HDFS.
If the same can be obtained by any other approach then the example for the same would be appreciated.
Thanks for the help in advance :)
Cheers,
Amit
you can remove the duplicate values providing removeDupKeys=TRUE parameter for rxSort() function. For example for your case:
XdfFilePath <- file.path("<your file's fully qualified path>/trans.xdf")
rxSort(inData = XdfFilePath,sortByVars=c("CARD_ID","SE_NO","r12m_cv","r12m_roc","PROD_GRP_CD"), removeDupKeys=TRUE)
if you want to remove duplicate records based on a specific key column, for example, based on SE_NO column
set the key value as sortByVars="SE_NO"
I am trying to use Pyparsing to identify a keyword which is not beginning with $ So for the following input:
$abc = 5 # is not a valid one
abc123 = 10 # is valid one
abc$ = 23 # is a valid one
I tried the following
var = Word(printables, excludeChars='$')
var.parseString('$abc')
But this doesn't allow any $ in var. How can I specify all printable characters other than $ in the first character position? Any help will be appreciated.
Thanks
Abhijit
You can use the method I used to define "all characters except X" before I added the excludeChars parameter to the Word class:
NOT_DOLLAR_SIGN = ''.join(c for c in printables if c != '$')
keyword_not_starting_with_dollar = Word(NOT_DOLLAR_SIGN, printables)
This should be a bit more efficient than building up with a Combine and a NotAny. But this will match almost anything, integers, words, valid identifiers, invalid identifiers, so I'm skeptical of the value of this kind of expression in your parser.
I am trying to use the x12 function in the x12 package for R.
My problem is, when using time series object (tso) with monthly data and each observation is a large number (11 or more digits), the function is making a spec file which x12a.exe (binaries) can not read.
x12 binaries does not allow the spec file to be wider then 132 column.
In my example, the spec file have 144 columns, which I believe give me this error message in R:"ERROR: Input record longer than limit : 133".
When I am using smaller numbers (fewer columns) in the spec file, there are no problem so far. When creating the spec file on my own, when using x12-arima for windows, I have never seen the problem before, because I always use the "free" format (one observation per line) for the series in x12-arima.
My question is: How do I make the format for the time series object = "free", or some how just one observation per line, in the "Rout.spc" file, while using x12 function in the x12 package for R?
I am using R version 2.15.2 and R-studio version 0.97.318
Attached is my example code in R-studio, output in R-console, and the spec file
"Rstudio"
library(x12)
alt <- read.csv2("alt.csv",header=T)
tal <- ts(data=alt,start=c(1995,4),freq=12)
x12path <- shortPathName("C:\\Dokumenter\\X_12_Arima_Program\\x12a\\x12a.exe")
x12tal <- x12(tso=tal,automdl=T,x12path=x12path,period=12,trendma=23)
"Console"
C:\Dokumenter\Eksperimentering\x12>md gra
C:\Dokumenter\Eksperimentering\x12>C:\DOKUME~1\X_12_A~2\x12a\x12a.exe Rout -g gra
X-12-ARIMA Seasonal Adjustment Program
Version Number 0.3 Build 192
Execution began Mar 12, 2013 23.46.25
Reading input spec file from Rout.spc
Storing any program output into Rout.out
Storing any program error messages into Rout.err
ERROR: Input record longer than limit : 133
Line 6: start=1995.4
^
ERROR: Expected an real number not "111"
Program error(s) halt execution for Rout.spc
Check error file Rout.err
Error messages generated from processing the X-12-ARIMA spec file
Rout.spc:
Error in readx12Out(file, freq_series = frequency(tso), start_series = start(tso), :
Error! No proper run of x12! Check your parameter settings.
"The spec file: Rout.spc"
series{
title="R Output for X12a"
decimals=2
start=1995.4
period=12
data=(
14056669449 12785389868 12772341230 12342935128 12081332395 12110109950 12367542268 12911930417 12836340370 12214486074 12057940408 11555540809
10002847699 9199284760 8704422249 8492914782 8507816348 8470254675 8665139772 8653204621 9177471163 9676069791 9483990311 9825510541
7613345714 7168896536 7527318694 7721174940 7584049271 7586159794 7411383039 7565724342 7555103032 7148551906 7792379395 7493885451
6636374143 6390731897 6160711917 6003196233 5955867663 5868369296 5858314348 6098506333 6297774946 6074680955 6132163345 5875098456
5198306672 4891946405 4875765641 4834436461 4835096514 4804664875 4684550404 4733459404 5056773308 4912329843 5080643820 4568733581
4286693348 3898776528 3872776341 3842469172 3756957390 3782676505 3924066331 3810475969 3943259720 3665136687 3962811976 3449264257
3120637669 2813261665 2692920289 2652153941 2557247524 2658115616 2777287302 2688976703 2712004412 2596430893 2520548046 2455531008
2429263753 2187017586 2181610529 2139024441 2008850781 2049874584 2110715482 2218937956 2565352715 2635375627 2598584163 2435211675
2433625715 2350144562 2298764466 2242464445 2288528533 2532374821 2696862060 2877128057 3086285374 3309497319 3684989376 3709283880
3483967873 3294407926 3465439983 3546006197 3526166213 3625899404 3774201496 3941610691 4325836434 4466576126 4115121591 4036118609
3824882119 3552896925 3649624960 3570454122 3622089655 3662984491 3601306018 3604389348 3620162022 3401732239 3158217491 2896252892
2800864675 2630474256 2668229303 2631120097 2343131082 2163910930 2108285015 2067601541 2099699134 1803097392 1742652674 1626660618
1560369744 1448264771 1419659828 1547101381 1310783818 1358686467 1300281852 1315247637 1380387680 1286158497 1329769957 1272124521
1185603967 1125238745 1217223861 1265616553 1222054134 1279497332 1499392605 1810208712 2314301847 2908395453 3388479445 3441615991
3432688695 3691000321 3891303059 4111250935 4258776704 4586315450 5050122946 5156728599 5550332779 5769588984 5943764465 6032516246
5765718572 5521116586 5498458566 5374456514 5130561755 5219814632 5542173962 6883624616 7744043244 7913799960 7416210299 7127265644
6790509897 6562709494 6390985216 6126897801 5855125688 6259675447 6439114484 6634617502 6771498442 6674343925 6295709586 5890916431
5545655270 5315444742 5205711894 5115065476 4648229650 4724377012 4816989052 5049928441 5041395923
)
}
transform{
function=auto
}
automdl {
maxorder=(3,2)
maxdiff=(1,1)
balanced=yes
savelog=(adf amd b5m mu)
}
forecast {
}
x11{
sigmalim=(1.5,2.5)
trendma=23
excludefcst=yes
final=(user)
appendfcst=yes
savelog=all
}