A tab-delimited text file, which is actually an export (using bcp) of a database table, is of that form (first 5 columns):
102 1 01 e113c 3224.96 12
102 1 01 e185 101127.25 12
102 2 01 e185 176417.90 12
102A 3 01 e185 26261.03 12
I tried to import it in R with a command like
data <- read.delim("C:\\test.txt", header = FALSE, sep = "\t")
The problem is that the 3rd column which is actually a varchar field (alphanumeric) is mistakenly read as integer (as there are no letters in the entire column) and the leading zeros disappeared. The same thing happened when I imported the data directly from the database, using odbcConnect. Again that column was read as integer.
str(data)
$ code: int 1 1 1 1 1 1 6 1 1 8 ...
How can I import such a dataset in R correctly, so as to be able to safely populate that db table again, after doing some data manipulations?
EDIT
I did it adding the following parameter in read.delim
colClasses = c("factor","integer","factor","factor","numeric","character","factor","factor","factor","factor","integer","character","factor")
Would you suggest "character" or "factor" for varchar fields?
Is it ok to use "character" for datetime ones?
What should I do in order to be able to read a numeric field like this 540912.68999999994 exactly as is and not as 540912.69?
I would like an -as automatic as possible- creation of that colClasses vector, depending on the datatypes defined in the relevant table's schema.
Would you suggest "character" or "factor" for varchar fields?
As John mentioned, this depends upon usage. It is simple to switch between the two, so don't worry too much about it. If the column represents a categorical variable, it should eventually be considered as a factor. If you intend on mining the text (e.g. comments fields), then character makes more sense.
Is it ok to use "character" for datetime ones?
It's fine for storing the dates in a data frame, but if you want them to be treated correctly for analysis purposes, you'll have to convert it to Date or POSIXct/POSIXlt form.
What should I do in order to be able to read a numeric field like this 540912.68999999994 exactly as is and not as 540912.69?
Values are read in to usual double accuracy (about 15 sig figs); in this particular example, 540912.69 is the best accuracy you can achieve. Compare
print(540912.68999999994) # 540912.7
print(540912.68999999994, digits=22) # 540912.69
print(540912.6899999994) # 540912.7
print(540912.6899999994, digits=22) # 540912.6899999994
EDIT: If you need more precision for your numbers, use the Rmpfr package.
I would like an -as automatic as possible- creation of that colClasses vector, depending on the datatypes defined in the relevant table's schema.
The default for colClasses (when you don't specify it) does a pretty good job of guessing what columns should be. If you are doing things like using 01 as a character, then there's no way round explicitly specifying it.
the character and factor question is something only you can answer. It depends if you need to use them later as factors or characters. It also depends whether you need to clean them up at all afterwards. For example, if you plan to apply a number of ifelse() modifications to a factor afterwards you might as well just read it in as a character now and turn it into a factor later. Or, if you want to specifically code the factor in some way you will likely be better off reading it in as character.
As an aside, the reason you use read.delim over read.table is because of the default settings therefore don't bother setting the sep to the same as the default.
Related
When I run a proc print where segment ID equals 1234 the output shows segment ID 1235. SAS actually changes the last 4 digits of a 19 digit number. Contents shows the field in a num 8 formatted as a char 20. I just pull the data and print with no additional formatting or processing.
If I run a SQL statement in a different software package where segment ID equals 1234 (the exact same record) the results show 1234 (no change to the last 4). The other vars pulled with the query exactly match those of SAS except for the segment ID.
My best guess is it's a formatting issue even though the field should be large enough, 20 > 19.
Second guess is some sort of encryption on the field. Typically if I don't have proper access a field would be blank. But I am unfamiliar with this new data source.
I'll try adding a specific format to my SAS datapull for that field but would love to hear any other suggestions.
Thank you!
PROC PRINT is not the issue. You cannot store 19 decimal digits exactly as a number in SAS. SAS stores numbers as 64-bit floating point numbers. The maximum number of decimal digits that can be represented as consecutive integers is 15. After that the binary representation will not have enough bits to exactly represent every decimal string.
Check this description about precision from the documentation: https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/lrcon/p0ji1unv6thm0dn1gp4t01a1u0g6.htm
You should store such things as character strings. I doubt that you need to do any arithmetic with those values.
If you are getting the data from a remote database use the DBSASTYPE= dataset option to control what type of SAS variable is created.
Contents shows the field in a num 8 formatted as a char 20. I just pull the data and print with no additional formatting or processing.
That doesn't make sense. A numeric variable shouldn't have a character format.
I think you'll need to re-read the source field as a character from the source. Not sure where you "just pull the data" from, but you'll need to modify that step to ensure it gets brought over as a character in the first place otherwise you'll have that issue no matter what.
You cannot fix the issue with the data as is, as far as I know.
*issue with numeric;
data have;
input segmentID;
format segmentID 32.;
segmentIDChar=put(SegmentID, 32.);
cards;
1234567890123456789
;;;;
run;
proc print data=have;
run;
*no issue with character fields;
data have;
length segmentID $20.;
input segmentID $;
format segmentID $32.;
cards;
1234567890123456789
;;;;
run;
proc print data=have;
run;
Results:
Numeric Isssue
Obs segmentID segmentIDChar
1 1234567890123456768 1234567890123456768
Character Issue
Obs segmentID
1 1234567890123456789
I have a data.frame in R, and I want to export it to a SAS file. I am using write.xport to do that. The column names are like:
a.b.c, a.b.d, a.f.g, ...
When I get the data in SAS, column names are like: a(1),a(2),..
How can I keep the original labels in exported SAS file?
I get the error:
Warning messages:
1: In makeSASNames(colnames(df)) :
Truncated 119 long names to 8 characters.
2: In makeSASNames(colnames(df)) : Made 106 duplicate names unique.
In addition to the length, it seems your column names contain '.'-character? SAS doesnt allow for those kind of names. SAS uses the . to represent e.g. library.dataset -notation and it has many others uses. The colnames cannot contain many + or - or & -chars either.
So to summarize; make your column names SAS -compatible. See the SAS documentation for more.
SAS uses the column labels, which allow for more complexity, only for the purposes of outputting, afaik. Thus, if you want to manipulate data in SAS, you need to rethink your column names first.
I don't want the display format like this: 2.150209e+06
the format I want is 2150209
because when I export data, format like 2.150209e+06 caused me a lot of trouble.
I did some search found this function could help me
formatC(numeric_summary$mean, digits=1,format="f").
I am wondering can I set options to change this forever? I don't want to apply this function to every variable of my data because I have this problem very often.
One more question is, can I change the class of all integer variables to numeric automatically? For integer format, when I sum the whole column usually cause trouble, says "integer overflow - use sum(as.numeric(.))".
I don't need integer format, all I need is numeric format. Can I set options to change integer class to numeric please?
I don't know how you are exporting your data, but when I use write.csv with a data frame containing numeric data, I don't get scientific notation, I get the full number written out, including all decimal precision. Actually, I also get the full number written out even with factor data. Have a look here:
df <- data.frame(c1=c(2150209.123, 10001111),
c2=c('2150209.123', '10001111'))
write.csv(df, file="C:\\Users\\tbiegeleisen\\temp.txt")
Output file:
"","c1","c2"
"1",2150209.123,"2150209.123"
"2",10001111,"10001111"
Update:
It is possible that you are just dealing with a data rendering issue. What you see in the R console or in your spreadsheet does not necessarily reflect the precision of the underlying data. For instance, if you are using Excel, you highlight a numeric cell, press CTRL + 1 and then change the format. You should be able to see full/true precision of the underlying data. Similarly, the number you see printed in the R console might use scientific notation only for ease of reading (SN was invented partially for this very reason).
Thank you all.
For the example above, I tried this:
df <- data.frame(c1=c(21503413542209.123, 10001111),
c2=c('2150209.123', '100011413413111'))
c1 in df is scientific notation, c2 is not.
then I run write.csv(df, file="C:\Users\tbiegeleisen\temp.txt").
It does out put all digits.
Can I disable scientific notation in R please? Because, it still cause me trouble, although it exported all digits to txt.
Sometimes I want to visually compare two big numbers. For example, if I run
df <- data.frame(c1=c(21503413542209.123, 21503413542210.123),
c2=c('2150209.123', '100011413413111'))
df will be
c1 c2
2.150341e+13 2150209.123
2.150341e+13 100011413413111
The two values for c1 are actually different, but I cannot differentiate them in R, unless I exported them to txt. The numbers here are fake numbers, but the same problem I encounter very day.
I am confused. I input a .csv file in R and want to fit a linear multivariate regression model.
However, R declares all my obvious numeric variables to be factors and my categorial variables to be integers. Therefore, I cannot fit the model.
Does anyone know how to resolve this?
I know this is probably so basic. But I really need to know this. Elsewhere, I found only posts concerning how to declare factors. But this does not apply here.
Any suggestions very much appreciated!
The easiest way, imo, to handle this is to just tell R what type of data your columns contain when you read them into the workspace. For example, if you have a csv file where the first column should be characters, columns 2-21 should be numeric, and column 22 should be a factor, here's how I would read that csv file into the workspace:
Data <- read.csv("MyData.csv", colClasses=c("character", rep("numeric", 20), "factor"))
Sometimes (with certain versions of R, as Andrew points out) float entries in a CSV are long enough that it thinks they are strings and not floats. In this case, you can do the following
data <- read.csv("filename.csv")
data$some.column <- as.numeric(as.character(data$some.column))
Or you could pass stringsAsFactors=F to the read.csv call, and just apply as.numeric in the next line. That might be a bad idea though if you have a lot of data.
It's a little harder to say what's going on with the categorical variables. You might want to try just treating those as strings and see how that works. Sometimes R will treat factor vectors as being of numeric type, so this is a good first sanity check. If that doesn't work, you can also see if the regression functions in question will let you declare how the variables should be treated.
It is hard to tell without a sample of your data file and the commands that you have been using to try and work with the data, but here are some general problems that can lead to what you describe (though there could be other possibilities as well).
The read.csv and read.table (which is called by read.csv) function will try and guess the types of data when they are not told what each column should be (the colClasses argument). If everything looks like a number then it will convert to a number, but if it sees anything in the first lines that does not look like part of a number then it will read it in as character and convert to a factor. Some of the common reasons why what you think should be a number but R sees something non-numeric include: a finger slip results in a letter somewhere in the column; similar looking substitutions, O for 0 or l for 1; a comma where one is not expected, many European files use , where R expects . (but there are options to tell R what you want here) or if you use read.table without setting sep when it really is a comma separated file.
If you have a categorical variable represented by integers, then R will convert it to integers unless you tell it to make a factor. If you use as.numeric on a factor then it will return the integers used to represent the factor internally. How to convert a factor with labels that are numbers to a numeric is a question (and answer) in the FAQ.
If this does not point you in the right direction then give us a sample of your data and what commands you are using.
First time poster here, so I'll try and make myself as clear as possible on the help I need. I'm fairly new to R, and this is my first real independent programming experience.
I have stock tick data for about 2.5 years, each day has its own file. The files are .txt and consist of approximately 20-30 million rows, and averaging I guess 360mb each. I am working one file at a time for now. I don't need all the data these files contain, and I was hoping that I could use the programming to minimize my files a bit.
Now my problem is that I am having some difficulties with writing the proper code so R understands what I need it to do.
Let me first show you some of the data so you can get an idea of the formatting.
M977
R 64266NRE1VEW107 FI0009653869 2EURXHEL 630 1
R 64516SSA0B 80SHB SE0002798108 8SEKXSTO 40 1
R 645730BBREEW750 FR0010734145 8EURXHEL 640 1
R 64655OXS1C 900SWE SE0002800136 8SEKXSTO 40 1
R 64663OXS1P 450SWE SE0002800219 8SEKXSTO 40 1
R 64801SSIEGV LU0362355355 11EURXCSE 160 1
M978
Another snip of data:
M732
D 3547742
A 3551497B 200000 67110 02800
D 3550806
D 3547743
A 3551498S 250000 69228 09900
So as you can see each line begins with a letter. Each letter denotes what the line means. For instance R means order book directory message, M means milliseconds after last second, H means stock trading action message. There are 14 different letters used in total.
I have used the readLines function to import the data into R. This however seems to take a very long time for R to process when I want to work with the data.
Now I would like to write some sort of If function that says if the first letter is R then from offset 1 to 4 the code means Market Segment Identifier etc., and have R add columns to these so I can work with the data in a more structured fashion.
What is the best way of importing such data, and also creating some form of structure - i.e. use unique ID information in the line of data to analyze 1 stock at a time for instance.
You can try something like this :
options(stringsAsFactors = FALSE)
f_A <- function(line,tab_A){
values <- unlist(strsplit(line," "))[2:5]
rbind(tab_A,list(name_1=as.character(values[1]),name_2=as.numeric(values[2]),name_3=as.numeric(values[3]),name_4=as.numeric(values[4])))
}
tab_A <- data.frame(name_1=character(),name_2=numeric(),name_3=numeric(),name_4=numeric(),stringsAsFactors=F)
for(i in readLines(con="/home/data.txt")){
switch(strsplit(x=i,split="")[[1]][1],M=cat("1\n"),R=cat("2\n"),D=cat("3\n"),A=(tab_A <- f_A(i,tab_A)))
}
And replace cat() by different functions that add values to each type of data.frame. Use the pattern of the function f_A() to construct others functions and same things for the table structure.
You can combine your readLines() command with regular expressions. To get more information about regular expressions, look at the R help site for grep()
> ?grep
So you can go through all the lines, check for each line what it means, and then handle or store the content of the line however you like. (Regular Expressions are also useful to split the data within one line...)