I've imported a csv file with lots of columns and sections of data.
v <- read.csv2("200109.csv", header=TRUE, sep=",", skip="6", na.strings=c(""))
The layout of the file is something like this:
Dataset1
time, data, .....
0 0
0 <NA>
0 0
Dataset2
time, data, .....
00:00 0
0 <NA>
0 0
(The headers of the different datasets is exactly the same.
Now, I can plot the first dataset with:
plot(as.numeric(as.character(v$Calls.served.by.agent[1:30])), type="l")
I am curious if there is a better way to:
Get all the numbers read as numbers, without having to convert.
Address the different datasets in the file, in some meaningfull way.
Any hints would be appreciated. Thank you.
Status update:
I haven't really found a good solution yet in R, but I've started writing a script in Lua to seperate each individual time-series into a seperate file. I'm leaving this open for now, because I'm curious how well R will deal with all these files. I'll get 8 files per day.
What I personally would do is to make a script in some scripting language to separate the different data sets before the file is read into R, and possibly do some of the necessary data conversions, too.
If you want to do the splitting in R, look up readLines and scan – read.csv2 is too high-level and is meant for reading a single data frame. You could write the different data sets into different files, or if you are ambitious, cook up file-like R objects that are usable with read.csv2 and read from the correct parts of the underlying big file.
Once you have dealt with separating the data sets into different files, use read.csv2 on those (or whichever read.table variant is best – if those are not tabs but fixed-width fields, see read.fwf). If <NA> indicates "not available" in your file, be sure to specify it as part of na.strings. If you don't do that, R thinks you have non-numeric data in that field, but with the right na.strings, you automatically get the field converted into numbers. It seems that one of your fields can include time stamps like 00:00, so you need to use colClasses and specify a class to which your time stamp format can be converted. If the built-in Date class doesn't work, just define your own timestamp class and an as.timestamp function that does the conversion.
Related
The data I have been working with reads everything just fine, **except** for the date column. It always reads it as characters instead.
This would be fine except that, when I have lots of dates (like over 400 of them), then you can see something like this on a scatterplot:
Scatter Plot
In essence, I have two questions.
The first is, apart from using as.Date, which is fine when I'm needed temporary stuff, how do I permanently make R read the date column as legit dates? What I mean is, is there a way I can make that date column read as dates when I am using read.csv or read.excel?
When graphing, like the graph I have included here, how can I only include some of the labels throughout so that it won't be so cramped up? I still want all the data, but really do not want all those labels.
I was hoping to add the data file, but I am unaware of how to add excel/csv files on this website and my data set is quite long (n = 491). I do have 9 columns, 1 of which is the date column. The others are numbers or actual letters (the latter of which is in fact a character). I can add maybe a few rows just to help out.
Some of the data set
Is there a generic command to view the top n lines of data in a spreadsheet to inform a decision about how many lines to skip on the actual read in of the data?
I have used readLines() and read_lines() in some circumstances, but I was wondering if there was more basic and perhaps more universal function for checking spreadsheet data across multiple formats (e.g., csv, xls, xlsx) to determine how many lines to skip.
you can use data.table::fread() with the nrows argument, it lets you decide how many rows to read.
it's also useful for skipping the lines tou would want to skip using the skip argument.
see fread's documentation here.
Recently I obtained a lot of RTF files that contain econ data for analysis I need to do. Unfortunately, this is how Bureau of Statistics of my country could help with time series data for a long time lapse. If there is a one time need to select particular indicator for 10 years or so I'm OK to find these values by hand using Word/Notepad/TestEdit(for Mac). But my problem is that I have 15 files with data that I need to combine somehow in one dataset for my work. But, before even start doing this I don't have a clue if it is possible to read those files in appropriate format (data.frame). Wanted to ask expert opinion on how to approach this task. An example of one of files could be downloaded from here:
[https://www.dropbox.com/s/863ikx6poid8unc/Export_for_SO.rtf?dl=0][1]
All values are in Russian. Dataset represents export of particular product (first column) across countries (second column) in US dollars for 2 periods.
Thank you.
Use code found on https://datascienceplus.com/how-to-import-multiple-csv-files-simultaneously-in-r-and-create-a-data-frame/
replace read_csv with read_rtf
You may want to manually convert your files to another format using an office suite or a text editor. You should be able to save as in another format.
While in R, you may want to give striprtf a try. I'm guessing you will still have to clean your data a bit afterward.
You can install the package like this:
install.packages("striprtf")
I have a loop that is going through a list of variables (in a csv), accessing a database and extracting the relevant data. It does this for 4 different time periods (which depend on the variables).
I am trying to get R to write this data to a csv, but at current I can only get it to store the data for the last variable in 4 different csv files as it overwrites the previous variable each time.
I'd like it to have all of the data for these variables for one time period all in the same file/sheet. (So either 4 sheets or 4 csv files with all of the data on them) This is because I need to do some data manipulation on the variables before I feed them into the next loop of the script.
I'd like it to be something like this, but need 4 separate sheets/files so I can cover each time period.
date/time | var1 | var2 | ... | varn
I would post the code, but even only posting the relevant loop and none of the surrounding code would be ~150 lines. I am not familiar with R (I can follow the script but struggle writing my own), I inherited this project and don't have long to work on it.
Note: each variable is recorded at a different frequency - some will only have one data point an hour, others one every minute, so will need to match these up based on time recorded (to the nearest minute).
EDIT: I hope I've explained this clearly enough
Four different .csv files would be easiest, because you could do something like the following in your loop:
outfile.name <- paste('Sales', year.of.data, sep='')
write.csv(outfile.name, out.filepath, row.names=FALSE)
You could also append the data into one data.frame and then export it all at once into one sheet. You won't be able to export to multiple sheets for a .csv, because a CSV won't let you have multiple sheets.
First time poster here, so I'll try and make myself as clear as possible on the help I need. I'm fairly new to R, and this is my first real independent programming experience.
I have stock tick data for about 2.5 years, each day has its own file. The files are .txt and consist of approximately 20-30 million rows, and averaging I guess 360mb each. I am working one file at a time for now. I don't need all the data these files contain, and I was hoping that I could use the programming to minimize my files a bit.
Now my problem is that I am having some difficulties with writing the proper code so R understands what I need it to do.
Let me first show you some of the data so you can get an idea of the formatting.
M977
R 64266NRE1VEW107 FI0009653869 2EURXHEL 630 1
R 64516SSA0B 80SHB SE0002798108 8SEKXSTO 40 1
R 645730BBREEW750 FR0010734145 8EURXHEL 640 1
R 64655OXS1C 900SWE SE0002800136 8SEKXSTO 40 1
R 64663OXS1P 450SWE SE0002800219 8SEKXSTO 40 1
R 64801SSIEGV LU0362355355 11EURXCSE 160 1
M978
Another snip of data:
M732
D 3547742
A 3551497B 200000 67110 02800
D 3550806
D 3547743
A 3551498S 250000 69228 09900
So as you can see each line begins with a letter. Each letter denotes what the line means. For instance R means order book directory message, M means milliseconds after last second, H means stock trading action message. There are 14 different letters used in total.
I have used the readLines function to import the data into R. This however seems to take a very long time for R to process when I want to work with the data.
Now I would like to write some sort of If function that says if the first letter is R then from offset 1 to 4 the code means Market Segment Identifier etc., and have R add columns to these so I can work with the data in a more structured fashion.
What is the best way of importing such data, and also creating some form of structure - i.e. use unique ID information in the line of data to analyze 1 stock at a time for instance.
You can try something like this :
options(stringsAsFactors = FALSE)
f_A <- function(line,tab_A){
values <- unlist(strsplit(line," "))[2:5]
rbind(tab_A,list(name_1=as.character(values[1]),name_2=as.numeric(values[2]),name_3=as.numeric(values[3]),name_4=as.numeric(values[4])))
}
tab_A <- data.frame(name_1=character(),name_2=numeric(),name_3=numeric(),name_4=numeric(),stringsAsFactors=F)
for(i in readLines(con="/home/data.txt")){
switch(strsplit(x=i,split="")[[1]][1],M=cat("1\n"),R=cat("2\n"),D=cat("3\n"),A=(tab_A <- f_A(i,tab_A)))
}
And replace cat() by different functions that add values to each type of data.frame. Use the pattern of the function f_A() to construct others functions and same things for the table structure.
You can combine your readLines() command with regular expressions. To get more information about regular expressions, look at the R help site for grep()
> ?grep
So you can go through all the lines, check for each line what it means, and then handle or store the content of the line however you like. (Regular Expressions are also useful to split the data within one line...)