I'm trying to read time series from CSV file and save them as xts to be able to process them with quantmod. The problem is that numeric values are not parsed.
CSV file:
name;amount;datetime
test1;3;2010-09-23 19:00:00.057
test2;9;2010-09-23 19:00:00.073
R code:
library(xts)
ColClasses = c("character", "numeric", "character")
Data <- read.zoo("c:\\dat\\test2.csv", index.column = 3, sep = ";", header = TRUE, FUN = as.POSIXct, colClasses = ColClasses)
as.xts(Data)
Result:
name amount
2010-09-23 19:00:00 "test1" "3"
2010-09-23 19:00:00 "test2" "9"
See amount column contains character data but expected to be numeric. What's wrong with my code?
The internal data structure of both zoo and xts is matrix, so you cannot mix data types.
Just read in the data with read.table:
Data <- read.table("file.csv", sep=";", header=TRUE, colClasses=ColClasses)
I notice your data have subseconds, so you may be interested in xts::align.time. This code will take Data and create one object with a column for each "name" by seconds.
NewData <- do.call( merge, lapply( split(Data,Data$name), function(x) {
align.time( xts(x[,"amount"],as.POSIXct(x[,"datetime"])), n=1 )
}) )
If you want to create objects test1 and test2 in your global environment, you can do something like:
lapply( split(Data,Data$name), function(x) {
assign(x[,"name"], xts(x[,"amount"],as.POSIXct(x[,"datetime"])),envir=.GlobalEnv)
})
You cannot mix numeric and character data in a zoo or xts object; however, if the name column is not intended to be time series data but rather is intended to distinguish between multiple time series, one for test1, one for test2, etc. then you can split on column 1 using split=1 to cause such splitting as shown in the following code. Be sure to set the digits.secs or else you won't see the sub-seconds on output (although they will be there in any case):
options(digits.secs = 3)
z <- read.zoo("myfile.csv", sep = ";", split = 1, index = 3, header = TRUE, tz = "")
x <- as.xts(z)
Related
I have a text file of names, separated by commas, and I want to read this into whatever in R (data frame or vector are fine). I try read.csv and it just reads them all in as headers for separate columns, but 0 rows of data. I try header=FALSE and it reads them in as separate columns. I could work with this, but what I really want is one column that just has a bunch of rows, one for each name. For example, when I try to print this data frame, it prints all the column headers, which are useless, and then doesn't print the values. It seems like it should be easily usable, but it appears to me one column of names would be easier to work with.
Since the OP asked me to, I'll post the comment above as an answer.
It's very simple, and it comes from some practice in reading in sequences of data, numeric or character, using scan.
dat <- scan(file = your_filename, what = 'character', sep = ',')
You can use read.csv are read string as header, but then just extract names (using names) and put this into a data.frame:
data.frame(x = names(read.csv("FILE")))
For example:
write.table("qwerty,asdfg,zxcvb,poiuy,lkjhg,mnbvc",
"FILE", col.names = FALSE, row.names = FALSE, quote = FALSE)
data.frame(x = names(read.csv("FILE")))
x
1 qwerty
2 asdfg
3 zxcvb
4 poiuy
5 lkjhg
6 mnbvc
Something like that?
Make some test data:
# test data
list_of_names <- c("qwerty","asdfg","zxcvb","poiuy","lkjhg","mnbvc" )
list_of_names <- paste(list_of_names, collapse = ",")
list_of_names
# write to temp file
tf <- tempfile()
writeLines(list_of_names, tf)
You need this part:
# read from file
line_read <- readLines(tf)
line_read
list_of_names_new <- unlist(strsplit(line_read, ","))
I have an excel file with several sheets, each one with several columns, so I would like to not to specify the type of column separately, but automatedly. I want to read them as stringsAsFactors= FALSE would do, because it interprets the type of column, correctly. In my current method, a column width "0.492 ± 0.6" is interpreted as number, returning NA, "because" the stringsAsFactors option is not available in read_excel. So here, I write a workaround, that works more or less well, but that I cannot use in real life, because I am not allowed to create a new file. Note: I need other columns as numbers or integers, also others that have only text as characters, as stringsAsFactors does in my read.csv example.
library(readxl)
file= "myfile.xlsx"
firstread<-read_excel(file, sheet = "mysheet", col_names = TRUE, na = "", skip = 0)
#firstread has the problem of the a column with "0.492 ± 0.6",
#being interpreted as number (returns NA)
colna<-colnames(firstread)
# read every column as character
colnumt<-ncol(firstread)
textcol<-rep("text", colnumt)
secondreadchar<-read_excel(file, sheet = "mysheet", col_names = TRUE,
col_types = textcol, na = "", skip = 0)
# another column, with the number 0.532, is now 0.5319999999999999
# and several other similar cases.
# read again with stringsAsFactors
# critical step, in real life, I "cannot" write a csv file.
write.csv(secondreadchar, "allcharac.txt", row.names = FALSE)
stringsasfactor<-read.csv("allcharac.txt", stringsAsFactors = FALSE)
colnames(stringsasfactor)<-colna
# column with "0.492 ± 0.6" now is character, as desired, others numeric as desired as well
Here is a script that imports all the data in your excel file. It puts each sheet's data in a list called dfs:
library(readxl)
# Get all the sheets
all_sheets <- excel_sheets("myfile.xlsx")
# Loop through the sheet names and get the data in each sheet
dfs <- lapply(all_sheets, function(x) {
#Get the number of column in current sheet
col_num <- NCOL(read_excel(path = "myfile.xlsx", sheet = x))
# Get the dataframe with columns as text
df <- read_excel(path = "myfile.xlsx", sheet = x, col_types = rep('text',col_num))
# Convert to data.frame
df <- as.data.frame(df, stringsAsFactors = FALSE)
# Get numeric fields by trying to convert them into
# numeric values. If it returns NA then not a numeric field.
# Otherwise numeric.
cond <- apply(df, 2, function(x) {
x <- x[!is.na(x)]
all(suppressWarnings(!is.na(as.numeric(x))))
})
numeric_cols <- names(df)[cond]
df[,numeric_cols] <- sapply(df[,numeric_cols], as.numeric)
# Return df in desired format
df
})
# Just for convenience in order to remember
# which sheet is associated with which dataframe
names(dfs) <- all_sheets
The process goes as follows:
First, you get all the sheets in the file with excel_sheets and then loop through the sheet names to create dataframes. For each of these dataframes, you initially import the data as text by setting the col_types parameter to text. Once you have gotten the dataframe's columns as text, you can convert the structure from a tibble to a data.frame. After that, you then find columns that are actually numeric columns and convert them into numeric values.
Edit:
As of late April, a new version of readxl got released, and the read_excel function got two enhancements pertinent to this question. The first is that you can have the function guess the column types for you with the argument "guess" provided to the col_types parameter. The second enhancement (corollary to the first) is that guess_max parameter got added to the read_excel function. This new parameter allows you to set the number of rows used for guessing the column types. Essentially, what I wrote above could be shortened with the following:
library(readxl)
# Get all the sheets
all_sheets <- excel_sheets("myfile.xlsx")
dfs <- lapply(all_sheets, function(sheetname) {
suppressWarnings(read_excel(path = "myfile.xlsx",
sheet = sheetname,
col_types = 'guess',
guess_max = Inf))
})
# Just for convenience in order to remember
# which sheet is associated with which dataframe
names(dfs) <- all_sheets
I would recommend that you update readxl to the latest version to shorten your script and as a result avoid possible annoyances.
I hope this helps.
I try to load data from a csv to a data frame. What I do is that:
input <- read.csv("CONCAT_RESULT.CSV", sep = ",", skip = 1, col.names = c("ABS_ERG","MEHRFACH_COUNTER","TECH_KEY","XX_KEY","YY_SCHLUESSEL","CCC","LAND","HIERARCHIE_STICHTAGSABZUG","DATUM_SST_ERZEUGUNG","UHRZEIT_SST_ERZEUGUNG","FRUEHESTES_ABZUGSDATUM","AGGR_KLASSE_ID","ANTWORT_NUM","ANTWORT_TEXT","UMFRAGETYP_ID","ZZZ_ID","TTT_ID","BEANTWORTUNG_TYP","TRANSFORMIERT"))
In the next step I remove a few columns:
input["HIERARCHIE_STICHTAGSABZUG"] <- NULL
input["DATUM_SST_ERZEUGUNG"] <- NULL
input["UHRZEIT_SST_ERZEUGUNG"] <- NULL
input["FRUEHESTES_ABZUGSDATUM"] <- NULL
input["ANTWORT_TEXT"] <- NULL
Then I try to convert it to a data.frame with:
input.data <- as.data.frame(input)
But typeof(input.data) returns: [1] "list"
Can anybody tell me why?
Thanks
A data.frame is a list of vectors of the same length. Thus, list is a correct type for a data.frame.
Try
typeof(data.frame(a=1, b=2, c=3))
to see, that a data.frame is just a list. To learn more, see help(mode) and help(data.frame).
I have a CSV file containing data with the first column in Unix time stamp. How can I convert it to xts form directly? Currently I am trying to read the file and convert using as.xts, but I get error messages every way I try.
An example of a code I used:
Data <- read.zoo("data.csv", index.column = 1, origin="01/01/1970",
sep = ",", header = TRUE, FUN = as.POSIXct)
as.xts(Data)
1st 2 lines of the csv:
1366930371 143.7 0.25275
1366930368 143.7 0.02664867
There could be several things wrong. First is that the first 2 lines of your "csv" are tab-separated, not comma-separated. Next, you specify header=TRUE, but the first 2 lines do not have a header. Third, origin= is in the wrong format. It should be yyyy-mm-dd.
This works:
library(xts)
Lines <- "1366978862,133.08,0.48180896
1366978862,133.08,0.5"
tc <- textConnection(Lines)
Data <- read.zoo(tc, sep=",", FUN=function(i) as.POSIXct(i, origin="1970-01-01"))
close(tc)
Data <- as.xts(Data)
I have daily data starting from 1980 in csv file. But I want to read data only from 1985. Because the other dataset in another file starts from 1985. How can I skip reading the data before 1985 in R language?
I think you want to take a look at ?read.csv to see all the options.
It's a bit hard to give an exact answer without seeing a sample of your data.
If your data doesn't have a header and you know which line the 1985 data starts on, you can just use something like...
impordata <- read.csv(file,skip=1825)
...to skip the first 1825 lines.
Otherwise you can always just subset the data after you've imported it if you have a year variable in your data.
impordata <- read.csv("skiplines.csv")
impordata <- subset(impordata,year>=1985)
If you don't know where the 1985 data starts, you can use grep to find the first instance of 1985 in your file's date variable and then only keep from that line onwards:
impordata <- read.csv("skiplines.csv")
impordata <- impordata[min(grep(1985,impordata$date)):nrow(impordata),]
Here are a few alternatives. (You may wish to convert the first column to "Date" class afterwards and possibly convert the entire thing to a zoo object or other time series class object.)
# create test data
fn <- tempfile()
dd <- seq(as.Date("1980-01-01"), as.Date("1989-12-31"), by = "day")
DF <- data.frame(Date = dd, Value = seq_along(dd))
write.table(DF, file = fn, row.names = FALSE)
read.table + subset
# if file is small enough to fit in memory try this:
DF2 <- read.table(fn, header = TRUE, as.is = TRUE)
DF2 <- subset(DF2, Date >= "1985-01-01")
read.zoo
# or this which produces a zoo object and also automatically converts the
# Date column to Date class. Note that all columns other than the Date column
# should be numeric for it to be representable as a zoo object.
library(zoo)
z <- read.zoo(fn, header = TRUE)
zw <- window(z, start = "1985-01-01")
If your data is not in the same format as the example you will need to use additional arguments to read.zoo.
multiple read.table's
# if the data is very large read 1st row (DF.row1) and 1st column (DF.Date)
# and use those to set col.names= and skip=
DF.row1 <- read.table(fn, header = TRUE, nrow = 1)
nc <- ncol(DF.row1)
DF.Date <- read.table(fn, header = TRUE, as.is = TRUE,
colClasses = c(NA, rep("NULL", nc - 1)))
n1985 <- which.max(DF.Date$Date >= "1985-01-01")
DF3 <- read.table(fn, col.names = names(DF.row1), skip = n1985, as.is = TRUE)
sqldf
# this is probably the easiest if data set is large.
library(sqldf)
DF4 <- read.csv.sql(fn, sql = 'select * from file where Date >= "1985-01-01"')
A data.table method which will offer speed and memory performance:
library(data.table)
fread(file, skip = 1825)