I want to load time series data from a csv file. I am confused if it is possible using the ts() function?
The data looks like this:
time(ms),value
1390933817000,3775.89624023438
1390933847000,3765.65698242188
1390933877000,3757.01416015625
1390933907000,3768.63623046875
1390933937000,3775.84497070312
1390933967000,3774.53588867188
1390933997000,3771.6240234375
1390934027000,3763.83081054688
As you can observe, the value is fetched every 30 seconds.
Try this (noting comment):
Lines <- "time(ms),value
1390933817000,3775.89624023438
1390933847000,3765.65698242188
1390933877000,3757.01416015625
1390933907000,3768.63623046875
1390933937000,3775.84497070312
1390933967000,3774.53588867188
1390933997000,3771.6240234375
1390934027000,3763.83081054688"
library(zoo)
# z <- read.zoo("myfile.dat", sep = ",", header = TRUE, FUN = identity)
z <- read.zoo(text = Lines, sep = ",", header = TRUE, FUN = identity)
as.ts(z)
Related
I am trying to save multiple csv files in one df and include a new column with the date of the file in the df. I already read all the files to get one df but I can't add the date column per file. Im using the next code
ccn_files <- list.files(pattern = '*.csv', path = "input/CCN/") ##Creates a list of all the files
ccn_data_raw <- do.call("rbind", ##Apply the bind to the files
lapply(ccn_files, ##call the list
function(x) ##apply the next function
read.csv(paste("input/CCN/", x, sep=''),fill = T, header = TRUE,
skip = 4)))
I was also able to get the date from all the files in a vector using this line
test <- ymd(substr(ccn_files,14,19))
How can I add this line inside the first chunk of code so it does what I want?
We can use Map
ccn_data_raw <- do.call(rbind, Map(cbind, lapply(ccn_files,
function(x) read.csv(paste("input/CCN/", x, sep=''),fill = TRUE,
header = TRUE, skip = 4)), date = test))
Or using purrr functions :
library(purrr)
ccn_data_raw <- map2_df(map(ccn_files, function(x)
read.csv(paste("input/CCN/", x, sep=''), fill = TRUE, header = TRUE,
skip = 4)), test, cbind)
Current dilemma: I have a massive data frame that I am trying to break down into smaller files based on a partial string match in the column. I have made a script that works great for this:
df <- read.csv("file.csv", header = TRUE, sep = ",")
newdf <- select(df, matches('threshold1',))
write.csv(newdf,"threshold1.file.csv", row.names = FALSE)
The problem is that I have hundreds of thresholds to break apart into separate files. There must be a way I can loop this script to create all the files for me rather than manually editing the script to say threshold2, threshold3, etc.
You can try to solve it with lapply.
# Functions that splits and saves the data.frame
split_df <- function(threshold, df){
newdf <- select(df, matches(threshold,))
write.csv(newdf,
paste(".file.csv", sep = ""), row.names = FALSE)
return(threshold)
}
df <- read.csv("file.csv", header = TRUE, sep = ",")
# Number for thresholds
N <- 100
threshold_l <- paste("threshold", 1:N, sep = "")
lapply(threshold_l, split_df, df = df)
I'm attempting to import and export, in pieces, a single 10GB CSV file with roughly 10 million observations. I want about 10 manageable RData files in the end (data_1.RData, data_2.Rdata, etc.), but I'm having trouble making the skip and nrows dynamic. My nrows will never change as I need almost 1 million per dataset, but I'm thinking I'll need some equation for skip= so that every loop it increases to catch the next 1 million rows. Also, having header=T might mess up anything over ii=1since only the first row will include variable names. The following is the bulk of the code I'm working with:
for (ii in 1:10){
data <- read.csv("myfolder/file.csv",
row.names=NULL, header=T, sep=",", stringsAsFactors=F,
skip=0, nrows=1000000)
outName <- paste("data",ii,sep="_")
save(data,file=file.path(outPath,paste(outName,".RData",sep="")))
}
(Untested but...) You can try something like this:
nrows <- 1000000
ind <- c(0, seq(from = nrows, length.out = 10, by = nrows) + 1)
header <- names(read.csv("myfolder/file.csv", header = TRUE, nrows = 1))
for (i in seq_along(ind)) {
data <- read.csv("myfolder/file.csv",
row.names = NULL, header = FALSE,
sep = ",", stringsAsFactors = FALSE,
skip = ind[i], nrows = 1000000)
names(data) <- header
outName <- paste("data", ii, sep = "_")
save(data, file = file.path(outPath, paste(outName, ".RData", sep = "")))
}
Iv'e written the following code to import data into R:
## specify where all the data files are stored
DataFolder <- "DataFolder"
## obtain the name of each file in DataFolder
files <- list.files(DataFolder)
## obtain name of each file
LocNames <- unique(sub("^([^.]*).*", "\\1", files)) # this removes the extension and keeps the unique names
for (i in 1:length(LocNames)){
#
car <- read.table(paste(DataFolder, paste(LocNames[i], ".car", sep=""), sep="/"),
header = TRUE, sep = "\t", colClasses=c(dateTime="POSIXct"))
car <- aggregate(car[colnames(car)[2:length(colnames(car))]],list(dateTime = cut(car$dateTime,breaks = "hour")),mean, na.rm = TRUE)
#
light <- read.table(paste(DataFolder, paste(LocNames[i], ".light", sep=""), sep="/"),
header = TRUE, sep = "\t", colClasses=c(dateTime="POSIXct"))
light <- aggregate(light[colnames(light)[2]],list(dateTime = cut(light$dateTime, breaks = "hour")),mean, na.rm = TRUE)
}
So, here I have a DataFolder where all of my files are stored. The files are named according to the location where the data was recorded and the extension of the file given the name of the variable measured. Here we have car sales and light as examples.
From here I would like to reduce the size of the arguments inside of the loop so instead of having to name one variable after the other repeating the same steps I want to only have to write the variable name e.g. car, light and then the outcome of the script shown will be returned.
Please let me know if my intentions have not been clear.
Just use a function. Something to the effect of
## specify where all the data files are stored
DataFolder <- "DataFolder"
## obtain the name of each file in DataFolder
files <- list.files(DataFolder)
readMyFiles <- function(DataFolder, LocNames, extension){
data <- read.table(paste(DataFolder, paste(LocNames[i], ".", extension, sep=""), sep="/"),
header = TRUE, sep = "\t", colClasses=c(dateTime="POSIXct"))
data <- aggregate(data[colnames(data)[2:length(colnames(data))]],list(dateTime = cut(data$dateTime,breaks = "hour")),mean, na.rm = TRUE)
data
}
## obtain name of each file
LocNames <- unique(sub("^([^.]*).*", "\\1", files)) # this removes the extension and keeps the unique names
for (i in 1:length(LocNames)){
car <- readMyFiles(DataFolder, LocNames, ".car")
light <- readMyFiles(DataFolder, LocNames, ".light")
}
I have daily data starting from 1980 in csv file. But I want to read data only from 1985. Because the other dataset in another file starts from 1985. How can I skip reading the data before 1985 in R language?
I think you want to take a look at ?read.csv to see all the options.
It's a bit hard to give an exact answer without seeing a sample of your data.
If your data doesn't have a header and you know which line the 1985 data starts on, you can just use something like...
impordata <- read.csv(file,skip=1825)
...to skip the first 1825 lines.
Otherwise you can always just subset the data after you've imported it if you have a year variable in your data.
impordata <- read.csv("skiplines.csv")
impordata <- subset(impordata,year>=1985)
If you don't know where the 1985 data starts, you can use grep to find the first instance of 1985 in your file's date variable and then only keep from that line onwards:
impordata <- read.csv("skiplines.csv")
impordata <- impordata[min(grep(1985,impordata$date)):nrow(impordata),]
Here are a few alternatives. (You may wish to convert the first column to "Date" class afterwards and possibly convert the entire thing to a zoo object or other time series class object.)
# create test data
fn <- tempfile()
dd <- seq(as.Date("1980-01-01"), as.Date("1989-12-31"), by = "day")
DF <- data.frame(Date = dd, Value = seq_along(dd))
write.table(DF, file = fn, row.names = FALSE)
read.table + subset
# if file is small enough to fit in memory try this:
DF2 <- read.table(fn, header = TRUE, as.is = TRUE)
DF2 <- subset(DF2, Date >= "1985-01-01")
read.zoo
# or this which produces a zoo object and also automatically converts the
# Date column to Date class. Note that all columns other than the Date column
# should be numeric for it to be representable as a zoo object.
library(zoo)
z <- read.zoo(fn, header = TRUE)
zw <- window(z, start = "1985-01-01")
If your data is not in the same format as the example you will need to use additional arguments to read.zoo.
multiple read.table's
# if the data is very large read 1st row (DF.row1) and 1st column (DF.Date)
# and use those to set col.names= and skip=
DF.row1 <- read.table(fn, header = TRUE, nrow = 1)
nc <- ncol(DF.row1)
DF.Date <- read.table(fn, header = TRUE, as.is = TRUE,
colClasses = c(NA, rep("NULL", nc - 1)))
n1985 <- which.max(DF.Date$Date >= "1985-01-01")
DF3 <- read.table(fn, col.names = names(DF.row1), skip = n1985, as.is = TRUE)
sqldf
# this is probably the easiest if data set is large.
library(sqldf)
DF4 <- read.csv.sql(fn, sql = 'select * from file where Date >= "1985-01-01"')
A data.table method which will offer speed and memory performance:
library(data.table)
fread(file, skip = 1825)