filtering while downloading a dataset R

filtering while downloading a dataset R - r

There is a large dataset that I need to download over the web using R, but I would like to learn how to filter it at the same time while downloading to the Dates that I need. Right now, I have it setup to download and .unzip and then I create another data set with a filter. The file is a text ";" delimited file
There is a Date column with format 1/1/2009 and I need to only select two dates, 3/1/2009 and 3/2/2009, how to do that in R ?
When I import it, R set it as a factor, since I only need those two dates and there is no need to do a Between, I just select the two factors and call it a day.
Thanks!

I don't think you can filter while downloading. To select only these dates you can use the subset function:
# do not convert string to factors
d.all = read.csv(file, ..., stringsAsFactors = FALSE, sep = ';')
# Date column is called DATE:
d.filter = subset(d.all, DATE %in% c("1/1/2009", "3/1/2009"))

Related

Isolate column in R from text file

I want to analyze market data that is being saved in a text file.
The data consists of "Date Time;Price;Size". I want to only look at the Sizes, how can I separate this data in R so that I may do statistical analysis on the sizes?
Example:
20170918 040001;50.42;1
20170918 040002;50.42;1

Just use read.csv with semicolon as a delimeter:
df <- read.csv(file="path/to/your/file.csv", sep=";", header=TRUE)
The sizes can be accessed using df$Sizes.

You can use the select argument of data.table:
library(data.table)
#[[1L]] extracts the column of the temporary table to a vector;
# you could also use $V2, but this _may_ not be perfectly robust
price = fread('/path/to/file'select = 2L)[[1L]]
fread should be able to detect automatically that your file doesn't have headers, as well as that the field separator is ;. If not, set header = FALSE and/or sep = ';'.
Of course, it's not likely that you will only use the vector of prices independently of the rest of the data. So you should really just store the whole data file in a data.table:
market_data = fread('/path/to/file', col.names = c('date_time', 'price', 'size'))
Then you can manipulate market_data as you would any data.table (see Getting Started), e.g.
market_data[ , mean(price)]
market_data[ , sd(price)]
and so on.

df=read.table("your file")
size=df[4]
your sizes data will be in size as a data frame

Export a simple R dataframe to txt tsv or csv

I am trying to do something apparently obvious, but have no way to solve it. From a dataframe in R downloaded from the web as follows I need to save the data. Here is how I do download it:
library(tseries)
library(zoo)
ts <- get.hist.quote(instrument="DJIA",
start="2008-07-01", end="2017-03-05",
quote="Close", provider="yahoo", origin="1970-01-01",
compression="d", retclass="zoo")
Then, returns object "ts" with a two columns table; the first of dates (with no header as R prefers) and the other with the "Close" value of DJIA
> ts
Close
2008-07-01 11382.26
2008-07-02 11215.51
2008-07-03 11288.53
2008-07-07 11231.96
.
.
.
2016-03-03 16943.90
2016-03-04 17006.77
I need this data exported in txt or similar format and import the list later; (because I will try to process health information, with no internet access) but when I try to save it; the date column with no header is missing. Additionally a "number of row" column is added. I do appologize if the question is obvious but have no other option to solve it

The date column has no header, because the date is imported as rownames/index. The default of write.csv has row.names = FALSE. Try:
write.csv(ts, file = "ts.csv",row.names=TRUE)
EDIT
Strangly, this doesn't work with an object of class "zoo"
According tot ? write.table:
write.table prints its required argument x (after converting it to a
data frame if it is not one nor a matrix) to a file or connection.
Apparently this conversion fails somehow. However, this works:
write.csv(data.frame(ts), file = "ts.csv",row.names=TRUE)

The ts object is a zoo object (not a two column table). In this case the zoo object is internally represented by a one column matrix of data and an "index" attribute holding the dates.
1) save/load If the only thing you want to do with the output file is to read it back into R later then there is no reason to require text and any format will do. In particular you could do this:
save(ts, file = "ts.Rda")
Now in a later session:
library(zoo)
load("ts.Rda")
1a) This would also work and produces an R source file that when sourced reconstructs the zoo object:
dump("ts", "ts.R")
and in a later session:
library(zoo)
source("ts.R")
2) write.zoo/read.zoo This will give a text file:
write.zoo(ts, "ts.dat")
and it can be written back in another session using:
library(zoo)
ts <- cbind( read.zoo("ts.dat", header = TRUE) )

How to Make a New Column in a Data Set with Values Corresponding to a Separate Data Set

I have two different csv files, one is called CA_Storms and one is called CA_adj. CA_Storms has many start and end dates/times for storm events (in one column), and CA_adj has a DateTime column that includes many thousand dates/times. I want to see if any of the dates/times in CA_adj correspond with any of the storm events in CA_Storms. To do this, I am trying to make a new column in CA_adj titled Storm_ID that will identify which storm it corresponds with based on the storm start and end times/dates in CA_Storms.
This is the process I have currently undergone:
#Make a value to which the csv files are attached
CA_Storms <- read.csv(file = "CA_Storms.csv", header = TRUE, stringsAsFactors = FALSE)
CA_adj <- read.csv(file = "CA_adj.csv", header = TRUE, stringsAsFactors
#strptime function (do this for both data sets)
CA_adj$DateTime1 <- strptime(CA_adj$DateTime, format = "%m/%d/%Y %H:%M")
CA_Storms$Start.time1 <- strptime(CA_Storms$Start.time, format = "%m/%d/%Y %H:%M")
CA_Storms$End.time1 <- strptime(CA_Storms$End.time, format = "%m/%d/%Y %H:%M")
#Make a new column into CA_adj that says Storm ID. Have it by
#default hold NAs.
CA_adj$Storm_ID <- NA
#Write a which statement to see if it meets the conditions of greater than
#or equal to start time or less than or equal to end time. Put this through a
#for loop to apply it to every row within CA_adj$DateTime1
for (i in nrow(CA_adj$DateTime1))
{
CA_adj$DateTime1[which(CA_adj$DateTime1 >= CA_Storms$Start.time1 | CA_adj$DateTime1 <= CA_Storms$End.time1), "Storm_ID"]
}
This is not giving me any errors, but it's also not replacing any of the values in the Storm_ID column that I have made. In my Global Environment under "Values" it now just says: i is NULL(empty). I am pretty sure what's missing is an i within the for loop, but I do not know where to put it. I also think the other issue is that it doesn't know what value to replace the NA's in the Storm_ID column with. I would like it to replace the NA's with the correct Storm ID that corresponds with the Storm dates (in CA_Storms$Start.time1 and in CA_Storms$End.Time1). For Dates/Times within CA_adj that do not apply to a storm date, I'd just want it to continue to say NA.
Any guidance on how to do this would be greatly appreciated. I'm new to R, and I've been trying to teach it to myself, which can make figuring out how to do these things on my own a bit difficult.
Thanks so much!

Why not have a look at the lubridate package. It will let you create time/date intervals which can then be tested against a specific time/date by %within% . Your code should be simpler.
You do need to use the loop index and you also need to make an assignment to CA_adj$StormID. I'm not certain if you could also have multiple CA_adj entries in a CA_Storms interval.
# make a lubridate interval in CA_Storms
# make CA_DateTime a lubridate
# or stick with the longer code...
# loop through all CA_adj
for (i in nrow(CA_adj)) {
CA_adj$StormID[i] <- CA_Storms$StormID[CA_adj$DateTime %within% CA_Storms$interval]
}

Using R to create and merge zoo object time series from csv files

I have a large set of csv files in a single directory. These files contain two columns, Date and Price. The filename of filename.csv contains the unique identifier of the data series. I understand that missing values for merged data series can be handled when these times series data are zoo objects. I also understand that, in using the na.locf(merge() function, I can fill in the missing values with the most recent observations.
I want to automate the process of.
loading the *.csv file columnar Date and Price data into R dataframes.
establishing each distinct time series within the Merged zoo "portfolio of time series" objects with an identity that is equal to each of their s.
merging these zoo objects time series using MergedData <- na.locf(merge( )).
The ultimate goal, of course, is to use the fPortfolio package.
I've used the following statement to create a data frame of Date,Price pairs. The problem with this approach is that I lose the <filename> identifier of the time series data from the files.
result <- lapply(files, function(x) x <- read.csv(x) )
I understand that I can write code to generate the R statements required to do all these steps instance by instance. I'm wondering if there is some approach that wouldn't require me to do that. It's hard for me to believe that others haven't wanted to perform this same task.

Try this:
z <- read.zoo(files, header = TRUE, sep = ",")
z <- na.locf(z)
I have assumed a header line and lines like 2000-01-31,23.40 . Use whatever read.zoo arguments are necessary to accommodate whatever format you have.

You can have better formatting using sapply( keep the files names). Here I will keep lapply.
Assuming that all your files are in the same directory you can use list.files.
it is very handy for such workflow.
I would use read.zoo to get directly zoo objects(avoid later coercing)
For example:
zoo.objs <- lapply(list.files(path=MY_FILES_DIRECTORY,
pattern='^zoo_*.csv', ## I look for csv files,
## which names start with zoo_
full.names=T), ## to get full names path+filename
read.zoo)
I use now list.files again to rename my result
names(zoo.objs) <- list.files(path=MY_FILES_DIRECTORY,
pattern='^zoo_*.csv')

RODBC sqlQuery as.is returning bad results

I'm trying to import an excel worksheet into R. I want to retrieve a (character) ID column and a couple of date columns from the worksheet. The following code works fine but brings one column in as a date and not another. I think it has something to do with more leading columns being empty in the second date field.
dateFile <- odbcConnectExcel2007(xcelFile)
query <- "SELECT ANIMALID, ST_DATE_TIME, END_DATE_TIME FROM [KNWR_CL$]"
idsAndDates <- sqlQuery(dateFile,query)
So my plan now is to bring in the date columns as character fields and convert them myself using as.POSIXct. However, the following code produces only a single row in idsAndDates.
dateFile <- odbcConnectExcel2007(xcelFile)
query <- "SELECT ANIMALID, ST_DATE_TIME, END_DATE_TIME FROM [KNWR_CL$]"
idsAndDates <- sqlQuery(dateFile,query,as.is=TRUE,TRUE,TRUE)
What am I doing wrong?

I had to move on and ended up using the gdata library (which worked). I'd still be interested in an answer for this though.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

filtering while downloading a dataset R - r

I don't think you can filter while downloading. To select only these dates you can use the subset function: # do not convert string to factors d.all = read.csv(file, ..., stringsAsFactors = FALSE, sep = ';') # Date column is called DATE: d.filter = subset(d.all, DATE %in% c("1/1/2009", "3/1/2009"))

Related

Isolate column in R from text file

Export a simple R dataframe to txt tsv or csv

How to Make a New Column in a Data Set with Values Corresponding to a Separate Data Set

Using R to create and merge zoo object time series from csv files

RODBC sqlQuery as.is returning bad results

Categories

Resources