I have 3 lists of large XTS objects: "SMA"; "L", "Marubozu". Quick look how it looks:
> names(Marubozu)
[1] "TSLA" "AAPL" "NTES" "GOOGL" "ASML" "GOOG" "NFLX" "ADBE" "AMZN" "MSFT" "ADI" "FB"
> names(SMA)
[1] "TSLA" "AAPL" "NTES" "GOOGL" "ASML" "GOOG" "NFLX" "ADBE" "AMZN" "MSFT" "ADI" "FB"
> names(L)
[1] "TSLA" "AAPL" "NTES" "GOOGL" "ASML" "GOOG" "NFLX" "ADBE" "AMZN" "MSFT" "ADI" "FB"
> head(Marubozu$AAPL, n = 2)
WhiteMarubozu BlackMarubozu
2000-01-03 FALSE FALSE
2000-01-04 FALSE FALSE
> head(SMA$AAPL, n = 2)
UpTrend NoTrend DownTrend Trend
2000-01-03 NA NA NA NA
2000-01-04 NA NA NA NA
> head(L$AAPL, n =2)
AAPL.Open AAPL.High AAPL.Low AAPL.Close AAPL.Volume AAPL.Adjusted
2000-01-03 0.936384 1.004464 0.907924 0.999442 535796800 0.856887
2000-01-04 0.966518 0.987723 0.903460 0.915179 512377600 0.784643
I want to merge corresponding XTS objects in that lists so that it creates one big lig list. For example, the output for New_List$AAPL would be:
AAPL.Open AAPL.High AAPL.Low AAPL.Close AAPL.Volume AAPL.Adjusted WhiteMarubozu BlackMarubozu UpTrend NoTrend DownTrend Trend
2000-01-03 0.936384 1.004464 0.907924 0.999442 535796800 0.856887 0 0 NA NA NA NA
2000-01-04 0.966518 0.987723 0.903460 0.915179 512377600 0.784643 0 0 NA NA NA NA
I tried to create a list of lists and merging it, but it didnt work. Here you can see:
#That works for a single ticker AAPL
full <- merge.xts(L$AAPL, Marubozu$AAPL, SMA$AAPL)
#This doesn't work
out3 <- Map(function(x) {full$x <- merge.xts(lista[[1]]$x, lista[[2]]$x)}, lista)
I guess it is just some simple 2-lines thing but can't really find the solution, thanks for any responses!
We could do this with Map - as the list of xts elements have the same tickers in the same order, just use Map instead of creating a list of lists
library(xts)
out <- Map(merge.xts, L, Marubozu, SMA)
Here's a small function u() that binds the xts-index to an xts object and converts to 'data.frame'.
u <- function(x) cbind.data.frame(index=index(x), unclass(x))
To test it, we create some data using sample_matrix which comes with xts. We split first two and last two columns into two separate xts objects with same index.
library(xts)
data(sample_matrix)
sample.xts <- as.xts(sample_matrix, descr='my new xts object')
S1 <- as.xts(sample_matrix[,1:2]) ##
S2 <- as.xts(sample_matrix[,3:4])
Now we may easily apply merge and create a new xts object out of it.
res <- merge(u(S1), u(S2)) |>
(\(x) xts(x[-1], x$index, descr='my new xts object'))()
class(res)
# [1] "xts" "zoo"
stopifnot(all.equal(res, sample.xts)) ## proof
Related
I am merging two xts objects with join="left" i.e. (all rows in the left object, and those that match in the right). I loaded these objectd in myEnv.
library(quantmod)
myEnv <- new.env()
getSymbols("AAPL;FB", env=myEnv)
[1] "AAPL" "FB"
MainXTS <- do.call(merge, c(eapply(myEnv, Cl), join = "left"))
head(MainXTS)
AAPL.Close FB.Close
2007-01-03 2.992857 NA
2007-01-04 3.059286 NA
2007-01-05 3.037500 NA
2007-01-08 3.052500 NA
2007-01-09 3.306072 NA
2007-01-10 3.464286 NA
range(index(myEnv$AAPL))
[1] "2007-01-03" "2020-10-27"
range(index(myEnv$FB))
[1] "2012-05-18" "2020-10-27"
So far it is working as expected since the time index in above merged object is being picked up from APPL. The issue is that when I change the order of the tickers so that FB comes first, the merged object still picks up time indexes from AAPL.
myEnv <- new.env()
getSymbols("FB;AAPL", env=myEnv)
[1] "FB" "AAPL"
MainXTS <- do.call(merge, c(eapply(myEnv, Cl), join = "left"))
head(MainXTS)
AAPL.Close FB.Close
2007-01-03 2.992857 NA
2007-01-04 3.059286 NA
2007-01-05 3.037500 NA
2007-01-08 3.052500 NA
2007-01-09 3.306072 NA
2007-01-10 3.464286 NA
I was expecting the time index to be picked up from FB. Does any one know what I am missing?
I think this has something to do with the fact that the order of objects being loaded is the same and in both cases above it is:
ls(myEnv)
[1] "AAPL" "FB"
We can change the order with match
out <- do.call(merge, c(lapply(mget(ls(myEnv)[match(ls(myEnv),
c("FB", "AAPL"))], myEnv), Cl), join = "left"))
-output
head(out)
# FB.Close AAPL.Close
#2012-05-18 38.23 18.94214
#2012-05-21 34.03 20.04571
#2012-05-22 31.00 19.89179
#2012-05-23 32.00 20.37714
#2012-05-24 33.03 20.19000
#2012-05-25 31.91 20.08178
I am looking for a way to rename the columns of several objects with a for loop or other method in R. Ultimately, I want to be able to bind the rows of each Stock object into one large data frame, but cannot due to differing column names. Example below:
AAPL <-
Date AAPL.Open AAPL.High AAPL.Low AAPL.Close AAPL.Volume AAPL.Adjusted Stock pct_change
2020-05-14 304.51 309.79 301.53 309.54 39732300 309.54 AAPL 0.61
2020-05-15 300.35 307.90 300.21 307.71 41561200 307.71 AAPL -0.59
GOOG <-
Date GOOG.Open GOOG.High GOOG.Low GOOG.Close GOOG.Volume GOOG.Adjusted Stock pct_change
2020-05-14 1335.02 1357.420 1323.910 1356.13 1603100 1356.13 GOOG 0.50
2020-05-15 1350.00 1374.480 1339.000 1373.19 1705700 1373.19 GOOG 1.26
For this example I have 2 objects (AAPL and GOOG), but realistically I would be working with many more. Can I create a for loop to iterate through each object, and rename the 2nd column of each to "Open", 3rd column to "High", 4th column to "Low",.... etc so I can then bind all these objects together?
I already have a column named "Stock", so I do not need the Ticker part of the column name.
Using quantmod we can read a set of stock ticker symbols, clean their names & rbind() into a single data frame.
There are three key features illustrated within this answer, including:
Use of get() to access the objects written by quantmod::getSymbols() once they are loaded into memory.
Use of the symbol names passed into lapply() to add a symbol column to each data frame.
Conversion of the dates stored as row names in the xts objects written by getSymbols() to a data frame column.
First, we'll use getSymbols() to read data from yahoo.com.
library(quantmod)
from.dat <- as.Date("12/02/19",format="%m/%d/%y")
to.dat <- as.Date("12/06/19",format="%m/%d/%y")
theSymbols <- c("AAPL","AXP","BA","CAT","CSCO","CVX","XOM","GS","HD","IBM",
"INTC","JNJ","KO","JPM","MCD","MMM","MRK","MSFT","NKE","PFE","PG",
"TRV","UNH","UTX","VZ","V","WBA","WMT","DIS","DOW")
getSymbols(theSymbols,from=from.dat,to=to.dat,src="yahoo")
# since quantmod::getSymbols() writes named data frames, need to use
# get() with the symbol names to access each data frame
head(get(theSymbols[[1]]))
> head(get(theSymbols[[1]]))
AAPL.Open AAPL.High AAPL.Low AAPL.Close AAPL.Volume AAPL.Adjusted
2019-12-02 267.27 268.25 263.45 264.16 23621800 262.8231
2019-12-03 258.31 259.53 256.29 259.45 28607600 258.1370
2019-12-04 261.07 263.31 260.68 261.74 16795400 260.4153
2019-12-05 263.79 265.89 262.73 265.58 18606100 264.2359
Having illustrated how to access the symbol objects in the global environment, we'll use lapply() to extract the dates from the row names, clean the column headings, and write the symbol name as a column for each symbol's data object.
# convert to list
symbolData <- lapply(theSymbols,function(x){
y <- as.data.frame(get(x))
colnames(y) <- c("open","high","low","close","volume","adjusted")
y$date <- rownames(y)
y$symbol <- x
y
})
Finally, we convert the list of data frames to a single data frame.
#combine to single data frame
combinedData <- do.call(rbind,symbolData)
rownames(combinedData) <- 1:nrow(combinedData)
...and the output:
> nrow(combinedData)
[1] 120
> head(combinedData)
open high low close volume adjusted date symbol
1 267.27 268.25 263.45 264.16 23621800 262.8231 2019-12-02 AAPL
2 258.31 259.53 256.29 259.45 28607600 258.1370 2019-12-03 AAPL
3 261.07 263.31 260.68 261.74 16795400 260.4153 2019-12-04 AAPL
4 263.79 265.89 262.73 265.58 18606100 264.2359 2019-12-05 AAPL
5 120.31 120.36 117.07 117.26 5538200 116.2095 2019-12-02 AXP
6 116.04 116.75 114.65 116.57 3792300 115.5256 2019-12-03 AXP
>
If you can guarantee the order of these columns this should do it:
for(df in list(AAPL, GOOG))
colnames(df) <- c("Date", "Open", "High", "Low", "Close", "Volume", "Adjusted", "Stock", "pct_change")
With lapply, we can loop over the list and remove the prefix in the column names with sub. This can be done without any external packages
lst1 <- lapply(list(AAPL, GOOG), function(x) {
colnames(x) <- sub(".*\\.", "", colnames(x))
x})
Question:
How to filter rows based on a nested dataframe using dplyr:filter
Problem:
The following code provides an example dataset to enable a working example.
Using the example code I can subset using which, but I am having a problem using dplyr due to the nested data frames.
Now I appreciate I could flatten the dataframe using jsonlite, however I am interested to know if and how I might harness dplyr without flattening the dataframe.
All help gratefully received and appreciated.
requiredPackages <- c("devtools","dplyr","tidyr","data.table","ggplot2","ggvis","RMySQL", "jsonlite", "psych", "plyr", "knitr")
ipak <- function(pkg)
{
new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
if (length(new.pkg))
install.packages(new.pkg, dependencies = TRUE)
sapply(pkg, require, character.only = TRUE)
}
ipak(requiredPackages)
dataDir <- "./data"
fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/yelp_dataset_challenge_academic_dataset.zip"
filePath <- file.path(dataDir)
# Does the directory Exist? If it does'nt create it
if (!file.exists(dataDir)) {
dir.create(dataDir)
}
# Now we check if we have downloaded the data already into
# "./data/yelp_dataset_challenge_academic_dataset". If not, then we download the
# zip file... and extract it under the data directory as
# './data/yelp_dataset_challenge_academic_dataset'...
if (!file.exists( file.path(dataDir,"yelp_dataset_challenge_academic_dataset"))) {
temp <- tempfile()
download.file(fileUrl, temp, mode = "wb", method = "curl")
unzip(temp, exdir = dataDir)
unlink(temp)
}
if ( !exists("yelpBusinessData") )
{
if (file.exists( file.path(dataDir,"yelpBusinessData.rds"))) {
yelpBusinessData <- readRDS(file.path(dataDir,"yelpBusinessData.rds"))
} else {
yelpBusinessDataFilePath <- file.path(dataDir,
"yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_business.json")
yelpBusinessData <- fromJSON(sprintf("[%s]",
paste(readLines(yelpBusinessDataFilePath),
collapse = ",")),
flatten = FALSE)
str(yelpBusinessData, max_level = 1)
# Fix the column name duplication issue
# If and when you flatten the data the you create two columns wiht the same column id
#
# i.e. yelpBusinessData$attributes.Good.for.kids
#
# This fixes the issue by renaming the first column...
#
colnames(yelpBusinessData$attributes)[6] <- "Price_Range"
colnames(yelpBusinessData$attributes)[7] <- "Good_For_Kids"
saveRDS( yelpBusinessData, file.path(dataDir, "yelpBusinessData.rds"))
}
}
The above code loads the example dataframe.
Here is an example of the problem I mentioned above. The first code example works and harnesses which to select four records. The problem is how to do the same with dplyr::filter - what am I missing? Specifically, how do you dereference nested dataframes???
# Extract the Phoenix subset using `which`
yelpBusinessData.PA <- yelpBusinessData[which(yelpBusinessData$city == "Phoenix"),]
yelpBusinessData.PA.rest <- yelpBusinessData.PA[which(grepl("Restaurants",
yelpBusinessData.PA$categories)),]
Exp <- yelpBusinessData.PA.rest[which(yelpBusinessData.PA.rest$attributes$Price_Range == 4),]
dim(Exp)
Result - Four records selected :-)
> dim(Exp)
[1] 4 15
Question: How to do this with dplyr?
yelpBusinessData.PA.rest <- yelpBusinessData %>%
filter(city == "Phoenix") %>%
filter(grepl("Restaurants", categories)) %>%
filter(attributes$Price_Range == 4)
the above code fails... now if I flatten the file I can get this to work correctly but...
Note the subtle change from: "attributes$Price_Range" to "attributes.Price_Range".
yelpBusinessData2 <- flatten(yelpBusinessData, recursive = TRUE)
dim(yelpBusinessData2)
Exp2 <- yelpBusinessData2 %>%
filter(city == "Phoenix") %>%
filter(grepl("Restaurants", categories)) %>%
filter(attributes.Price_Range == 4)
dim(Exp2)
my goal however is to understand how to do this without flattening the nested data frames.
I.E -> **How to use dplyr with nested dataframes? **
What am I missing here? :-)
One potential answer that I have tried is to index the nested data frame using [[]], this does work but you loose the elegance of dplyr...
Is there a better way?
Exp2 <- yelpBusinessData %>%
filter(city == "Phoenix") %>%
filter(grepl("Restaurants", categories)) %>%
filter( attributes[[6]][] == 4)
The above indexes into "attributes$Price_range" and returned the correct result when using nested data frames. i.e Price_Range is the 6th dataframe of the attributes dataframe...
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.2 (El Capitan)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] knitcitations_1.0.6 pander_0.5.2 plyr_1.8.3 jsonlite_0.9.16 ggvis_0.4.2.9000
[6] tidyr_0.2.0 devtools_1.8.0 qmap_1.0-3 fitdistrplus_1.0-4 knitr_1.11
[11] dplyr_0.4.3.9000 data.table_1.9.4 psych_1.5.6 mapproj_1.2-4 maptools_0.8-36
[16] rworldmap_1.3-1 sp_1.1-1 maps_2.3-11 ggmap_2.5.2 ggplot2_1.0.1
[21] RMySQL_0.10.5 DBI_0.3.1 setwidth_1.0-4 colorout_1.1-1 vimcom_1.2-3
loaded via a namespace (and not attached):
[1] httr_1.0.0 splines_3.2.2 shiny_0.12.2 assertthat_0.1 highr_0.5
[6] yaml_2.1.13 lattice_0.20-33 chron_2.3-47 digest_0.6.8 RefManageR_0.8.63
[11] colorspace_1.2-6 htmltools_0.2.6 httpuv_1.3.3 XML_3.98-1.3 bibtex_0.4.0
[16] xtable_1.7-4 scales_0.3.0 jpeg_0.1-8 git2r_0.11.0 lazyeval_0.1.10.9000
[21] mnormt_1.5-3 proto_0.3-10 survival_2.38-3 RJSONIO_1.3-0 magrittr_1.5
[26] mime_0.3 memoise_0.2.1 evaluate_0.7.2 MASS_7.3-43 xml2_0.1.1
[31] foreign_0.8-66 ggthemes_2.2.1 rsconnect_0.4.1.4 tools_3.2.2 geosphere_1.4-3
[36] RgoogleMaps_1.2.0.7 formatR_1.2 stringr_1.0.0 munsell_0.4.2 rversions_1.0.2
[41] grid_3.2.2 RCurl_1.95-4.7 rstudioapi_0.3.1 rjson_0.2.15 spam_1.0-1
[46] bitops_1.0-6 labeling_0.3 rmarkdown_0.7 gtable_0.1.2 curl_0.9.3
[51] reshape2_1.4.1 R6_2.1.1 lubridate_1.3.3 stringi_0.5-5 parallel_3.2.2
[56] Rcpp_0.12.0 fields_8.2-1 png_0.1-7
There are at least 3 different parts to this question, each of which has very likely been answered well (& thoroughly) elsewhere on SO.
These are:
How to work with a "messy" data.frame in R/dplyr?
The example you give here is messier than a 'nested' data.frame since it contains list-columns as well as data-frame-columns containing data-frame-columns.
How to clean up a "messy" data.frame in R/dplyr?
Is there a better way to work with these data, maintaining their hierarchy?
Working with a "messy" data frame in R/dplyr?
Generally and particularly when starting out I take an approach of iteratively cleaning my data. This means I first identify the columns I most need to work with, the columns that are most problematic, and clean only those at the intersection.
Specifically:
Filter out any column that is problematic but not important
Focus my effort on any column that is problematic AND important
Keep any column that is important and not problematic
Aside: This leaves a fourth group of columns that are BOTH unimportant and not problematic. What you do with these depends on the problem. For example, if I'm preparing a production database, I will exclude them and only include the "cleaned" columns (#2 and #3 above). If I'm doing an exploratory analysis I'll include them since I may change my mind about their importance down the line.
In this example you give, the problematic columns are those that contain data.frames. (These are problematic because they break compatibility with dplyr -- not because they are messy).
You can filter them out using dplyr::select_if:
yelpBusinessData %>%
dplyr::select_if(purrr::negate(is.data.frame)) %>%
dplyr::filter(city == 'Phoenix')
After that, the other dplyr operators in your example will work provided they don't reference data in columns that are data.frames (for example, the attributes). This brings me to part II ..
How to clean a "messy" data.frame in R/dplyr?
One way to handle the "messy" data-frame columns in this data would be to flatten each one & join it back to the original data frame.
Taking the attributes column as an example, we can use jsonlite::flatten on this sub-data-frame & then join it back to our original:
yelpBusinessData %>%
dplyr::select_if(purrr::negate(is.data.frame)) %>%
dplyr::bind_cols(jsonlite::flatten(yelpBusinessData$attributes, recursive = T)) %>%
dplyr::filter(city == 'Phoenix') %>%
dplyr::filter(grepl("Restaurants", categories)) %>%
dplyr::filter(Price_Range == 4)
The hours component, however, you might want to handle differently. In this example, the hours data.frame contains a data.frame for each day of the week with two fields ("open" and "close"). Here I use purrr:map to apply a function to each column simplifying the data.frame into a character vector.
hours <-
yelpBusinessData$hours %>%
purrr::map(. %>%
dplyr::transmute(hours = stringr::str_c(open, close, sep = ' - ')) %>%
unlist()) %>%
tibble::as_tibble()
This produces a data.frame with the start - stop time for each day of the week:
> str(hours)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 61184 obs. of 7 variables:
$ Tuesday : chr "08:00 - 17:00" NA NA "10:00 - 21:00" ...
$ Friday : chr "08:00 - 17:00" NA NA "10:00 - 21:00" ...
$ Monday : chr "08:00 - 17:00" NA NA "10:00 - 21:00" ...
$ Wednesday: chr "08:00 - 17:00" NA NA "10:00 - 21:00" ...
$ Thursday : chr "08:00 - 17:00" NA NA "10:00 - 21:00" ...
$ Sunday : chr NA NA NA "11:00 - 18:00" ...
$ Saturday : chr NA NA NA "10:00 - 21:00" ...
Similarly, one could use map2_dfc (which automagically calls bind_cols after mapping) to collapse the data-frames of this object yourself:
hours <- yelpBusinessData$hours %>%
purrr::map2_dfc(.x = .,
.y = names(.),
.f = ~ .x %>%
dplyr::rename_all(funs(stringr::str_c(.y, ., sep = '_'))))
This produces a single data.frame with the day-specific start & stop times:
> str(hours)
'data.frame': 61184 obs. of 14 variables:
$ Tuesday_close : chr "17:00" NA NA "21:00" ...
$ Tuesday_open : chr "08:00" NA NA "10:00" ...
$ Friday_close : chr "17:00" NA NA "21:00" ...
$ Friday_open : chr "08:00" NA NA "10:00" ...
$ Monday_close : chr "17:00" NA NA "21:00" ...
$ Monday_open : chr "08:00" NA NA "10:00" ...
$ Wednesday_close: chr "17:00" NA NA "21:00" ...
$ Wednesday_open : chr "08:00" NA NA "10:00" ...
$ Thursday_close : chr "17:00" NA NA "21:00" ...
$ Thursday_open : chr "08:00" NA NA "10:00" ...
$ Sunday_close : chr NA NA NA "18:00" ...
$ Sunday_open : chr NA NA NA "11:00" ...
$ Saturday_close : chr NA NA NA "21:00" ...
$ Saturday_open : chr NA NA NA "10:00" ...
Rather than put important information in the field names, however, you might prefer to "denormalize" some data, to produce a more tidy structure:
> purrr::flatten_dfr(yelpBusinessData$hours, .id = 'day')
# A tibble: 61,184 x 3
day close open
<chr> <chr> <chr>
1 1 NA NA
2 1 NA NA
3 1 NA NA
4 1 21:00 10:00
5 1 16:00 10:00
6 1 NA NA
7 1 NA NA
8 1 NA NA
9 1 NA NA
10 1 02:00 08:00
# ... with 61,174 more rows
Is there a better way to filter these data, maintaining their original hierarchy?
At the end of the day, there is a fundamental problem in your original data structure. A data.frame in R is implemented as a list of lists, and yet your data are stored as a data.frame of data.frames. This leads to confusion when indexing into various parts of the structure.
This is a little unorthodox, but one option is to keep your data as a list of lists rather than convert to a data.frame right away. Using the tools in purrr package, you can work with lists pretty easily to filter/flatten your data and then construct a data.frame from the filtered results.
For example:
> ## read in yelpBusinessData without converting to data.frame
> yelpBusinessData2 <- fromJSON(sprintf("[%s]",
paste(readLines(yelpBusinessDataFilePath),
collapse = ",")),
flatten = FALSE,
simplify = FALSE)
# filter to Phoenix cities _before_ converting to a data.frame
> yelpBusinessData2 %>%
purrr::keep(~ .$'city' == 'Phoenix'
&& grepl("Restaurants", .$categories)) %>%
jsonlite:::simplify(., flatten = T) %>%
dplyr::select(business_id, full_address, contains('kids')) %>%
str()
'data.frame': 8410 obs. of 5 variables:
$ business_id : chr "vcNAWiLM4dR7D2nwwJ7nCA" "x5Mv61CnZLohZWxfCVCPTQ" "2ZnCITVa0abGce4gZ6RhIw" "EmzaQR5hQlF0WIl24NxAZA" ...
$ full_address : chr "4840 E Indian School Rd\nSte 101\nPhoenix, AZ 85018" "2819 N Central Ave\nPhoenix, AZ 85004" "1850 N Central Ave\nPhoenix, AZ 85004" "132 E Washington St\nPhoenix, AZ 85004" ...
$ attributes.Good for Kids : logi NA FALSE TRUE FALSE NA NA ...
$ attributes.Good For Kids : logi NA NA NA NA NA NA ...
$ attributes.Hair Types Specialized In.kids: logi NA NA NA NA NA NA ...
As a final thought, I would say, if you're still left with variable-naming issues, take a look at the janitor package in R, specifically the clean_names() function. This package has some nice features for working with messy data, particularly when read in from Excel.
I have a vector (length=1704) of character like this:
[1] "1871_01" "1871_02" "1871_03" "1871_04" "1871_05" "1871_06" "1871_07" "1871_08" "1871_09" "1871_10" "1871_11" "1871_12"
[13] "1872_01" "1872_02" "1872_03" ...
.
.
.
[1681] "2011_01" "2011_02" "2011_03" "2011_04" "2011_05" "2011_06" "2011_07" "2011_08" "2011_09" "2011_10" "2011_11" "2011_12"
[1693] "2012_01" "2012_02" "2012_03" "2012_04" "2012_05" "2012_06" "2012_07" "2012_08" "2012_09" "2012_10" "2012_11" "2012_12"
I want to convert this vector into a vector of dates.
For that I use:
as.Date(vector, format="%Y_%m")
But it returns "NA"
I tried for one value:
b <- "1871_01"
as.Date(b, format="%Y_%m")
[1] NA
strptime(b, "%Y_%m")
[1] NA
I don't understand why it doesn't work...
Does anyone have a clue?
If you do regular work in year+month format, the zoo package can come in handy since it treats yearmon as a first class citizen (and is compatible with Date objects/functions):
library(zoo)
my.ym <- as.yearmon("1871_01", format="%Y_%m")
print(my.ym)
## [1] "Jan 1871"
str(my.ym)
## Class 'yearmon' num 1871
my.date <- as.Date(my.date)
print(my.date)
## [1] "1871-01-01"
str(my.date)
## Date[1:1], format: "1871-01-01"
I have a csv file and extract data using
banknifty <- as.xts(read.zoo("banknifty.csv",sep=",",tz="" ,header=T))
read.zoo() extracts the data frame with numeric values but as I apply as.xts(), the data. frame's numeric values get converted to characters.
# banknifty[1,] gives
2008-01-01 09.34:00 "10" "12" "13"
I want as.xts should return data.frame with numeric values.
How to avoid this problem?
You're confused about the nature of xts/zoo objects. They are matrices with an ordered index attribute, therefore you cannot mix types in xts/zoo objects like you can in a data.frame.
The reason your object is being converted to character is because some of the values in your file are not numeric. This is also why you get the NAs introduced by coercion error when you tried hd1's solution.
So the answer to your question is, "fix your CSV file", but we can't help you fix it unless you show us the file's contents.
I just ran into a similar problem. In my case, the issue was that the as.xts() function tries to convert the date column along with the numeric columns. Because R does not consider dates to be numeric values, it automatically converts the entire data frame to character. I'm assuming that happens in your example as well (you can check this using your .csv-file).
Something like this should help:
data.in <- read.csv("banknifty.csv",sep=",",header=T)
data.in[,1] <- format(as.Date(data.in[,1]), format="%Y-%m-%d", tz="GMT", usetz=TRUE) #change tz to whatever applies
data.in[,1] <- as.POSIXct(data.in[,1], "GMT")
data.ts <- xts(data.in[,c(2,3,4,5)], order.by = data.in[,1])
(Note that data.ts <- xts(data.in, order.by = data.in[,1]) would replicate the unwanted conversion. Also, apologies that this is probably not the cleanest / most concise code, I'm still learning.)
Use as.numeric and your code will be:
> data.in <- as.xts(read.zoo("banknifty.csv",sep=",",tz="" ,header=T);
> sapply(c(1:4), function(n) { data.in[,n] <- as.numeric(data.in[,n]) }, simplify=TRUE )
[,1] [,2] [,3] [,4]
[1,] 6032.25 6040.50 6032.17 6036.29
[2,] 6036.29 6036.29 6020.00 6025.05
[3,] 6025.05 6026.00 6020.10 6023.12
[4,] 6023.12 6034.45 6022.73 6034.45
[5,] 6034.45 6034.45 6030.00 6030.00
[6,] 6030.00 6038.00 6028.25 6038.00
> data.in
V2 V3 V4 V5
2007-01-02 10:00:00 6032.25 6040.50 6032.17 6036.29
2007-01-02 10:05:00 6036.29 6036.29 6020.00 6025.05
2007-01-02 10:10:00 6025.05 6026.00 6020.10 6023.12
2007-01-02 10:15:00 6023.12 6034.45 6022.73 6034.45
2007-01-02 10:20:00 6034.45 6034.45 6030.00 6030.00
2007-01-02 10:25:00 6030.00 6038.00 6028.25 6038.00
>
> sapply(c(1:4), function(n) { data.in[,n] <- as.numeric(data.in[,n]) }, simplify=TRUE )
This command does not make any change to data.in. It returns the data in same format with quotes
> data.in
V2 V3 V4 V5
2007-01-02 10:00:00 "6032.25" "6040.50" "6032.17" "6036.29"
2007-01-02 10:05:00 "6036.29" "6036.29" "6020.00" "6025.05"
2007-01-02 10:10:00 "6025.05" "6026.00" "6020.10" "6023.12"