Using cbind on XTS object changes the dash (-) character in previous column names to a dot (.) - r

I have some R code that creates an XTS object, and then performs various cbind operations in the lifetime of that object. Some of my columns have names such as "adx-1". That is fine until another cbind() operation is performed. At that point, any columns with the "-" character are changes to a ".". So "adx-1" becomes "adx.1".
To reproduce:
x = xts(order.by=as.Date(c("2014-01-01","2014-01-02")))
x = cbind(x,c(1,2))
x
..2
2014-01-01 1
2014-01-02 2
colnames(x) = c("adx-1")
x
adx-1
2014-01-01 1
2014-01-02 2
x = cbind(x,c(1,2))
x
adx.1 ..2
2014-01-01 1 1
2014-01-02 2 2
It doesn't just do this with numbers either. It changes "test-text" to "test.text" as well. Multiple dashes are changed too. "test-text-two" is changed to "test.text.two".
Can someone please explain why this happens and, if possible, how to stop it from happening?
I can of course change my naming schemes, but it would be preferred if I didn't have to.
Thanks!

merge.xts converts the column names into syntactic names, which cannot contain -. According to ?Quotes:
Identifiers consist of a sequence of letters, digits, the period
('.') and the underscore. They must not start with a digit nor
underscore, nor with a period followed by a digit.
There is currently no way to alter this behavior.

The reason for the behavior is precisely the one Joshua Ulrich highlighted. It's common across many data types in R: you need "valid" names. Here is a great discussion of this "issue".
For data frames, you can pass the option check.names = FALSE as a workaround, but this is not implemented for xts object. This said, there are plenty of other workarounds available to you.
For instance, you could simply rename the columns of interest after very cbind. Using your code, simply add:
colnames(x)[1] <- c("adx-1")
to force back your desired column name.
Alternatively, you could consider this gsub solution if you wanted something potentially more systematic.

Related

Converting 7 or 8 digit numbers to dates in R

I am importing a very large fixed-width dataset into R and wish to use vroom for much better speed. However, the dates in this dataset are in numeric format with either 7 or 8 digits, depending on whether the day of the month has 1 or 2 digits (examples below).
#8 digit date (1985-03-21):
# 21031985
#7 digit date (1985-03-01):
# 1031985
I cannot see any way to specify this type of format using col_date(format = ) as one normally would. It is easy to make a function that converts these 7/8 digit numbers into dates, but doing that means materialising the imported data and removes the speed advantage that vroom provides.
I am looking for a way to have vroom interpret these numbers on its own, or a workaround that does not sacrifice vroom's speed.
Thanks very much for any help here.
Those formats are horrible in general, but regardless I expect nothing in readr is going to work right for you because of the 1 or 2 digit day-of-month. I suggest importing reading that column in as col_character, then post-processing them with
vec <- c("21031985", "1031985")
as.Date(paste0(strrep("0", pmax(8 - nchar(vec), 0)), vec), format = "%d%m%Y")
# [1] "1985-03-21" "1985-03-01"
Quick walk-through:
8 - nchar(vec) tells us how many 0s need to be padded to the left of each string. In this case, it should be 0 and 1, respectively. This might be a problem if you have length 6 strings, only you know if that's an issue.
strrep("0", ..) repeats the 0 string as many times as we need, including strrep("0", 0) producing "" (no zeroes).
pmax(.., 0) is the defensive programmer, if there's a length-9 string in there, we cannot do strrep("0", -1), we want to keep it from going negative.
paste0(.., vec) to do the actual padding.
From there, all strings should be normalized and able to be converted using "%d%m%Y".
Vroom can use a pipe as input. That means you can use a tool like awk to fix the format (e.g. make it always 8 digit, which is eaasy with sprintf). That way you can still benefit from vroom streaming the file. You could even use R - but if you are after performance, you need something that can process the file streaming and better be lightweight.
I used a test file test.csv:
id,date,text
1,1022020,some
2,12042020,more
3,2012020,text
I could read it via (of course the awk call needs to be adjusted for your data - but essentially if you need to just adjust the column $2 means 2nd column, the ',' specifies the separator):
vroom(pipe("awk -F ',' 'BEGIN{OFS=\",\"}; NR==1{print}; NR!=1 {$2=sprintf(\"%08d\",$2);print;}' test.csv"),
col_types=cols(date=col_date(format='%d%m%Y'))
)
giving
# A tibble: 3 × 3
id date text
<int> <date> <chr>
1 1 2020-02-01 some
2 2 2020-04-12 more
3 3 2020-01-02 text
If you have integer data you can left pad the lost 0s back on.
as.Date(sprintf("%08d", vec), format = "%d%m%Y")
# [1] "1985-03-21" "1985-03-01"

number some patterns in the string using R

I have a strings and it has some patterns like this
my_string = "`d#k`0.55`0.55`0.55`0.55`0.55`0.55`0.55`0.55`0.55`n$l`0.4`0.1`0.25`0.28`0.18`0.3`0.17`0.2`0.03`!lk`0.04`0.04`0.04`0.04`0.04`0.04`0.04`0.04`0.04`vnabgjd`0.02`0.02`0.02`0.02`0.02`0.02`0.02`0.02`0.02`pogk(`1.01`0.71`0.86`0.89`0.79`0.91`0.78`0.81`0.64`r!#^##niw`0.0014`0.0020`9.9999`9.9999`0.0020`0.0022`0.0032`9.9999`0.0000`
As you can see there is patterns [`nonnumber] then [`number.num~] repeated.
So I want to identify how many [`number.num~] are between [`nonnumber].
I tried to use regex
index <- gregexpr("`(\\w{2,20})`\\d\\.\\d(.*?)`\\D",cle)
regmatches(cle,index)
but using this code, the [`\D] is overlapped. so just It can't number how many the pattern are.
So if you know any method about it, please leave some reply
Using strsplit. We split at the backtick and count the position difference which of the values coerced to "numeric" yield NA. Note, that we need to exclude the first element after strsplit and add an NA at the end in the numerics. Resulting in a vector named with the non-numerical element using setNames (not very good names actually, but it's demonstrating what's going on).
s <- el(strsplit(my_string, "\\`"))[-1]
s.num <- suppressWarnings(as.numeric(s))
setNames(diff(which(is.na(c(s.num, NA)))) - 1,
s[is.na(s.num)])
# d#k n$l !lk vnabgjd pogk( r!#^##niw
# 9 9 9 9 9 9

How to shorten multiple timeseries with different dates?

I am using timeseries data which were obtained from different providers. This leads to the fact that the length of the vectors are not matching.
e.g.:
nrow(xts_ret) #2176
nrow(xts_trade) #2177
nrow(xts_trans) #2192
nrow(xts_vola_ret) #2177
I have one additional timeseries which contains solely factors:
> head(xts_sentiment)
[,1]
2019-04-29 "neutral"
2019-04-29 "negative"
2019-04-29 "neutral"
2019-04-29 "neutral"
2019-04-29 "neutral"
2019-04-29 "neutral"
Note: all above vectors are formated as "xts"-objects.
The main problem of this setting is that the dates of the xts_ret, xts_trade, xts_trans, xts_vola_ret and xts_sentiment differs by variable.
I am using R version 3.5.1 (2018-07-02).
I found the "merge" command for xts which does exactly what I want
data_pool <- merge(xts_ret, xts_trade, xts_trans, xts_vola_ret)
If one date (or value) is missing, it replaces its entry in the respective vector with "NA" but lists this entry in the line with the respective date.
> head(data_pool)
xts_ret xts_trade xts_trans xts_vola_ret
2013-04-28 NA NA 40986 NA
2013-04-29 0.04805079 0 50009 0.00000000
2013-04-30 -0.04805079 0 48795 -0.04516775
2013-05-01 -0.14532060 0 50437 -0.13931143
2013-05-02 -0.12327888 0 57278 -0.12424083
2013-05-03 -0.12792566 0 55859 -0.12770457
The "complete.case"-function allows me to kick out all lines, which have a "NA" entry so that all vectors have the same length.
Problem:
if I add the xts_sentiment vector to my pool variable, it contains solely "NA" values and the "complete.cases" removes every line of the dataset.
If I take a look at the xts_sentiment variable it self (see above) it contains the correct values.
I also tried to set "as.character(xts_sentiment)" or "as.string(xts_sentiment)" in the "merge"-command but it did not help.
Has anyone an idea how to get the values of the xts_sentiment into the "pool"-variable?
BTW: I also tried data.table, which displays xts_sentiment with all of its value but I have not the benefit of the "unique" dates.
Thank you very much for your help!
The soltuion of my problem was:
The variable xts_sentiment consists of characters.
XTS functions work as matrices, that means every vector needs the same content (e.g. all vectors contain solely characters or all vectors solely contain numbers).
So, it is not possible to create a xts element out of a character vector and a vector with numbers.
My solution was to decode the sentiment levels into numbers and use the "merge.xts" command. That worked.

How can I assign a value using if-else conditions in R

I have this dataframe with a column a. I would like to add a different column 'b' based on column 'a'.
For: if a>=10, b='double'. Otherwise b='single'.
How can I do it?
Sample output:
a b
2 single
2 single
4 single
11 double
12 double
12 double
45 double
4 single
You can use ifelse to act on vectors with if statements.
ifelse(a>=10, "double", "single")
So your code could look like this
mydata <- cbind(a, ifelse(a>10, "double", "single"))
(Specified in comments below that if a=10, then "double")
Strictly speaking, if-else is assignable in r, that is
x1 <- if (TRUE) 1 else 2
is legit. For details see https://adv-r.hadley.nz/control-flow.html#choices
However, as this vectorizes over neither the test condition nor the value branches, it's not applicable to the particular case described in the question details, which is about adding a column in a conditional manner. In such a situation ifelse or the more typesafe if_else (from dplyr) can be used.

Why does R need the name of the dataframe?

If you have a dataframe like this
mydf <- data.frame(firstcol = c(1,2,1), secondcol = c(3,4,5))
Why would
mydf[mydf$firstcol,]
work but
mydf[firstcol,]
wouldn't?
You can do this:
mydf[,"firstcol"]
Remember that the column goes second, not first.
In your example, to see what mydf[mydf$firstcol,] gives you, let's break it down:
> mydf$firstcol
[1] 1 2 1
So really mydf[mydf$firstcol,] is the same as
> mydf[c(1,2,1),]
firstcol secondcol
1 1 3
2 2 4
1.1 1 3
So you are asking for rows 1, 2, and 1. That is, you are asking for your row one to be the same as row 1 of mydf, your row 2 to be the same as row 2 of mydf and your row 3 to be the same as row 1 of mydf; and you are asking for both columns.
Another question is why the following doesn't work:
> mydf[,firstcol]
Error in `[.data.frame`(mydf, , firstcol) : object 'firstcol' not found
That is, why do you have to put quotes around the column name when you ask for it like that but not when you do mydf$firstcol. The answer is just that the operators you are using require different types of arguments. You can look at '$' to see the form x$name and thus the second argument can be a name, which is not quoted. You can then look up ?'[', which will actually lead you to the same help page. And there you will find the following, which explains it. Note that a "character" vector needs to have quoted entries (that is how you enter a character vector in R (and many other languages).
i, j, ...: indices specifying elements to extract or replace. Indices
are ‘numeric’ or ‘character’ vectors or empty (missing) or
‘NULL’. Numeric values are coerced to integer as by
‘as.integer’ (and hence truncated towards zero). Character
vectors will be matched to the ‘names’ of the object (or for
matrices/arrays, the ‘dimnames’): see ‘Character indices’
below for further details.
Nothing to add to the very clear explanation of Xu Wang. You might want to note in addition that the package data.table allows you to use notation such as mydf[firstcol==1,] or mydf[,firstcol], that many find more natural.

Resources