How to shorten multiple timeseries with different dates? - r

I am using timeseries data which were obtained from different providers. This leads to the fact that the length of the vectors are not matching.
e.g.:
nrow(xts_ret) #2176
nrow(xts_trade) #2177
nrow(xts_trans) #2192
nrow(xts_vola_ret) #2177
I have one additional timeseries which contains solely factors:
> head(xts_sentiment)
[,1]
2019-04-29 "neutral"
2019-04-29 "negative"
2019-04-29 "neutral"
2019-04-29 "neutral"
2019-04-29 "neutral"
2019-04-29 "neutral"
Note: all above vectors are formated as "xts"-objects.
The main problem of this setting is that the dates of the xts_ret, xts_trade, xts_trans, xts_vola_ret and xts_sentiment differs by variable.
I am using R version 3.5.1 (2018-07-02).
I found the "merge" command for xts which does exactly what I want
data_pool <- merge(xts_ret, xts_trade, xts_trans, xts_vola_ret)
If one date (or value) is missing, it replaces its entry in the respective vector with "NA" but lists this entry in the line with the respective date.
> head(data_pool)
xts_ret xts_trade xts_trans xts_vola_ret
2013-04-28 NA NA 40986 NA
2013-04-29 0.04805079 0 50009 0.00000000
2013-04-30 -0.04805079 0 48795 -0.04516775
2013-05-01 -0.14532060 0 50437 -0.13931143
2013-05-02 -0.12327888 0 57278 -0.12424083
2013-05-03 -0.12792566 0 55859 -0.12770457
The "complete.case"-function allows me to kick out all lines, which have a "NA" entry so that all vectors have the same length.
Problem:
if I add the xts_sentiment vector to my pool variable, it contains solely "NA" values and the "complete.cases" removes every line of the dataset.
If I take a look at the xts_sentiment variable it self (see above) it contains the correct values.
I also tried to set "as.character(xts_sentiment)" or "as.string(xts_sentiment)" in the "merge"-command but it did not help.
Has anyone an idea how to get the values of the xts_sentiment into the "pool"-variable?
BTW: I also tried data.table, which displays xts_sentiment with all of its value but I have not the benefit of the "unique" dates.
Thank you very much for your help!

The soltuion of my problem was:
The variable xts_sentiment consists of characters.
XTS functions work as matrices, that means every vector needs the same content (e.g. all vectors contain solely characters or all vectors solely contain numbers).
So, it is not possible to create a xts element out of a character vector and a vector with numbers.
My solution was to decode the sentiment levels into numbers and use the "merge.xts" command. That worked.

Related

number some patterns in the string using R

I have a strings and it has some patterns like this
my_string = "`d#k`0.55`0.55`0.55`0.55`0.55`0.55`0.55`0.55`0.55`n$l`0.4`0.1`0.25`0.28`0.18`0.3`0.17`0.2`0.03`!lk`0.04`0.04`0.04`0.04`0.04`0.04`0.04`0.04`0.04`vnabgjd`0.02`0.02`0.02`0.02`0.02`0.02`0.02`0.02`0.02`pogk(`1.01`0.71`0.86`0.89`0.79`0.91`0.78`0.81`0.64`r!#^##niw`0.0014`0.0020`9.9999`9.9999`0.0020`0.0022`0.0032`9.9999`0.0000`
As you can see there is patterns [`nonnumber] then [`number.num~] repeated.
So I want to identify how many [`number.num~] are between [`nonnumber].
I tried to use regex
index <- gregexpr("`(\\w{2,20})`\\d\\.\\d(.*?)`\\D",cle)
regmatches(cle,index)
but using this code, the [`\D] is overlapped. so just It can't number how many the pattern are.
So if you know any method about it, please leave some reply
Using strsplit. We split at the backtick and count the position difference which of the values coerced to "numeric" yield NA. Note, that we need to exclude the first element after strsplit and add an NA at the end in the numerics. Resulting in a vector named with the non-numerical element using setNames (not very good names actually, but it's demonstrating what's going on).
s <- el(strsplit(my_string, "\\`"))[-1]
s.num <- suppressWarnings(as.numeric(s))
setNames(diff(which(is.na(c(s.num, NA)))) - 1,
s[is.na(s.num)])
# d#k n$l !lk vnabgjd pogk( r!#^##niw
# 9 9 9 9 9 9

Does 0 plays any important role in as.numeric function when using factors in R

Hi guys :) I know this question has been asked before here for example but I would like to ask if 0 plays any important role using the as.numeric function. For example, we have the following simple code
x2<-factor(c(2,2,0,2), label=c('Male','Female'))
as.numeric(x2) #knonwing that this is not the appropriate command used , as.numeric(levels(x2))[x2] would be more appropriate but return NAs
this returns
[1] 2 2 1 2
Is 0 being replaced here by 1 ? Moreover,
unclass(x2)
seems to give the same thing as well:
[1] 2 2 1 2
attr(,"levels")
[1] "Male" "Female"
It might be simple but I am trying to figure this out and it seems that I cant. Any help would be highly appreciated as I am new in R.
0 has no special meaning for factor.
As commenters have pointed out, factor recodes the input vector to an integer vector (starting with 1) and slaps a name tag onto each integer (the levels).
In the most simplest case, factor(c(2,2,0,2), the function takes the unique values of the input vector, sorts it, and converts it to a character vector, for the levels. I.e. the factor is internally represented as c(2,2,1,2) where 1 corresponds to '0' and 2 to '2'.
You then go further on by giving the levels some labels; these are normally identical to the levels. In your case factor(c(2,2,0,2), labels=c('Male','Female')), the levels are still evaluated to the sorted, unique vector (i.e. c(2,2,1,2)) but the levels now have labels Male for first level and Female for second level.
We can decide which levels should be used, as in factor(c(2,2,0,2), levels=c(2,0), labels=c('Male','Female')). Now we have been explicit towards which input value should have which level and label.

Using cbind on XTS object changes the dash (-) character in previous column names to a dot (.)

I have some R code that creates an XTS object, and then performs various cbind operations in the lifetime of that object. Some of my columns have names such as "adx-1". That is fine until another cbind() operation is performed. At that point, any columns with the "-" character are changes to a ".". So "adx-1" becomes "adx.1".
To reproduce:
x = xts(order.by=as.Date(c("2014-01-01","2014-01-02")))
x = cbind(x,c(1,2))
x
..2
2014-01-01 1
2014-01-02 2
colnames(x) = c("adx-1")
x
adx-1
2014-01-01 1
2014-01-02 2
x = cbind(x,c(1,2))
x
adx.1 ..2
2014-01-01 1 1
2014-01-02 2 2
It doesn't just do this with numbers either. It changes "test-text" to "test.text" as well. Multiple dashes are changed too. "test-text-two" is changed to "test.text.two".
Can someone please explain why this happens and, if possible, how to stop it from happening?
I can of course change my naming schemes, but it would be preferred if I didn't have to.
Thanks!
merge.xts converts the column names into syntactic names, which cannot contain -. According to ?Quotes:
Identifiers consist of a sequence of letters, digits, the period
('.') and the underscore. They must not start with a digit nor
underscore, nor with a period followed by a digit.
There is currently no way to alter this behavior.
The reason for the behavior is precisely the one Joshua Ulrich highlighted. It's common across many data types in R: you need "valid" names. Here is a great discussion of this "issue".
For data frames, you can pass the option check.names = FALSE as a workaround, but this is not implemented for xts object. This said, there are plenty of other workarounds available to you.
For instance, you could simply rename the columns of interest after very cbind. Using your code, simply add:
colnames(x)[1] <- c("adx-1")
to force back your desired column name.
Alternatively, you could consider this gsub solution if you wanted something potentially more systematic.

R converts strings into numbers using rownames()

I have a numerical matrix "test" like this"
[1,] 474.00 478.81 468.25 474.98 474.98
[2,] 463.25 470.00 454.12 468.22 468.22
[3,] 456.47 466.50 452.58 457.35 454.70
...
and want to assign rownames, which are strings of dates (stored in variable a names).
> names
[1] "2013-02-08" "2013-02-07" "2013-02-06" ...
when I invoke the rowname function on my matrix, the strings are converted to numbers, which I don't understand. Does someone know a solution that would preserve the strings in names as row names?enter code here
rownames(test) <- names
15744 474.00 478.81 468.25 474.98 474.98
15743 463.25 470.00 454.12 468.22 468.22
15742 456.47 466.50 452.58 457.35 454.70
...
Try rownames(test) <- as.character(names)
I don't have enough rep to put this as a reply to your comment, but I think those numbers are based upon a difference in dates. By default, when R detects a date input, it is represented as the number of days since 1970-01-01, with negative values for earlier dates.
See: http://www.statmethods.net/input/dates.html
EDIT: Just as a test, I took your first input (February 8th, 2013) and calculated the difference between it and January 1st, 1970, and I do get 15,744 days which matches your rowname.

Different behaviour of intersect on vectors and factors

I try to compare multiple vectors of Entrez IDs (integer vectors) by using Reduce(intersect,...). The vectors are selected from a database using "DISTINCT" so a single vector does not contain duplicates.
length(factor(c(l1$entrez)))
gives the same length (and the same IDs w/o the length function) as
length(c(l1$entrez))
When I compare multiple vectors with
length(Reduce(intersect,list(c(l1$entrez),c(l2$entrez),c(l3$entrez),c(l4$entrez))))
or
length(Reduce(intersect,list(c(factor(l1$entrez)),c(factor(l2$entrez)),c(factor(l3$entrez)),c(factor(l4$entrez)))))
the result is not the same. I know that factor!=originalVector but I cannot understand why the result differs although the length and the levels of the initial factors/vectors are the same.
Could somebody please explain the different behaviour of the intersect function on vectors and factors? Is it that the intersect of two factor lists are again factorlists and then duplicates are treated differently?
Edit - Example:
> head(l1)
entrez
1 1
2 503538
3 29974
4 87769
5 2
6 144568
> head(l2)
entrez
1 1743
2 1188
3 8915
4 7412
5 51082
6 5538
The lists contain around 500 to 20K Entrez IDs. So the vectors contain pure integer and should give the intersect among all tested vectors.
> length(Reduce(intersect,list(c(factor(l1$entrez)),c(factor(l2$entrez)),c(factor(l3$entrez)),c(factor(l4$entrez)))))
[1] 514
> length(Reduce(intersect,list(c(l1$entrez),c(l2$entrez),c(l3$entrez),c(l4$entrez))))
[1] 338
> length(Reduce(intersect,list(l1$entrez,l2$entrez,l3$entrez,l4$entrez)))
[1] 494
I have to apologize profusely. The different behaviour of the intersect function may be caused by a problem with the data. I have found fields in the dataset containing comma seperated Entrez IDs (22038, 23207, ...). I should have had a more detailed look at the data first. Thank you for the answers and your time. Although I do not understand the different results yet, I am sure that this is the cause of the different behaviour. Can somebody confirm that?
As Roman says, an example would be very helpful.
Nevertheless, one possibility is that your variables l1$entrez, l2$entrez etc have the same levels but in different orders.
intersect converts its arguments via as.vector, which turns factors into character variables. This is usually the right thing to do, as it means that varying level order doesn't make any difference to the result.
Passing factor(l1$entrez) as an argument to intersect also removes the impact of varying level order, as it effectively creates a new factor with level ordering set to the default. However, if you pass c(l1$entrez), you strip the factor attributes off your variable and what you're left with is the raw integer codes which will depend on level ordering.
Example:
a <- factor(letters[1:3], levels=letters)
b <- factor(letters[1:3], levels=rev(letters)
# returns 1 2 3
intersect(c(factor(a)), c(factor(b)))
# returns integer(0)
intersect(c(a), c(b))
I don't see any reason why you should use c() in here. Just let R handle factors by itself (although to be fair, there are other scenarios where you do want to step in).

Resources