R converts strings into numbers using rownames() - r

I have a numerical matrix "test" like this"
[1,] 474.00 478.81 468.25 474.98 474.98
[2,] 463.25 470.00 454.12 468.22 468.22
[3,] 456.47 466.50 452.58 457.35 454.70
...
and want to assign rownames, which are strings of dates (stored in variable a names).
> names
[1] "2013-02-08" "2013-02-07" "2013-02-06" ...
when I invoke the rowname function on my matrix, the strings are converted to numbers, which I don't understand. Does someone know a solution that would preserve the strings in names as row names?enter code here
rownames(test) <- names
15744 474.00 478.81 468.25 474.98 474.98
15743 463.25 470.00 454.12 468.22 468.22
15742 456.47 466.50 452.58 457.35 454.70
...

Try rownames(test) <- as.character(names)

I don't have enough rep to put this as a reply to your comment, but I think those numbers are based upon a difference in dates. By default, when R detects a date input, it is represented as the number of days since 1970-01-01, with negative values for earlier dates.
See: http://www.statmethods.net/input/dates.html
EDIT: Just as a test, I took your first input (February 8th, 2013) and calculated the difference between it and January 1st, 1970, and I do get 15,744 days which matches your rowname.

Related

Why R doesn't show digit after decimal point in numeric data type?

I have a data frame read from a file into R. Data contains some numeric columns some with digit after decimal point and some column do not have. The columns without values with digit after decimal points have an issue as the following:
If I do some manipulation into that column which result in values with some digit after decimal point. However, the final data has some numbers such as 22.5, R does not show the value in this format, it only shows 22. But if I check it in If condition it confirm the value is actually 22.5. This does not happen when the original data contains some decimal points.
Could anyone let me know how to resolve this issue?
This is a FAQ. Presentation may be different from content as it is generally optimised for meaningful output. Really simple example follows:
> df <- data.frame(a=c(10000.12, 10000.13), b=c(42L, 43L))
> df
a b
1 10000.1 42
2 10000.1 43
> all.equal(df[1,"a"], 10000.12)
[1] TRUE
>
So the last digit did not "disappear" as the test confirms---it is simply beyond the (six in the default) digits displayed.
Similarly, you can always explicitly display with more decimals than the (compact, default) displays do:
> cat(sprintf("%14.8f", df[1,"a"]), "\n")
10000.12000000
>
Edit You can also increate the default display size by one or more:
options(digits=7) is the minimal change but not all columns use seven digits:
> options(digits=7)
> df
a b
1 10000.12 42
2 10000.13 43
>
Needless to say, if you had digits .123 only the first two would be shown etc.

Trying to extract a date from a 5 or 6-digit number

I am trying to extract a date from a number. The date is stored as the first 6 digits of a 11-digit personal ID-number (date-month-year). Unfortunately the cloud-based database (REDCap) output of this gets formatted as a number, so that the leading zero in those born on the first nine days of the month end up with a 10 digit ID number instead of a 11 digit one. I managed to extract the 6 or 5 digit number corresponding to the date, i.e. 311230 for 31st December 1930, or 11230 for first December 1930. I end up with two problems that I have not been able to solve.
Let's say we use the following numbers:
dato <- c(311230, 311245, 311267, 311268, 310169, 201104, 51230, 51269, 51204)
I convert these into string, and then apply the as.Date() function:
datostr <- as.character(dato)
datofinal <- as.Date(datostr, "%d%m%y")
datofinal
The problems i have are:
Five-digit numbers (eg 11230) gets reported as NA.
Six-digit numbers are recognized, but those born before 1.1.1969 gets reported with 100 years added, i.e. 010160 gets converted to 2060.01.01
I am sure this must be easy for those who are more knowledgeable about R, but, I struggle a bit solving this. Any help is greatly appreciated.
Greetings
Bjorn
If your 5-digit numbers really just need to be zero-padded, then
dato_s <- sprintf("%06d", dato)
dato_s
# [1] "311230" "311245" "311267" "311268" "310169" "201104" "051230" "051269" "051204"
From there, your question about "dates before 1969", take a look at ?strptime for the '%y' pattern:
'%y' Year without century (00-99). On input, values 00 to 68 are
prefixed by 20 and 69 to 99 by 19 - that is the behaviour
specified by the 2018 POSIX standard, but it does also say
'it is expected that in a future version the default century
inferred from a 2-digit year will change'.
So if you have specific alternate years for those, you need to add the century before sending to as.Date (which uses strptime-patterns).
dato_d <- as.Date(gsub("([0-4][0-9])$", "20\\1",
gsub("([5-9][0-9])$", "19\\1", dato_s)),
format = "%d%m%Y")
dato_d
# [1] "2030-12-31" "2045-12-31" "1967-12-31" "1968-12-31" "1969-01-31" "2004-11-20"
# [7] "2030-12-05" "1969-12-05" "2004-12-05"
In this case, I'm assuming 50-99 will be 1900, everything else 2000. If you need 40s or 30s, feel free to adjust the pattern: add digits to the second pattern (e.g., [3-9]) and remove from the first pattern (e.g., [0-2]), ensuring that all decades are included in exactly one pattern, not "neither" and not "both".
Borrowing from Allan's answer, I like that assumption of now() (since you did mention "born on"). Without lubridate, try this:
dato_s <- sprintf("%06d", dato)
dato_d <- as.Date(dato_s, format = "%d%m%y")
dato_d[ dato_d > Sys.Date() ] <-
as.Date(sub("([0-9]{2})$", "19\\1", dato_s[ dato_d > Sys.Date() ]), format = "%d%m%Y")
dato_d
# [1] "1930-12-31" "1945-12-31" "1967-12-31" "1968-12-31" "1969-01-31" "2004-11-20"
# [7] "1930-12-05" "1969-12-05" "2004-12-05"
You can make this a bit easier using lubridate, and noting that no-one can have a date of birth that is in the future of the current time:
library(lubridate)
dato <- dmy(sprintf("%06d", dato))
dato[dato > now()] <- dato[dato > now()] - years(100)
dato
#> [1] "1930-12-31" "1945-12-31" "1967-12-31" "1968-12-31" "1969-01-31"
#> [6] "2004-11-20" "1930-12-05" "1969-12-05" "2004-12-05"
Of course, without further information, this method will not (nor will any other method) be able to pick out the edge cases of people who are aged over 100. This might be easy to determine from the context.
Created on 2020-06-29 by the reprex package (v0.3.0)
Converting five digit "numbers" to six digits is straightforward: x <- stringr::str_pad(x, 6, pad="0") or similar will do the trick.
Your problem with years is the Millennium bug revisited. You'll have to consult with whoever compiled your data to see what assumptions they used.
I suspect all dates on or before 31Dec1970 are affected, not just those before 01Jan1960. That's because as.Date uses a default origin of 01Jan1970 when deciding how to handle two digit years. So your solution is to pick an appropriate origin in your conversion to fix this dataset. Something like d <- as.Date(x, origin="1900-01-01"). And then start using four digit years in the fiture! ;)

number some patterns in the string using R

I have a strings and it has some patterns like this
my_string = "`d#k`0.55`0.55`0.55`0.55`0.55`0.55`0.55`0.55`0.55`n$l`0.4`0.1`0.25`0.28`0.18`0.3`0.17`0.2`0.03`!lk`0.04`0.04`0.04`0.04`0.04`0.04`0.04`0.04`0.04`vnabgjd`0.02`0.02`0.02`0.02`0.02`0.02`0.02`0.02`0.02`pogk(`1.01`0.71`0.86`0.89`0.79`0.91`0.78`0.81`0.64`r!#^##niw`0.0014`0.0020`9.9999`9.9999`0.0020`0.0022`0.0032`9.9999`0.0000`
As you can see there is patterns [`nonnumber] then [`number.num~] repeated.
So I want to identify how many [`number.num~] are between [`nonnumber].
I tried to use regex
index <- gregexpr("`(\\w{2,20})`\\d\\.\\d(.*?)`\\D",cle)
regmatches(cle,index)
but using this code, the [`\D] is overlapped. so just It can't number how many the pattern are.
So if you know any method about it, please leave some reply
Using strsplit. We split at the backtick and count the position difference which of the values coerced to "numeric" yield NA. Note, that we need to exclude the first element after strsplit and add an NA at the end in the numerics. Resulting in a vector named with the non-numerical element using setNames (not very good names actually, but it's demonstrating what's going on).
s <- el(strsplit(my_string, "\\`"))[-1]
s.num <- suppressWarnings(as.numeric(s))
setNames(diff(which(is.na(c(s.num, NA)))) - 1,
s[is.na(s.num)])
# d#k n$l !lk vnabgjd pogk( r!#^##niw
# 9 9 9 9 9 9

Numeric data type with "0" digits after dot

Today I had a look at the pop dataset of the wpp2019 package and noticed that the population numbers are shown as numeric values with a "." after the last three digits (e.g. 10500 is 10.500).
library(wpp2019)
data("pop")
pop$`2020`
To remove the dots, I would usually simply turn the column into a character column and then use for example stringr::str_replace(), but as soon as I apply any function (except printing) to the population number columns, the dots disappear.
How can it be that this dataset shows e.g. 10.500 when printing the data.frame even though R usually removes the 0 digits after the dot for numeric values? And what would be the best way to remove the dots in the above example without losing the 0 digits?
Expected output
# instead of
pop$`2020`[153]
#[1] 164.1
# this value should return 164100 because printing the data frame
# shows 164.100
Population estimates in wpp2019 are given in thousands. So multiply by 1000 to get back to the estimated number of individuals:
> pop$`2020`[153]*1000
[1] 164100
R prints the decimal part sometimes but not other times based on the digits argument in print, and what else is in the vector it is printing. For example:
> print(1234567.890)
[1] 1234568 # max 7 digits printed by default
> print(c(1234567.890,0.011))
[1] 1234567.890 0.011 # but when printed alongside 0.011 all the digits shown.
This explains why your data frame always shows all the digits but you don't see all the digits when you extract individual numbers.

How to shorten multiple timeseries with different dates?

I am using timeseries data which were obtained from different providers. This leads to the fact that the length of the vectors are not matching.
e.g.:
nrow(xts_ret) #2176
nrow(xts_trade) #2177
nrow(xts_trans) #2192
nrow(xts_vola_ret) #2177
I have one additional timeseries which contains solely factors:
> head(xts_sentiment)
[,1]
2019-04-29 "neutral"
2019-04-29 "negative"
2019-04-29 "neutral"
2019-04-29 "neutral"
2019-04-29 "neutral"
2019-04-29 "neutral"
Note: all above vectors are formated as "xts"-objects.
The main problem of this setting is that the dates of the xts_ret, xts_trade, xts_trans, xts_vola_ret and xts_sentiment differs by variable.
I am using R version 3.5.1 (2018-07-02).
I found the "merge" command for xts which does exactly what I want
data_pool <- merge(xts_ret, xts_trade, xts_trans, xts_vola_ret)
If one date (or value) is missing, it replaces its entry in the respective vector with "NA" but lists this entry in the line with the respective date.
> head(data_pool)
xts_ret xts_trade xts_trans xts_vola_ret
2013-04-28 NA NA 40986 NA
2013-04-29 0.04805079 0 50009 0.00000000
2013-04-30 -0.04805079 0 48795 -0.04516775
2013-05-01 -0.14532060 0 50437 -0.13931143
2013-05-02 -0.12327888 0 57278 -0.12424083
2013-05-03 -0.12792566 0 55859 -0.12770457
The "complete.case"-function allows me to kick out all lines, which have a "NA" entry so that all vectors have the same length.
Problem:
if I add the xts_sentiment vector to my pool variable, it contains solely "NA" values and the "complete.cases" removes every line of the dataset.
If I take a look at the xts_sentiment variable it self (see above) it contains the correct values.
I also tried to set "as.character(xts_sentiment)" or "as.string(xts_sentiment)" in the "merge"-command but it did not help.
Has anyone an idea how to get the values of the xts_sentiment into the "pool"-variable?
BTW: I also tried data.table, which displays xts_sentiment with all of its value but I have not the benefit of the "unique" dates.
Thank you very much for your help!
The soltuion of my problem was:
The variable xts_sentiment consists of characters.
XTS functions work as matrices, that means every vector needs the same content (e.g. all vectors contain solely characters or all vectors solely contain numbers).
So, it is not possible to create a xts element out of a character vector and a vector with numbers.
My solution was to decode the sentiment levels into numbers and use the "merge.xts" command. That worked.

Resources