how to tell if a factor has no value R - r

Here is a sample of what the data looks like. I need to replace all those empty spaces with NA so that as.Date(dat[,i]) produces no errors
> dat[,i]
[1]
[28]
[55]
[82] 6/26/2007 7/5/2007 7/5/2007 12/6/2007 2/5/2008
[109] 3/27/2008 6/29/2008 9/16/2008 11/3/2008 9/11/2008 11/24/2008 12/29/2008 11/20/2008 1/26/2009 1/8/2009 3/5/2009
[136] 4/7/2009 6/9/2009 8/23/2009 8/16/2009 9/2/2009 10/6/2009 10/14/2009 10/24/2009 10/22/2009 11/5/2009 12/9/2009 2/4/2010 3/18/2010
[163] 7/8/2010 7/7/2010 7/29/2010 10/6/2010 10/7/2010 11/18/2010 1/12/2011 1/6/2011 4/5/2011 4/21/2011 5/25/2011 6/20/2011
[190] 12/12/2011 2/29/2012 2/22/2012 3/7/2012 3/28/2012 5/16/2012 5/23/2012 6/14/2012 8/14/2012 8/16/2012 9/5/2012 9/30/2012 11/5/2012 12/25/2012 12/27/2012 3/14/2013
[217] 7/24/2013 7/31/2013 9/2/2013 10/16/2013 10/30/2013 12/13/2013 2/24/2014 3/9/2014 6/29/2014 6/23/2014
[244] 9/1/2014 9/22/2014 9/22/2014 11/23/2014 2/24/2015 3/17/2015 4/8/2015 6/23/2015 6/23/2015 7/4/2015
[271] ...
[3538] 6/29/2012 11/16/2012 11/23/2012 9/1/2012
916 Levels: 10/10/2008 10/10/2009 10/10/2012 10/11/2010 10/11/2013 10/1/2010 10/12/2009 10/14/2009 10/14/2010 10/14/2011 10/14/2014 10/15/2009 10/15/2014 10/16/2013 10/17/2011 10/19/2009 10/19/2010 10/19/2011 10/20/2012 10/21/2008 10/21/2010 10/21/2013 10/2/2010 10/2/2012 10/2/2013 ... 9/9/2014
But each cell in this has the same data type - 'factor.' dat[,i][1] == "" returns false for both dat[,i][1] and dat[,i][3511], so how am I supposed to tell them apart so that I can use apply appropriately to place NA where it needs to go?
> dat[,i][1]
[1]
916 Levels: 10/10/2008 10/10/2009 10/10/2012 10/11/2010 10/11/2013 10/1/2010 10/12/2009 10/14/2009 10/14/2010 10/14/2011 10/14/2014 10/15/2009 10/15/2014 10/16/2013 10/17/2011 10/19/2009 10/19/2010 10/19/2011 10/20/2012 10/21/2008 10/21/2010 10/21/2013 10/2/2010 10/2/2012 10/2/2013 ... 9/9/2014
> class(dat[,i][1])
[1] "factor"
> dat[,i][3511]
[1] 2/20/2012
916 Levels: 10/10/2008 10/10/2009 10/10/2012 10/11/2010 10/11/2013 10/1/2010 10/12/2009 10/14/2009 10/14/2010 10/14/2011 10/14/2014 10/15/2009 10/15/2014 10/16/2013 10/17/2011 10/19/2009 10/19/2010 10/19/2011 10/20/2012 10/21/2008 10/21/2010 10/21/2013 10/2/2010 10/2/2012 10/2/2013 ... 9/9/2014
> class(dat[,i][3511])
[1] "factor"
Also, trying to "go down a level" does nothing, still just a factor:
> dat[,i][[1]]
[1]
916 Levels: 10/10/2008 10/10/2009 10/10/2012 10/11/2010 10/11/2013 10/1/2010 10/12/2009 10/14/2009 10/14/2010 10/14/2011 10/14/2014 10/15/2009 10/15/2014 10/16/2013 10/17/2011 10/19/2009 10/19/2010 10/19/2011 10/20/2012 10/21/2008 10/21/2010 10/21/2013 10/2/2010 10/2/2012 10/2/2013 ... 9/9/2014
> dat[,i][1][1]
[1]
916 Levels: 10/10/2008 10/10/2009 10/10/2012 10/11/2010 10/11/2013 10/1/2010 10/12/2009 10/14/2009 10/14/2010 10/14/2011 10/14/2014 10/15/2009 10/15/2014 10/16/2013 10/17/2011 10/19/2009 10/19/2010 10/19/2011 10/20/2012 10/21/2008 10/21/2010 10/21/2013 10/2/2010 10/2/2012 10/2/2013 ... 9/9/2014

It would have been better to show the dput of the example. Based on the OP's post, I am assuming that the levels are white space (' ') instead of a blank (''). So, we can remove the space to convert to a '' and then use ==
library(stringr)
sapply(dat, function(x) sum(str_trim(x)=='')==1)
#[1] TRUE FALSE
Or use grep
sapply(lapply(dat, grepl, pattern= '^\\s+$'), all)
#[1] TRUE FALSE
data
dat <- list(factor(' ', levels=c(' ', 1:5)), factor(1:5, levels=1:5))

Related

How can i forecast with same accuracy in GRETL and R

My timeseries comes from the datasets package. It is called "USAccDeaths".
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1973 9007 8106 8928 9137 10017 10826 11317 10744 9713 9938 9161 8927
1974 7750 6981 8038 8422 8714 9512 10120 9823 8743 9129 8710 8680
1975 8162 7306 8124 7870 9387 9556 10093 9620 8285 8466 8160 8034
1976 7717 7461 7767 7925 8623 8945 10078 9179 8037 8488 7874 8647
1977 7792 6957 7726 8106 8890 9299 10625 9302 8314 8850 8265 8796
1978 7836 6892 7791 8192 9115 9434 10484 9827 9110 9070 8633 9240
When I make an out of sample forcast in GRETL i get the follwoing:
However, when i make the same forecast in R, my results differ subtantially.
This is my r code:
library(forecast)
fit <- arima(USAccDeaths[1:60], order = c(1,1,0), method = "ML")
preds <- as.vector(forecast(fit, h = 12))$mean
RMSE <- sqrt(mean((preds - as.vector(USAccDeaths[61:72])) ^ 2))
RMSE
I get an RMSE of 946.024. This is my predictions in R:
[1] 8803.104 8803.199 8803.201 8803.201 8803.201 8803.201 8803.201 8803.201 8803.201
[10] 8803.201 8803.201 8803.201here
What is the problem? How can I get the same results in both programs?

How I can convert time variable from sas data in R? [duplicate]

It's quite weird to ask this question that I apply a sas7 dataset into R.
One of my variable is visit_date
now it looks like this, i am wondering where i can transform them back to MM-DD-YYYY since i need to exclude data that's less than MDY(08-01-2010).
> chris$visit_date
[1] 17077 17091 17105 17119 17133 17069 17083 17097 17111 17125 17080 17094 17108
[14] 17122 17136 17098 17112 17210 17224 17238 17252 17266 17247 17261 17254 17268
[27] 17282 17296 17324 17237 17251 17265 17279 17293 17329 17343 17357 17385 17413
[40] 17259 17273 17287 17301 17315 17328 17342 17356 17370 17384 17335 17349 17377
[53] 17391 17405 17331 17345 17359 17373 17387 17435 17449 17463 17477 17505 17336
[66] 17364 17378 17392 17406 17352 17366 17380 17394 17408 17427 17441 17469 17483
[79] 17497 17440 17454 17468 17482 17496 17434 17448 17462 17476 17490 17419 17433
[92] 17447 17461 17475 17518 17560 17574 17588 17616 17653 17667 17681 17695 17709
[105] 17644 17658 17686 17700 17728 17755 17769 17783 17811 17825 17825 17610 17624
[118] 17638 17652 17666 18072 18114 18127 18155 18169 17651 17665 17680 17693 17707
[131] 17657 17671 17685 17699 17659 17673 17687 17701 17715 17646 17660 17674 17688
[144] 17702 17721 17735 17749 17763 17770 17734 17748 17762 17790 17861 17736 17750
[157] 17764 17778 17792 17751 17765 17779 17793 17807 17742 17756 17770 17784 17798
[170] 17772 17757 17771 17785 17799 17813 17777 17791 17819 17833 17854 17923 17937
[183] 17965 17979 17993 17825 17839 17853 17867 17909 17832 17846 17860 17874 17888
[196] 17919 17933 17961 17975 17989 17960 17974 17988 18002 18016 18183 18211 18225
[209] 18239 18253 17931 17945 17959 17973 17987 17940 17954 17968 17982 17996 17966
[222] 17980 17994 18022 18036 18021 18035 18049 18063 18091 18050 18064 18078 18092
[235] 18106 18045 18059 18073 18087 18115 18024 18038 18052 18066 18080 18056 18070
[248] 18084 18098 18112 18107 18121 18135 18149 18163 18105 18119 18133 18161 18175
[261] 18143 18171 18185 18199 18213 18203 18246 18274 18288 18302 18316 18248 18276
[274] 18290 18304 18318 18310 18324 18338 18352 18366 18315 18343 18357 18364 18378
[287] 18350 18364 18378 18406 18420 18337 18351 18365 18379 18393 18374 18388 18402
[300] 18430 18472 18344 18358 18386 18400 18414 18353 18381 18395 18409 18423 18387
[313] 18415 18429 18443 18450 18408 18422 18436 18443 18464 18430 18437 18457 18464
[326] 18471 18427 18434 18441 18455 18462 18428 18442 18456 18463 18470
Thanks
Those "dates" are clearly using a different origin/offset than the typical POSIX standard that would work with this conversion. R generally uses YYYY-MM-DD format
as.Date(ddd, origin="1970-01-01")
> head( as.Date(ddd, origin="1970-01-01") )
[1] "2016-10-03" "2016-10-17" "2016-10-31" "2016-11-14" "2016-11-28" "2016-09-25"
So you need to establish the correct origin. If it was 1960-01-01, then none of those dates is greater than 08-01-2010.
> sum( as.Date(ddd, origin="1960-01-01") >= as.Date("2010-08-01") )
[1] 0
> sum( as.Date(ddd, origin="1960-01-01") < as.Date("2010-08-01") )
[1] 336

how do you make a sequence using along.with for unique values in r

Lets suppose I have a vector of numeric values
[1] 2844 4936 4936 4972 5078 6684 6689 7264 7264 7880 8133 9018 9968 9968 10247
[16] 11267 11508 11541 11607 11717 12349 12349 12364 12651 13025 13086 13257 13427 13427 13442
[31] 13442 13442 13442 14142 14341 14429 14429 14429 14538 14872 15002 15064 15163 15163 15324
[46] 15324 15361 15361 15400 15624 15648 15648 15648 15864 15864 15881 16332 16847 17075 17136
[61] 17136 17196 17843 17925 17925 18217 18455 18578 18578 18742 18773 18806 19130 19195 19254
[76] 19254 19421 19421 19429 19585 19686 19729 19729 19760 19760 19901 20530 20530 20530 20581
[91] 20629 20629 20686 20693 20768 20902 20980 21054 21079 21156
and I want to create a sequence along this vector but for unique numbers. for example
length(unique(vector))
is 74 and there are a total of 100 values in the vector. The sequence should have numbers ranging from 1 - 74 only but with length 100 as some numbers will be repeated.
Any idea on how this can be done?
Thanks.
Perhaps
res <- as.numeric(factor(v1))
head(res)
#[1] 1 2 2 3 4 5
Or
res1 <- match(v1, unique(v1))
Or
library(fastmatch)
res2 <- fmatch(v1, unique(v1))
Or
res3 <- findInterval(v1, unique(v1))
data
v1 <- c(2844, 4936, 4936, 4972, 5078, 6684, 6689, 7264, 7264, 7880,
8133, 9018, 9968, 9968, 10247, 11267, 11508, 11541, 11607, 11717,
12349, 12349, 12364, 12651, 13025, 13086, 13257, 13427, 13427,
13442, 13442, 13442, 13442, 14142, 14341, 14429, 14429, 14429,
14538, 14872, 15002, 15064, 15163, 15163, 15324, 15324, 15361,
15361, 15400, 15624, 15648, 15648, 15648, 15864, 15864, 15881,
16332, 16847, 17075, 17136, 17136, 17196, 17843, 17925, 17925,
18217, 18455, 18578, 18578, 18742, 18773, 18806, 19130, 19195,
19254, 19254, 19421, 19421, 19429, 19585, 19686, 19729, 19729,
19760, 19760, 19901, 20530, 20530, 20530, 20581, 20629, 20629,
20686, 20693, 20768, 20902, 20980, 21054, 21079, 21156)
You could use .GRP from "data.table" for this:
library(data.table)
y <- as.data.table(x)[, y := .GRP, by = x]
head(y)
# x y
# 1: 2844 1
# 2: 4936 2 ## Note the duplicated value
# 3: 4936 2 ## in these rows, corresponding to x
# 4: 4972 3
# 5: 5078 4
# 6: 6684 5
tail(y)
# x y
# 1: 20768 69
# 2: 20902 70
# 3: 20980 71
# 4: 21054 72
# 5: 21079 73
# 6: 21156 74 ## "y" values go to 74

Intraday high/low clustering

I am attempting to perform a study on the clustering of high/low points based on time. I managed to achieve the above by using to.daily on intraday data and merging the two using:
intraday.merge <- merge(intraday,daily)
intraday.merge <- na.locf(intraday.merge)
intraday.merge <- intraday.merge["T08:30:00/T16:30:00"] # remove record at 00:00:00
Next, I tried to obtain the records where the high == daily.high/low == daily.low using:
intradayhi <- test[test$High == test$Daily.High]
intradaylo <- test[test$Low == test$Daily.Low]
Resulting data resembles the following:
Open High Low Close Volume Daily.Open Daily.High Daily.Low Daily.Close Daily.Volume
2012-06-19 08:45:00 258.9 259.1 258.5 258.7 1424 258.9 259.1 257.7 258.7 31523
2012-06-20 13:30:00 260.8 260.9 260.6 260.6 1616 260.4 260.9 259.2 260.8 35358
2012-06-21 08:40:00 260.7 260.8 260.4 260.5 493 260.7 260.8 257.4 258.3 31360
2012-06-22 12:10:00 255.9 256.2 255.9 256.1 626 254.5 256.2 253.9 255.3 50515
2012-06-22 12:15:00 256.1 256.2 255.9 255.9 779 254.5 256.2 253.9 255.3 50515
2012-06-25 11:55:00 254.5 254.7 254.4 254.6 1589 253.8 254.7 251.5 253.9 65621
2012-06-26 08:45:00 253.4 254.2 253.2 253.7 5849 253.8 254.2 252.4 253.1 70635
2012-06-27 11:25:00 255.6 256.0 255.5 255.9 973 251.8 256.0 251.8 255.2 53335
2012-06-28 09:00:00 257.0 257.3 256.9 257.1 601 255.3 257.3 255.0 255.1 23978
2012-06-29 13:45:00 253.0 253.4 253.0 253.4 451 247.3 253.4 246.9 253.4 52539
There are duplicated results using the subset, how do I achieve only the first record of the day? I would then be able to plot the count of records for periods in the day.
Also, are there alternate methods to get the results I want? Thanks in advance.
Edit:
Sample output should look like this, count could either be 1st result for day or aggregated (more than 1 occurrence in that day):
Time Count
08:40:00 60
08:45:00 54
08:50:00 60
...
14:00:00 20
14:05:00 12
14:10:00 30
You can get the first observation of each day via:
y <- apply.daily(x, first)
Then you can simply aggregate the count based on hours and minutes:
z <- aggregate(1:NROW(y), by=list(Time=format(index(y),"%H:%M")), sum)

transfer date data from SAS into R

It's quite weird to ask this question that I apply a sas7 dataset into R.
One of my variable is visit_date
now it looks like this, i am wondering where i can transform them back to MM-DD-YYYY since i need to exclude data that's less than MDY(08-01-2010).
> chris$visit_date
[1] 17077 17091 17105 17119 17133 17069 17083 17097 17111 17125 17080 17094 17108
[14] 17122 17136 17098 17112 17210 17224 17238 17252 17266 17247 17261 17254 17268
[27] 17282 17296 17324 17237 17251 17265 17279 17293 17329 17343 17357 17385 17413
[40] 17259 17273 17287 17301 17315 17328 17342 17356 17370 17384 17335 17349 17377
[53] 17391 17405 17331 17345 17359 17373 17387 17435 17449 17463 17477 17505 17336
[66] 17364 17378 17392 17406 17352 17366 17380 17394 17408 17427 17441 17469 17483
[79] 17497 17440 17454 17468 17482 17496 17434 17448 17462 17476 17490 17419 17433
[92] 17447 17461 17475 17518 17560 17574 17588 17616 17653 17667 17681 17695 17709
[105] 17644 17658 17686 17700 17728 17755 17769 17783 17811 17825 17825 17610 17624
[118] 17638 17652 17666 18072 18114 18127 18155 18169 17651 17665 17680 17693 17707
[131] 17657 17671 17685 17699 17659 17673 17687 17701 17715 17646 17660 17674 17688
[144] 17702 17721 17735 17749 17763 17770 17734 17748 17762 17790 17861 17736 17750
[157] 17764 17778 17792 17751 17765 17779 17793 17807 17742 17756 17770 17784 17798
[170] 17772 17757 17771 17785 17799 17813 17777 17791 17819 17833 17854 17923 17937
[183] 17965 17979 17993 17825 17839 17853 17867 17909 17832 17846 17860 17874 17888
[196] 17919 17933 17961 17975 17989 17960 17974 17988 18002 18016 18183 18211 18225
[209] 18239 18253 17931 17945 17959 17973 17987 17940 17954 17968 17982 17996 17966
[222] 17980 17994 18022 18036 18021 18035 18049 18063 18091 18050 18064 18078 18092
[235] 18106 18045 18059 18073 18087 18115 18024 18038 18052 18066 18080 18056 18070
[248] 18084 18098 18112 18107 18121 18135 18149 18163 18105 18119 18133 18161 18175
[261] 18143 18171 18185 18199 18213 18203 18246 18274 18288 18302 18316 18248 18276
[274] 18290 18304 18318 18310 18324 18338 18352 18366 18315 18343 18357 18364 18378
[287] 18350 18364 18378 18406 18420 18337 18351 18365 18379 18393 18374 18388 18402
[300] 18430 18472 18344 18358 18386 18400 18414 18353 18381 18395 18409 18423 18387
[313] 18415 18429 18443 18450 18408 18422 18436 18443 18464 18430 18437 18457 18464
[326] 18471 18427 18434 18441 18455 18462 18428 18442 18456 18463 18470
Thanks
Those "dates" are clearly using a different origin/offset than the typical POSIX standard that would work with this conversion. R generally uses YYYY-MM-DD format
as.Date(ddd, origin="1970-01-01")
> head( as.Date(ddd, origin="1970-01-01") )
[1] "2016-10-03" "2016-10-17" "2016-10-31" "2016-11-14" "2016-11-28" "2016-09-25"
So you need to establish the correct origin. If it was 1960-01-01, then none of those dates is greater than 08-01-2010.
> sum( as.Date(ddd, origin="1960-01-01") >= as.Date("2010-08-01") )
[1] 0
> sum( as.Date(ddd, origin="1960-01-01") < as.Date("2010-08-01") )
[1] 336

Resources