htmlTable is replacing dataframe contents with sequential numbers - r

I'm using R markdown to create an html document. I've written a function that produces the following data frame as its output:
April ($) April Growth (%) Current ($) Current Growth (%) Change (%)
1 2013:3 253,963.49 0.2 251,771.20 0.7 -0.9
2 2013:4 253,466.09 -0.8 251,515.26 -0.4 -0.8
3 2014:1 255,448.95 3.2 255,300.10 6.2 -0.1
4 2014:2 259,376.84 6.3 259,919.99 7.4 0.2
5 2014:3 261,398.85 3.2 262,486.91 4.0 0.4
6 2014:4 264,309.06 4.5 266,662.59 6.5 0.9
I'm then supplying this data frame to htmlTable as shown:
html.tab <- htmlTable(sample.df, rnames=F)
print(html.tab)
However, when I knit the file I the following table is produced:
Can anyone explain what is happening? I thought perhaps it was the data class in the data frame but I didn't see anything in the htmlTable vignette saying it couldn't handle data of certain classes.
This is my first time working with R Markdown and htmlTables so hopefully I've just made some basic mistake but I haven't been able to find anyone else with the same problem.

Thanks to Benjamin for the suggestion. It turns out the problem was the data class. sample.df contained data of class factor which apparently htmlTable can't handle. By converting the data to characters the correct table is produced.
sample.df[] <- lapply(sample.df, as.character)
Perhaps someone more familiar with the package can explain why factors are a problem?
I knew it would be something basic like this!

Related

Missing data warning R

I have a dataframe with climatic values like temperature_max, temperature_min... in diferent locations. The data collection is a time series data there are some especific days in which there are no data registration. I woul like to impute taking in account date and also the location (place variable in the dataframe)
I have tried to impute those missing values with amelia. But no imputation is done with warning information
Checking variables:
head(df): PLACE, DATE, TEMP_MAX, TEMP_MIN, TEMP_AVG
PLACE DATE TEMP_MAX TEMP_MIN TEMP_AVG
F 12/01/2007 19.7 2.5 10.1
F 13/01/2007 18.8 3.5 10.4
F 14/01/2007 17.3 2.4 10.4
F 15/01/2007 19.5 4.0 9.2
F 16/01/2007
F 17/01/2007 21.5 2.8 9.7
F 18/01/2007 17.7 3.3 12.9
F 19/01/2007 18.3 3.8 9.7
A 16/01/2007 17.7 3.4 9.7
A 17/01/2007
A 18/01/2007 19.7 6.2 10.4
A 19/01/2007 17.7 3.8 10.1
A 20/01/2007 18.6 3.8 12.9
This is just some of the records of my data set.
DF = amelia(df, m=4, ts= c("DATE"), cs = c("PLACE"))
where DATE is time series data (01/01/2001, 02/01/2001, 03/01/2001...) but if you filter by PLACE the time series is not equal (not the same star and end time).
I have 3 questions:
I am not sure if I should have the time series data complete for all the places, I mean same start and end time for all the places.
I am not using lags or polytime parameters so, am I imputting correctly taking in account time series influence? I am not sure about how to use lag parameter although I have checked the R package information.
The last question is that when I try to use that code there is a warning
and no imputation is done.
Warning: There are observations in the data that are completely missing.
These observations will remain unimputed in the final datasets.
-- Imputation 1 --
No missing data in bootstrapped sample: EM chain unnecessary
-- Imputation 2 --
No missing data in bootstrapped sample: EM chain unnecessary
-- Imputation 3 --
No missing data in bootstrapped sample: EM chain unnecessary
-- Imputation 4 --
No missing data in bootstrapped sample: EM chain unnecessary
Can someone help me with this?
Thanks very much for your time!
For the software it does not matter if you have different start and end dates for different places. I think that it is more up to you and your thoughts on the data. I would ask myself, if those were missing data (missing at random) thus I would create empty rows in your data set or not.
You want to use lags in order to use past values of the variable to improve the prediction of missing values. It is not mandatory (i.e., the function can impute missing data even without such a specification) but it can be useful.
I contacted the author of the package and he told me that you need to specify the splinetime or polytime arguments to make sure that Amelia will use the time-series information to impute. For instance, if you set polytime = 3, it will impute based on a cubic of time. If you do that, I think you shouldn't see that error anymore.

R creating timeseries with existing dates from file

I'm trying to create a time series plot using R where obtain the dates from a REST request and then I want to group and count the date occurrences on a one week interval. I followed the examples of ts() in R and tried plots, which worked great. But I couldn't find any examples that shows how to create date aggregation based on existing data. Can someone point me in the proper direction?
This is a sample of my parsed REST data:
REST Response excerpt ....
"2014-01-16T14:51:50.000-0800"
"2014-01-14T15:42:55.000-0800"
"2014-01-13T17:29:08.000-0800"
"2014-01-13T16:19:31.000-0800"
"2013-12-16T16:56:39.000-0800"
"2014-02-28T08:11:54.000-0800"
"2014-02-28T08:11:28.000-0800"
"2014-02-28T08:07:02.000-0800"
"2014-02-28T08:06:36.000-0800"
....
Sincerely,
code B.
You can define the date with "as.Date" and then create a time series with "xts", as it allows merging by any period of time.
library(xts)
REST$date <- as.Date(REST$date, format="%Y-%m-%d")
REST$variable <- seq(0,2.4,by=.3)
ts <- xts(REST[,"variable"], order.by=REST[,"date"])
> to.monthly(ts)
ts.Open ts.High ts.Low ts.Close
Dec 2013 1.2 1.2 1.2 1.2
Xan 2014 0.6 0.9 0.0 0.0
Feb 2014 1.5 2.4 1.5 2.4
> to.weekly(ts)
ts.Open ts.High ts.Low ts.Close
2013-12-16 1.2 1.2 1.2 1.2
2014-01-16 0.6 0.9 0.0 0.0
2014-02-28 1.5 2.4 1.5 2.4
Not sure if this is what you needed. Is it?

R readHTMLTable() function error

I'm running into a problem when trying to use the readHTMLTable function in the R package XML. When running
library(XML)
baseurl <- "http://www.pro-football-reference.com/teams/"
team <- "nwe"
year <- 2011
theurl <- paste(baseurl,team,"/",year,".htm",sep="")
readurl <- getURL(theurl)
readtable <- readHTMLTable(readurl)
I get the error message:
Error in names(ans) = header :
'names' attribute [27] must be the same length as the vector [21]
I'm running 64 bit R 2.15.1 through R Studio 0.96.330. It seems there are several other questions that have been asked about the readHTMLTable() function, but none addressed this specific question. Does anyone know what's going on?
When readHTMLTable() complains about the 'names' attribute, it's a good bet that it's having trouble matching the data with what it's parsed for header values. The simplest way around this is to simply turn off header parsing entirely:
table.list <- readHTMLTable(theurl, header=F)
Note that I changed the name of the return value from "readtable" to "table.list". (I also skipped the getURL() call since 1. it didn't work for me and 2. readHTMLTable() knows how to handle URLs). The reason for the change is that, without further direction, readHTMLTable() will hunt down and parse every HTML table it can find on the given page, returning a list containing a data.frame for each.
The page you have sent it after is fairly rich, with 8 separate tables:
> length(table.list)
[1] 8
If you were only interested in a single table on the page, you can use the which attribute to specify it and receive its contents as a data.frame directly.
This could also cure your original problem if it had choked on a table you're not interested in. Many pages still use tables for navigation, search boxes, etc., so it's worth taking a look at the page first.
But this is unlikely to be the case in your example since it actually choked on all but one of them. In the unlikely event that the stars aligned and you were only interested in the successfully-oarsed third table on the page (passing statistics) you could grab it like this, keeping header parsing on:
> passing.df = readHTMLTable(theurl, which=3)
> print(passing.df)
No. Age Pos G GS QBrec Cmp Att Cmp% Yds TD TD% Int Int% Lng Y/A AY/A Y/C Y/G Rate Sk Yds NY/A ANY/A Sk% 4QC GWD
1 12 Tom Brady* 34 QB 16 16 13-3-0 401 611 65.6 5235 39 6.4 12 2.0 99 8.6 9.0 13.1 327.2 105.6 32 173 7.9 8.2 5.0 2 3
2 8 Brian Hoyer 26 3 0 1 1 100.0 22 0 0.0 0 0.0 22 22.0 22.0 22.0 7.3 118.7 0 0 22.0 22.0 0.0

Stock indicator calculations in TTR ( R package): Best way to align the output to the left?

I am using the TTR package to generate stock indicators. However, the indicator functions add NA (where applicable -- e.g. CMO, SMA, CMF, etc.) to the beginning of the series instead of the end. Is there a way to align the output to the left so the NA values are added to the end of the series as opposed to the beginning?
For example:
library(TTR)
x = 1:10
# TTR's simple moving average
SMA(x,n=2)
[1] NA 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
zoo package has an align option to pad the series with NAs at the end:
library(zoo)
rollmean(x,2,na.pad=TRUE,align='left')
[1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 NA
Is there a way to specify something like this in TTR as I need to generate indicators beyond moving averages? I guess I can create a wrapper around these functions and manually shift the resulting values but not sure if there is a better way to do it.
Also, since TTR is heavily used to add indicators to stock prices, I am wondering why the padding is at the beginning as opposed to the end especially since most historical prices as sorted in descending order(by date)? In the above example, if x[1] is the price of a stock today and x[10] the price 10-days ago, shouldn't the moving average (span = 2) for today the average of today + yesterday? As much as I would like to add NAs at the end, I would also like to make sure I am not misinterpreting how these indicators are used.
Thanks,
-e
I couldn't figure out an option in the function call to shift the series in a different direction. However, now I understand why TTR shifts the series downwards. Historical stock prices obtained via quantmod's getSymbols() returns them sorted by date in ascending order. When I manually download the quotes from Yahoo! or use ystockquote.py, the order is descending. I just re-sorted my data by date and used the TTR library as-is.
There were certain vectors that I wanted to be shifted up (padded with NAs) and I just used this code:
miss_len = length(x[is.na(x])
x = x[!is.na(x)]
length(x) = length(x) + miss_len

rank() doesn't rank properly when using with scienctific notation number

I tried to order csv file but the rank() function acting weird on number with -E notation.
> comparison = read.csv("e:/thesis/comparison/output.csv", header=TRUE)
> comparison$proxygeneld_full.txt[0:20]
[1] 9.34E-07 4.04E-06 4.16E-06 7.17E-06 2.08E-05 3.00E-05
[7] 3.59E-05 4.16E-05 7.75E-05 9.50E-05 0.0001116 0.00012452
[13] 0.00015494 0.00017892 0.00017892 0.00018345 0.0002232 0.000231775
[19] 0.00023241 0.0002666
13329 Levels: 0.0001116 0.00012452 0.00015494 0.00017892 0.00018345 ... adjP
> rank(comparison$proxygeneld_full.txt[0:20])
[1] 19.0 14.0 16.0 17.0 11.0 12.0 13.0 15.0 18.0 20.0 1.0 2.0 3.0 4.5 4.5
[16] 6.0 7.0 8.0 9.0 10.0
#It should be 1-20 in order ....
It seems just ignore -E notation right there. It turn out to be fine if I'm not using data from file
> rank(c(9.34E-07, 4.04E-06, 7.17E-06))
[1] 1 2 3
Am I missing something ? Thanks.
I guess you have some non-numeric data in your csv file.
What happens if you do?
as.numeric(comparison$proxygeneld_full.txt)
If this produces different numbers than you expected, you certainly have some text in this column.
Yep - $proxygeneld_full.txt[0:20] isn't even numeric. It is a factor:
13329 Levels: 0.0001116 0.00012452 0.00015494 0.00017892 0.00018345 ... adjP
So rank() is ranking the numeric codes that lay behind the factor representation, and the E-0X "numbers" sort after the non-E numbers in the levels.
Look at str(comparison) and you'll see that proxygeneld_full.txt is a factor.
I'm struggling to replicate the behaviour you are seeing with E numbers in a csv file. R reads them properly as numeric. Check your CSV to make sure you don't have some none numeric values in that column, or that the E numbers are not quoted.
Ahh! looking again at the levels you quote: there is an adjP lurking at the end of the code you show. Check your data again as this adjP is in there someone where and that is forcing R to code that variable as a factor hence the behaviour you see with ranking as I described above.

Resources