R readHTMLTable() function error - r

I'm running into a problem when trying to use the readHTMLTable function in the R package XML. When running
library(XML)
baseurl <- "http://www.pro-football-reference.com/teams/"
team <- "nwe"
year <- 2011
theurl <- paste(baseurl,team,"/",year,".htm",sep="")
readurl <- getURL(theurl)
readtable <- readHTMLTable(readurl)
I get the error message:
Error in names(ans) = header :
'names' attribute [27] must be the same length as the vector [21]
I'm running 64 bit R 2.15.1 through R Studio 0.96.330. It seems there are several other questions that have been asked about the readHTMLTable() function, but none addressed this specific question. Does anyone know what's going on?

When readHTMLTable() complains about the 'names' attribute, it's a good bet that it's having trouble matching the data with what it's parsed for header values. The simplest way around this is to simply turn off header parsing entirely:
table.list <- readHTMLTable(theurl, header=F)
Note that I changed the name of the return value from "readtable" to "table.list". (I also skipped the getURL() call since 1. it didn't work for me and 2. readHTMLTable() knows how to handle URLs). The reason for the change is that, without further direction, readHTMLTable() will hunt down and parse every HTML table it can find on the given page, returning a list containing a data.frame for each.
The page you have sent it after is fairly rich, with 8 separate tables:
> length(table.list)
[1] 8
If you were only interested in a single table on the page, you can use the which attribute to specify it and receive its contents as a data.frame directly.
This could also cure your original problem if it had choked on a table you're not interested in. Many pages still use tables for navigation, search boxes, etc., so it's worth taking a look at the page first.
But this is unlikely to be the case in your example since it actually choked on all but one of them. In the unlikely event that the stars aligned and you were only interested in the successfully-oarsed third table on the page (passing statistics) you could grab it like this, keeping header parsing on:
> passing.df = readHTMLTable(theurl, which=3)
> print(passing.df)
No. Age Pos G GS QBrec Cmp Att Cmp% Yds TD TD% Int Int% Lng Y/A AY/A Y/C Y/G Rate Sk Yds NY/A ANY/A Sk% 4QC GWD
1 12 Tom Brady* 34 QB 16 16 13-3-0 401 611 65.6 5235 39 6.4 12 2.0 99 8.6 9.0 13.1 327.2 105.6 32 173 7.9 8.2 5.0 2 3
2 8 Brian Hoyer 26 3 0 1 1 100.0 22 0 0.0 0 0.0 22 22.0 22.0 22.0 7.3 118.7 0 0 22.0 22.0 0.0

Related

Fill data with linear proyection based on scalars

I'm trying to project some variables based on specific scalars but I'm having some trouble putting it together. The general idea is to make a lineal projection (or extension? excuse my english) based on the average change rate of each variable for the missing values at the end of the data in order to complete it. The main data looks like this:
| t | com | var1 | var2...
---------------------------
1 1 2.2 5.8
1 2 2.4 6.2
... ... ... ...
1 38 1.8 6.4
2 1 2.0 7.2
... ... ... ...
73 1 1.2 9.2
... ... ... ...
73 38 1.4 10.2
74 1 NA NA
... ... ... ...
104 38 NA NA
Basically 38 observations on 104 periods for a bunch variables, but some variables stop having values at t = 73. (The "..." are there to allow me to show the actual data size, not as a missing value representation)
I also have the scalars I need by com stored as Tx_Var1:
com | tx_var1
1 2.3
2 1.7
... ...
38 4.5
for every variable I need. These scalars are simply the average change rate for each variable by com, so the actual solution may not have to use it. I just built it because I'm trying to solve this step by step.
What I'm looking for is a way to complete the main data for these variables when t >= 73 and the variable is NA using var1 = lag(var1)*(1+tx_var1) and this would have to be by com.
I believe I need to mutate grouping_by com but I don't know how to call the scalar value from Tx_Var1, Tx_Var2, etc... into the code and combine that with the t and NA restrictions. I am also not sure about how to work with the NA restriction and the lag(var) part because it would need the previous value to complete the data for each row. I'm currently looking into Complete and Fill functions to see if there's a less complicated way to make this work.
Any help on this problem would be greatly appreciated.
Thanks,

R – How to give a common ID to matching data with close value, and arrange dataframe for paired tests (ex from Hmisc::find.matches() )

Hi everyone ! I hope you are having a great day.
Aim and context
My two dataframes are built from different methods, but measure the same parameters on the same signals.
I’d like to match every signals in the first dataframe with the same signal in the second dataframe, to compare the parameter values, and evaluate the methods against each other.
I would gratefully appreciate any help, as I reached my beginner’s limits in R coding but also in dataframe management.
Basically, I would like to find matches in two separate dataframes and consider that the matches are refering to the same entity (for instance along the creation an ID variable), in order to perform statistical analysis for paired data.
I could have made the matches by hand on a spreadsheet, but because there are hundreds of entries and more comparisons to come, I’d like to automatize the matching and creation of dataframe.
To give you an idea, my dataframes look like this :
DF1
Recording
Selection
Start (ms)
Freq.max (kHz)
001
1
11.3
42.4
001
2
122.9
46.2
001
3
232.3
47.5
002
1
22.9
30.9
002
2
512.4
31.3
My second dataframe would look something like this :
DF2
Recording
Selection
Start (ms)
Freq.max (kHz)
001
1
10.9
41.8
001
2
122.1
44.5
001
3
231.3
44.4
002
1
513.0
30.2
My ideas
I thought about identifying each signal, but
An ID using "Recording + selection" (001_1, 001_2...) would not work because some signals are not detected in both methods.
So I'd want to use the start position to identify the signals, but rounding to the closest or upper/lower value would not match all the signals.
Hmisc::find.matches() function
I tried the function find.matches() from the package Hmisc, that gives the matches of your columns, given the tolerance threshold you input.
find <- find.matches(DF_method1$start_one, DF_method2$start_two, tol=(2))
(I arbitrarily chose a tolerance of 2 ms, for it to be considered as the same signal)
The output looks like this :
Matches:
Match #1 Match #2 Match #3
[1,] 1 7 0
[2,] 2 42 0
[3,] 3 0 0
[4,] 4 0 0
[5,] 0 0 0
[6,] 5 0 0
[7,] 22 6 0
I feel like it is coming together but I am stuck to these two questions :
How to find the closest match among each recording, not comparing all signals in all recordings ? (example here, all 1st matches are correctly identified, except n°7, matched with n°22, which is from a different recording) How could I run the function, within each recording ?
How to create a dataframe from the output ? Which would be a dataframe with only the signals that had a match and their related parameter values.
I feel like this function gets close to my aim but if you have any other suggestion, I am all ears.
Thanks a lot

could not find function "as.triangle"

I need to transform data table into triangle form.
So, I download the ChainLadder package.Then, I import the data from the table. You can see it as below:
> head(triangle.csv)
année dev montant
1 2009 1 2147
2 2009 2 3365
3 2009 3 2129
4 2009 4 1070
5 2009 5 0
6 2009 6 300
I need to convert this table into triangle form like :
this table constructed in Excel
So I wrote this code:
data<-as.triangle(triangle.csv)
But an Error is shown:
Error in as.triangle(triangle.csv) : could not find function
"as.triangle"
How do I resolve this problem please?
What you need to do is actually just specify the arguments in the function. You should write:
as.triangle(triangle.csv, origin="année", dev="dev", "montant")
Remember you can also specify the package name, as pointed out by #Paul in the comments, as: ChainLadder::as.triangle().
EDIT: for future issues, remember to practice with the examples that you can find in the man pages (eg .?as.triangle).

Data.table fread position error [duplicate]

Is this a bug in data.table::fread (version 1.9.2) or misplaced user expectation/error?
Consider this trivial example where I have a table of values, TAB separated with possibly missing values. If the values are missing in the first column, fread gets upset, but if missing values are elsewhere I return the data.table I expect:
# Data with missing value in first column, third row and last column, second row:
12 876 19
23 39
15 20
fread("12 876 19
23 39
15 20")
#Error in fread("12\t876\t19\n23\t39\t\n\t15\t20") :
# Not positioned correctly after testing format of header row. ch=' '
# Data with missing values last column, rows two and three:
"12 876 19
23 39
15 20 "
fread( "12 876 19
23 39
15 20 " )
# V1 V2 V3
#1: 12 876 19
#2: 23 39 NA
#3: 15 20 NA
# Returns as expected.
Is this a bug or is it not possible to have missing values in the first column (or do I have malformed data somehow?).
I believe this is the same bug that I reported here.
The most recent version that I know will work with this type of input is Rev. 1180. You could checkout and build that version by adding #1180 to the end of the svn checkout command.
svn checkout svn://svn.r-forge.r-project.org/svnroot/datatable/#1180
If you're not familiar with checking out and building packages, see here
But, a lot of great features, bug fixes, enhancements have been implemented since Rev. 1180. (The deveolpment version at the time of this writing is Rev. 1272). So, a better solution, is to replace the R/fread.R and src/fread.c files with the versions from Rev. 1180 or older, and then re-building the package.
You can find those files online without checking them out here (sorry, I can't figure out how to post links that include '*', so you have to copy/paste):
fread.R:
http://r-forge.r-project.org/scm/viewvc.php/*checkout*/pkg/R/fread.R?revision=988&root=datatable
fread.c:
http://r-forge.r-project.org/scm/viewvc.php/*checkout*/pkg/src/fread.c?revision=1159&root=datatable
Once you've rebuilt the package, you'll be able to read your tsv file.
> fread("12\t876\t19\n23\t39\t\n\t15\t20")
V1 V2 V3
1: 12 876 19
2: 23 39 NA
3: NA 15 20
The downside to doing this is that the old version of fread() does not pass a newer test -- you won't be able to read fields that have quotes in the middle.
> fread('A,B,C\n1.2,Foo"Bar,"a"b\"c"d"\nfo"o,bar,"b,az""\n')
Error in fread("A,B,C\n1.2,Foo\"Bar,\"a\"b\"c\"d\"\nfo\"o,bar,\"b,az\"\"\n") :
Not positioned correctly after testing format of header row. ch=','
With newer versions of fread, you would get this
> fread('A,B,C\n1.2,Foo"Bar,"a"b\"c"d"\nfo"o,bar,"b,az""\n')
A B C
1: 1.2 Foo"Bar a"b"c"d
2: fo"o bar b,az"
So, for now, which version "works" depends on whether you're more likely to have missing values in the first column, or quotes in fields. For me, it's the former, so I'm still using the old code.

rank() doesn't rank properly when using with scienctific notation number

I tried to order csv file but the rank() function acting weird on number with -E notation.
> comparison = read.csv("e:/thesis/comparison/output.csv", header=TRUE)
> comparison$proxygeneld_full.txt[0:20]
[1] 9.34E-07 4.04E-06 4.16E-06 7.17E-06 2.08E-05 3.00E-05
[7] 3.59E-05 4.16E-05 7.75E-05 9.50E-05 0.0001116 0.00012452
[13] 0.00015494 0.00017892 0.00017892 0.00018345 0.0002232 0.000231775
[19] 0.00023241 0.0002666
13329 Levels: 0.0001116 0.00012452 0.00015494 0.00017892 0.00018345 ... adjP
> rank(comparison$proxygeneld_full.txt[0:20])
[1] 19.0 14.0 16.0 17.0 11.0 12.0 13.0 15.0 18.0 20.0 1.0 2.0 3.0 4.5 4.5
[16] 6.0 7.0 8.0 9.0 10.0
#It should be 1-20 in order ....
It seems just ignore -E notation right there. It turn out to be fine if I'm not using data from file
> rank(c(9.34E-07, 4.04E-06, 7.17E-06))
[1] 1 2 3
Am I missing something ? Thanks.
I guess you have some non-numeric data in your csv file.
What happens if you do?
as.numeric(comparison$proxygeneld_full.txt)
If this produces different numbers than you expected, you certainly have some text in this column.
Yep - $proxygeneld_full.txt[0:20] isn't even numeric. It is a factor:
13329 Levels: 0.0001116 0.00012452 0.00015494 0.00017892 0.00018345 ... adjP
So rank() is ranking the numeric codes that lay behind the factor representation, and the E-0X "numbers" sort after the non-E numbers in the levels.
Look at str(comparison) and you'll see that proxygeneld_full.txt is a factor.
I'm struggling to replicate the behaviour you are seeing with E numbers in a csv file. R reads them properly as numeric. Check your CSV to make sure you don't have some none numeric values in that column, or that the E numbers are not quoted.
Ahh! looking again at the levels you quote: there is an adjP lurking at the end of the code you show. Check your data again as this adjP is in there someone where and that is forcing R to code that variable as a factor hence the behaviour you see with ranking as I described above.

Categories

Resources