rank() doesn't rank properly when using with scienctific notation number - r

I tried to order csv file but the rank() function acting weird on number with -E notation.
> comparison = read.csv("e:/thesis/comparison/output.csv", header=TRUE)
> comparison$proxygeneld_full.txt[0:20]
[1] 9.34E-07 4.04E-06 4.16E-06 7.17E-06 2.08E-05 3.00E-05
[7] 3.59E-05 4.16E-05 7.75E-05 9.50E-05 0.0001116 0.00012452
[13] 0.00015494 0.00017892 0.00017892 0.00018345 0.0002232 0.000231775
[19] 0.00023241 0.0002666
13329 Levels: 0.0001116 0.00012452 0.00015494 0.00017892 0.00018345 ... adjP
> rank(comparison$proxygeneld_full.txt[0:20])
[1] 19.0 14.0 16.0 17.0 11.0 12.0 13.0 15.0 18.0 20.0 1.0 2.0 3.0 4.5 4.5
[16] 6.0 7.0 8.0 9.0 10.0
#It should be 1-20 in order ....
It seems just ignore -E notation right there. It turn out to be fine if I'm not using data from file
> rank(c(9.34E-07, 4.04E-06, 7.17E-06))
[1] 1 2 3
Am I missing something ? Thanks.

I guess you have some non-numeric data in your csv file.
What happens if you do?
as.numeric(comparison$proxygeneld_full.txt)
If this produces different numbers than you expected, you certainly have some text in this column.

Yep - $proxygeneld_full.txt[0:20] isn't even numeric. It is a factor:
13329 Levels: 0.0001116 0.00012452 0.00015494 0.00017892 0.00018345 ... adjP
So rank() is ranking the numeric codes that lay behind the factor representation, and the E-0X "numbers" sort after the non-E numbers in the levels.
Look at str(comparison) and you'll see that proxygeneld_full.txt is a factor.
I'm struggling to replicate the behaviour you are seeing with E numbers in a csv file. R reads them properly as numeric. Check your CSV to make sure you don't have some none numeric values in that column, or that the E numbers are not quoted.
Ahh! looking again at the levels you quote: there is an adjP lurking at the end of the code you show. Check your data again as this adjP is in there someone where and that is forcing R to code that variable as a factor hence the behaviour you see with ranking as I described above.

Related

Fill data with linear proyection based on scalars

I'm trying to project some variables based on specific scalars but I'm having some trouble putting it together. The general idea is to make a lineal projection (or extension? excuse my english) based on the average change rate of each variable for the missing values at the end of the data in order to complete it. The main data looks like this:
| t | com | var1 | var2...
---------------------------
1 1 2.2 5.8
1 2 2.4 6.2
... ... ... ...
1 38 1.8 6.4
2 1 2.0 7.2
... ... ... ...
73 1 1.2 9.2
... ... ... ...
73 38 1.4 10.2
74 1 NA NA
... ... ... ...
104 38 NA NA
Basically 38 observations on 104 periods for a bunch variables, but some variables stop having values at t = 73. (The "..." are there to allow me to show the actual data size, not as a missing value representation)
I also have the scalars I need by com stored as Tx_Var1:
com | tx_var1
1 2.3
2 1.7
... ...
38 4.5
for every variable I need. These scalars are simply the average change rate for each variable by com, so the actual solution may not have to use it. I just built it because I'm trying to solve this step by step.
What I'm looking for is a way to complete the main data for these variables when t >= 73 and the variable is NA using var1 = lag(var1)*(1+tx_var1) and this would have to be by com.
I believe I need to mutate grouping_by com but I don't know how to call the scalar value from Tx_Var1, Tx_Var2, etc... into the code and combine that with the t and NA restrictions. I am also not sure about how to work with the NA restriction and the lag(var) part because it would need the previous value to complete the data for each row. I'm currently looking into Complete and Fill functions to see if there's a less complicated way to make this work.
Any help on this problem would be greatly appreciated.
Thanks,

R vector numeric expression warning

I created a simple vector in R to store the temperatures of 3 patients.
temperature <- c(98.1, 98.6, 101.4)
then later I tried to retrieve the temperature of the second and third patient.
temperature[2:3]
[1] 98.6 101.4
While trying to retrieve all three values I succeeded but then got this warning from RStudio
temperature[1:2:3]
[1] 98.1 98.6 101.4
Warning message:
In 1:2:3 : numerical expression has 2 elements: only the first used
What does this warning mean?
The expression temperature[1:2:3] though is valid (valid in the sense that it will compile without errors) in R, but will give you same result as temperature[1:3].
R only uses the first and the last indices. So, temperature[1:3:4:5:3] is same as temperature[1:3].

Missing data warning R

I have a dataframe with climatic values like temperature_max, temperature_min... in diferent locations. The data collection is a time series data there are some especific days in which there are no data registration. I woul like to impute taking in account date and also the location (place variable in the dataframe)
I have tried to impute those missing values with amelia. But no imputation is done with warning information
Checking variables:
head(df): PLACE, DATE, TEMP_MAX, TEMP_MIN, TEMP_AVG
PLACE DATE TEMP_MAX TEMP_MIN TEMP_AVG
F 12/01/2007 19.7 2.5 10.1
F 13/01/2007 18.8 3.5 10.4
F 14/01/2007 17.3 2.4 10.4
F 15/01/2007 19.5 4.0 9.2
F 16/01/2007
F 17/01/2007 21.5 2.8 9.7
F 18/01/2007 17.7 3.3 12.9
F 19/01/2007 18.3 3.8 9.7
A 16/01/2007 17.7 3.4 9.7
A 17/01/2007
A 18/01/2007 19.7 6.2 10.4
A 19/01/2007 17.7 3.8 10.1
A 20/01/2007 18.6 3.8 12.9
This is just some of the records of my data set.
DF = amelia(df, m=4, ts= c("DATE"), cs = c("PLACE"))
where DATE is time series data (01/01/2001, 02/01/2001, 03/01/2001...) but if you filter by PLACE the time series is not equal (not the same star and end time).
I have 3 questions:
I am not sure if I should have the time series data complete for all the places, I mean same start and end time for all the places.
I am not using lags or polytime parameters so, am I imputting correctly taking in account time series influence? I am not sure about how to use lag parameter although I have checked the R package information.
The last question is that when I try to use that code there is a warning
and no imputation is done.
Warning: There are observations in the data that are completely missing.
These observations will remain unimputed in the final datasets.
-- Imputation 1 --
No missing data in bootstrapped sample: EM chain unnecessary
-- Imputation 2 --
No missing data in bootstrapped sample: EM chain unnecessary
-- Imputation 3 --
No missing data in bootstrapped sample: EM chain unnecessary
-- Imputation 4 --
No missing data in bootstrapped sample: EM chain unnecessary
Can someone help me with this?
Thanks very much for your time!
For the software it does not matter if you have different start and end dates for different places. I think that it is more up to you and your thoughts on the data. I would ask myself, if those were missing data (missing at random) thus I would create empty rows in your data set or not.
You want to use lags in order to use past values of the variable to improve the prediction of missing values. It is not mandatory (i.e., the function can impute missing data even without such a specification) but it can be useful.
I contacted the author of the package and he told me that you need to specify the splinetime or polytime arguments to make sure that Amelia will use the time-series information to impute. For instance, if you set polytime = 3, it will impute based on a cubic of time. If you do that, I think you shouldn't see that error anymore.

htmlTable is replacing dataframe contents with sequential numbers

I'm using R markdown to create an html document. I've written a function that produces the following data frame as its output:
April ($) April Growth (%) Current ($) Current Growth (%) Change (%)
1 2013:3 253,963.49 0.2 251,771.20 0.7 -0.9
2 2013:4 253,466.09 -0.8 251,515.26 -0.4 -0.8
3 2014:1 255,448.95 3.2 255,300.10 6.2 -0.1
4 2014:2 259,376.84 6.3 259,919.99 7.4 0.2
5 2014:3 261,398.85 3.2 262,486.91 4.0 0.4
6 2014:4 264,309.06 4.5 266,662.59 6.5 0.9
I'm then supplying this data frame to htmlTable as shown:
html.tab <- htmlTable(sample.df, rnames=F)
print(html.tab)
However, when I knit the file I the following table is produced:
Can anyone explain what is happening? I thought perhaps it was the data class in the data frame but I didn't see anything in the htmlTable vignette saying it couldn't handle data of certain classes.
This is my first time working with R Markdown and htmlTables so hopefully I've just made some basic mistake but I haven't been able to find anyone else with the same problem.
Thanks to Benjamin for the suggestion. It turns out the problem was the data class. sample.df contained data of class factor which apparently htmlTable can't handle. By converting the data to characters the correct table is produced.
sample.df[] <- lapply(sample.df, as.character)
Perhaps someone more familiar with the package can explain why factors are a problem?
I knew it would be something basic like this!

R readHTMLTable() function error

I'm running into a problem when trying to use the readHTMLTable function in the R package XML. When running
library(XML)
baseurl <- "http://www.pro-football-reference.com/teams/"
team <- "nwe"
year <- 2011
theurl <- paste(baseurl,team,"/",year,".htm",sep="")
readurl <- getURL(theurl)
readtable <- readHTMLTable(readurl)
I get the error message:
Error in names(ans) = header :
'names' attribute [27] must be the same length as the vector [21]
I'm running 64 bit R 2.15.1 through R Studio 0.96.330. It seems there are several other questions that have been asked about the readHTMLTable() function, but none addressed this specific question. Does anyone know what's going on?
When readHTMLTable() complains about the 'names' attribute, it's a good bet that it's having trouble matching the data with what it's parsed for header values. The simplest way around this is to simply turn off header parsing entirely:
table.list <- readHTMLTable(theurl, header=F)
Note that I changed the name of the return value from "readtable" to "table.list". (I also skipped the getURL() call since 1. it didn't work for me and 2. readHTMLTable() knows how to handle URLs). The reason for the change is that, without further direction, readHTMLTable() will hunt down and parse every HTML table it can find on the given page, returning a list containing a data.frame for each.
The page you have sent it after is fairly rich, with 8 separate tables:
> length(table.list)
[1] 8
If you were only interested in a single table on the page, you can use the which attribute to specify it and receive its contents as a data.frame directly.
This could also cure your original problem if it had choked on a table you're not interested in. Many pages still use tables for navigation, search boxes, etc., so it's worth taking a look at the page first.
But this is unlikely to be the case in your example since it actually choked on all but one of them. In the unlikely event that the stars aligned and you were only interested in the successfully-oarsed third table on the page (passing statistics) you could grab it like this, keeping header parsing on:
> passing.df = readHTMLTable(theurl, which=3)
> print(passing.df)
No. Age Pos G GS QBrec Cmp Att Cmp% Yds TD TD% Int Int% Lng Y/A AY/A Y/C Y/G Rate Sk Yds NY/A ANY/A Sk% 4QC GWD
1 12 Tom Brady* 34 QB 16 16 13-3-0 401 611 65.6 5235 39 6.4 12 2.0 99 8.6 9.0 13.1 327.2 105.6 32 173 7.9 8.2 5.0 2 3
2 8 Brian Hoyer 26 3 0 1 1 100.0 22 0 0.0 0 0.0 22 22.0 22.0 22.0 7.3 118.7 0 0 22.0 22.0 0.0

Resources