Retrieve data that have similar values in one column - r

I have the following dataset:
Class Value
Drive 9.5
Analyser 6.35
GameGUI 12.09
Drive 9.5
Analyser 5.5
GameGUI 2.69
Drive 9.5
Analyser 9.10
GameGUI 6.1
I want to retrieve the classes that have similar values, which would be in the case of the example above is Drive. To do that I have the following command:
dataset[as.logical(ave(dataset$Value, dataset$Class, FUN = function(x) all(x==1))), ]
But this command returns only the classes that their values is always one. What I want is different, I don't want to give a specific value.

Related

Missing data warning R

I have a dataframe with climatic values like temperature_max, temperature_min... in diferent locations. The data collection is a time series data there are some especific days in which there are no data registration. I woul like to impute taking in account date and also the location (place variable in the dataframe)
I have tried to impute those missing values with amelia. But no imputation is done with warning information
Checking variables:
head(df): PLACE, DATE, TEMP_MAX, TEMP_MIN, TEMP_AVG
PLACE DATE TEMP_MAX TEMP_MIN TEMP_AVG
F 12/01/2007 19.7 2.5 10.1
F 13/01/2007 18.8 3.5 10.4
F 14/01/2007 17.3 2.4 10.4
F 15/01/2007 19.5 4.0 9.2
F 16/01/2007
F 17/01/2007 21.5 2.8 9.7
F 18/01/2007 17.7 3.3 12.9
F 19/01/2007 18.3 3.8 9.7
A 16/01/2007 17.7 3.4 9.7
A 17/01/2007
A 18/01/2007 19.7 6.2 10.4
A 19/01/2007 17.7 3.8 10.1
A 20/01/2007 18.6 3.8 12.9
This is just some of the records of my data set.
DF = amelia(df, m=4, ts= c("DATE"), cs = c("PLACE"))
where DATE is time series data (01/01/2001, 02/01/2001, 03/01/2001...) but if you filter by PLACE the time series is not equal (not the same star and end time).
I have 3 questions:
I am not sure if I should have the time series data complete for all the places, I mean same start and end time for all the places.
I am not using lags or polytime parameters so, am I imputting correctly taking in account time series influence? I am not sure about how to use lag parameter although I have checked the R package information.
The last question is that when I try to use that code there is a warning
and no imputation is done.
Warning: There are observations in the data that are completely missing.
These observations will remain unimputed in the final datasets.
-- Imputation 1 --
No missing data in bootstrapped sample: EM chain unnecessary
-- Imputation 2 --
No missing data in bootstrapped sample: EM chain unnecessary
-- Imputation 3 --
No missing data in bootstrapped sample: EM chain unnecessary
-- Imputation 4 --
No missing data in bootstrapped sample: EM chain unnecessary
Can someone help me with this?
Thanks very much for your time!
For the software it does not matter if you have different start and end dates for different places. I think that it is more up to you and your thoughts on the data. I would ask myself, if those were missing data (missing at random) thus I would create empty rows in your data set or not.
You want to use lags in order to use past values of the variable to improve the prediction of missing values. It is not mandatory (i.e., the function can impute missing data even without such a specification) but it can be useful.
I contacted the author of the package and he told me that you need to specify the splinetime or polytime arguments to make sure that Amelia will use the time-series information to impute. For instance, if you set polytime = 3, it will impute based on a cubic of time. If you do that, I think you shouldn't see that error anymore.

Sparklyr: How to attach a group by to invoke method?

I have this Spark table:
xydata
y: num 11.00 22.00 33.00 ...
x0: num 1.00 2.00 3.00 ...
x1: num 2.00 3.00 4.00 ...
...
x788: num 2.00 3.00 4.00 ...
and a handle named xy_df that is connected to this table.
I want to invoke the selectExpr function to calculate the mean, something like:
xy_centered <- xy_df %>%
spark_dataframe() %>%
invoke("selectExpr", list("( y0-mean(y0) ) AS y0mean"))
which is also applicable to all other columns.
But when I run it, it gives this error:
Error: org.apache.spark.sql.AnalysisException: expression 'y0' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;
I know this happens because, in common SQL rules, I didn't put a GROUP BY clause for columns contained in the aggregate function (mean). How do I put the GROUP BY to the invoke method?
Previously, I manage to do complete the purpose using another way, which is by:
Calculate the mean of each column by summarize_all
Collect the mean inside R
Apply this mean using invoke and selectExpr
as explained in this answer, but now I'm trying to speed up the execution time a bit by putting all operation inside the Spark itself, without retrieving anything to R.
My Spark version is 1.6.0

R readHTMLTable() function error

I'm running into a problem when trying to use the readHTMLTable function in the R package XML. When running
library(XML)
baseurl <- "http://www.pro-football-reference.com/teams/"
team <- "nwe"
year <- 2011
theurl <- paste(baseurl,team,"/",year,".htm",sep="")
readurl <- getURL(theurl)
readtable <- readHTMLTable(readurl)
I get the error message:
Error in names(ans) = header :
'names' attribute [27] must be the same length as the vector [21]
I'm running 64 bit R 2.15.1 through R Studio 0.96.330. It seems there are several other questions that have been asked about the readHTMLTable() function, but none addressed this specific question. Does anyone know what's going on?
When readHTMLTable() complains about the 'names' attribute, it's a good bet that it's having trouble matching the data with what it's parsed for header values. The simplest way around this is to simply turn off header parsing entirely:
table.list <- readHTMLTable(theurl, header=F)
Note that I changed the name of the return value from "readtable" to "table.list". (I also skipped the getURL() call since 1. it didn't work for me and 2. readHTMLTable() knows how to handle URLs). The reason for the change is that, without further direction, readHTMLTable() will hunt down and parse every HTML table it can find on the given page, returning a list containing a data.frame for each.
The page you have sent it after is fairly rich, with 8 separate tables:
> length(table.list)
[1] 8
If you were only interested in a single table on the page, you can use the which attribute to specify it and receive its contents as a data.frame directly.
This could also cure your original problem if it had choked on a table you're not interested in. Many pages still use tables for navigation, search boxes, etc., so it's worth taking a look at the page first.
But this is unlikely to be the case in your example since it actually choked on all but one of them. In the unlikely event that the stars aligned and you were only interested in the successfully-oarsed third table on the page (passing statistics) you could grab it like this, keeping header parsing on:
> passing.df = readHTMLTable(theurl, which=3)
> print(passing.df)
No. Age Pos G GS QBrec Cmp Att Cmp% Yds TD TD% Int Int% Lng Y/A AY/A Y/C Y/G Rate Sk Yds NY/A ANY/A Sk% 4QC GWD
1 12 Tom Brady* 34 QB 16 16 13-3-0 401 611 65.6 5235 39 6.4 12 2.0 99 8.6 9.0 13.1 327.2 105.6 32 173 7.9 8.2 5.0 2 3
2 8 Brian Hoyer 26 3 0 1 1 100.0 22 0 0.0 0 0.0 22 22.0 22.0 22.0 7.3 118.7 0 0 22.0 22.0 0.0

Memory problems with large-scale social network visualization using R and Cytoscape

I'm relatively new to R and am trying to solve the following problem:
I work on a Windows 7 Enterprise platform with the 32bit version of R
and have about 3GB of RAM on my machine. I have large-scale social
network data (c. 7,000 vertices and c. 30,000 edges) which are
currently stored in my SQL database. I have managed to pull this data
(omitting vertex and edge attributes) into an R dataframe and then
into an igraph object. For further analysis and visualization, I would
now like to push this igraph into Cytoscape using RCytoscape.
Currently, my approach is to convert the igraph object into an
graphNEL object since RCytoscape seems to work well with this object
type. (The igraph plotting functions are much too slow and lack
further analysis functionality.)
Unfortunately, I always run into memory issues when running this
script. It has worked previously with smaller networks though.
Does anyone have an idea on how to solve this issue? Or can you
recommend any other visualization and analysis tools that work well
with R and can handle such large-scale data?
Sorry for taking several days to get back to you.
I just ran some tests in which
1) an adjacency matrix is created in R
2) an R graphNEL is then created from the matrix
3) (optionally) node & edge attributes are added
4) a CytoscapeWindow is created, displayed, and layed out, and redrawn
(all times are in seconds)
nodes edges attributes? matrix graph cw display layout redraw total
70 35 no 0.001 0.001 0.5 5.7 2.5 0.016 9.4
70 0 no 0.033 0.001 0.2 4.2 0.5 0.49 5.6
700 350 no 0.198 0.036 6.0 8.3 1.6 0.037 16.7
1000 500 no 0.64 0.07 12.0 9.8 1.8 0.09 24.9
1000 500 yes 0.42 30.99 15.7 29.9 1.7 0.08 79.4
2000 1000 no 3.5 0.30 73.5 14.9 4.8 0.08 96.6
2500 1250 no 2.7 0.45 127.1 18.3 11.5 0.09 160.7
3000 1500 no 4.2 0.46 236.8 19.6 10.7 0.10 272.8
4000 2000 no 8.4 0.98 502.2 27.9 21.4 0.14 561.8
To my complete surprise, and chagrin, there is an exponential slowdown in 'cw' (the new.CytoscapeWindow method) --which makes no sense at all. It may be that your memory exhaustion is related to that, and is quite fixable.
I will explore this, and probably have a fix in the next week.
By the way, did you know that you can create a graphNEL directly from an adjacency matrix?
g = new ("graphAM", adjMat = matrix, edgemode="directed")
Thanks, Ignacio, for your most helpful report. I should have done these timing tests long ago!
Paul
It has been a while since I used Cytoscape so I am not exactly sure how to do it, but the manual states that you can use text files as input using the "Table Import" feature.
In igraph you can use the write.graph() function to export a graph in a bunch of ways. This way you can circumvent having to convert to a graphNEL object which might be enough to not run out of memory.

rank() doesn't rank properly when using with scienctific notation number

I tried to order csv file but the rank() function acting weird on number with -E notation.
> comparison = read.csv("e:/thesis/comparison/output.csv", header=TRUE)
> comparison$proxygeneld_full.txt[0:20]
[1] 9.34E-07 4.04E-06 4.16E-06 7.17E-06 2.08E-05 3.00E-05
[7] 3.59E-05 4.16E-05 7.75E-05 9.50E-05 0.0001116 0.00012452
[13] 0.00015494 0.00017892 0.00017892 0.00018345 0.0002232 0.000231775
[19] 0.00023241 0.0002666
13329 Levels: 0.0001116 0.00012452 0.00015494 0.00017892 0.00018345 ... adjP
> rank(comparison$proxygeneld_full.txt[0:20])
[1] 19.0 14.0 16.0 17.0 11.0 12.0 13.0 15.0 18.0 20.0 1.0 2.0 3.0 4.5 4.5
[16] 6.0 7.0 8.0 9.0 10.0
#It should be 1-20 in order ....
It seems just ignore -E notation right there. It turn out to be fine if I'm not using data from file
> rank(c(9.34E-07, 4.04E-06, 7.17E-06))
[1] 1 2 3
Am I missing something ? Thanks.
I guess you have some non-numeric data in your csv file.
What happens if you do?
as.numeric(comparison$proxygeneld_full.txt)
If this produces different numbers than you expected, you certainly have some text in this column.
Yep - $proxygeneld_full.txt[0:20] isn't even numeric. It is a factor:
13329 Levels: 0.0001116 0.00012452 0.00015494 0.00017892 0.00018345 ... adjP
So rank() is ranking the numeric codes that lay behind the factor representation, and the E-0X "numbers" sort after the non-E numbers in the levels.
Look at str(comparison) and you'll see that proxygeneld_full.txt is a factor.
I'm struggling to replicate the behaviour you are seeing with E numbers in a csv file. R reads them properly as numeric. Check your CSV to make sure you don't have some none numeric values in that column, or that the E numbers are not quoted.
Ahh! looking again at the levels you quote: there is an adjP lurking at the end of the code you show. Check your data again as this adjP is in there someone where and that is forcing R to code that variable as a factor hence the behaviour you see with ranking as I described above.

Resources