I'm want to start working with sparkR followed tutorials but I get the below error:
library(SparkR)
Sys.setenv(SPARK_HOME="/Users/myuserhone/dev/spark-2.2.0-bin-hadoop2.7")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R","lib"), .libPaths()))
spark <- sparkR.session(appName = "mysparkr", Sys.getenv("SPARK_HOME"), master = "local[*]")
csvPath <- "file:///Users/myuserhome/dev/spark-data/donation"
mySparkDF <- read.df(csvPath, "csv", header = "true", inferSchema = "true", na.strings = "?")
mySparkDF.show()
But I get:
Error in mySparkDF.show() : could not find function "mySparkDF.show"
Not sure what I do wrong, in addition, I don't have code completion for the spark functions like read.df(...)
In addition if I try
show(describe(mySparkDF))
or
show(summary(mySparkDF))
I get in results the metadata and not the "describe" expected result
SparkDataFrame[summary:string, id_1:string, id_2:string, cmp_fname_c1:string, cmp_fname_c2:string, cmp_lname_c1:string, cmp_lname_c2:string, cmp_sex:string, cmp_bd:string, cmp_bm:string, cmp_by:string, cmp_plz:string]
Anything i'm doing wrong?
show is not used in such a way in SparkR, neither it serves the same purpose with the same-name command in PySpark; you should use either head or showDF:
df <- as.DataFrame(faithful)
show(df)
# result:
SparkDataFrame[eruptions:double, waiting:double]
head(df)
# result:
eruptions waiting
1 3.600 79
2 1.800 54
3 3.333 74
4 2.283 62
5 4.533 85
6 2.883 55
showDF(df)
# result:
+---------+-------+
|eruptions|waiting|
+---------+-------+
| 3.6| 79.0|
| 1.8| 54.0|
| 3.333| 74.0|
| 2.283| 62.0|
| 4.533| 85.0|
| 2.883| 55.0|
| 4.7| 88.0|
| 3.6| 85.0|
| 1.95| 51.0|
| 4.35| 85.0|
| 1.833| 54.0|
| 3.917| 84.0|
| 4.2| 78.0|
| 1.75| 47.0|
| 4.7| 83.0|
| 2.167| 52.0|
| 1.75| 62.0|
| 4.8| 84.0|
| 1.6| 52.0|
| 4.25| 79.0|
+---------+-------+
only showing top 20 rows
Related
I use WGCNA package for analyzing the co-expressed genes. Here I try to Form a data frame analogous to expression data that will hold the clinical traits. and i use the following codes:
table for traitData
| x | sample | NoduleperPlant |
|- |- |- |
| 1 | 1021_verbena_rep_1 | 2 |
| 2 | 1021_verbena_rep_2 | 3 |
| 3 | 1021_verbena_rep_3 | 1 |
| 4 | 1021_camporegio_rep_1 | 2 |
| 5 | 1021_camporegio_rep_2 | 3 |
| 6 | 1021_camporegio_rep_3 | 4 |
| 7 | BL225C_camporegio_rep_1 | 5 |
| 8 | BL225C_camporegio_rep_2 | 4 |
| 9 | BL225C_camporegio_rep_3 | 1 |
Table dfxpr (some of the genes are presented in table)
|FIELD1 |aacC-1|aacC4-1|aapJ-1|aapM-1|aapP-1|aapQ-1|aarF-1|
|-----------------------|------|-------|------|------|------|------|------|
|X1021_verbena_rep_1 |42 |46 |12412 |935 |3354 |2876 |550 |
|X1021_verbena_rep_2 |52 |37 |11775 |946 |2970 |2824 |514 |
|X1021_verbena_rep_3 |12 |22 |5077 |397 |1462 |1228 |230 |
|X1021_camporegio_rep_1 |52 |71 |12983 |1454 |3408 |3248 |707 |
|X1021_camporegio_rep_2 |20 |65 |9240 |803 |2807 |3146 |445 |
|X1021_camporegio_rep_3 |28 |53 |11030 |1065 |3480 |3410 |582 |
|BL225C_camporegio_rep_1|29 |19 |6346 |375 |938 |768 |118 |
|BL225C_camporegio_rep_2|51 |62 |12938 |781 |1765 |1629 |291 |
|BL225C_camporegio_rep_3|52 |43 |6462 |504 |1120 |1091 |238 |
traitData = read.csv("NodulPerPlantTraitForLowGroup.csv"); #this csv file contains 3 columns as the first column is non-relevant information, second column contains the names of samples and the third column holds the values measured for the traits.
# remove columns that hold information I do not need.
allTraits = traitData[, -1];
allTraits = allTraits[, 1:2];
# Form a data frame analogous to expression data that will hold the clinical traits.
lowNoduleSamples = rownames(dfxpr) #dfxpr is a data frame containing 9 observations (i.e. samples) and 6398 variables (i.e. genes)
traitRows = match(lowNoduleSamples, allTraits$sample); #here is the line i get wrong values as NAs while i know they all should match
datTraits = allTraits[traitRows, -1]; #then this lines result NAs too
rownames(datTraits) = allTraits[traitRows, 1];
collectGarbage();
how can I fix the problem?
I have Added a "drop = FALSE" to this line: datTraits = allTraits[traitRows, -1]
datTraits = allTraits[traitRows, -1, drop = FALSE]
I realized that my allTraits contains only 2 columns; when I remove the first one, I'm left with just one column and R converts that into a single vector unless I add the drop = FALSE argument.
I'm using the tableby function from the arsenal package to create summary tables. For most of the statistics I need to generate, this package gives me exactly the format I'm asked except for one. I need to get in the same cell something like this:
n (%) [95%CI of the percentage]
For now, I'm using the countpct function which gives me the "n (%)" and binomCI which gives me the proportion with 95%CI but it doubles the number of rows in my final table so it's not ideal...
How could I do to have everything on the same line ?
I tried to see if I could create another function from the original ones but I don't really understand their syntax...
Thanks for your help.
EDIT : Here is a reproducible example.
Code for the original functions can be found here.
So this is what I have now :
data<-NULL
data$Visit2<-c(rep("Responder",121),rep("Not Responder",29),rep("Responder",4),rep("Not Responder",47))
data$Group<-c(rep("Tx",150),rep("No Tx",51))
data<-as.data.frame(data)
library(arsenal)
my_controls <- tableby.control(test = F,total = F, cat.stats = c("countpct" ,"binomCI"), conf.level = 0.95)
summary(tableby(Group ~ Visit2,
data = data,
control = my_controls),
digits=2, digits.p=3, digits.pct=1)
# Results :
| | No Tx (N=51) | Tx (N=150) |
|:-------------------------------|:-----------------:|:-----------------:|
|**Visit2** | | |
| Not Responder | 47 (92.2%) | 29 (19.3%) |
| Responder | 4 (7.8%) | 121 (80.7%) |
| Not Responder | 0.92 (0.81, 0.98) | 0.19 (0.13, 0.27) |
| Responder | 0.08 (0.02, 0.19) | 0.81 (0.73, 0.87) |
And this is what I want :
| | No Tx (N=51) | Tx (N=150) |
|:----------------|:-------------------------:|:------------------------:|
|**Visit2** | |
| Not Responder | 47 (92.2%) [81.1, 97.8] | 29 (19.3%) [13.3, 26.6] |
| Responder | 4 (7.8%) [2.2, 18.9] | 121 (80.7%) [73.4, 86.7] |
|
I have to combine two R data frames which have trade and quote information. Like a join, but based on a timestamp in seconds. I need to match each trade with the most recent quote. There are many more quotes than trades.
I have this table with stock quotes. The Timestamp is in seconds:
+--------+-----------+-------+-------+
| Symbol | Timestamp | bid | ask |
+--------+-----------+-------+-------+
| IBM | 10 | 132 | 133 |
| IBM | 20 | 132.5 | 133.3 |
| IBM | 30 | 132.6 | 132.7 |
+--------+-----------+-------+-------+
And these are trades:
+--------+-----------+----------+-------+
| Symbol | Timestamp | quantity | price |
+--------+-----------+----------+-------+
| IBM | 25 | 100 | 132.5 |
| IBM | 31 | 80 | 132.7 |
+--------+-----------+----------+-------+
I think a native R function or dplyr could do it - I've used both for basic purposes but not sure how to proceed here. Any ideas?
So the trade at 25 seconds should match with the quote at 20 seconds, and the trade #31 matches the quote #30, like this:
+--------+-----------+----------+-------+-------+-------+
| Symbol | Timestamp | quantity | price | bid | ask |
+--------+-----------+----------+-------+-------+-------+
| IBM | 25 | 100 | 132.5 | 132.5 | 133.3 |
| IBM | 31 | 80 | 132.7 | 132.6 | 132.7 |
+--------+-----------+----------+-------+-------+-------+
Consider merging on a calculated field by increments of 10. Specifically, calculate a column for multiples of 10 in both datasets, and merge on that field with Symbol.
Below transform and within are used to assign and de-assign the helper field, mult10. In this use case, both base functions are interchangeable:
final_df <- transform(merge(within(quotes, mult10 = floor(Timestamp / 10) * 10),
within(trades, mult10 = floor(Timestamp / 10) * 10),
by=c("Symbol", "mult10"),
multi10 = NULL)
Now if the 10 multiple does not suffice for your needs, adjust to level you require such as 15, 5, 2, etc.
within(quotes, mult10 <- floor(Timestamp / 15) * 15)
within(quotes, mult10 <- floor(Timestamp / 5) * 5)
within(quotes, mult10 <- floor(Timestamp / 2) * 2)
Even more, you may need to use the reverse, floor or ceiling for both data sets respectively to calculate highest multiple of quote's Timestamp and lowest multiple of trade's Timestamp:
within(quotes, mult10 <- ceiling(Timestamp / 15) * 15)
within(trades, mult10 <- floor(Timestamp / 5) * 5)
I'm using a double apply function to get a list of p-values for cor.test between any two columns of two tables.
hel_plist<-apply(bc, 2, function(x) { apply(otud, 2, function(y) { if (cor.test(x,y,method="spearman", exact=FALSE)$p.value<0.05){cor.test(x,y,method="spearman", exact=FALSE)$p.value}}) })
The otud data.frame is 90X11 (90rows,11 colums or to say dim(otud) 90 11) and will be used with different data.frames.
bc and hel - are both 90X2 data.frame-s - so for both I get 2*11=22 p-values out of functions
bc_plist<-apply(bc, 2, function(x) { apply(otud, 2, function(y) { if (cor.test(x,y,method="spearman", exact=FALSE)$p.value<0.05){cor.test(x,y,method="spearman", exact=FALSE)$p.value}}) })
hel_plist<-apply(hel, 2, function(x) { apply(otud, 2, function(y) { if (cor.test(x,y,method="spearman", exact=FALSE)$p.value<0.05){cor.test(x,y,method="spearman", exact=FALSE)$p.value}}) })
For bc I will have an output with dim=NULL a list of elements of otunames$bcnames$ p-value (a format that I have always got from these scripts and are happy with)
But for hel I will get and output of dim(hel) 11 2 - an 11X2 table with p-values written inside.
Shortened examples of output.
hel_plist
+--------+--------------+--------------+
| | axis1 | axis2 |
+--------+--------------+--------------+
| Otu037 | 1.126362e-18 | 0.01158251 |
| Otu005 | 3.017458e-2 | NULL |
| Otu068 | 0.00476002 | NULL |
| Otu070 | 1.27646e-15 | 5.252419e-07 |
+--------+--------------+--------------+
bc_plist
$axis1
$axis1$Otu037
[1] 1.247717e-06
$axis1$Otu005
[1] 1.990313e-05
$axis1$Otu068
[1] 5.664597e-07
Why is it like that when the input formats are all the same? (Shortened examples)
bc
+-------+-----------+-----------+
| group | axis1 | axis2 |
+-------+-----------+-----------+
| 1B041 | 0.125219 | 0.246319 |
| 1B060 | -0.022412 | -0.030227 |
| 1B197 | -0.088005 | -0.305351 |
| 1B222 | -0.119624 | -0.144123 |
| 1B227 | -0.148946 | -0.061741 |
+-------+-----------+-----------+
hel
+-------+---------------+---------------+
| group | axis1 | axis2 |
+-------+---------------+---------------+
| 1B041 | -0.0667782322 | -0.1660606406 |
| 1B060 | 0.0214470932 | -0.0611351008 |
| 1B197 | 0.1761876858 | 0.0927570627 |
| 1B222 | 0.0681058251 | 0.0549292399 |
| 1B227 | 0.0516864361 | 0.0774155225 |
| 1B235 | 0.1205676221 | 0.0181712761 |
+-------+---------------+---------------+
How could I force my scripts to always produce "flat" outputs as in the case of bc
OK different output-s are caused because of the NULL results from conditional function in bc_plist case. If I'd to modify code to replace possible NULL-s with NA-s I'd get 2d tables in any case.
So to keep things constant :
bc_nmds_plist<-apply(bc_nmds, 2, function(x) { apply(stoma_otud, 2, function(y) { if (cor.test(x,y,method="spearman", exact=FALSE)$p.value<0.05){cor.test(x,y,method="spearman", exact=FALSE)$p.value}else NA}) })
And I get a 2d tabel out for bc_nmds_plist too.
So I guess this thing can be called solved - as I now have a piece of code that produces predictable output on any correct input.
If anyone has any idea how to force the output to conform to previos bc_plist format instead I would still be interested as I do actually prefer that form:
$axis1
$axis1$Otu037
[1] 1.247717e-06
$axis1$Otu005
[1] 1.990313e-05
$axis1$Otu068
[1] 5.664597e-07
When I trying print table with knitr::kable function "id" word apperas in the column names. How can I change it?
Example:
> x <- structure(c(42.3076923076923, 53.8461538461538, 96.1538461538462,
2.56410256410256, 1.28205128205128, 3.84615384615385,
44.8717948717949, 55.1282051282051, 100),
.Dim = c(3L, 3L),
.Dimnames = structure(list(Condition1 = c("Yes", "No", "Sum"),
Condition2 = c("Yes", "No", "Sum")),
.Names = c("Condition1", "Condition2")), class = c("table", "matrix"))
> print(x)
Condition2
Condition1 Yes No Sum
Yes 42,31 2,56 44,87
No 53,85 1,28 55,13
Sum 96,15 3,85 100,00
> library(knitr)
> kable(x)
|id | Yes| No| Sum|
|:----|-----:|-----:|------:|
|Yes | 42,3| 2,56| 44,9|
|No | 53,8| 1,28| 55,1|
|Sum | 96,2| 3,85| 100,0|
Edit: I find reason of this behavior in the knitr:::kable_mark function. But now I not understand how to make it more flexible.
An alternative to kable might be the general S3 method of pander:
> library(pander)
> pander(x, style = 'rmarkdown')
| | Yes | No | Sum |
|:---------:|:-----:|:-----:|:-----:|
| **Yes** | 42.31 | 2.564 | 44.87 |
| **No** | 53.85 | 1.282 | 55.13 |
| **Sum** | 96.15 | 3.846 | 100 |
If you need to set the decimal mark to comma, then set the relevant option before and use that in your R session:
> panderOptions('decimal.mark', ',')
> pander(x, style = 'rmarkdown')
| | Yes | No | Sum |
|:---------:|:-----:|:-----:|:-----:|
| **Yes** | 42,31 | 2,564 | 44,87 |
| **No** | 53,85 | 1,282 | 55,13 |
| **Sum** | 96,15 | 3,846 | 100 |
There are also some other possible tweaks: http://rapporter.github.io/pander/#pander-options
I think the easiest way is to rip out and replace kable_mark completely. Note: this is quite dirty – but it seems to work, and there is no current way to customise how kable_mark works (you could submit a patch to knitr though).
km <- edit(knitr:::kable_mark)
# Now edit the code and remove lines 7 and 8.
unlockBinding('kable_mark', environment(knitr:::kable_mark))
assign('kable_mark', km, envir=environment(knitr:::kable_mark))
Explanation: First we edit the function and store the amended definition in a temporary variable. We remove the two lines
if (grepl("^\\s*$", cn[1L]))
cn[1L] = "id"
… of course you can also hard-code the amended function rather than editing it, or change the function around completely.
Next we use unlockBinding to make knitr:::kable_mark overridable. If we don’t do this, the next assign command wouldn’t work.
Finally, we assign the patched function back to knitr:::kable_mark. Done.