I'm using a double apply function to get a list of p-values for cor.test between any two columns of two tables.
hel_plist<-apply(bc, 2, function(x) { apply(otud, 2, function(y) { if (cor.test(x,y,method="spearman", exact=FALSE)$p.value<0.05){cor.test(x,y,method="spearman", exact=FALSE)$p.value}}) })
The otud data.frame is 90X11 (90rows,11 colums or to say dim(otud) 90 11) and will be used with different data.frames.
bc and hel - are both 90X2 data.frame-s - so for both I get 2*11=22 p-values out of functions
bc_plist<-apply(bc, 2, function(x) { apply(otud, 2, function(y) { if (cor.test(x,y,method="spearman", exact=FALSE)$p.value<0.05){cor.test(x,y,method="spearman", exact=FALSE)$p.value}}) })
hel_plist<-apply(hel, 2, function(x) { apply(otud, 2, function(y) { if (cor.test(x,y,method="spearman", exact=FALSE)$p.value<0.05){cor.test(x,y,method="spearman", exact=FALSE)$p.value}}) })
For bc I will have an output with dim=NULL a list of elements of otunames$bcnames$ p-value (a format that I have always got from these scripts and are happy with)
But for hel I will get and output of dim(hel) 11 2 - an 11X2 table with p-values written inside.
Shortened examples of output.
hel_plist
+--------+--------------+--------------+
| | axis1 | axis2 |
+--------+--------------+--------------+
| Otu037 | 1.126362e-18 | 0.01158251 |
| Otu005 | 3.017458e-2 | NULL |
| Otu068 | 0.00476002 | NULL |
| Otu070 | 1.27646e-15 | 5.252419e-07 |
+--------+--------------+--------------+
bc_plist
$axis1
$axis1$Otu037
[1] 1.247717e-06
$axis1$Otu005
[1] 1.990313e-05
$axis1$Otu068
[1] 5.664597e-07
Why is it like that when the input formats are all the same? (Shortened examples)
bc
+-------+-----------+-----------+
| group | axis1 | axis2 |
+-------+-----------+-----------+
| 1B041 | 0.125219 | 0.246319 |
| 1B060 | -0.022412 | -0.030227 |
| 1B197 | -0.088005 | -0.305351 |
| 1B222 | -0.119624 | -0.144123 |
| 1B227 | -0.148946 | -0.061741 |
+-------+-----------+-----------+
hel
+-------+---------------+---------------+
| group | axis1 | axis2 |
+-------+---------------+---------------+
| 1B041 | -0.0667782322 | -0.1660606406 |
| 1B060 | 0.0214470932 | -0.0611351008 |
| 1B197 | 0.1761876858 | 0.0927570627 |
| 1B222 | 0.0681058251 | 0.0549292399 |
| 1B227 | 0.0516864361 | 0.0774155225 |
| 1B235 | 0.1205676221 | 0.0181712761 |
+-------+---------------+---------------+
How could I force my scripts to always produce "flat" outputs as in the case of bc
OK different output-s are caused because of the NULL results from conditional function in bc_plist case. If I'd to modify code to replace possible NULL-s with NA-s I'd get 2d tables in any case.
So to keep things constant :
bc_nmds_plist<-apply(bc_nmds, 2, function(x) { apply(stoma_otud, 2, function(y) { if (cor.test(x,y,method="spearman", exact=FALSE)$p.value<0.05){cor.test(x,y,method="spearman", exact=FALSE)$p.value}else NA}) })
And I get a 2d tabel out for bc_nmds_plist too.
So I guess this thing can be called solved - as I now have a piece of code that produces predictable output on any correct input.
If anyone has any idea how to force the output to conform to previos bc_plist format instead I would still be interested as I do actually prefer that form:
$axis1
$axis1$Otu037
[1] 1.247717e-06
$axis1$Otu005
[1] 1.990313e-05
$axis1$Otu068
[1] 5.664597e-07
Related
I am trying to parse the below data in Kusto. Need help.
[[ObjectCount][LinkCount][DurationInUs]]
[ChangeEnumeration][[88][9][346194]]
[ModifyTargetInLive][[3][6][595903]]
Need generic implementation without any hardcoding.
ideally - you'd be able to change the component that produces source data in that format to use a standard format (e.g. CSV, Json, etc.) instead.
The following could work, but you should consider it very inefficient
let T = datatable(s:string)
[
'[[ObjectCount][LinkCount][DurationInUs]]',
'[ChangeEnumeration][[88][9][346194]]',
'[ModifyTargetInLive][[3][6][595903]]',
];
let keys = toscalar(
T
| where s startswith "[["
| take 1
| project extract_all(#'\[([^\[\]]+)\]', s)
);
T
| where s !startswith "[["
| project values = extract_all(#'\[([^\[\]]+)\]', s)
| mv-apply with_itemindex = i keys on (
extend Category = tostring(values[0]), p = pack(tostring(keys[i]), values[i + 1])
| summarize b = make_bag(p) by Category
)
| project-away values
| evaluate bag_unpack(b)
--->
| Category | ObjectCount | LinkCount | DurationInUs |
|--------------------|-------------|-----------|--------------|
| ChangeEnumeration | 88 | 9 | 346194 |
| ModifyTargetInLive | 3 | 6 | 595903 |
I have a dataset in which I paste values in a dplyr chain and collapse with the pipe character (e.g. " | "). If any of the values in the dataset are blank, I just get recurring pipe characters in the pasted list.
Some of the values look like this, for example:
badstring = "| | | | | | GHOULSBY,SCROGGINS | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CAT,JOHNSON | | | | | | | | | | | | BURGLAR,PALA | | | | | | | | |"
I want to match all the pipes that occur more than once and delete them, so that just the names appear like so:
correctstring = "| GHOULSBY,SCROGGINS | CAT,JOHNSON | |BURGLAR,PALA |"
I tried the following, but to no avail:
mutate(names = gsub('[\\|]{2,}', '', name_list))
The difficulty in this question is in formulating a regex which can selectively remove every pipe, except the ones we want to remain as actual separators between terms. We can match on the following pattern:
\|\s+(?=\|)
and then replace just empty string. This pattern will remove any pipe (and any following whitespace) so long as what follows is another pipe. A removal would not occur when a pipe is followed by an actual term, or when it is followed by the end of the string.
badstring = "| | | | | | GHOULSBY,SCROGGINS | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CAT,JOHNSON | | | | | | | | | | | | BURGLAR,PALA | | | | | | | | |"
result <- gsub("\\|\\s+(?=\\|)", "", badstring, perl=TRUE)
result
[1] "| GHOULSBY,SCROGGINS | CAT,JOHNSON | BURGLAR,PALA |"
Demo
Edit:
If you expect to have inputs like | | | which are devoid of any terms, and you would expect empty string as the output, then my solution would fail. I don't see an obvious way to modify the above regex, but you can handle this case with one more call to sub:
result <- sub("^\\|$", "", result)
We also might be able to modify the original pattern to use an alternation covering all cases:
result <- gsub("\\|\\s+(?=\\|)|(?:^\\|$)", "", badstring, perl=TRUE)
When rendering tables such as this one (using RStudio + knitr), there is unwanted indentation (see red zone in the image). How can I avoid such indentation?
I imagine there is some CSS involved, but if there was a way to even prevent rmarkdown from "considering" this as a list, it could simplify matters. This is needed for an R package, so heavy hacks are not really an option, but I'll gladly receive all suggestions. Thx.
The (grid) table:
+------------------------+------------------------------------+
| Variable | Stats / Values |
+========================+====================================+
| SomeVar1 | mean (sd) : 1500000.5 (288675.28)\ |
| [numeric] | min < med < max :\ |
| | 1000001 < 1500000.5 < 2e+06\ |
| | IQR (CV) : 499999.5 (0.19) |
+------------------------+------------------------------------+
| SomeVar2 | 1. AAAAAA\ |
| [factor] | 2. BBBBBB\ |
| | 3. CCCCCC\ |
| | 4. DDDDDD\ |
| | 5. EEEEEE\ |
| | 6. FFFFFF\ |
| | 7. GGGGGG\ |
| | 8. HHHHHH\ |
| | 9. IIIIII\ |
| | 10. JJJJJJ\ |
| | [ 102917 others ] |
+------------------------+------------------------------------+
The rendered html table:
I want to create a calculated field to use with the rpivotTable package, similar to the functionality seen in excel.
For instance, consider the following table:
+--------------+--------+---------+-------------+-----------------+
| Manufacturer | Vendor | Shipper | Total Units | Defective Units |
+--------------+--------+---------+-------------+-----------------+
| A | P | X | 173247 | 34649 |
| A | P | Y | 451598 | 225799 |
| A | P | Z | 759695 | 463414 |
| A | Q | X | 358040 | 225565 |
| A | Q | Y | 102068 | 36744 |
| A | Q | Z | 994961 | 228841 |
| A | R | X | 454672 | 231883 |
| A | R | Y | 275994 | 124197 |
| A | R | Z | 691100 | 165864 |
| B | P | X | 755594 | 302238 |
| . | . | . | . | . |
| . | . | . | . | . |
+--------------+--------+---------+-------------+-----------------+
(my actual table has many more columns, both dimensions and measures, time, etc. and I need to define multiple such "calculated columns")
If I want to calculate defect rate (which would be Defective Units/Total Units) and I want to aggregate by either of the first three columns, I'm not able to.
I tried assignment by reference (:=), but that still didn't seem to work and summed up defect rates (i.e., sum(Defective_Units/Total_Units)), instead of sum(Defective_Units)/sum(Total_Units):
myData[, Defect.Rate := Defective_Units / Total_Units]
This ended up giving my defect rates greater than 1. Is there anywhere I can declare a calculated field, which is just a formula evaluated post aggregation?
You're lucky - the creator of pivottable.js foresaw cases like yours (and mine, earlier today) by implementing an aggregator called "Sum over Sum" and a few more, likewise, cf. https://github.com/nicolaskruchten/pivottable/blob/master/src/pivot.coffee#L111 and https://github.com/nicolaskruchten/pivottable/blob/master/src/pivot.coffee#L169.
So we'll use "Sum over Sum" as parameter "aggregatorName", and the columns whose quotient we want in the "vals" parameter.
Here's a meaningless usage example from the mtcars data for reproducibility:
require(rpivotTable)
data(mtcars)
rpivotTable(mtcars,rows="gear", cols=c("cyl","carb"),
aggregatorName = "Sum over Sum",
vals =c("mpg","disp"),
width="100%", height="400px")
Sum of var values by group with certain values excluded conditioned on the other variable.
How to do it elegantly without transposing?
So in the table below for each (fTicker, DATE_f), I seek to sum the values of wght with the value of wght conditioned on sTicker excluded from the sum.
In the table below, (excl_val,sTicker=A) |(fTicker=XLK, DATE_f = 6/20/2003) = wght_AAPL_6/20/2003_XLK + wght_AA_6/20/2003_XLK but not the wght for sTicker=A
+---------+---------+-----------+-------------+-------------+
| sTicker | fTicker | DATE_f | wght | excl_val |
+---------+---------+-----------+-------------+-------------+
| A | XLK | 6/20/2003 | 0.087600002 | 1.980834016 |
| A | XLK | 6/23/2003 | 0.08585 | 1.898560068 |
| A | XLK | 6/24/2003 | 0.085500002 | |
| AAPL | XLK | 6/20/2003 | 0.070080002 | |
| AAPL | XLK | 6/23/2003 | 0.06868 | |
| AAPL | XLK | 6/24/2003 | 0.068400002 | |
| AA | XLK | 6/20/2003 | 1.910754014 | |
| AA | XLK | 6/23/2003 | 1.829880067 | |
| AA | XLK | 6/24/2003 | 1.819775 | |
| | | | | |
| | | | | |
+---------+---------+-----------+-------------+-------------+
There are several fTicker groups with many sTicker in them (10 to 70), some sTicker may belong to several fTicker. The end result should be an excl_val for each sTicker on each DATE_f and for each fTicker.
I did it by transposing in SAS with resulting file about 6 gb but the same approach in R, blew memory up to 40 gb and it's basically unworkable.
In R, I got as far as this
weights$excl_val <- with(weights, aggregate(wght, list(fTicker, DATE_f), sum, na.rm=T))
but it's just a simple sum (without excluding the necessary observation) and there is mismatch between rows length. If i could condition the sum to exclude the sTicker obs for wght from the summation, i think it might work.
About the excl_val length: i computed it in excel, for just 2 cells, that's why it's short
Thank you!
Arsenio
When you have data in a data.frame, it is better if the rows are meaningful
(in particular, the columns should have the same length):
in this case, excl_val looks like a separate vector.
After putting the information it contains in the data.frame,
things become easier.
# Sample data
k <- 5
d <- data.frame(
sTicker = rep(LETTERS[1:k], k),
fTicker = rep(LETTERS[1:k], each=k),
DATE_f = sample( seq(Sys.Date(), length=2, by=1), k*k, replace=TRUE ),
wght = runif(k*k)
)
excl_val <- sample(d$wght, k)
# Add a "valid" column to the data.frame
d$valid <- ! d$wght %in% excl_val
# Compute the sum
library(plyr)
ddply(d, c("fTicker","DATE_f"), summarize, sum=sum(wght[valid]))