R binning dataset and surface plot - r

I have a large data set that I am trying to discretise and create a 3d surface plot with:
rowColFoVCell wpbCount Feret
1 001001001001 1 0.58
2 001001001001 1 1.30
3 001001001001 1 0.58
4 001001001001 1 0.23
5 001001001001 2 0.23
6 001001001001 2 0.58
There are currently 695302 rows in this data set. I am trying to discretise the third 'Feret' column based on the second column, so for each 'wpbCount' bin the 'Feret' column.
I think the solution will involve using cut but I am not sure how to go about this. I would like to end up with a data frame something like this:
wpbCount Feret Count
1 1 [0.0,0.2] 3
2 1 [0.2,0.4] 5
3 1 [0.4,0.6] 6
4 1 [0.8,0.8] 9
5 2 [0.0,0.2] 6
6 2 [0.4,0.6] 23

This is to answer the first part:
Create Some data
DF <- data.frame(wpbCount = sample(1:1000, 1000),
Feret = sample(seq(0, 1, 0.001), 1000))
1) Discretize
Use cut with right = FALSE so the intervals are [)
I normally find this more usefull than the default
DF$cut_it <- cut(DF$Feret, right = FALSE,
breaks = c(0, 0.2, 0.4, 0.6, 0.8, 1))
2) Aggregate
TABLE <- data.frame(table(DT$cut_it))
EDIT Another attempt
library(data.table)
DT <- data.table(DF)
DT <- DT[, list(wpbCount = length(wpbCount),
Feret = length(Feret)
), by=cut_it]
Perhaps you are just trying to discretize and not aggregate.
Try this:
DF2 <- data.frame(wpbCount = sample(1:3, 1000, replace=T),
Feret = sample(seq(0, 1, 0.001), 1000))
DF2$Feret2 <- cut(DF$Feret, right = FALSE,
breaks = c(0, 0.2, 0.4, 0.6, 0.8, 1.1))
DF2 <- DF2[, c(1, 3)]

Thanks very much for your help I used the following functions in R:
x$bin <- cut(x$Feret, right = FALSE, breaks = seq(0,max(wpbFeatures$Feret), by=0.1))
y <-aggregate(x$bin, by = x[c('wpbCount', 'bin')], length)
From your suggestions I have been able to get the data frame that I require:
wpbCount | bin | x
1 [0.2,0.3) 72
2 [0.2,0.3) 142
3 [0.2,0.3) 224
4 [0.2,0.3) 299
5 [0.2,0.3) 421
6 [0.2,0.3) 479
Now I need to plot this in 3D and I am not sure how to do so with a non-numerical column i.e. the bin column which is factors.
Does anyone know how I can plot these three columns against each other?

Check out this link.
There are some 3d plots. However, 3d plots aren't the greatest tool to analize data.
If you insist with the 3d approach, try stat_contout()
from the ggplot2 package.
However, a probably better apprach is to do a few plots in 2d, or use facet_grid().
Take a look at ggplot2 current documentation also.
Try this based on your last answer (not tested):
ggplot(DF, aes(wpbCount , x)) +
geon_point() +
facet_grid(. ~ bin)
The idea is to use the factor variable (in this case, bin) to facet the plot.

Related

Generate buckets based on a column data then create another column storing values assigned to corresponding buckets

I have a dataframe which includes 2 columns below
|Systolic blood pressure |Urea Nitrogen|
|------------------------|-------------|
|155.86667|50.000000|
|140.00000| 20.33333|
|135.33333| 33.857143|
|126.40000|15.285714|
|...|...|
I want to create 2 more columns called Sys_points and BUN_points based on the bucket criteria like the image attached, which will store the values (not in equally spaced) of column Points in the image. I have tried findInterval and cut but can't find functions that allow me to assign values not in sequence order to buckets.
#findInterval
BUN_int <- seq(0,150,by=10)
data3$BUN <- findInterval(data3$`Urea Nitrogen`,BUN_int)
#cut
cut(data3$`Urea Nitrogen`,breaks = BUN_int, right=FALSE, dig.lab=c(0,2,4,6,8,9,11,13,15,17,19,21,23,25,27,28))
Is there any function that can help me with this?
Here’s how to do it using cut(). Note the use of -Inf and Inf to include <x and >=x bins.
bun_data$Sys_points <- cut(
bun_data$`Systolic blood pressure`,
breaks = c(5:20 * 10, Inf),
labels = c(28,26,24,23,21,19,17,15,13,11,9,8,6,4,2,0),
right = FALSE
)
bun_data$BUN_points <- cut(
bun_data$`Urea Nitrogen`,
breaks = c(-Inf, 1:15 * 10, Inf),
labels = c(0,2,4,6,8,9,11,13,15,17,19,21,23,25,27,28),
right = FALSE
)
Result:
Systolic blood pressure Urea Nitrogen Sys_points BUN_points
1 155.8667 50.00000 9 9
2 140.0000 20.33333 11 4
3 135.3333 33.85714 13 6
4 126.4000 15.28571 15 2

ggplot2 alternatives to fill in barplots, occurence of factor in multiple rows

I'm pretty new to R and I have a problem with plotting a barplot out of my data which looks like this:
condition answer
2 H
1 H
8 H
5 W
4 M
7 H
9 H
10 H
6 H
3 W
The data consists of 100 rows with the conditions 1 to 10, each randomly generated 10 times (10 times condition 1, 10 times condition 8,...). Each of the conditions also has a answer which could be H for Hit, M for Miss or W for wrong.
I want to plot the number of Hits for each condition in a barplot (for example 8 Hits out of 10 for condition 1,...) for that I tried to do the following in ggplot2
ggplot(data=test, aes(x=test$condition, fill=answer=="H"))+
geom_bar()+labs(x="Conditions", y="Hitrate")+
coord_cartesian(xlim = c(1:10), ylim = c(0:10))+
scale_x_continuous(breaks=seq(1,10,1))
And it looked like this:
This actually exactly what I need except for the red color which covers everything. You can see that conditions 3 to 5 have no blue bar, because there are no hits for these conditions.
Is there any way to get rid of this red color and to maybe count the amount of hits for the different conditions? -> I tried the count function of dplyr but it only showed me the amount of H when there where some for this particular condition. 3-5 where just "ignored" by count, there wasn't even a 0 in the output.-> but I'd still need those numbers for the plot
I'm sorry for this particular long post but I'm really at the end of knowledge considering this. I'd be open for suggestions or alternatives! Thanks in advance!
This is a situation where a little preprocessing goes a long way. I made sample data that would recreate the issue, i.e. has cases where there won't be any "H"s.
Instead of relying on ggplot to aggregate data in the way you want it, use proper tools. Since you mention dplyr::count, I use dplyr functions.
The preprocessing task is to count observations with answer "H", including cases where the count is 0. To make sure all combinations are retained, convert condition to a factor and set .drop = F in count, which is in turn passed to group_by.
library(dplyr)
library(ggplot2)
set.seed(529)
test <- data.frame(condition = rep(1:10, times = 10),
answer = c(sample(c("H", "M", "W"), 50, replace = T),
sample(c("M", "W"), 50, replace = T)))
hit_counts <- test %>%
mutate(condition = as.factor(condition)) %>%
filter(answer == "H") %>%
count(condition, .drop = F)
hit_counts
#> # A tibble: 10 x 2
#> condition n
#> <fct> <int>
#> 1 1 0
#> 2 2 1
#> 3 3 4
#> 4 4 2
#> 5 5 3
#> 6 6 0
#> 7 7 3
#> 8 8 2
#> 9 9 1
#> 10 10 1
Then just plot that. geom_col is the version of geom_bar for where you have your y-values already, instead of having ggplot tally them up for you.
ggplot(hit_counts, aes(x = condition, y = n)) +
geom_col()
One option is to just filter out anything but where answer == "H" from your dataset, and then plot.
An alternative is to use a grouped bar plot, made by setting position = "dodge":
test <- data.frame(condition = rep(1:10, each = 10),
answer = sample(c('H', 'M', 'W'), 100, replace = T))
ggplot(data=test) +
geom_bar(aes(x = condition, fill = answer), position = "dodge") +
labs(x="Conditions", y="Hitrate") +
coord_cartesian(xlim = c(1:10), ylim = c(0:10)) +
scale_x_continuous(breaks=seq(1,10,1))
Also note that if the condition is actually a categorical variable, it may be better to make it a factor:
test$condition <- as.factor(test$condition)
This means that you don't need the scale_x_continuous call, and that the grid lines will be cleaner.
Another option is to pick your fill colors explicitly and make FALSE transparent by using scale_fill_manual. Since FALSE comes alphabetically first, the first value to specify is FALSE, the second TRUE.
ggplot(data=test, aes(x=condition, fill=answer=="H"))+
geom_bar()+labs(x="Conditions", y="Hitrate")+
coord_cartesian(xlim = c(1:10), ylim = c(0:10))+
scale_x_continuous(breaks=seq(1,10,1)) +
scale_fill_manual(values = c(alpha("red", 0), "cadetblue")) +
guides(fill = F)

Using R to remove data which is below a quartile threshold

I am creating correlations using R, with the following code:
Values<-read.csv(inputFile, header = TRUE)
O<-Values$Abundance_O
S<-Values$Abundance_S
cor(O,S)
pear_cor<-round(cor(O,S),4)
outfile<-paste(inputFile, ".jpg", sep = "")
jpeg(filename = outfile, width = 15, height = 10, units = "in", pointsize = 10, quality = 75, bg = "white", res = 300, restoreConsole = TRUE)
rx<-range(0,20000000)
ry<-range(0,200000)
plot(rx,ry, ylab="S", xlab="O", main="O vs S", type="n")
points(O,S, col="black", pch=3, lwd=1)
mtext(sprintf("%s %.4f", "pearson: ", pear_cor), adj=1, padj=0, side = 1, line = 4)
dev.off()
pear_cor
I now need to find the lower quartile for each set of data and exclude data that is within the lower quartile. I would then like to rewrite the data without those values and use the new column of data in the correlation analysis (because I want to threshold the data by the lower quartile). If there is a way I can write this so that it is easy to change the threshold by applying arguments from Java (as I have with the input file name) that's even better!
Thank you so much.
I have now implicated the answer below and that is working, however I need to keep the pairs of data together for the correlation. Here is an example of my data (from csv):
Abundance_O Abundance_S
3635900.752 1390.883073
463299.4622 1470.92626
359101.0482 989.1609251
284966.6421 3248.832403
415283.663 2492.231265
2076456.856 10175.48946
620286.6206 5074.268802
3709754.717 269.6856808
803321.0892 118.2935093
411553.0203 4772.499758
50626.83554 17.29893001
337428.8939 203.3536852
42046.61549 152.1321255
1372013.047 5436.783169
939106.3275 7080.770535
96618.01393 1967.834701
229045.6983 948.3087208
4419414.018 23735.19352
So I need to exclude both values in the row if one does not meet my quartile threshold (0.25 quartile). So if the quartile for O was 45000 then the row "42046.61549,152.1321255" would be removed. Is this possible? If I read in both columns as a dataframe can I search each column separately? Or find the quartiles and then input that value into code to remove the appropriate rows?
Thanks again, and sorry for the evolution of the question!
Please try to provide a reproducible example, but if you have data in a data.frame, you can subset it using the quantile function as the logical test. For instance, in the following data we want to select only rows from the dataframe where the value of the measured variable 'Val' is above the bottom quartile:
# set.seed so you can reproduce these values exactly on your system
set.seed(39856)
df <- data.frame( ID = 1:10 , Val = runif(10) )
df
ID Val
1 1 0.76487516
2 2 0.59755578
3 3 0.94584374
4 4 0.72179297
5 5 0.04513418
6 6 0.95772248
7 7 0.14566118
8 8 0.84898704
9 9 0.07246594
10 10 0.14136138
# Now to select only rows where the value of our measured variable 'Val' is above the bottom 25% quartile
df[ df$Val > quantile(df$Val , 0.25 ) , ]
ID Val
1 1 0.7648752
2 2 0.5975558
3 3 0.9458437
4 4 0.7217930
6 6 0.9577225
7 7 0.1456612
8 8 0.8489870
# And check the value of the bottom 25% quantile...
quantile(df$Val , 0.25 )
25%
0.1424363
Although this is an old question, I came across it during research of my own and I arrived at a solution that someone may be interested in.
I first defined a function which will convert a numerical vector into its quantile groups. Parameter n determines the quantile length (n = 4 for quartiles, n = 10 for deciles).
qgroup = function(numvec, n = 4){
qtile = quantile(numvec, probs = seq(0, 1, 1/n))
out = sapply(numvec, function(x) sum(x >= qtile[-(n+1)]))
return(out)
}
Function example:
v = rep(1:20)
> qgroup(v)
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
Consider now the following data:
dt = data.table(
A0 = runif(100),
A1 = runif(100)
)
We apply qgroup() across the data to obtain two quartile group columns:
cols = colnames(dt)
qcols = c('Q0', 'Q1')
dt[, (qcols) := lapply(.SD, qgroup), .SDcols = cols]
head(dt)
> A0 A1 Q0 Q1
1: 0.72121846 0.1908863 3 1
2: 0.70373594 0.4389152 3 2
3: 0.04604934 0.5301261 1 3
4: 0.10476643 0.1108709 1 1
5: 0.76907762 0.4913463 4 2
6: 0.38265848 0.9291649 2 4
Lastly, we only include rows for which both quartile groups are above the first quartile:
dt = dt[Q0 + Q1 > 2]

reverse axis in R

I am trying to plot simple picture like this, using 3 values - xyz loaded from textfile.
Now I need X-axis to go from biggest numbers to lowest (now are biggest numbers on the right, I need them on the left), so that two zeros meet in the same corner. I am using this simple code:
xyz <- read.table("excel")
scatterplot3d(xyz,xlim = c(0, 100000))
xyz
I have tried "rev" with no success. Picture always looks the same. Help will be greatly appreciated.
Sample data stored in file named "excel":
8884 20964 2
8928 5 1
9033 6 2
9261 61307 1
9435 64914 3
9605 5 2
9626 7 3
9718 5 3
10117 48941 7
10599 399 9
20834 5802 10
21337 3 8
21479 556 8
I want my 0,0,0 point to be in right front down corner.
You can choose an angle between >90 and <270
scatterplot3d(xyz,xlim = c(0, 100000),angle=ang)
for example:
z <- seq(-10, 10, 0.01)
x <- cos(z)+1
y <- sin(z)+1
scatterplot3d(x, y, z, highlight.3d=TRUE, col.axis="blue",angle=120,
col.grid="lightblue", main="scatterplot3d - 1", pch=20)
if you don't mind using cloud function from lattice package, then you can simply put the arguments of xlim in reversed order:
require(lattice)
xyz <- read.table( text =
"0 1 2
1 2 3
2 3 4
3 4 5")
cloud(V3~V1*V2,data = xyz, scales = list(arrows = FALSE), drape = T, xlim = c(3,0))
You can rotate the axes with screen parameter to make it look the way you like.

Plotting only a subset of the points?

I am trying to plot the CDF curve for a large dataset containing about 29 million values using ggplot. The way I am computing this is like this:
mycounts = ddply(idata.frame(newdata), .(Type), transform, ecd = ecdf(Value)(Value))
plot = ggplot(mycounts, aes(x=Value, y=ecd))
This is taking ages to plot. I was wondering if there is a clean way to plot only a sample of this dataset (say, every 10th point or 50th point) without compromising on the actual result?
I am not sure about your data structure, but a simple sample call might be enough:
n <- nrow(mycounts) # number of cases in data frame
mycounts <- mycounts[sample(n, round(n/10)), ] # get an n/10 sample to the same data frame
Instead of taking every n-th point, can you quantize your data set down to a sufficient resolution before plotting it? That way, you won't have to plot resolution you don't need (or can't see).
Here's one way you can do it. (The function I've written below is generic, but the example uses names from your question.)
library(ggplot2)
library(plyr)
## A data set containing two ramps up to 100, one by 1, one by 10
tens <- data.frame(Type = factor(c(rep(10, 10), rep(1, 100))),
Value = c(1:10 * 10, 1:100))
## Given a data frame and ddply-style arguments, partition the frame
## using ddply and summarize the values in each partition with a
## quantized ecdf. The resulting data frame for each partition has
## two columns: value and value_ecdf.
dd_ecdf <- function(df, ..., .quantizer = identity, .value = value) {
value_colname <- deparse(substitute(.value))
ddply(df, ..., .fun = function(rdf) {
xs <- rdf[[value_colname]]
qxs <- sort(unique(.quantizer(xs)))
data.frame(value = qxs, value_ecdf = ecdf(xs)(qxs))
})
}
## Plot each type's ECDF (w/o quantization)
tens_cdf <- dd_ecdf(tens, .(Type), .value = Value)
qplot(value, value_ecdf, color = Type, geom = "step", data = tens_cdf)
## Plot each type's ECDF (quantizing to nearest 25)
rounder <- function(...) function(x) round_any(x, ...)
tens_cdfq <- dd_ecdf(tens, .(Type), .value = Value, .quantizer = rounder(25))
qplot(value, value_ecdf, color = Type, geom = "step", data = tens_cdfq)
While the original data set and the ecdf set had 110 rows, the quantized-ecdf set is much reduced:
> dim(tens)
[1] 110 2
> dim(tens_cdf)
[1] 110 3
> dim(tens_cdfq)
[1] 10 3
> tens_cdfq
Type value value_ecdf
1 1 0 0.00
2 1 25 0.25
3 1 50 0.50
4 1 75 0.75
5 1 100 1.00
6 10 0 0.00
7 10 25 0.20
8 10 50 0.50
9 10 75 0.70
10 10 100 1.00
I hope this helps! :-)

Resources