R: creating uneven levels of factor for a numeric variable - r

I have a set of values (100000 entries) ranging from -0.20 to +0.15, which are return percentages.
Bulk of the values lies between +3.5% and -3.5%
I am looking to convert this into a factor such that:
any return between -0.035 to +.035 are equally binned in 0.05 increments and
anything between -0.2 to -.035 is binned as one factor and
anything between 0.05 to .15 is binned as one factor variable.
Any thoughts on how I can achieve this in R? I did try cut, but it seems to bin only in equal increments.

So I generated the vector that holds the values (out of uniform distribution)
library(data.table)
set.seed(555)#in order to be reproducible
N <- 100000#number of pseudonumbers to be generated
min1=-0.035#arbitrary limits
max1=0.035#idem
samp <- runif(N,min = -0.2,max = 0.15)#create the vector
level1 <- as.factor(ifelse(samp<=min1,paste0("(",min(samp),",",min1,"]"),NA))#create the first level
level2 <- as.factor(ifelse(samp>=max1,paste0("[",max1,",",max(samp),")"),NA))#create the second level
incr <- 0.005
level3 <- cut(samp,seq(min1, max1, by = incr))#create the intermediate levels
dt <- data.table(samp,level1,level2,level3)#put all together
mylevels <- na.omit(unlist(matrix(t(dt[,-1]))))#the vector that contains in which range the samp belongs to
For better visualization of results:
mylevels<-factor(mylevels,levels= unique(mylevels))
dt2<-dt[,.(samp,levels=mylevels)]
samp levels
1: -0.07023653 (-0.199996188434307,-0.035]
2: 0.10889991 [0.035,0.149995080730878)
3: 0.04246077 [0.035,0.149995080730878)
4: -0.01193010 (-0.015,-0.01]
5: 0.02607736 (0.025,0.03]
---
99996: -0.04786692 (-0.199996188434307,-0.035]
99997: -0.08700210 (-0.199996188434307,-0.035]
99998: 0.09989973 [0.035,0.149995080730878)
99999: 0.10095336 [0.035,0.149995080730878)
100000: -0.05555869 (-0.199996188434307,-0.035]

Related

How to generate a sample time series data set containing multiple individuals of different length (rows)?

I want to simulate a time series data frame that contains observations of 5 variables that were taken on 10 individuals. I want the number of rows (observations) to be different between each individual. For instance, I could start with something like this:
ID = rep(c("alp", "bet", "char", "delta", "echo"), times = c(1000,1200,1234,980,1300))
in which case ID represents each unique individual (I would later turn this into a factor), and the number of times each ID was repeated would represent the length of measurements for that factor. I would next need to create a column called Time with sequences from 1:1000, 1:1200, 1:1234, 1:980, and 1:1300 (to represent the length of measurements for each individual). Lastly I would need to generate 5 columns of random numbers for each of the 5 variables.
There are tons of ways to go about generating this data set, but what would be the most practical way to do it?
You can do :
ID = c("alp", "bet", "char", "delta", "echo")
num = c(1000,1200,1234,980,1300)
df <- data.frame(ID = rep(ID, num), num = sequence(num))
df[paste0('rand', seq_along(ID))] <- rnorm(length(ID) * sum(num))
head(df)
# ID num rand1 rand2 rand3 rand4 rand5
#1 alp 1 0.1340386 0.95900538 0.84573154 0.7151784 -0.07921171
#2 alp 2 0.2210195 1.67105483 -1.26068288 0.9171749 -0.09736927
#3 alp 3 1.6408462 0.05601673 -0.35454240 -2.6609228 0.21615254
#4 alp 4 -0.2190504 -0.05198191 -0.07355602 1.1102771 0.88246516
#5 alp 5 0.1680654 -1.75323736 -1.16865142 -0.4849876 0.20559750
#6 alp 6 1.1683839 0.09932759 -0.63474826 0.2306168 -0.61643584
I have used rnorm here, you can use any other distribution to generate random numbers.

New variable: sum of numbers from a list powered by value of different columns

This is my first question in Stackoverflow. I am not new to R, although I sometimes struggle with things that might be considered basic.
I want to calculate the count median diameter (CMD) for each of my rows from a Particle Size Distribution dataset.
My data looks like this (several rows and 53 columns in total):
date CPC n3.16 n3.55 n3.98 n4.47 n5.01 n5.62 n6.31 n7.08 n7.94
2015-01-01 00:00:00 5263.434 72.988 140.346 138.801 172.473 344.806 484.415 606.430 739.625 927.082
2015-01-01 01:00:00 4813.182 152.823 80.861 140.017 213.382 264.496 359.455 487.293 840.349 1069.846
Each variable starting with "n" indicates the number of particles for the corresponding size (variable n3.16 = number of particles of median size of 3.16nm). I will divide the values by 100 prior to the calculations, in order to avoid such high numbers that prevent from the computation.
To compute the CMD, I need to do the following calculation:
CMD = (D1^n1*D2^n2...Di^ni)^(1/N)
where Di is the diameter (to be extracted from the column name), ni is the number of particles for diameter Di, and N is the total sum of particles (sum of all the columns starting with "n").
To get the Di, I created a numeric list from the column names that start with n:
D <- as.numeric(gsub("n", "", names(data)[3:54]))
This is my attempt to create a new variable with the calculation of CMD, although it doesn't work.
data$cmd <- for i in 1:ncol(D) {
prod(D[[i]]^data[,i+2])
}
I also tried to use apply, but I again, it didn't work
data$cmd <- for i in 1:ncol(size) {
apply(data,1, function(x) prod(size[[i]]^data[,i+2])
}
I have different datasets from different sites which have different number of columns, so I would like to make code "universal".
Thank you very much
This should work (I had to mutilate your date variable because of read.table, but it is not involved in the calculations, so just ignore that):
> df
date CPC n3.16 n3.55 n3.98 n4.47 n5.01 n5.62 n6.31 n7.08 n7.94
1 2015-01-01 5263.434 72.988 140.346 138.801 172.473 344.806 484.415 606.430 739.625 927.082
2 2015-01-01 4813.182 152.823 80.861 140.017 213.382 264.496 359.455 487.293 840.349 1069.846
N <- sum(df[3:11]) # did you mean the sum of all n.columns over all rows? if not, you'd need to edit this
> N
[1] 7235.488
D <- as.numeric(gsub("n", "", names(df)[3:11]))
> D
[1] 3.16 3.55 3.98 4.47 5.01 5.62 6.31 7.08 7.94
new <- t(apply(df[3:11], 1, function(x, y) (x^y), y = D))
> new
n3.16 n3.55 n3.98 n4.47 n5.01 n5.62 n6.31 n7.08 n7.94
[1,] 772457.6 41933406 336296640 9957341349 5.167135e+12 1.232886e+15 3.625318e+17 2.054007e+20 3.621747e+23
[2,] 7980615.0 5922074 348176502 25783108893 1.368736e+12 2.305272e+14 9.119184e+16 5.071946e+20 1.129304e+24
df$CMD <- rowSums(new)^(1/N)
> df
date CPC n3.16 n3.55 n3.98 n4.47 n5.01 n5.62 n6.31 n7.08 n7.94 CMD
1 2015-01-01 5263.434 72.988 140.346 138.801 172.473 344.806 484.415 606.430 739.625 927.082 1.007526
2 2015-01-01 4813.182 152.823 80.861 140.017 213.382 264.496 359.455 487.293 840.349 1069.846 1.007684

data.table slow aggregating on factor column

Came across this issue today. I have a data.table with some categorical fields (i.e. factors). Something like
set.seed(2016)
dt <- data.table(
ID=factor(sample(30000, 2000000, replace=TRUE)),
Letter=factor(LETTERS[sample(26, 2000000, replace=TRUE)])
)
dt
ID Letter
1: 5405 E
2: 4289 E
3: 25250 J
4: 4008 J
5: 14326 G
---
Now, I'd like to calculate the gini impurity for each column of dt, grouped by the values in ID.
My attempt:
giniImpurity <- function(vals){
# Returns the gini impurity of a set of categorical values
# vals can either be the raw category instances (vals=c("red", "red", "blue", "green")) or named category frequencies (vals=c(red=2, blue=1, green=1))
# Gini Impurity is the probability a value is incorrectly labeled when labeled according to the distribution of classes in the set
if(is(vals, "numeric")) counts <- vals else counts <- table(vals)
total <- sum(counts)
return(sum((counts/total)*(1-counts/total)))
}
# Calculate gini impurities
dt[, list(Samples=.N, ID.GinitImpurity=giniImpurity(ID), Letter.GiniImpurity=giniImpurity(Letter)), by=ID]
ID Samples ID.GinitImpurity Letter.GiniImpurity
1: 5405 66 0 0.9527
2: 4289 73 0 0.9484
3: 25250 60 0 0.9394
4: 4008 66 0 0.9431
5: 14326 79 0 0.9531
---
This works but it's incredibly slow. It seems that if I change ID from factor to numeric, it runs much quicker. Is this what I should do in practice or is there a less hacky way to speed up this operation?
Also, I know it's unnecessary to calculate the gini impurity of ID grouped by itself, but please look past this. My real dataset has many more categorical features which add to the slowness.
Also note that I'm using data.table version 1.9.7 (devel)
EDIT
Sorry guys... I just realized that when I tested this with ID as numeric instead of a factor, my call to giniImpurity() is where the speed up occurred due to the nature of how it works. I guess the call to table() is where the slowdown is. Still not 100% sure how to make this quicker though.
Got it.
giniImpurities <- function(dt){
# Returns pairs of categorical fields (cat1, cat2, GI) where GI is the weighted gini impurity of
# cat2 relative to the groups determined by cat1
#--------------------------------------------------
# Subset dt by just the categorical fields
catfields <- colnames(dt)[sapply(dt, is.factor)]
cats1 <- dt[, catfields, with=FALSE]
# Build a table to store the results
varpairs <- CJ(Var1=catfields, Var2=catfields)
varpairs[Var1==Var2, GI := 0]
# Loop through each grouping variable
for(catcol in catfields){
print(paste("Calculating gini impurities by field:", catcol))
setkeyv(cats1, catcol)
impuritiesDT <- cats1[, list(Samples=.N), keyby=catcol]
# Looop through each of the other categorical columns
for(colname in setdiff(catfields, catcol)){
# Get the gini impurity for each pair (catcol, other)
counts <- cats1[, list(.N), by=c(catcol, colname)]
impurities <- counts[, list(GI=sum((N/sum(N))*(1-N/sum(N)))), by=catcol]
impuritiesDT[impurities, GI := GI]
setnames(impuritiesDT, "GI", colname)
}
cats1.gini <- melt(impuritiesDT, id.vars=c(catcol, "Samples"))
cats1.gini <- cats1.gini[, list(GI=weighted.mean(x=value, w=Samples)), by=variable]
cats1.gini <- cats1.gini[, list(Var1=catcol, Var2=variable, GI)]
varpairs[cats1.gini, `:=`(GI=i.GI), on=c("Var1", "Var2")]
}
return(varpairs[])
}
giniImpurities(dt)
Var1 Var2 GI
1: Letter Letter 0.0000000
2: Letter Letter2 0.9615258
3: Letter PGroup 0.9999537
4: Letter2 Letter 0.9615254
5: Letter2 Letter2 0.0000000
6: Letter2 PGroup 0.9999537
7: PGroup Letter 0.9471393
8: PGroup Letter2 0.9470965
9: PGroup PGroup 0.0000000

Peak detection in Manhattan plot

The attached plot (Manhattan plot) contains on the x axis chromosome positions from the genome and on the Y axis -log(p), where p is a p-value associated with the points (variants) from that specific position.
I have used the following R code to generate it (from the gap package) :
require(gap)
affy <-c(40220, 41400, 33801, 32334, 32056, 31470, 25835, 27457, 22864, 28501, 26273,
24954, 19188, 15721, 14356, 15309, 11281, 14881, 6399, 12400, 7125, 6207)
CM <- cumsum(affy)
n.markers <- sum(affy)
n.chr <- length(affy)
test <- data.frame(chr=rep(1:n.chr,affy),pos=1:n.markers,p=runif(n.markers))
oldpar <- par()
par(cex=0.6)
colors <- c("red","blue","green","cyan","yellow","gray","magenta","red","blue","green", "cyan","yellow","gray","magenta","red","blue","green","cyan","yellow","gray","magenta","red")
mhtplot(test,control=mht.control(colors=colors),pch=19,bg=colors)
> head(test)
chr pos p
1 1 1 0.79296584
2 1 2 0.96675136
3 1 3 0.43870076
4 1 4 0.79825513
5 1 5 0.87554143
6 1 6 0.01207523
I am interested in getting the coordinates of the peaks of the plot above a certain threshold (-log(p)) .
If you want the indices of the values above the 99th percentile:
# Add new column with log values
test = transform(test, log_p = -log10(test[["p"]]))
# Get the 99th percentile
pct99 = quantile(test[["log_p"]], 0.99)
...and get the values from the original data test:
peaks = test[test[["log_p"]] > pct99,]
> head(peaks)
chr pos p log_p
5 1 5 0.002798126 2.553133
135 1 135 0.003077302 2.511830
211 1 211 0.003174833 2.498279
586 1 586 0.005766859 2.239061
598 1 598 0.008864987 2.052322
790 1 790 0.001284629 2.891222
You can use this with any threshold. Note that I have not calculated the first derivative, see this question for some pointers:
How to calculate first derivative of time series
after calculating the first derivative, you can find the peaks by looking at points in the timeseries where the first derivative is (almost) zero. After identifying these peaks, you can check which ones are above the threshold.
Based on my experience after plotting the graph you can use following R code to find the peak coordinate
plot(x[,1], x[,2])
identify(x[,1], x[,2], labels=row.names(x))
note here x[,1] refers to x coordinate(genome coordinate and x[,2] would be #your -log10P value
at this time use point you mouse to select a point and hit enter which #will give you peak location and then type the following code to get the #coordinate
coords <- locator(type="l")
coords

Plotting only a subset of the points?

I am trying to plot the CDF curve for a large dataset containing about 29 million values using ggplot. The way I am computing this is like this:
mycounts = ddply(idata.frame(newdata), .(Type), transform, ecd = ecdf(Value)(Value))
plot = ggplot(mycounts, aes(x=Value, y=ecd))
This is taking ages to plot. I was wondering if there is a clean way to plot only a sample of this dataset (say, every 10th point or 50th point) without compromising on the actual result?
I am not sure about your data structure, but a simple sample call might be enough:
n <- nrow(mycounts) # number of cases in data frame
mycounts <- mycounts[sample(n, round(n/10)), ] # get an n/10 sample to the same data frame
Instead of taking every n-th point, can you quantize your data set down to a sufficient resolution before plotting it? That way, you won't have to plot resolution you don't need (or can't see).
Here's one way you can do it. (The function I've written below is generic, but the example uses names from your question.)
library(ggplot2)
library(plyr)
## A data set containing two ramps up to 100, one by 1, one by 10
tens <- data.frame(Type = factor(c(rep(10, 10), rep(1, 100))),
Value = c(1:10 * 10, 1:100))
## Given a data frame and ddply-style arguments, partition the frame
## using ddply and summarize the values in each partition with a
## quantized ecdf. The resulting data frame for each partition has
## two columns: value and value_ecdf.
dd_ecdf <- function(df, ..., .quantizer = identity, .value = value) {
value_colname <- deparse(substitute(.value))
ddply(df, ..., .fun = function(rdf) {
xs <- rdf[[value_colname]]
qxs <- sort(unique(.quantizer(xs)))
data.frame(value = qxs, value_ecdf = ecdf(xs)(qxs))
})
}
## Plot each type's ECDF (w/o quantization)
tens_cdf <- dd_ecdf(tens, .(Type), .value = Value)
qplot(value, value_ecdf, color = Type, geom = "step", data = tens_cdf)
## Plot each type's ECDF (quantizing to nearest 25)
rounder <- function(...) function(x) round_any(x, ...)
tens_cdfq <- dd_ecdf(tens, .(Type), .value = Value, .quantizer = rounder(25))
qplot(value, value_ecdf, color = Type, geom = "step", data = tens_cdfq)
While the original data set and the ecdf set had 110 rows, the quantized-ecdf set is much reduced:
> dim(tens)
[1] 110 2
> dim(tens_cdf)
[1] 110 3
> dim(tens_cdfq)
[1] 10 3
> tens_cdfq
Type value value_ecdf
1 1 0 0.00
2 1 25 0.25
3 1 50 0.50
4 1 75 0.75
5 1 100 1.00
6 10 0 0.00
7 10 25 0.20
8 10 50 0.50
9 10 75 0.70
10 10 100 1.00
I hope this helps! :-)

Resources