I want to repeat plotting for each section - r

I am trying to repeat plotting in R
my main command is
data<-read.csv(file.choose(),header=TRUE)
t=data[,1]
PCI=data[,2]
plot(t,PCI,xlim=c(0,30))
boxplot(t,PCI,xlim=c(0,30))
# some starting values
alpha=1
betha=1
theta=1
# do the fit
fit = nls(ydata ~ alpha-beta*exp((-theta*(xdata^gama))), start=list(alpha=80,beta=15,theta=15,gama=-2))
ydata=data[,2]
xdata=data[,1]
new = data.frame(xdata = seq(min(xdata),max(xdata),len=200))
lines(new$xdata,predict(fit,newdata=new))
resid(fit)
qqnorm(resid(fit))
fitted(fit)
resid(fit)
xdata=fitted(fit)
ydata=resid(fit)
plot(xdata,ydata,xlab="fitted values",ylab="Residuals")
abline(0, 0)
My first column is number of section, my second column is t=x and my third column is PCI=y. I want to repeat my plotting command for each section individually.but I think I can not use loop since the number of data in each section is not equal.
I would really appreciate your help since i am new in R.
SecNum t PCI
961 1 94.84
961 2 93.04
961 3 91.69
961 11 80.47
961 12 79.26
961 13 77
962 1 90.46
962 2 90.01
962 3 86.88
962 4 86.36
962 5 84.56
962 6 85.11
963 1 91.33
963 2 90.7
963 3 86.46
963 4 88.47
963 5 81.07
963 6 84.07
963 7 82.55
963 8 73.58
963 9 71.85
963 10 83.8
963 11 82.16

To repeat your code for each different SecNum in your data, do something like:
sections <- unique(data$SecNum)
for (sec in sections) {
# just get the data for that section
data.section <- subset(data, SecNum == sec)
# now do all your plotting commands. `data.section` is the
# subset of `data` that corresponds to this SecNum only.
}

Related

Create a vector from a specific sequence of intervals

I have 20 intervals:
10 intervals from 1 to 250 of size 25:
[1.25] [26.50] [51.75] [76.100] [101.125] [126.150] ... [226.250]
10 intervals from 251 to 1000 of size 75:
[251,325] [326,400] [401,475] [476,550] [551,625] ... [926,1000]
I would like to create a vector composed of the first 5 elements of each interval like:
(1,2,3,5, 26,27,28,29,30, 51,52,53,54,55, 76,77,78,79,80, ....,
251,252,253,254,255, 326,327,328,329,330, ...)
How create this vector using R?
Let's assume you have two interval like :
interval1 <- seq(1.25, 226.250, 25)
interval2 <- seq(251, 1000, 75)
We can create a new interval combining the two and then use mapply to create sequence
new_interval <- c(as.integer(interval1), interval2)
c(mapply(`:`, new_interval, new_interval + 4))
#[1] 1 2 3 4 5 26 27 28 29 30 51 52 53 54 .....
#[89] ..... 779 780 851 852 853 854 855 926 927 928 929 930

lmList - loss of group information

I am using lmList to do linear models on many subsets of a data frame:
res <- lmList(Rds.on.fwd~Length | Wafer, data=sub, na.action=na.omit, pool=F)
This works fine, and I get the desired output (full output not shown):
(Intercept) Length
2492 5816.726 1571.260
2493 2520.311 1361.317
2494 3058.408 1286.516
2502 4727.328 1344.728
2564 3790.942 1576.223
2567 2350.296 1290.396
I have subsetted by "Wafer" (first column above). However, within my data frame ("sub"), the data is grouped by another factor "ERF" (there are many other factors but I am only concerned with "ERF"):
head(sub):
ERF Wafer Device Row Col Width Length Date Von.fwd Vth.fwd STS.fwd On.Off.fwd Ion.fwd Ioff.fwd Rds.on.fwd
1 474 2492 11.06E 11 6 100 5 09/10/2014 12:05 0.596747 3.05655 0.295971 7874420 0.000104 1.32e-11 9626.54
3 474 2492 11.08E 11 8 100 5 09/10/2014 12:05 0.581131 3.08380 0.299050 7890780 0.000109 1.38e-11 9193.62
5 474 2492 11.09E 11 9 100 5 09/10/2014 12:05 0.578171 3.06713 0.298509 8299740 0.000107 1.29e-11 9337.86
7 474 2492 11.10E 11 10 100 5 09/10/2014 12:05 0.565504 2.95532 0.298349 8138320 0.000109 1.34e-11 9173.15
9 474 2492 11.11E 11 11 100 5 09/10/2014 12:05 0.581289 2.97091 0.297885 8463620 0.000109 1.29e-11 9178.50
11 474 2492 11.12E 11 12 100 5 09/10/2014 12:05 0.578003 3.05802 0.294260 9326360 0.000112 1.20e-11 8955.51
I do not want ERF including in my lm but I do want to keep the factor "ERF" with the lm results for colouring graphs later i.e. I want this:
ERF Wafer (Intercept) Length
474 2492 5816.726 1571.260
474 2493 2520.311 1361.317
474 2494 3058.408 1286.516
475 2502 4727.328 1344.728
475 2564 3790.942 1576.223
476 2567 2350.296 1290.396
I know I could do this manually later by just adding a column to the results with a vector containing the correct sequence of ERF. However, I regularly add data to the set and dont want to do this every time. Im sure there is a more elegant way?
Thanks
Edit - data added for solution:
res <- ddply(sub, c("ERF", "Wafer"), function(x) coefficients(lm(Rds.on.fwd~Length,x)))
head(res)
ERF Wafer (Intercept) Length
1 474 2492 5816.726 1571.260
2 474 2493 2520.311 1361.317
3 474 2494 3058.408 1286.516
4 474 2502 4727.328 1344.728
5 479 2564 3790.942 1576.223
6 479 2567 2350.296 1290.396
If I drop ERF:
res <- ddply(sub, c("Wafer"), function(x) coefficients(lm(Rds.on.fwd~Length,x)))
head(res)
Wafer (Intercept) Length
1 2492 5816.726 1571.260
2 2493 2520.311 1361.317
3 2494 3058.408 1286.516
4 2502 4727.328 1344.728
5 2564 3790.942 1576.223
6 2567 2350.296 1290.396
Does this made sense? Did i ask the question incorrectly?
Ah, with a bit more research i've answer my own question based on this answer:
Regression on subset of data set
Must look harder next time. I used ddply instead of lmList (makes me wonder why anyone uses lmList...maybe I should ask another question?):
res1 <- ddply(sub, c("ERF", "Wafer"), function(x) coefficients(lm(Rds.on.fwd~Length,x)))

Row wise operation on data.table

Let's say I'd like to calculate the magnitude of the range over a few columns, on a row-by-row basis.
set.seed(1)
dat <- data.frame(x=sample(1:1000,1000),
y=sample(1:1000,1000),
z=sample(1:1000,1000))
Using data.frame(), I would do something like this:
dat$diff_range <- apply(dat,1,function(x) diff(range(x)))
To put it more simply, I'm looking for this operation, over each row:
diff(range(dat[1,]) # for i 1:nrow(dat)
If I were doing this for the entire table, it would be something like:
setDT(dat)[,diff_range := apply(dat,1,function(x) diff(range(x)))]
But how would I do it for only named (or numbered) rows?
pmax and pmin find the min and max across columns in a vectorized way, which is much better than splitting and working with each row separately. It's also pretty concise:
dat[, r := do.call(pmax,.SD) - do.call(pmin,.SD)]
x y z r
1: 266 531 872 606
2: 372 685 967 595
3: 572 383 866 483
4: 906 953 437 516
5: 201 118 192 83
---
996: 768 945 292 653
997: 61 231 965 904
998: 771 145 18 753
999: 841 148 839 693
1000: 857 252 218 639
How about this:
D[,list(I=.I,x,y,z)][,diff(range(x,y,z)),by=I][c(1:4,15:18)]
# I V1
#1: 1 971
#2: 2 877
#3: 3 988
#4: 4 241
#5: 15 622
#6: 16 684
#7: 17 971
#8: 18 835
#actually this will be faster
D[c(1:4,15:18),list(I=.I,x,y,z)][,diff(range(x,y,z)),by=I]
use .I to give you an index to call with the by= parameter, then you can run the function on each row. The second call pre-filters by any list of row numbers, or you can add a key and filter on that if your real table looks different.
You can do it by subsetting before/during the function. If you only want every second row for example
dat_Diffs <- apply(dat[seq(2,1000,by=2),],1,function(x) diff(range(x)))
Or for rownames 1:10 (since their names weren't specified they are just numbers counting up)
dat_Diffs <- apply(dat[rownames(dat) %in% 1:10,],1,function(x) diff(range(x)))
But why not just calculate per row then subset later?

Binning a dataframe with equal frequency of samples

I have binned my data using the cut function
breaks<-seq(0, 250, by=5)
data<-split(df2, cut(df2$val, breaks))
My split dataframe looks like
... ...
$`(15,20]`
val ks_Result c
15 60 237
18 70 247
... ...
$`(20,25]`
val ks_Result c
21 20 317
24 10 140
... ...
My bins looks like
> table(data)
data
(0,5] (5,10] (10,15] (15,20] (20,25] (25,30] (30,35]
0 0 0 7 128 2748 2307
(35,40] (40,45] (45,50] (50,55] (55,60] (60,65] (65,70]
1404 11472 1064 536 7389 1008 1714
(70,75] (75,80] (80,85] (85,90] (90,95] (95,100] (100,105]
2047 700 329 1107 399 376 323
(105,110] (110,115] (115,120] (120,125] (125,130] (130,135] (135,140]
314 79 1008 77 474 158 381
(140,145] (145,150] (150,155] (155,160] (160,165] (165,170] (170,175]
89 660 15 1090 109 824 247
(175,180] (180,185] (185,190] (190,195] (195,200] (200,205] (205,210]
1226 139 531 174 1041 107 257
(210,215] (215,220] (220,225] (225,230] (230,235] (235,240] (240,245]
72 671 98 212 70 95 25
(245,250]
494
When I mean the bins, I get on an average of ~900 samples
> mean(table(data))
[1] 915.9
I want to tell R to make irregular bins in such a way that each bin will contain on an average 900 samples (e.g. (0, 27] = 900, (27,28.5] = 900, and so on). I found something similar here, which deals with only one variable, not the whole dataframe.
I also tried Hmisc package, unfortunately the bins don't contain equal frequency!!
library(Hmisc)
data<-split(df2, cut2(df2$val, g=30, oneval=TRUE))
data<-split(df2, cut2(df2$val, m=1000, oneval=TRUE))
Assuming you want 50 equal sized buckets (based on your seq) statement, you can use something like:
df <- data.frame(var=runif(500, 0, 100)) # make data
cut.vec <- cut(
df$var,
breaks=quantile(df$var, 0:50/50), # breaks along 1/50 quantiles
include.lowest=T
)
df.split <- split(df, cut.vec)
Hmisc::cut2 has this option built in as well.
Can be done by the function provided here by Joris Meys
EqualFreq2 <- function(x,n){
nx <- length(x)
nrepl <- floor(nx/n)
nplus <- sample(1:n,nx - nrepl*n)
nrep <- rep(nrepl,n)
nrep[nplus] <- nrepl+1
x[order(x)] <- rep(seq.int(n),nrep)
x
}
data<-split(df2, EqualFreq2(df2$val, 25))

How to divide a set of overlapping ranges into non-overlapping ranges? but in R

Let's say we have two datasets:
assays:
BHID<-c(127,127,127,127,128)
FROM<-c(950,959,960,961,955)
TO<-c(958,960,961,966,969)
Cu<-c(0.3,0.9,2.5,1.2,0.5)
assays<-data.frame(BHID,FROM,TO,Cu)
and litho:
BHID<-c(125,127,127,127)
FROM<-c(940,949,960,962)
TO<-c(949,960,961,969)
ROCK<-c(1,1,2,3)
litho<-data.frame(BHID,FROM,TO,ROCK)
and I want to join the two sets and the results after running the algorithm would be:
BHID FROM TO CU ROCK
125 940 970 - 1
127 949 950 - 1
127 950 958 0.3 1
127 958 959 - 1
127 959 960 0.9 1
127 960 961 2.5 2
127 961 962 1.2 -
127 962 966 1.2 3
127 966 969 - 3
128 955 962 0.5 -
Use merge
merge(assays, litho, all=T)
In essence, all=T is the SQL equivalent for FULL OUTER JOIN. I haven't specified any columns, because in this case merge function will perform the join across the column with same names.
Tough one but the code seems to work. The idea is to first expand each row into many, each representing a one-increment from FROM to TO. After merging, identify contiguous rows and un-expand them... Obviously it is not a very efficient approach so it may or may not work if your real data has very large FROM and TO ranges.
library(plyr)
ASSAYS <- adply(assays, 1, with, {
SEQ <- seq(FROM, TO)
data.frame(BHID,
FROM = head(seq(FROM, TO), -1),
TO = tail(seq(FROM, TO), -1),
Cu)
})
LITHO <- adply(litho, 1, with, {
SEQ <- seq(FROM, TO)
data.frame(BHID,
FROM = head(seq(FROM, TO), -1),
TO = tail(seq(FROM, TO), -1),
ROCK)
})
not.as.previous <- function(x) {
x1 <- head(x, -1)
x2 <- tail(x, -1)
c(TRUE, !is.na(x1) & !is.na(x2) & x1 != x2 |
is.na(x1) & !is.na(x2) |
!is.na(x1) & is.na(x2))
}
MERGED <- merge(ASSAYS, LITHO, all = TRUE)
MERGED <- transform(MERGED,
gp.id = cumsum(not.as.previous(BHID) |
not.as.previous(Cu) |
not.as.previous(ROCK)))
merged <- ddply(MERGED, "gp.id", function(x) {
out <- head(x, 1)
out$TO <- tail(x$TO, 1)
out
})
merged
# BHID FROM TO Cu ROCK gp.id
# 1 125 940 949 NA 1 1
# 2 127 949 950 NA 1 2
# 3 127 950 958 0.3 1 3
# 4 127 958 959 NA 1 4
# 5 127 959 960 0.9 1 5
# 6 127 960 961 2.5 2 6
# 7 127 961 962 1.2 NA 7
# 8 127 962 966 1.2 3 8
# 9 127 966 969 NA 3 9
# 10 128 955 969 0.5 NA 10
Note that the first row is not exactly the same as in your expected output, but I think mine makes more sense.

Resources