lmList - loss of group information - r

I am using lmList to do linear models on many subsets of a data frame:
res <- lmList(Rds.on.fwd~Length | Wafer, data=sub, na.action=na.omit, pool=F)
This works fine, and I get the desired output (full output not shown):
(Intercept) Length
2492 5816.726 1571.260
2493 2520.311 1361.317
2494 3058.408 1286.516
2502 4727.328 1344.728
2564 3790.942 1576.223
2567 2350.296 1290.396
I have subsetted by "Wafer" (first column above). However, within my data frame ("sub"), the data is grouped by another factor "ERF" (there are many other factors but I am only concerned with "ERF"):
head(sub):
ERF Wafer Device Row Col Width Length Date Von.fwd Vth.fwd STS.fwd On.Off.fwd Ion.fwd Ioff.fwd Rds.on.fwd
1 474 2492 11.06E 11 6 100 5 09/10/2014 12:05 0.596747 3.05655 0.295971 7874420 0.000104 1.32e-11 9626.54
3 474 2492 11.08E 11 8 100 5 09/10/2014 12:05 0.581131 3.08380 0.299050 7890780 0.000109 1.38e-11 9193.62
5 474 2492 11.09E 11 9 100 5 09/10/2014 12:05 0.578171 3.06713 0.298509 8299740 0.000107 1.29e-11 9337.86
7 474 2492 11.10E 11 10 100 5 09/10/2014 12:05 0.565504 2.95532 0.298349 8138320 0.000109 1.34e-11 9173.15
9 474 2492 11.11E 11 11 100 5 09/10/2014 12:05 0.581289 2.97091 0.297885 8463620 0.000109 1.29e-11 9178.50
11 474 2492 11.12E 11 12 100 5 09/10/2014 12:05 0.578003 3.05802 0.294260 9326360 0.000112 1.20e-11 8955.51
I do not want ERF including in my lm but I do want to keep the factor "ERF" with the lm results for colouring graphs later i.e. I want this:
ERF Wafer (Intercept) Length
474 2492 5816.726 1571.260
474 2493 2520.311 1361.317
474 2494 3058.408 1286.516
475 2502 4727.328 1344.728
475 2564 3790.942 1576.223
476 2567 2350.296 1290.396
I know I could do this manually later by just adding a column to the results with a vector containing the correct sequence of ERF. However, I regularly add data to the set and dont want to do this every time. Im sure there is a more elegant way?
Thanks
Edit - data added for solution:
res <- ddply(sub, c("ERF", "Wafer"), function(x) coefficients(lm(Rds.on.fwd~Length,x)))
head(res)
ERF Wafer (Intercept) Length
1 474 2492 5816.726 1571.260
2 474 2493 2520.311 1361.317
3 474 2494 3058.408 1286.516
4 474 2502 4727.328 1344.728
5 479 2564 3790.942 1576.223
6 479 2567 2350.296 1290.396
If I drop ERF:
res <- ddply(sub, c("Wafer"), function(x) coefficients(lm(Rds.on.fwd~Length,x)))
head(res)
Wafer (Intercept) Length
1 2492 5816.726 1571.260
2 2493 2520.311 1361.317
3 2494 3058.408 1286.516
4 2502 4727.328 1344.728
5 2564 3790.942 1576.223
6 2567 2350.296 1290.396
Does this made sense? Did i ask the question incorrectly?

Ah, with a bit more research i've answer my own question based on this answer:
Regression on subset of data set
Must look harder next time. I used ddply instead of lmList (makes me wonder why anyone uses lmList...maybe I should ask another question?):
res1 <- ddply(sub, c("ERF", "Wafer"), function(x) coefficients(lm(Rds.on.fwd~Length,x)))

Related

R: How to compare values in a column with later values in the same column

I am attempting to work with a large dataset in R where I need to create a column that compares the value in an existing column to all values that follow it (ex: row 1 needs to compare rows 1-10,000, row 2 needs to compare rows 2-10,000, row 3 needs to compare rows 3-10,000, etc.), but cannot figure out how to write the range.
I currently have a column of raw numeric values and a column of row values generated by:
samples$row = seq.int(nrow(samples))
I have attempted to generate the column with the following command:
samples$processed = min(samples$raw[samples$row:10000])
but get the error "numerical expression has 10000 elements: only the first used" and the generated column only has the value for row 1 repeated for each of the 10,000 rows.
How do I need to write this command so that the lower bound of the range is the row currently being calculated instead of 1?
Any help would be appreciated, as I have minimal programming experience.
If all you need is the min of the specific row and all following rows, then
rev(cummin(rev(samples$val)))
# [1] 24 24 24 24 24 24 24 24 24 24 24 24 165 165 165 165 410 410 410 882
If you have some other function that doesn't have a cumulative variant (and your use of min is just a placeholder), then one of:
mapply(function(a, b) min(samples$val[a:b]), seq.int(nrow(samples)), nrow(samples))
# [1] 24 24 24 24 24 24 24 24 24 24 24 24 165 165 165 165 410 410 410 882
sapply(seq.int(nrow(samples)), function(a) min(samples$val[a:nrow(samples)]))
The only reason to use mapply over sapply is if, for some reason, you want window-like operations instead of always going to the bottom of the frame. (Though if you wanted windows, I'd suggest either the zoo or slider packages.)
Data
set.seed(42)
samples <- data.frame(val = sample(1000, size=20))
samples
# val
# 1 561
# 2 997
# 3 321
# 4 153
# 5 74
# 6 228
# 7 146
# 8 634
# 9 49
# 10 128
# 11 303
# 12 24
# 13 839
# 14 356
# 15 601
# 16 165
# 17 622
# 18 532
# 19 410
# 20 882

sample() not sampling randomly? [duplicate]

This question already has an answer here:
Sample() in R returning non-random sample after population vector length > 13. Why? [duplicate]
(1 answer)
Closed 7 years ago.
Can someone explain whats going wrong here? I wanted to simulate 10,000 20-sided dice rolls. I used this code:
x <- sample(1:20,10000,replace=T)
but that give me this:
hist(x)
It seems to be a problem above 12:
What am I not understanding here?
Thanks
It's not actually to do with your sample, it's hist.
If you do this
set.seed(1)
x <- sample(1:20,10000,replace=T)
table(x)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
513 522 482 495 459 549 506 505 518 498 495 492 440 490 459 509 496 528 511 533
you'll notice it's random. However hist reproduces your graph. If you count the bars you'll notice there are 19 and not 20.
Trying this instead:
bins <- seq(0, 20, by=1)
hist(x, breaks=bins)
gives a graph with even bar heights because all 20 bars are shown (i.e. 1 and 2 are not collapsed together).

Loop for subsetting data.frame

I work with neuralnet package to predict values of stocks (diploma thesis). The example data are below
predict<-runif(23,min=0,max=1)
day<-c(369:391)
ChoosedN<-c(2,5,5,5,5,5,4,3,5,5,5,2,1,1,5,5,4,3,2,3,4,3,2)
Profit<-runif(23,min=-2,max=5)
df<-data.frame(predict,day,ChoosedN,Profit)
colnames(df)<-c('predict','day','ChoosedN','Profit')
But I haven't always same period for investments (ChoodedN). For backtest the neural site I have to skip the days when I am still in position even if the neural site says 'buy it' (i.e.predict > 0.5). The frame looks like this
predict day ChoosedN Profit
1 0.6762981061 369 2 -1.6288823350
2 0.0195611224 370 5 1.5682195597
3 0.2442795106 371 5 0.6195915225
4 0.9587601107 372 5 -1.9701975542
5 0.7415729680 373 5 3.7826137026
6 0.4814927997 374 5 4.1228808255
7 0.1340754859 375 4 3.7818792837
8 0.6316874851 376 3 0.7670884461
9 0.1107241728 377 5 -1.3367400097
10 0.5850426450 378 5 2.2848396166
11 0.2809308425 379 5 2.5234691438
12 0.2835292015 380 2 -0.3291319925
13 0.3328713216 381 1 4.7425349397
14 0.4766904986 382 1 -0.4062103292
15 0.5005860797 383 5 4.8612083721
16 0.2734292494 384 5 -0.2320077328
17 0.1488479455 385 4 2.6195679584
18 0.9446908936 386 3 0.4889716264
19 0.8222738281 387 2 0.7362413658
20 0.7570014759 388 3 4.6661250258
21 0.9988698252 389 4 2.6340743946
22 0.8384663551 390 3 1.0428046484
23 0.1938821415 391 2 0.8855748393
And I need to create new data.frame this way.For example:If predict (in first row) > 0.5,delete second and third row (because ChoosedN in first row is 2 so next two after first row has to be delete, because there we were still in position). And continue on fourth the same way (if predict (fourth row) > 0.5, delete next five rows and so. And of course, if predict <=0.5 delete this row too.
Any straightforward way how to do it with some loop?
Thanks
I would create a new dataframe, then bind the rows you want using rbind inside of a for loop
newDF <- data.frame() # New, Empty Dataframe
i = 1 # Loop index Variable
while (i < nrow(df)) {
if (df$predict[i] > 0.5) { # If predict > 0.5,
newDF <- rbind(newDF, df[i,]) # Bind the row
i = i + df$ChoosedN[i] # Adjust for ChoosedN rows
}
i = i + 1 # Move to the next row
}

How to obtain a new table after filtering only one column in an existing table in R?

I have a data frame having 20 columns. I need to filter / remove noise from one column. After filtering using convolve function I get a new vector of values. Many values in the original column become NA due to filtering process. The problem is that I need the whole table (for later analysis) with only those rows where the filtered column has values but I can't bind the filtered column to original table as the number of rows for both are different. Let me illustrate using the 'age' column in 'Orange' data set in R:
> head(Orange)
Tree age circumference
1 1 118 30
2 1 484 58
3 1 664 87
4 1 1004 115
5 1 1231 120
6 1 1372 142
Convolve filter used
smooth <- function (x, D, delta){
z <- exp(-abs(-D:D/delta))
r <- convolve (x, z, type='filter')/convolve(rep(1, length(x)),z,type='filter')
r <- head(tail(r, -D), -D)
r
}
Filtering the 'age' column
age2 <- smooth(Orange$age, 5,10)
data.frame(age2)
The number of rows for age column and age2 column are 35 and 15 respectively. The original dataset has 2 more columns and I like to work with them also. Now, I only need 15 rows of each column corresponding to the 15 rows of age2 column. The filter here removed first and last ten values from age column. How can I apply the filter in a way that I get truncated dataset with all columns and filtered rows?
You would need to figure out how the variables line up. If you can add NA's to age2 and then do Orange$age2 <- age2 followed by na.omit(Orange) you should have what you want. Or, equivalently, perhaps this is what you are looking for?
df <- tail(head(Orange, -10), -10) # chop off the first and last 10 observations
df$age2 <- age2
df
Tree age circumference age2
11 2 1004 156 915.1678
12 2 1231 172 876.1048
13 2 1372 203 841.3156
14 2 1582 203 911.0914
15 3 118 30 948.2045
16 3 484 51 1008.0198
17 3 664 75 955.0961
18 3 1004 108 915.1678
19 3 1231 115 876.1048
20 3 1372 139 841.3156
21 3 1582 140 911.0914
22 4 118 32 948.2045
23 4 484 62 1008.0198
24 4 664 112 955.0961
25 4 1004 167 915.1678
Edit: If you know the first and last x observations will be removed then the following works:
x <- 2
df <- tail(head(Orange, -x), -x) # chop off the first and last x observations
df$age2 <- age2

creating vector from 'if' function using apply in R

I'm tyring to create new vector in R using an 'if' function to pull out only certain values for the new array. Basically, I want to segregate data by day of week for each of several cities. How do I use the apply function to get only, say, Tuesdays in a new array for each city? Thanks
It sounds as though you don't want if or apply at all. The solution is simpler:
Suppose that your data frame is data. Then subset(data, Weekday == 3) should work.
You don't want to use the R if. Instead use the subsetting function [
dat <- read.table(text=" Date Weekday Holiday Atlanta Chicago Houston Tulsa
1 1/1/2008 3 1 313 313 361 123
2 1/2/2008 4 0 735 979 986 310
3 1/3/2008 5 0 690 904 950 286
4 1/4/2008 6 0 610 734 822 281
5 1/5/2008 7 0 482 633 622 211
6 1/6/2008 1 0 349 421 402 109", header=TRUE)
dat[ dat$Weekday==3, ]

Resources