Loop linear regrssion model - r

I have a data like this where Amount is the dependent variable and len,age, quantity and pos are explanotry variables. I trying to Make a regression of Amount On age, quantity and pos Using stepwise.
ID Sym Month Amount len Age quantity Pos
11 10 1 500 5 17 0 12
22 10 1 300 6 11 0 11
33 10 1 200 2 10 0 10
44 10 1 100 2 11 0 11
55 10 1 150 4 15 0 12
66 10 1 250 4 16 0 14
11 20 1 500 5 17 0 12
22 20 1 300 6 11 0 11
33 20 1 200 2 10 0 10
44 20 1 100 2 11 0 11
55 20 1 150 4 15 0 12
66 20 1 250 4 16 0 14
77 20 1 700 4 17 0 11
88 20 1 100 2 16 0 12
11 30 1 500 5 17 0 12
22 30 1 300 6 11 0 11
33 30 1 200 2 10 0 10
44 30 1 100 2 11 0 11
55 30 1 150 4 15 0 12
66 30 1 250 4 16 0 14
11 40 1 500 5 17 2000 12
22 40 1 300 6 11 1000 11
33 40 1 200 2 10 1000 10
44 40 1 100 2 11 1000 11
55 40 1 150 4 15 1000 12
66 40 1 250 4 16 1000 14
And the Output of the results I want after running all regression should be a dataframe that's look like this (That should help me detect outliers):
Id Month Sym Amount len Age Quantity Pos R^2 CookDistanse Residuals UpperLimit LowerLimit
11 1 10 500 5 17 null 12 0.7 1.5 -350 -500 1000
22 1 10 300 6 11 null 11 0.8 1.7 -400 -500 1000
That's the code that I am trying to run on Sym = 10, Sym= 20, Sym = 30, Sym = 40.
I have something like 400 Sym values to run a regression analysis on them.
fit[i] <- step(lm (Sym[i]$Sum ~ len + Age + Quantity,
na.action=na.omit), direction="backward")
R_Sq <- summary(fit[i])$r.squared
Res[i] <- resid(fit[i])
D[i] <- cooks.distance(fit[i])
Q[i] <- quantile(resid(fit[i), c(.25, .50, .75, .99))
L[i]<- Q[1][i] - 2.2* (Q[3][i]-Q[1][i])
U[i] <- Q[3][i] + 2.2*(Q[3][i]-Q[1][i])
"i" means the results for the regression of sym = i (10,20..).
Any way to do this on loop for every Sym value?
Any help will be highly appreciate.

Related

Finding the k-largest clusters in dbscan result

I have a dataframe df, consists of 2 columns: x and y coordinates.
Each row refers to a point.
I feed it into dbscan function to obtain the clusters of the points in df.
library("fpc")
db = fpc::dbscan(df, eps = 0.08, MinPts = 4)
plot(db, df, main = "DBSCAN", frame = FALSE)
By using print(db), I can see the result returned by dbscan.
> print(db)
dbscan Pts=13131 MinPts=4 eps=0.08
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
border 401 38 55 5 2 3 0 0 0 8 0 6 1 3 1 3 3 2 1 2 4 3
seed 0 2634 8186 35 24 561 99 7 22 26 5 75 17 9 9 54 1 2 74 21 3 15
total 401 2672 8241 40 26 564 99 7 22 34 5 81 18 12 10 57 4 4 75 23 7 18
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
border 4 1 2 6 2 1 3 7 2 1 2 3 11 1 3 1 3 2 5 5 1 4 3
seed 14 9 4 48 2 4 38 111 5 11 5 14 111 6 1 5 1 8 3 15 10 15 6
total 18 10 6 54 4 5 41 118 7 12 7 17 122 7 4 6 4 10 8 20 11 19 9
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
border 2 4 2 1 3 2 1 1 3 1 0 2 2 3 0 3 3 3 3 0 0 2 3 1
seed 15 2 9 11 4 8 12 4 6 8 7 7 3 3 4 3 3 4 2 9 4 2 1 4
total 17 6 11 12 7 10 13 5 9 9 7 9 5 6 4 6 6 7 5 9 4 4 4 5
69 70 71
border 3 3 3
seed 1 1 1
total 4 4 4
From the above summary, I can see cluster 2 consists of 8186 seed points (core points), cluster 1 consists of 2634 seed points and cluster 5 consists of 561 points.
I define the largest cluster as the one contains the largest amount of seed points. So, in this case, the largest cluster is cluster 2. And the 1st, 2nd, 3th largest clusters are 2, 1 and 5.
Are they any direct way to return the rows (points) in the largest cluster or the k-largest cluster in general?
I can do it in an indirect way.
I can obtain the assigned cluster number of each point by
db$cluster.
Hence, I can create a new dataframe df2 with db$cluster as the
new additional column besides the original x column and y
column.
Then, I can aggregate the df2 according to the cluster numbers in
the third column and find the number of points in each cluster.
After that, I can find the k-largest groups, which are 2, 1 and 5
again.
Finally, I can select the rows in df2 with third column value equals to 2 to return the points in the largest cluster.
But the above approach re-computes many known results as stated in the summary of print(db).
The dbscan function doesn't appear to retain the data.
library(fpc)
set.seed(665544)
n <- 600
df <- data.frame(x=runif(10, 0, 10)+rnorm(n, sd=0.2), y=runif(10, 0, 10)+rnorm(n,sd=0.2))
(dbs <- dbscan(df, 0.2))
#dbscan Pts=600 MinPts=5 eps=0.2
# 0 1 2 3 4 5 6 7 8 9 10 11
#border 28 4 4 8 5 3 3 4 3 4 6 4
#seed 0 50 53 51 52 51 54 54 54 53 51 1
#total 28 54 57 59 57 54 57 58 57 57 57 5
attributes(dbs)
#$names
#[1] "cluster" "eps" "MinPts" "isseed"
#$class
#[1] "dbscan"
Your indirect steps are not that indirect (only two lines needed), and these commands won't recalculate the clusters. So just run those commands, or put them in a function and then call the function in one command.
cluster_k <- function(dbs, data, k){
kth <- names(rev(sort(table(dbs$cluster)))[k])
data[dbs$cluster == kth,]
}
cluster_k(dbs=dbs, data=df, k=1)
## x y
## 3 6.580695 8.715245
## 13 6.704379 8.528486
## 23 6.809558 8.160721
## 33 6.375842 8.756433
## 43 6.603195 8.640206
## 53 6.728533 8.425067
## a data frame with 59 rows

R Data.Table Filter by table A in table B

The goal of this code is that finding the quadrant of a given point on the given circle equation.
I have two separate data.table. In table A, I have a different variation of circle equation variables. In table B I have raw data for finding how many points are lie on each circle quadrant. I have following sequence:
Get the circle equation from Table A
Filter out points where circle lie on the coordinates from Table B
Find the each points where they lie on the circle (getQuadrant function)
Count how many points lie on each quadrant (Quadrants function)
I had some attempts but it is kind of slow to return the results. The tables are as follows:
set.seed(4)
TableA <- data.table(speed=rep(42:44,each=3),
minX = rep(c(1:12),3),
maxX = rep(c(10:21),3),
minY = 1,
maxY = 10,
r = 5,
cX = rep(c(6:17),3),
cY = 6,
indx = 1:36)
TableA
speed minX maxX minY maxY r cX cY indx
1: 42 1 10 1 10 1 2 2 1
2: 42 2 11 1 10 1 2 2 2
3: 42 3 12 1 10 1 2 2 3
4: 43 1 10 1 10 1 2 2 4
5: 43 2 11 1 10 1 2 2 5
6: 43 3 12 1 10 1 2 2 6
7: 44 1 10 1 10 1 2 2 7
8: 44 2 11 1 10 1 2 2 8
9: 44 3 12 1 10 1 2 2 9
TableB <- data.table(speed=rep(42:44,each=100),
x = rep(sample(12),100),
y = rep(sample(12),100),
n = rep(sample(12),100))
TableB
speed x y n
1: 42 8 2 8
2: 42 1 11 10
3: 42 3 5 5
4: 42 10 10 12
5: 42 7 8 11
Function to find quadrant:
getQuadrant <- function(X=0,Y=0,R=1,PX=10,PY=10){
#' X and Y are center of the circle
#' R = Radius
#' PX and PY are a point anywhere
# The point is on the center
if (PX == X & PY == Y)
return(0)
val = ((PX - X)^2 + (PY - Y)^2)
# Outside the circle
if (val > R^2)
return(5)
# 1st quadrant
if (PX > X & PY >= Y)
return(1)
# 2nd quadrant
if (PX <= X & PY > Y)
return(2)
# 3rd quadrant
if (PX < X & PY <= Y)
return(3)
# 4th quadrant
if (PX >= X & PY < Y)
return(4)
}
Function to return number of points in the quadrant.
Quadrants <- function(dt,radius,centerX,centerY){
#' dt is filtered data for the circle
#' radius of the circle equation
#' centerX and centerY are the center point of the circle equation
if(nrow(dt) > 0 ){
dt[,quadrant:=factor(mapply(function(X,Y,R,PX,PY) getQuadrant(X=X,Y=Y,R=R,PX=PX,PY=PY),centerX,centerY,radius,x_cut,y_cut), levels = c("1","2","3","4","5"))]
dt <- dt[, .(.N), keyby = .(quadrant)]
setkeyv(dt, c("quadrant"))
dt <- dt[CJ(levels(dt[,quadrant])),]
dd <- list(Q1=dt$N[1],Q2=dt$N[2],Q3=dt$N[3],Q4=dt$N[4],Q5=dt$N[5])
}else{
dd <- list(Q1=NA,Q2=NA,Q3=NA,Q4=NA,Q5=NA) }
return(dd)
}
I have following solution but it won't work.
finalTable <- TableA[,c('Q1','Q2','Q3','Q4','Q5') := mapply(function(a,b,c,d,e,f,g,h) Quadrants(TableB[, .SD[x %between% c(a,b) & y %between% c(c,d) & speed == h]], radius=e, centerX = f, centerY = g),minX,maxX,minY,maxY,r,cX,cY,speed)]
I don't think so I am doing right. Because below results are not the expected one.
speed minX maxX minY maxY r cX cY indx Q1 Q2 Q3 Q4 Q5
1: 42 1 10 1 10 5 6 6 1 32 32 100 68 68
2: 42 2 11 1 10 5 7 6 2 32 32 100 68 68
3: 42 3 12 1 10 5 8 6 3 32 32 100 68 68
4: 43 4 13 1 10 5 9 6 4 32 32 100 68 68
...
11: 42 11 20 1 10 5 16 6 11 32 32 100 68 68
12: 42 12 21 1 10 5 17 6 12 32 32 100 68 68
13: 43 1 10 1 10 5 6 6 13 32 32 100 68 68
14: 43 2 11 1 10 5 7 6 14 32 32 100 68 68
15: 43 3 12 1 10 5 8 6 15 32 32 100 68 68
...
22: 43 10 19 1 10 5 15 6 22 32 32 100 68 68
23: 43 11 20 1 10 5 16 6 23 32 32 100 68 68
24: 43 12 21 1 10 5 17 6 24 32 32 100 68 68
25: 44 1 10 1 10 5 6 6 25 32 32 100 68 68
26: 44 2 11 1 10 5 7 6 26 32 32 100 68 68
27: 44 3 12 1 10 5 8 6 27 32 32 100 68 68
28: 42 4 13 1 10 5 9 6 28 32 32 100 68 68
...
35: 44 11 20 1 10 5 16 6 35 32 32 100 68 68
36: 44 12 21 1 10 5 17 6 36 32 32 100 68 68
Can anyone take a look please. I really appreciated.
Expected Output:
speed minX maxX minY maxY r cX cY indx Q1 Q2 Q3 Q4 Q5
1: 42 2 11 1 10 5 7 6 1 200 100 400 100 200
2: 42 3 12 1 10 5 8 6 2 200 100 300 100 200
3: 42 4 13 1 10 5 9 6 3 200 100 300 100 100
4: 42 5 14 1 10 5 10 6 4 100 200 300 NA 100
...
11: 42 12 21 1 10 5 17 6 11 NA NA NA NA NA
12: 42 13 22 1 10 5 18 6 12 NA NA NA NA NA
13: 43 2 11 1 10 5 7 6 13 200 100 400 100 200
14: 43 3 12 1 10 5 8 6 14 200 100 300 100 200
15: 43 4 13 1 10 5 9 6 15 200 100 300 100 100
...
22: 43 11 20 1 10 5 16 6 22 NA NA NA NA 100
23: 43 12 21 1 10 5 17 6 23 NA NA NA NA NA
24: 43 13 22 1 10 5 18 6 24 NA NA NA NA NA
25: 44 2 11 1 10 5 7 6 25 200 100 400 100 200
26: 44 3 12 1 10 5 8 6 26 200 100 300 100 200
27: 44 4 13 1 10 5 9 6 27 200 100 300 100 100
28: 44 5 14 1 10 5 10 6 28 100 200 300 NA 100
...
35: 44 12 21 1 10 5 17 6 35 NA NA NA NA NA
36: 44 13 22 1 10 5 18 6 36 NA NA NA NA NA

Using aggregate in a dataframe with NA without dropping rows [duplicate]

This question already has an answer here:
Blend of na.omit and na.pass using aggregate?
(1 answer)
Closed 5 years ago.
I am using aggregate to get the means of several variables by a specific category (cy), but there are a few NA's in my dataframe. I am using aggregate rather than ddply because from my understanding it takes care of NA's similarly to using rm.na=TRUE. The problem is that it drops all rows containing NA in the output, so the means are slightly off.
Dataframe:
> bt cy cl pf ne YH YI
1 1 H 1 95 70.0 20 20
2 2 H 1 25 70.0 46 50
3 1 H 1 0 70.0 40 45
4 2 H 1 95 59.9 40 40
5 2 H 1 75 59.9 36 57
6 2 H 1 5 70.0 35 43
7 1 H 1 50 59.9 20 36
8 2 H 1 95 59.9 40 42
9 3 H 1 95 49.5 17 48
10 2 H 1 5 70.0 42 42
11 2 H 1 95 49.5 19 30
12 3 H 1 25 49.5 33 51
13 1 H 1 75 49.5 5 26
14 1 H 1 5 70.0 35 37
15 1 H 1 5 59.9 20 40
16 2 H 1 95 49.5 29 53
17 2 H 1 75 70.0 41 41
18 2 H 1 0 70.0 10 10
19 2 H 1 95 49.5 25 32
20 1 H 1 95 59.9 10 11
21 2 H 1 0 29.5 20 28
22 1 H 1 95 29.5 11 27
23 2 H 1 25 59.9 26 26
24 1 H 1 5 70.0 30 30
25 3 H 1 25 29.5 20 30
26 3 H 1 50 70.0 5 5
27 1 H 1 0 59.9 3 10
28 1 K 1 5 49.5 25 29
29 2 K 1 0 49.5 30 32
30 1 K 1 95 49.5 13 24
31 1 K 1 0 39.5 13 13
32 2 M 1 NA 70.0 45 50
33 3 M 1 25 59.9 3 34'
The full dataframe has 74 rows, and there are NA's peppered throughout all but two columns (cy and cl).
My code looks like this:
meancnty<-(aggregate(cbind(pf,ne,YH,YI)~cy, data = newChart, FUN=mean))
I double checked in excel, and the means this function produces are for a dataset of N=69, after removing all rows containing NA's. Is there any way to tell R to ignore the NA's rather than remove the rows, other than taking the mean of each variable by county (I have a lot of variables to summarize by many different categories)?
Thank you
using dplyr
df %>%
group_by(cy) %>%
summarize_all(mean, na.rm = TRUE)
# cy bt cl pf ne YH YI
# 1 H 1.785714 0.7209302 53.41463 51.75952 21.92857 29.40476
# 2 K 1.333333 0.8333333 33.33333 47.83333 20.66667 27.33333
# 3 M 1.777778 0.4444444 63.75000 58.68889 24.88889 44.22222
# 4 O 2.062500 0.8750000 31.66667 53.05333 18.06667 30.78571
I think this will work:
meancnty<-(aggregate(with(newChart(cbind(pf,ne,YH,YI),
by = list(newchart$cy), FUN=mean, na.rm=T))
I used the following test data:
> q<- data.frame(y = sample(c(0,1), 10, replace=T), a = runif(10, 1, 100), b=runif(10, 20,30))
> q$a[c(2, 5, 7)]<- NA
> q$b[c(1, 3, 4)]<- NA
> q
y a b
1 0 86.87961 NA
2 0 NA 22.39432
3 0 89.38810 NA
4 0 12.96266 NA
5 1 NA 22.07757
6 0 73.96121 24.13154
7 0 NA 22.31431
8 1 62.77095 21.46395
9 0 55.28476 23.14393
10 0 14.01912 28.08305
Using your code from above, I get:
> aggregate(cbind(a,b)~y, data=q, mean, na.rm=T)
y a b
1 0 47.75503 25.11951
2 1 62.77095 21.46395
which is wrong, i.e. it deletes all rows with any NAs and then takes the mean.
This however gave the right result:
> aggregate(with(q, cbind(a, b)), by = list(q$y), mean, na.rm=T)
Group.1 a b
1 0 55.41591 24.01343
2 1 62.77095 21.77076
It did na.rm=T by column first, and then took the average by group.
Unfortunately, I have no idea why that is, but my guess is that is has to do with the class of y.

Count of values in intervals of latitude and years

I have different dataframes with a column in which there are the latitudes (latitude) of some records and in another column of the same dataframe the date of the records (datecollected).
I would like to count and export in a new dataframe the number of the records in the same intervals of latitude (5 degrees) and year (two years).
(Hint: you'll make it easier for us to answer by providing some sample data.)
dataset <- data.frame(datecollected=
sample(as.Date("2000-01-01")+(0:3650),1000,replace=TRUE),
latitude=90*runif(1000))
We round the datecollected down to the next even year:
year.index <- (as.POSIXlt(dataset$datecollected)$year %/% 2)*2+1900
Similarly, we round the latitude down to the nearest multiple of 5 degrees:
latitude.index <- (floor(dataset$latitude) %/% 5)*5
Then we simply build a table on the rounded years and latitudes:
table(year.index,latitude.index)
latitude.index
year.index 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85
2000 12 9 15 7 11 10 11 14 9 13 11 10 8 11 13 25 10 18
2002 11 9 11 16 11 15 12 5 12 13 7 15 8 7 11 7 10 13
2004 8 12 9 10 12 16 12 13 9 7 16 11 6 13 4 15 12 10
2006 14 8 13 10 12 9 12 9 6 11 11 9 13 9 10 5 5 12
2008 8 12 17 12 12 8 12 8 14 12 11 11 10 10 14 16 17 13
EDIT: after a bit of discussion in the comments, I'll post my current script. It seems like there may be an issue when you read the data into R. This is what I do and what I get:
rm(list=ls())
dataset <- read.csv("GADUS.csv",header=TRUE,sep=",")
year.index <- (as.POSIXlt(as.character(dataset$datecollected),format="%Y-%m-%d")$year
%/% 2)*2+1900
latitude.index <- (floor(dataset$latitude) %/% 5)*5
table(year.index,latitude.index)
latitude.index
year.index 0 5 20 35 40 45 50 55 60 65 70 75
1752 0 0 0 0 0 20 0 0 0 0 0 0
1754 0 0 0 0 0 27 0 3 0 0 0 0
1756 0 0 0 0 0 21 0 1 0 0 0 0
1758 0 0 0 0 0 46 0 2 0 0 0 0
...
Does this give the same result for you? If not, please edit your question and post the result of str(dataset[,c("datecollected","latitude")]).

Replace last NA of a segment of NAs in a column with last valid value

Here is a sample data frame:
> df = data.frame(rep(seq(0, 120, length.out=6), times = 2), c(sample(1:50, 4),
+ NA, NA, NA, sample(1:50, 5)))
> colnames(df) = c("Time", "Pat1")
> df
Time Pat1
1 0 33
2 24 48
3 48 7
4 72 8
5 96 NA
6 120 NA
7 0 NA
8 24 1
9 48 6
10 72 28
11 96 31
12 120 32
NAs which have to be replaced are identified by which and logical operators:
x = which(is.na(df$Pat1) & df$Time == 0)
I know the locf() command, but it's replacing all NAs. How can I replace only the NAs at position x in a multi-column df?
EDIT: Here is a link to my original dataset: link
And thats how far I get:
require(reshape2)
require(zoo)
pad.88 <- read.csv2("pad_88.csv")
colnames(pad.88) = c("Time", "Increment", "Side", 4:length(pad.88)-3)
attach(pad.88)
x = which(Time == 240 & Increment != 5)
pad.88 = pad.88[c(1:x[1], x[1]:x[2], x[2]:x[3], x[3]:x[4], x[4]:x[5], x[5]:x[6],x[6]:x[7], x[7]:x[8], x[8]:nrow(pad.88)),]
y = which(duplicated(pad.88))
pad.88$Time[y] = 0
pad.88$Increment[y] = Increment[x] + 1
z = which(is.na(pad.88[4:ncol(pad.88)] & pad.88$Time == 0), arr.ind=T)
a = na.locf(pad.88[4:ncol(pad.88)])
My next step is something like pat.cols[z] = a[z], which doesn't work.
That's how the result should look like:
Time Increment Side 1 2 3 4 5 ...
150 4 0 27,478 24,076 27,862 20,001 25,261
165 4 0 27,053 24,838 27,231 20,001 NA
180 4 0 27,599 24,166 27,862 20,687 NA
195 4 0 27,114 23,403 27,862 20,001 NA
210 4 0 26,993 24,076 27,189 19,716 NA
225 4 0 26,629 24,21 26,221 19,887 NA
240 4 0 26,811 26,228 26,431 20,001 NA
0 5 1 26,811 26,228 26,431 20,001 25,261
15 5 1 ....
The last valid value in col 5 is 25,261. This value replaces the NA at Time 0/Col 5.
You can change it so that x records all the NA values and use the first and last from that to identify the locations you want.
df
Time Pat1
1 0 36
2 24 13
3 48 32
4 72 38
5 96 NA
6 120 NA
7 0 NA
8 24 5
9 48 10
10 72 7
11 96 25
12 120 28
x <- which(is.na(df$Pat1))
df[rev(x)[1],"Pat1"] <- df[x[1]-1,"Pat1"]
df
Time Pat1
1 0 36
2 24 13
3 48 32
4 72 38
5 96 NA
6 120 NA
7 0 38
8 24 5
9 48 10
10 72 7
11 96 25
12 120 28
For the multi-column example use the same idea in a sapply call:
cbind(df[1],sapply(df[-1],function(x) {y<-which(is.na(x));x[rev(y)[1]]<-x[y[1]-1];x}))
Time Pat1 Pat2
1 0 41 42
2 24 8 30
3 48 3 41
4 72 14 NA
5 96 NA NA
6 120 NA NA
7 0 14 41
8 24 5 37
9 48 29 48
10 72 31 11
11 96 50 43
12 120 46 21

Resources