Using aggregate in a dataframe with NA without dropping rows [duplicate] - r

This question already has an answer here:
Blend of na.omit and na.pass using aggregate?
(1 answer)
Closed 5 years ago.
I am using aggregate to get the means of several variables by a specific category (cy), but there are a few NA's in my dataframe. I am using aggregate rather than ddply because from my understanding it takes care of NA's similarly to using rm.na=TRUE. The problem is that it drops all rows containing NA in the output, so the means are slightly off.
Dataframe:
> bt cy cl pf ne YH YI
1 1 H 1 95 70.0 20 20
2 2 H 1 25 70.0 46 50
3 1 H 1 0 70.0 40 45
4 2 H 1 95 59.9 40 40
5 2 H 1 75 59.9 36 57
6 2 H 1 5 70.0 35 43
7 1 H 1 50 59.9 20 36
8 2 H 1 95 59.9 40 42
9 3 H 1 95 49.5 17 48
10 2 H 1 5 70.0 42 42
11 2 H 1 95 49.5 19 30
12 3 H 1 25 49.5 33 51
13 1 H 1 75 49.5 5 26
14 1 H 1 5 70.0 35 37
15 1 H 1 5 59.9 20 40
16 2 H 1 95 49.5 29 53
17 2 H 1 75 70.0 41 41
18 2 H 1 0 70.0 10 10
19 2 H 1 95 49.5 25 32
20 1 H 1 95 59.9 10 11
21 2 H 1 0 29.5 20 28
22 1 H 1 95 29.5 11 27
23 2 H 1 25 59.9 26 26
24 1 H 1 5 70.0 30 30
25 3 H 1 25 29.5 20 30
26 3 H 1 50 70.0 5 5
27 1 H 1 0 59.9 3 10
28 1 K 1 5 49.5 25 29
29 2 K 1 0 49.5 30 32
30 1 K 1 95 49.5 13 24
31 1 K 1 0 39.5 13 13
32 2 M 1 NA 70.0 45 50
33 3 M 1 25 59.9 3 34'
The full dataframe has 74 rows, and there are NA's peppered throughout all but two columns (cy and cl).
My code looks like this:
meancnty<-(aggregate(cbind(pf,ne,YH,YI)~cy, data = newChart, FUN=mean))
I double checked in excel, and the means this function produces are for a dataset of N=69, after removing all rows containing NA's. Is there any way to tell R to ignore the NA's rather than remove the rows, other than taking the mean of each variable by county (I have a lot of variables to summarize by many different categories)?
Thank you

using dplyr
df %>%
group_by(cy) %>%
summarize_all(mean, na.rm = TRUE)
# cy bt cl pf ne YH YI
# 1 H 1.785714 0.7209302 53.41463 51.75952 21.92857 29.40476
# 2 K 1.333333 0.8333333 33.33333 47.83333 20.66667 27.33333
# 3 M 1.777778 0.4444444 63.75000 58.68889 24.88889 44.22222
# 4 O 2.062500 0.8750000 31.66667 53.05333 18.06667 30.78571

I think this will work:
meancnty<-(aggregate(with(newChart(cbind(pf,ne,YH,YI),
by = list(newchart$cy), FUN=mean, na.rm=T))
I used the following test data:
> q<- data.frame(y = sample(c(0,1), 10, replace=T), a = runif(10, 1, 100), b=runif(10, 20,30))
> q$a[c(2, 5, 7)]<- NA
> q$b[c(1, 3, 4)]<- NA
> q
y a b
1 0 86.87961 NA
2 0 NA 22.39432
3 0 89.38810 NA
4 0 12.96266 NA
5 1 NA 22.07757
6 0 73.96121 24.13154
7 0 NA 22.31431
8 1 62.77095 21.46395
9 0 55.28476 23.14393
10 0 14.01912 28.08305
Using your code from above, I get:
> aggregate(cbind(a,b)~y, data=q, mean, na.rm=T)
y a b
1 0 47.75503 25.11951
2 1 62.77095 21.46395
which is wrong, i.e. it deletes all rows with any NAs and then takes the mean.
This however gave the right result:
> aggregate(with(q, cbind(a, b)), by = list(q$y), mean, na.rm=T)
Group.1 a b
1 0 55.41591 24.01343
2 1 62.77095 21.77076
It did na.rm=T by column first, and then took the average by group.
Unfortunately, I have no idea why that is, but my guess is that is has to do with the class of y.

Related

Consecutive Positive or Negative calculation from data frame and filter results using R

I have the following dataset and looking to write a code that can help pull out which stocks have been positive or negative consecutively. The data would have first 3 column. last 2 columns are manually calculated in excel to depict expected results.
This is only sample, i would have data for 200+ stocks and few years of data with all stocks not trading every day.
In the end, i want to extract which stocks have say 3 or 4 or 5 consecutive positive or negative change for the day.
` Stocks Date Close Price Change for day Positive/Negative Count
A 11/11/2020 11
B 11/11/2020 50
C 11/11/2020 164
A 11/12/2020 19 8 1
B 11/12/2020 62 12 1
C 11/12/2020 125 -39 -1
A 11/13/2020 7 -12 -1
B 11/13/2020 63 1 2
C 11/13/2020 165 40 1
A 11/16/2020 17 10 1
B 11/16/2020 70 7 3
C 11/16/2020 170 5 2
A 11/17/2020 24 7 2
B 11/17/2020 52 -18 -1
C 11/17/2020 165 -5 -1
A 11/18/2020 31 7 3
B 11/18/2020 61 9 1
C 11/18/2020 157 -8 -2
The difficulty is to have a function that makes the cumulative sum, both positive and negative, resetting the count when the sign changes, and starting the count with the first value. I managed to make one, but it is not terribly efficient and will probably get slow on a bigger dataset. I suspect there is a way to do better, if only with a simple for loop in C or C++.
library(tidyverse)
df <- read.table(text="Stocks Date Close_Price Change_for_day Positive/Negative_Count
A 11/11/2020 11 NA 0
B 11/11/2020 50 NA 0
C 11/11/2020 164 NA 0
A 11/12/2020 19 8 1
B 11/12/2020 62 12 1
C 11/12/2020 125 -39 -1
A 11/13/2020 7 -12 -1
B 11/13/2020 63 1 2
C 11/13/2020 165 40 1
A 11/16/2020 17 10 1
B 11/16/2020 70 7 3
C 11/16/2020 170 5 2
A 11/17/2020 24 7 2
B 11/17/2020 52 -18 -1
C 11/17/2020 165 -5 -1
A 11/18/2020 31 7 3
B 11/18/2020 61 9 1
C 11/18/2020 157 -8 -2",
header = TRUE) %>%
select(1:3) %>%
as_tibble()
# this formulation could be faster on data with longer stretches
nb_days_cons2 <- function(x){
n <- length(x)
if(n < 2) x
out <- integer(n)
y <- rle(x)
cur_pos <- 1
for(i in seq_len(length(y$lengths))){
out[(cur_pos):(cur_pos+y$lengths[i]-1)] <- cumsum(rep(y$values[i], y$lengths[i]))
cur_pos <- cur_pos + y$lengths[i]
}
out
}
# this formulation was faster on some tests, and would be easier to rewrite in C
nb_days_cons <- function(x){
n <- length(x)
if(n < 2) x
out <- integer(n)
out[1] <- x[1]
for(i in 2:n){
if(x[i] == x[i-1]){
out[i] <- out[i-1] + x[i]
} else{
out[i] <- x[i]
}
}
out
}
Once we have that function, the dplyr part is quite classic.
df %>%
group_by(Stocks) %>%
arrange(Date) %>% # make sure of order
mutate(change = c(0, diff(Close_Price)),
stretch_duration = nb_days_cons(sign(change))) %>%
arrange(Stocks)
#> # A tibble: 18 x 5
#> # Groups: Stocks [3]
#> Stocks Date Close_Price change stretch_duration
#> <chr> <chr> <int> <dbl> <dbl>
#> 1 A 11/11/2020 11 0 0
#> 2 A 11/12/2020 19 8 1
#> 3 A 11/13/2020 7 -12 -1
#> 4 A 11/16/2020 17 10 1
#> 5 A 11/17/2020 24 7 2
#> 6 A 11/18/2020 31 7 3
#> 7 B 11/11/2020 50 0 0
#> 8 B 11/12/2020 62 12 1
#> 9 B 11/13/2020 63 1 2
#> 10 B 11/16/2020 70 7 3
#> 11 B 11/17/2020 52 -18 -1
#> 12 B 11/18/2020 61 9 1
#> 13 C 11/11/2020 164 0 0
#> 14 C 11/12/2020 125 -39 -1
#> 15 C 11/13/2020 165 40 1
#> 16 C 11/16/2020 170 5 2
#> 17 C 11/17/2020 165 -5 -1
#> 18 C 11/18/2020 157 -8 -2
Created on 2020-11-19 by the reprex package (v0.3.0)
Of course, the final arrange() is just for easy visualization, and you can remove the columns you don't need anymore with select().

R Data.Table Filter by table A in table B

The goal of this code is that finding the quadrant of a given point on the given circle equation.
I have two separate data.table. In table A, I have a different variation of circle equation variables. In table B I have raw data for finding how many points are lie on each circle quadrant. I have following sequence:
Get the circle equation from Table A
Filter out points where circle lie on the coordinates from Table B
Find the each points where they lie on the circle (getQuadrant function)
Count how many points lie on each quadrant (Quadrants function)
I had some attempts but it is kind of slow to return the results. The tables are as follows:
set.seed(4)
TableA <- data.table(speed=rep(42:44,each=3),
minX = rep(c(1:12),3),
maxX = rep(c(10:21),3),
minY = 1,
maxY = 10,
r = 5,
cX = rep(c(6:17),3),
cY = 6,
indx = 1:36)
TableA
speed minX maxX minY maxY r cX cY indx
1: 42 1 10 1 10 1 2 2 1
2: 42 2 11 1 10 1 2 2 2
3: 42 3 12 1 10 1 2 2 3
4: 43 1 10 1 10 1 2 2 4
5: 43 2 11 1 10 1 2 2 5
6: 43 3 12 1 10 1 2 2 6
7: 44 1 10 1 10 1 2 2 7
8: 44 2 11 1 10 1 2 2 8
9: 44 3 12 1 10 1 2 2 9
TableB <- data.table(speed=rep(42:44,each=100),
x = rep(sample(12),100),
y = rep(sample(12),100),
n = rep(sample(12),100))
TableB
speed x y n
1: 42 8 2 8
2: 42 1 11 10
3: 42 3 5 5
4: 42 10 10 12
5: 42 7 8 11
Function to find quadrant:
getQuadrant <- function(X=0,Y=0,R=1,PX=10,PY=10){
#' X and Y are center of the circle
#' R = Radius
#' PX and PY are a point anywhere
# The point is on the center
if (PX == X & PY == Y)
return(0)
val = ((PX - X)^2 + (PY - Y)^2)
# Outside the circle
if (val > R^2)
return(5)
# 1st quadrant
if (PX > X & PY >= Y)
return(1)
# 2nd quadrant
if (PX <= X & PY > Y)
return(2)
# 3rd quadrant
if (PX < X & PY <= Y)
return(3)
# 4th quadrant
if (PX >= X & PY < Y)
return(4)
}
Function to return number of points in the quadrant.
Quadrants <- function(dt,radius,centerX,centerY){
#' dt is filtered data for the circle
#' radius of the circle equation
#' centerX and centerY are the center point of the circle equation
if(nrow(dt) > 0 ){
dt[,quadrant:=factor(mapply(function(X,Y,R,PX,PY) getQuadrant(X=X,Y=Y,R=R,PX=PX,PY=PY),centerX,centerY,radius,x_cut,y_cut), levels = c("1","2","3","4","5"))]
dt <- dt[, .(.N), keyby = .(quadrant)]
setkeyv(dt, c("quadrant"))
dt <- dt[CJ(levels(dt[,quadrant])),]
dd <- list(Q1=dt$N[1],Q2=dt$N[2],Q3=dt$N[3],Q4=dt$N[4],Q5=dt$N[5])
}else{
dd <- list(Q1=NA,Q2=NA,Q3=NA,Q4=NA,Q5=NA) }
return(dd)
}
I have following solution but it won't work.
finalTable <- TableA[,c('Q1','Q2','Q3','Q4','Q5') := mapply(function(a,b,c,d,e,f,g,h) Quadrants(TableB[, .SD[x %between% c(a,b) & y %between% c(c,d) & speed == h]], radius=e, centerX = f, centerY = g),minX,maxX,minY,maxY,r,cX,cY,speed)]
I don't think so I am doing right. Because below results are not the expected one.
speed minX maxX minY maxY r cX cY indx Q1 Q2 Q3 Q4 Q5
1: 42 1 10 1 10 5 6 6 1 32 32 100 68 68
2: 42 2 11 1 10 5 7 6 2 32 32 100 68 68
3: 42 3 12 1 10 5 8 6 3 32 32 100 68 68
4: 43 4 13 1 10 5 9 6 4 32 32 100 68 68
...
11: 42 11 20 1 10 5 16 6 11 32 32 100 68 68
12: 42 12 21 1 10 5 17 6 12 32 32 100 68 68
13: 43 1 10 1 10 5 6 6 13 32 32 100 68 68
14: 43 2 11 1 10 5 7 6 14 32 32 100 68 68
15: 43 3 12 1 10 5 8 6 15 32 32 100 68 68
...
22: 43 10 19 1 10 5 15 6 22 32 32 100 68 68
23: 43 11 20 1 10 5 16 6 23 32 32 100 68 68
24: 43 12 21 1 10 5 17 6 24 32 32 100 68 68
25: 44 1 10 1 10 5 6 6 25 32 32 100 68 68
26: 44 2 11 1 10 5 7 6 26 32 32 100 68 68
27: 44 3 12 1 10 5 8 6 27 32 32 100 68 68
28: 42 4 13 1 10 5 9 6 28 32 32 100 68 68
...
35: 44 11 20 1 10 5 16 6 35 32 32 100 68 68
36: 44 12 21 1 10 5 17 6 36 32 32 100 68 68
Can anyone take a look please. I really appreciated.
Expected Output:
speed minX maxX minY maxY r cX cY indx Q1 Q2 Q3 Q4 Q5
1: 42 2 11 1 10 5 7 6 1 200 100 400 100 200
2: 42 3 12 1 10 5 8 6 2 200 100 300 100 200
3: 42 4 13 1 10 5 9 6 3 200 100 300 100 100
4: 42 5 14 1 10 5 10 6 4 100 200 300 NA 100
...
11: 42 12 21 1 10 5 17 6 11 NA NA NA NA NA
12: 42 13 22 1 10 5 18 6 12 NA NA NA NA NA
13: 43 2 11 1 10 5 7 6 13 200 100 400 100 200
14: 43 3 12 1 10 5 8 6 14 200 100 300 100 200
15: 43 4 13 1 10 5 9 6 15 200 100 300 100 100
...
22: 43 11 20 1 10 5 16 6 22 NA NA NA NA 100
23: 43 12 21 1 10 5 17 6 23 NA NA NA NA NA
24: 43 13 22 1 10 5 18 6 24 NA NA NA NA NA
25: 44 2 11 1 10 5 7 6 25 200 100 400 100 200
26: 44 3 12 1 10 5 8 6 26 200 100 300 100 200
27: 44 4 13 1 10 5 9 6 27 200 100 300 100 100
28: 44 5 14 1 10 5 10 6 28 100 200 300 NA 100
...
35: 44 12 21 1 10 5 17 6 35 NA NA NA NA NA
36: 44 13 22 1 10 5 18 6 36 NA NA NA NA NA

Calculating table in R with uneven length

I have to table of data in R
a = Duration (-10,0] (0,0.25] (0.25,0.5] (0.5,10]
1 2 0 0 0 2
2 3 0 0 10 3
3 4 0 51 25 0
4 5 19 129 14 0
5 6 60 137 1 0
6 7 31 62 15 5
7 8 7 11 7 0
and
b = Duration (-10,0] (0,0.25] (0.25,0.5] (0.5,10]
1 1 0 0 1 266
2 2 1 0 47 335
3 3 1 26 415 142
4 4 3 965 508 5
5 5 145 2535 103 0
6 6 939 2239 15 6
7 7 420 613 86 34
8 8 46 84 36 16
I wouold like to calculate b/a by matching the duration. I though of some thing like ifelse() but it does not work. Can someone please help me?
Thanks a lot
Match the order and selection of b with a (in my example y with x). Then do the math.
x <- data.frame(duration = 2:8, v = rnorm(7))
y <- data.frame(duration = 8:1, v = rnorm(8))
m <- match(y$duration, x$duration)
ym <- y[m[!is.na(m)],]
x$v/ym$v
It does not work when x contains items that are not in y, btw.
Do you want something like the following:
a <- a[-1]
b <- b[-1]
a <- a[order(a$Duration),]
b <- b[order(b$Duration),]
durations <- intersect(a$Duration, b$Duration)
b[b$Duration %in% durations,] / a[a$Duration %in% durations,]
Duration (-10,0] (0,0.25] (0.25,0.5] (0.5,10]
2 1 Inf NaN Inf 167.50000
3 1 Inf Inf 41.500000 47.33333
4 1 Inf 18.921569 20.320000 Inf
5 1 7.631579 19.651163 7.357143 NaN
6 1 15.650000 16.343066 15.000000 Inf
7 1 13.548387 9.887097 5.733333 6.80000
8 1 6.571429 7.636364 5.142857 Inf
you may like to replace NaN and Inf values by something else.

How to calculate the standard deviation of subsets of data down a column in R

I would like to calculate the standard deviation of every 4 values down a column from the first to the last observation. I have found lots of answers for moving SD functions, but I simply need a line of code that will calculate the sd() for every 4 data values and write the answers into a new column in the data frame as below:
Example data:
Obs Count
1 56
2 29
3 66
4 62
5 49
6 12
7 65
8 81
9 73
10 66
11 71
12 59
Desired output:
Obs Count SD
1 56 16.68
2 29 16.68
3 66 16.68
4 62 16.68
5 49 29.55
6 12 29.55
7 65 29.55
8 81 29.55
9 73 6.24
10 66 6.24
11 71 6.24
12 59 6.24
I tried the below code, but this is obviously incorrect:
a <- for(i in 1: length(df)) sd(df$Count[i:(i+3)])
This should be a very easy task, but I have not been able to find an answer. I am still learning and any help would be appreciated.
In base R, you can use the following to create an index of "every 4 rows":
(seq_len(nrow(mydf))-1) %/% 4
# [1] 0 0 0 0 1 1 1 1 2 2 2 2
Using that, you can use ave to get the desired result:
mydf$SD <- ave(mydf$Count, (seq_len(nrow(mydf))-1) %/% 4, FUN = sd)
mydf
# Obs Count SD
# 1 1 56 16.680827
# 2 2 29 16.680827
# 3 3 66 16.680827
# 4 4 62 16.680827
# 5 5 49 29.545163
# 6 6 12 29.545163
# 7 7 65 29.545163
# 8 8 81 29.545163
# 9 9 73 6.238322
# 10 10 66 6.238322
# 11 11 71 6.238322
# 12 12 59 6.238322
An anternative is using rollapply from zoo package in combination with rep.
> library(zoo)
> N <- 4 # every four values
> SDs <- rollapply(df[,2], width=N, by=N, sd)
> df$SD <- rep(SDs, each=N)
> df
Obs Count SD
1 1 56 16.680827
2 2 29 16.680827
3 3 66 16.680827
4 4 62 16.680827
5 5 49 29.545163
6 6 12 29.545163
7 7 65 29.545163
8 8 81 29.545163
9 9 73 6.238322
10 10 66 6.238322
11 11 71 6.238322
12 12 59 6.238322
You might want to get it all in a once:
df$SD <- rep( rollapply(df[,2], width=N, by=N, sd), each=N)
This looks faster (i didn't test tough):
# mydf = your data
idxs = rep(1:nrow(mydf), each = 4, length = nrow(mydf))
mydf = within(mydf, {
Sd = rep(tapply(Count, idxs, sd), each = 4)
})
print(mydf)

Replace last NA of a segment of NAs in a column with last valid value

Here is a sample data frame:
> df = data.frame(rep(seq(0, 120, length.out=6), times = 2), c(sample(1:50, 4),
+ NA, NA, NA, sample(1:50, 5)))
> colnames(df) = c("Time", "Pat1")
> df
Time Pat1
1 0 33
2 24 48
3 48 7
4 72 8
5 96 NA
6 120 NA
7 0 NA
8 24 1
9 48 6
10 72 28
11 96 31
12 120 32
NAs which have to be replaced are identified by which and logical operators:
x = which(is.na(df$Pat1) & df$Time == 0)
I know the locf() command, but it's replacing all NAs. How can I replace only the NAs at position x in a multi-column df?
EDIT: Here is a link to my original dataset: link
And thats how far I get:
require(reshape2)
require(zoo)
pad.88 <- read.csv2("pad_88.csv")
colnames(pad.88) = c("Time", "Increment", "Side", 4:length(pad.88)-3)
attach(pad.88)
x = which(Time == 240 & Increment != 5)
pad.88 = pad.88[c(1:x[1], x[1]:x[2], x[2]:x[3], x[3]:x[4], x[4]:x[5], x[5]:x[6],x[6]:x[7], x[7]:x[8], x[8]:nrow(pad.88)),]
y = which(duplicated(pad.88))
pad.88$Time[y] = 0
pad.88$Increment[y] = Increment[x] + 1
z = which(is.na(pad.88[4:ncol(pad.88)] & pad.88$Time == 0), arr.ind=T)
a = na.locf(pad.88[4:ncol(pad.88)])
My next step is something like pat.cols[z] = a[z], which doesn't work.
That's how the result should look like:
Time Increment Side 1 2 3 4 5 ...
150 4 0 27,478 24,076 27,862 20,001 25,261
165 4 0 27,053 24,838 27,231 20,001 NA
180 4 0 27,599 24,166 27,862 20,687 NA
195 4 0 27,114 23,403 27,862 20,001 NA
210 4 0 26,993 24,076 27,189 19,716 NA
225 4 0 26,629 24,21 26,221 19,887 NA
240 4 0 26,811 26,228 26,431 20,001 NA
0 5 1 26,811 26,228 26,431 20,001 25,261
15 5 1 ....
The last valid value in col 5 is 25,261. This value replaces the NA at Time 0/Col 5.
You can change it so that x records all the NA values and use the first and last from that to identify the locations you want.
df
Time Pat1
1 0 36
2 24 13
3 48 32
4 72 38
5 96 NA
6 120 NA
7 0 NA
8 24 5
9 48 10
10 72 7
11 96 25
12 120 28
x <- which(is.na(df$Pat1))
df[rev(x)[1],"Pat1"] <- df[x[1]-1,"Pat1"]
df
Time Pat1
1 0 36
2 24 13
3 48 32
4 72 38
5 96 NA
6 120 NA
7 0 38
8 24 5
9 48 10
10 72 7
11 96 25
12 120 28
For the multi-column example use the same idea in a sapply call:
cbind(df[1],sapply(df[-1],function(x) {y<-which(is.na(x));x[rev(y)[1]]<-x[y[1]-1];x}))
Time Pat1 Pat2
1 0 41 42
2 24 8 30
3 48 3 41
4 72 14 NA
5 96 NA NA
6 120 NA NA
7 0 14 41
8 24 5 37
9 48 29 48
10 72 31 11
11 96 50 43
12 120 46 21

Resources