As an example I use the Boston data with 3 columns (id (added), medv, lstat) and 506 observations.
I want to calculate a moving average for k-1 observations for the variable medv. This means that the mean value should be calculated over all observations except a certain row. For id 1, the mean value is calculated from line 2-506. For id 2, the mean value is calculated over line 1 + 3-506. For id 3, the mean value is calculated over the lines 1-2 + 4-506 and so on.
In a second step the calculation of the mean value should be conditional, e.g. above the median and below the median in two different columns. This means that we first check whether a value within each column (medv and lstat) is above or below the median. If the value in medv is above the median, we calculate the mean value of lstat from the values that are above the median in lstat. If the value in medv is below the median, we calculate the mean value of lstat from the values that are below the median. See example table below for the first 10 rows. The median for the first 10 rows is 25.55 for medv and 7.24 for lstat.
Here is the data:
library(mlbench)
data(BostonHousing)
df <- BostonHousing
df$id <- seq.int(nrow(df))
df <- subset(df, select = c(id, medv, lstat))
id medv lstat mean1out meancond
1 24.0 4.98 26.66667 4.50
2 21.6 9.14 26.93333 4.50
3 34.7 4.03 25.47778 17.55
4 33.4 2.94 25.62222 17.55
5 36.2 5.33 25.31111 17.55
6 28.7 5.21 26.14444 17.55
7 22.9 12.43 26.78889 4.50
8 27.1 19.15 26.32222 17.55
9 16.5 29.93 27.50000 4.50
10 18.9 17.10 27.23333 4.50
The first part of the problem is already solved by #r2evans.
For the second part we can calculate median of lstat and medv, compare and assign values.
#First part from #r2evans answer.
n <- nrow(df)
df$mean1out <- (mean(df$medv)*n - df$medv)/(n-1)
#Second part
med_lsat <- median(df$lstat)
med_medv <- median(df$medv)
higher_lsat <- mean(df$lstat[df$lstat > med_lsat])
lower_lsat <- mean(df$lstat[df$lstat < med_lsat])
df$meancond <- ifelse(df$medv > med_medv, higher_lsat, lower_lsat)
df
# id medv lstat mean1out meancond
#1 1 24.0 4.98 26.66667 4.498
#2 2 21.6 9.14 26.93333 4.498
#3 3 34.7 4.03 25.47778 17.550
#4 4 33.4 2.94 25.62222 17.550
#5 5 36.2 5.33 25.31111 17.550
#6 6 28.7 5.21 26.14444 17.550
#7 7 22.9 12.43 26.78889 4.498
#8 8 27.1 19.15 26.32222 17.550
#9 9 16.5 29.93 27.50000 4.498
#10 10 18.9 17.10 27.23333 4.498
data
df <- BostonHousing
df$id <- seq.int(nrow(df))
df <- subset(df, select = c(id, medv, lstat))
df <- head(df, 10)
mean(dat$medv[-3])
# [1] 25.47778
sapply(seq_len(nrow(dat)), function(i) mean(dat$medv[-i]))
# [1] 26.66667 26.93333 25.47778 25.62222 25.31111 26.14444 26.78889 26.32222 27.50000 27.23333
Alternatively (mathematically), without the sapply, you can get the same numbers this way:
n <- nrow(dat)
(mean(dat$medv)*n - dat$medv)/(n-1)
# [1] 26.66667 26.93333 25.47778 25.62222 25.31111 26.14444 26.78889 26.32222 27.50000 27.23333
For your conditional mean, a simple ifelse works:
n <- nrow(dat)
transform(
dat,
a = (mean(dat$medv)*n - dat$medv)/(n-1),
b = ifelse(medv <= median(medv),
mean(lstat[ lstat <= median(lstat) ]),
mean(lstat[ lstat > median(lstat) ]))
)
# id medv lstat mean1out meancond a b
# 1 1 24.0 4.98 26.66667 4.50 26.66667 4.498
# 2 2 21.6 9.14 26.93333 4.50 26.93333 4.498
# 3 3 34.7 4.03 25.47778 17.55 25.47778 17.550
# 4 4 33.4 2.94 25.62222 17.55 25.62222 17.550
# 5 5 36.2 5.33 25.31111 17.55 25.31111 17.550
# 6 6 28.7 5.21 26.14444 17.55 26.14444 17.550
# 7 7 22.9 12.43 26.78889 4.50 26.78889 4.498
# 8 8 27.1 19.15 26.32222 17.55 26.32222 17.550
# 9 9 16.5 29.93 27.50000 4.50 27.50000 4.498
# 10 10 18.9 17.10 27.23333 4.50 27.23333 4.498
(I'm inferring that the differences are rounding errors on data entry.)
Related
using this function I calculate the variance of some 3d points.
centroid_3d_sq_dist <- function(
point_matrix
) {
if (nrow(point_matrix) == 1) {
return(0)
}
mean_point <- apply(point_matrix, 2, mean)
point_sq_distances <- apply(
point_matrix,
1,
function(row_point) {
sum((row_point - mean_point) ** 2)
}
)
sum_sq_distances <- sum(point_sq_distances)
return(sum_sq_distances)
}
point_3d_variance <- function(
point_matrix
) {
if (nrow(point_matrix) == 1) {
return(0)
}
dist_var <- centroid_3d_sq_dist(point_matrix) /
(nrow(point_matrix) - 1)
return(dist_var)
}
The argument of this function is a matrix (x,y,z).
Now I have a dataset with two 3D points.
ID Trial Size PP PA FkA ciccioX ciccioY ciccioZ pinoX pinoY pinoZ
1 Gigi 1 40 39.6 1050. 31.5 521. 293. 10.6 516. 323. 6.41
2 Gigi 2 20.0 30.7 944. 9.35 525. 300. 12.6 520. 305. 7.09
3 Gigi 3 30 29.5 1056. 24.1 521. 298. 12.3 519. 321. 5.89
4 Gigi 5 60 53.0 1190. 53.0 680. 287. 64.4 699. 336. 68.6
5 Bibi 1 40 38.3 1038. 31.4 524. 289. 10.9 519. 319. 6.17
6 Bibi 2 60 64.7 1293. 47.8 516. 282. 10.4 519. 330. 6.32
7 Bibi 3 20.0 33.8 1092. 17.5 523. 300. 12.8 518. 315. 6.22
8 Bibi 4 30 35.0 1108. 26.4 525. 295. 11.7 517. 320. 5.78
9 Bibi 5 50 46.5 1199. 34.2 515. 289. 11.2 517. 323. 6.27
10 Bibi 6 30 28.7 1016. 17.1 528. 298. 12.7 524. 314. 6.36
The 3D points are:
ciccio: ciccioX ciccioY ciccioZ
pino: pinoX pinoY pinoZ
I want to calculate the variance of ciccio and the variance of pino grouped by ID and SIZE.
I tried to do:
data %>%
group_by(SubjectID, Size) %>%
summarize(as.data.frame(matrix(f4(dd[7:9],dd[10:12]), nr = 1)))
But it doesn't work.
Do you have any advice?
Your shown dataset is too small to calculate (meaningful) variations. But you could use:
library(dplyr)
df %>%
group_by(ID, Size) %>%
summarise(var_ciccio = point_3d_variance(as.matrix(across(ciccioX:ciccioZ))),
var_pino = point_3d_variance(as.matrix(across(pinoX:pinoZ))),
.groups = "drop")
This returns
# A tibble: 9 x 4
ID Size var_ciccio var_pinoo
<chr> <dbl> <dbl> <dbl>
1 Bibi 20 0 0
2 Bibi 30 9.5 42.7
3 Bibi 40 0 0
4 Bibi 50 0 0
5 Bibi 60 0 0
6 Gigi 20 0 0
7 Gigi 30 0 0
8 Gigi 40 0 0
9 Gigi 60 0 0
I was trying to forecast a time series problem using lm() and my data looks like below
Customer_key date sales
A35 2018-05-13 31
A35 2018-05-20 20
A35 2018-05-27 43
A35 2018-06-03 31
BH22 2018-05-13 60
BH22 2018-05-20 67
BH22 2018-05-27 78
BH22 2018-06-03 55
Converted my df to a list format by
df <- dcast(df, date ~ customer_key,value.var = c("sales"))
df <- subset(df, select = -c(dt))
demandWithKey <- as.list(df)
Trying to write a function such that applying this function across all customers
my_fun <- function(x) {
fit <- lm(ds_load ~ date, data=df) ## After changing to list ds_load and date column names
## are no longer available for formula
fit_b <- forecast(fit$fitted.values, h=20) ## forecast using lm()
return(data.frame(c(fit$fitted.values, fit_b[["mean"]])))
}
fcast <- lapply(df, my_fun)
I know the above function doesn't work, but basically I'm looking for getting both the fitted values and forecasted values for a grouped data.
But I've tried all other methods using tslm() (converting into time series data) and so on but no luck I can get the lm() work somehow on just one customer though. Also many questions/posts were on just fitting the model but I would like to forecast too at same time.
lm() is for a regression model
but here you have a time serie so for forecasting the serie you have to use one of the time serie model (ARMA ARCH GARCH...)
so you can use the function in r : auto.arima() in "forecast" package
I don't know what you're up to exactly, but you could make this less complicated.
Using by avoids the need to reshape your data, it splits your data e.g. by customer ID as in your case and applies a function on the subsets (i.e. it's a combination of split and lapply; see ?by).
Since you want to compare fitted and forecasted values somehow in your result, you probably need predict rather than $fitted.values, otherwise the values won't be of same length. Because your independent variable is a date in weekly intervals, you may use seq.Date and take the first date as a starting value; the sequence has length actual values (nrow each customer) plus h= argument of the forecast.
For demonstration purposes I add the fitted values as first column in the following.
res <- by(dat, dat$cus_key, function(x) {
H <- 20 ## globally define 'h'
fit <- lm(sales ~ date, x)
fitted <- fit$fitted.values
pred <- predict(fit, newdata=data.frame(
date=seq(x$date[1], length.out= nrow(x) + H, by="week")))
fcst <- c(fitted, forecast(fitted, h=H)$mean)
fit.na <- `length<-`(unname(fitted), length(pred)) ## for demonstration
return(cbind(fit.na, pred, fcst))
})
Result
res
# dat$cus_key: A28
# fit.na pred fcst
# 1 41.4 41.4 41.4
# 2 47.4 47.4 47.4
# 3 53.4 53.4 53.4
# 4 59.4 59.4 59.4
# 5 65.4 65.4 65.4
# 6 NA 71.4 71.4
# 7 NA 77.4 77.4
# 8 NA 83.4 83.4
# 9 NA 89.4 89.4
# 10 NA 95.4 95.4
# 11 NA 101.4 101.4
# 12 NA 107.4 107.4
# 13 NA 113.4 113.4
# 14 NA 119.4 119.4
# 15 NA 125.4 125.4
# 16 NA 131.4 131.4
# 17 NA 137.4 137.4
# 18 NA 143.4 143.4
# 19 NA 149.4 149.4
# 20 NA 155.4 155.4
# 21 NA 161.4 161.4
# 22 NA 167.4 167.4
# 23 NA 173.4 173.4
# 24 NA 179.4 179.4
# 25 NA 185.4 185.4
# ----------------------------------------------------------------
# dat$cus_key: B16
# fit.na pred fcst
# 1 49.0 49.0 49.0
# 2 47.7 47.7 47.7
# 3 46.4 46.4 46.4
# 4 45.1 45.1 45.1
# 5 43.8 43.8 43.8
# 6 NA 42.5 42.5
# 7 NA 41.2 41.2
# 8 NA 39.9 39.9
# 9 NA 38.6 38.6
# 10 NA 37.3 37.3
# 11 NA 36.0 36.0
# 12 NA 34.7 34.7
# 13 NA 33.4 33.4
# 14 NA 32.1 32.1
# 15 NA 30.8 30.8
# 16 NA 29.5 29.5
# 17 NA 28.2 28.2
# 18 NA 26.9 26.9
# 19 NA 25.6 25.6
# 20 NA 24.3 24.3
# 21 NA 23.0 23.0
# 22 NA 21.7 21.7
# 23 NA 20.4 20.4
# 24 NA 19.1 19.1
# 25 NA 17.8 17.8
# ----------------------------------------------------------------
# dat$cus_key: C12
# fit.na pred fcst
# 1 56.4 56.4 56.4
# 2 53.2 53.2 53.2
# 3 50.0 50.0 50.0
# 4 46.8 46.8 46.8
# 5 43.6 43.6 43.6
# 6 NA 40.4 40.4
# 7 NA 37.2 37.2
# 8 NA 34.0 34.0
# 9 NA 30.8 30.8
# 10 NA 27.6 27.6
# 11 NA 24.4 24.4
# 12 NA 21.2 21.2
# 13 NA 18.0 18.0
# 14 NA 14.8 14.8
# 15 NA 11.6 11.6
# 16 NA 8.4 8.4
# 17 NA 5.2 5.2
# 18 NA 2.0 2.0
# 19 NA -1.2 -1.2
# 20 NA -4.4 -4.4
# 21 NA -7.6 -7.6
# 22 NA -10.8 -10.8
# 23 NA -14.0 -14.0
# 24 NA -17.2 -17.2
# 25 NA -20.4 -20.4
As you can see, prediction and forecast yield the same values, since both methods are based on the same single explanatory variable date in this case.
Toy data:
set.seed(42)
dat <- transform(expand.grid(cus_key=paste0(LETTERS[1:3], sample(12:43, 3)),
date=seq.Date(as.Date("2018-05-13"), length.out=5, by="week")),
sales=sample(20:80, 15, replace=TRUE))
I have a pretty large dataset where I have a column for time in seconds and I want to combine rows where the time is close (range: .1-.2 seconds apart) as a mean.
Here is an example of how the data looks:
BPM seconds
63.9 61.899
63.9 61.902
63.8 61.910
62.1 130.94
62.1 130.95
61.8 211.59
63.8 280.5
60.3 290.4
So I would want to combine the first 3 rows, then the 2 following after that, and the rest would stand alone. Meaning I would want the data to look like this:
BPM seconds
63.9 61.904
62.1 130.95
61.8 211.59
63.8 280.5
60.3 290.4
We need to create groups, this is the important bit, the rest is standard aggregation:
cumsum(!c(0, diff(df1$seconds)) < 0.2)
# [1] 0 0 0 1 1 2 3 4
Then aggregate using aggregate:
aggregate(df1[, 2], list(cumsum(!c(0, diff(df1$seconds)) < 0.2)), mean)
# Group.1 x
# 1 0 61.90367
# 2 1 130.94500
# 3 2 211.59000
# 4 3 280.50000
# 5 4 290.40000
Or use dplyr:
library(dplyr)
df1 %>%
group_by(myGroup = cumsum(!c(0, diff(seconds)) < 0.2)) %>%
summarise(BPM = first(BPM),
seconds = mean(seconds))
# # A tibble: 5 x 3
# myGroup BPM seconds
# <int> <dbl> <dbl>
# 1 0 63.9 61.9
# 2 1 62.1 131.
# 3 2 61.8 212.
# 4 3 63.8 280.
# 5 4 60.3 290.
Reproducible example data:
df1 <- read.table(text = "BPM seconds
63.9 61.899
63.9 61.902
63.8 61.910
62.1 130.94
62.1 130.95
61.8 211.59
63.8 280.5
60.3 290.4", header = TRUE)
I have problem with my code. I can't trace the error. I have coor data (40 by 2 matrix) as below and a rainfall data (14610 by 40 matrix).
No Longitude Latitude
1 100.69 6.34
2 100.77 6.24
3 100.39 6.11
4 100.43 5.53
5 100.39 5.38
6 101.00 5.71
7 101.06 5.30
8 100.80 4.98
9 101.17 4.48
10 102.26 6.11
11 102.22 5.79
12 102.28 5.31
13 102.02 5.38
14 101.97 4.88
15 102.95 5.53
16 103.13 5.32
17 103.06 4.94
18 103.42 4.76
19 103.42 4.23
20 102.38 4.24
21 101.94 4.23
22 103.04 3.92
23 103.36 3.56
24 102.66 3.03
25 103.19 2.89
26 101.35 3.70
27 101.41 3.37
28 101.75 3.16
29 101.39 2.93
30 102.07 3.09
31 102.51 2.72
32 102.26 2.76
33 101.96 2.74
34 102.19 2.36
35 102.49 2.29
36 103.02 2.38
37 103.74 2.26
38 103.97 1.85
39 103.72 1.76
40 103.75 1.47
rainfall= 14610 by 40 matrix;
coor= 40 by 2 matrix
my_prog=function(rainrain,coordinat,misss,distance)
{
rain3<-rainrain # target station i**
# neighboring stations for target station i
a=coordinat # target station i**
diss=as.matrix(distHaversine(a,coor,r=6371))
mmdis=sort(diss,decreasing=F,index.return=T)
mdis=as.matrix(mmdis$x)
mdis1=as.matrix(mmdis$ix)
dist=cbind(mdis,mdis1)
# NA creation
# create missing values in rainfall data
set.seed(100)
b=sample(1:nrow(rain3),(misss*nrow(rain3)),replace=F)
k=replace(rain3,b,NA)
# pick i closest stations
neig=mdis1[distance] # neighbouring selection distance
# target (with NA) and their neighbors
rainB=rainfal00[,neig]
rainA0=rainB[,2:ncol(rainB)]
rainA<-as.matrix(cbind(k,rainA0))
rain2=na.omit(rainA)
x=as.matrix(rain2[,1]) # used to calculate the correlation
n1=ncol(rainA)-1
#1) normal ratio(nr)
jum=as.matrix(apply(rain2,2,mean))
nr0=(jum[1]/jum)
nr=as.matrix(nr0[2:nrow(nr0),])
m01=as.matrix(rainA[is.na(k),])
m1=m01[,2:ncol(m01)]
out1=as.matrix(sapply(seq_len(nrow(m1)),
function(i) sum(nr*m1[i,],na.rm=T)/n1))
print(out1)
}
impute=my_prog(rainrain=rainfall[,1],coordinat=coor[1,],misss=0.05,distance=mdis<200)
I have run this code and and the output obtained is:
Error in my_prog(rainrain = rainfal00[, 1], misss = 0.05, coordinat = coor[1, :
object 'mdis' not found
I have checked the program, but cannot trace the problem. I would really appreciate if someone could help me.
I have a data frame in 18528 rows and 3 columns like below:
Sample Target Value
100 A 21.5
100 A 20.5
100 B 19.5
100 B 19.75
100 B 18.15
100 B 21.95
200 A 21.1
200 A 21.6
200 B 23.5
200 B 20.75
100 C 21.25
100 C 22.0
100 C 18.33
100 C 21.84
I need to calculate difference between values in each groups:
Sample Target Value dif
100 A 21.5 1
100 A 20.5 1
100 B 19.5 0.25
100 B 19.75 1.6
100 B 18.15 3.8
100 B 21.95 2.45
200 A 21.1 0.5
200 A 21.6 0.5
200 B 23.5 2.75
200 B 20.75 2.75
100 C 21.25 0.75
100 C 22.0 3.67
100 C 18.33 3.51
100 C 21.84 0.59
Then if difference is more than 2, make that value "NA" like:
Sample Target Value dif
100 A 21.5 1
100 A 20.5 1
100 B 19.5 0.25
100 B 19.75 1.6
100 B 18.15 3.8
100 B NA 2.45
200 A 21.1 0.5
200 A 21.6 0.5
200 B NA 2.75
200 B NA 2.75
100 C 21.25 0.75
100 C 22.0 3.67
100 C NA 3.51
100 C 21.84 0.59
I used combn to calculate difference, but I got Error, I think the reason can be different length in groups (2 and 4).
Thanks in advance
You can get desired output using dplyr package. If you don't have it installed first run command install.packages("dplyr") or install it manually.
Then what we have:
require("dplyr")
mydf <- read.table(text = "
Sample Target Value
100 A 21.5
100 A 20.5
100 B 19.5
100 B 19.75
100 B 18.15
100 B 21.95
200 A 21.1
200 A 21.6
200 B 23.5
200 B 20.75
100 C 21.25
100 C 22.0
100 C 18.33
100 C 21.84", header = T)
mydf1 <- mydf %>% group_by(Sample, Target) %>%
mutate(ValueShifted = c(Value[-1], Value[1]) ) %>%
mutate(dif = abs(Value - ValueShifted) ) %>%
mutate(NewValue = c(1, NA)[(as.numeric(dif > 2)+1)] * Value )
> mydf1
Source: local data frame [14 x 6]
Groups: Sample, Target
Sample Target Value ValueShifted dif NewValue
1 100 A 21.50 20.50 1.00 21.50
2 100 A 20.50 21.50 1.00 20.50
3 100 B 19.50 19.75 0.25 19.50
4 100 B 19.75 18.15 1.60 19.75
5 100 B 18.15 21.95 3.80 NA
6 100 B 21.95 19.50 2.45 NA
7 200 A 21.10 21.60 0.50 21.10
8 200 A 21.60 21.10 0.50 21.60
9 200 B 23.50 20.75 2.75 NA
10 200 B 20.75 23.50 2.75 NA
11 100 C 21.25 22.00 0.75 21.25
12 100 C 22.00 18.33 3.67 NA
13 100 C 18.33 21.84 3.51 NA
14 100 C 21.84 21.25 0.59 21.84