Transposing and Add column in R in Azure ML Studio

Transposing and Add column in R in Azure ML Studio - r

I obtain the following data set in Azure. Each row is a parameter that is relevant to a forecasting model.
I am relatively new to R. I tried the following code but it does not give me the expected output. After I transpose the data set, I want to add an additional column "Month-Year".
Can someone help me? Thanks.
Data set
features V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
A 28.21 42.03 48.56 46.85 46.03 54.6 63.87 50 53.34 43.47 34.66 27.48
B 1333 1348.64 1364.28 1379.92 1395.56 1411.2 1426.84 1442.48 1458.11 1473.75 1489.39 1505.03
C 10.05 5.46 4.82 5.27 5.07 4.07 9.53 1.95 6.95 6.54 5.91 0.56
D 18.22 18.41 14.31 30.28 18.16 15.52 12.52 13.14 15.05 8.89 12.51 25.25
R code
# Map 1-based optional input ports to variables
dataset <- maml.mapInputPort(1)
a <- c("A", "B", "C", "D")
data.set <- cbind(a, dataset)
names(data.set)[1] <- c("features")
# first remember the names
n <- dataset$features
# transpose all but the first column (name)
df.aree <- as.data.frame(t(data.set[,-1]))
names(data.set)[1] <- n
df.aree$myfactor <- factor(row.names(df.aree))
maml.mapOutputPort("df.aree")
Expected result
Month-Year A B C D
01-01-15 28.21 1333 10.05 18.22
01-02-15 42.03 1348.64 5.46 18.41
01-03-15 48.56 1364.28 4.82 14.31
01-04-15 46.85 1379.92 5.27 30.28
01-05-15 46.03 1395.56 5.07 18.16
01-06-15 54.6 1411.2 4.07 15.52
01-07-15 63.87 1426.84 9.53 12.52
01-08-15 50 1442.48 1.95 13.14
01-09-15 53.34 1458.11 6.95 15.05
01-10-15 43.47 1473.75 6.54 8.89
01-11-15 34.66 1489.39 5.91 12.51
01-12-15 27.48 1505.03 0.56 25.25

Create "MonYear" using seq with from and to dates.
MonYear <- format(seq(as.Date('2015-01-01'), as.Date('2015-12-01'),
by = 'month'), '%d-%m-%y')
Transpose the non-numeric columns in the original dataset (the output will be a matrix. We create a data.frame by combining 'MonYear' and the matrix output.
df2 <- data.frame(MonYear,t(df1[-1]))
Change the column names and row names accordingly
colnames(df2)[-1] <- LETTERS[1:4]
row.names(df2) <- NULL
df2
MonYear A B C D
1 01-01-15 28.21 1333.00 10.05 18.22
2 01-02-15 42.03 1348.64 5.46 18.41
3 01-03-15 48.56 1364.28 4.82 14.31
4 01-04-15 46.85 1379.92 5.27 30.28
5 01-05-15 46.03 1395.56 5.07 18.16
6 01-06-15 54.60 1411.20 4.07 15.52
7 01-07-15 63.87 1426.84 9.53 12.52
8 01-08-15 50.00 1442.48 1.95 13.14
9 01-09-15 53.34 1458.11 6.95 15.05
10 01-10-15 43.47 1473.75 6.54 8.89
11 01-11-15 34.66 1489.39 5.91 12.51
12 01-12-15 27.48 1505.03 0.56 25.25

Related

How to convert a list into a data.frame in R?

I've created a frequency table in R with the fdth package using this code
fdt(x, breaks = "Sturges")
The specific result was:
Class limits f rf rf(%) cf cf(%)
[-15.907,-11.817) 12 0.00 0.10 12 0.10
[-11.817,-7.7265) 8 0.00 0.07 20 0.16
[-7.7265,-3.636) 6 0.00 0.05 26 0.21
[-3.636,0.4545) 70 0.01 0.58 96 0.79
[0.4545,4.545) 58 0.00 0.48 154 1.27
[4.545,8.6355) 91 0.01 0.75 245 2.01
[8.6355,12.726) 311 0.03 2.55 556 4.57
[12.726,16.817) 648 0.05 5.32 1204 9.89
[16.817,20.907) 857 0.07 7.04 2061 16.93
[20.907,24.998) 1136 0.09 9.33 3197 26.26
[24.998,29.088) 1295 0.11 10.64 4492 36.90
[29.088,33.179) 1661 0.14 13.64 6153 50.55
[33.179,37.269) 2146 0.18 17.63 8299 68.18
[37.269,41.36) 2525 0.21 20.74 10824 88.92
[41.36,45.45) 1349 0.11 11.08 12173 100.00
It was given as a list:
> class(x)
[1] "fdt.multiple" "fdt" "list"
I need to convert it into a data frame object, so I can have a table. How can I do it?
I'm a beginner at using R :(

Since you did not provide a reproducible example of your data I have used example from the help page of ?fdt which is closer to what you have.
library(fdth)
mdf <- data.frame(c1=sample(LETTERS[1:3], 1e2, TRUE),
c2=as.factor(sample(1:10, 1e2, TRUE)),
n1=c(NA, NA, rnorm(96, 10, 1), NA, NA),
n2=rnorm(100, 60, 4),
n3=rnorm(100, 50, 4),
stringsAsFactors=TRUE)
fdt <- fdt(mdf,breaks='FD',by='c1')
class(fdt)
#[1] "fdt.multiple" "fdt" "list"
You can extract the table part from each list and bind them together.
result <- purrr::map_df(fdt, `[[`, 'table')
#In base R
#result <- do.call(rbind, lapply(fdt, `[[`, 'table'))
result
# Class limits f rf rf(%) cf cf(%)
#1 [8.1781,9.1041) 5 0.20833333 20.833333 5 20.833333
#2 [9.1041,10.03) 6 0.25000000 25.000000 11 45.833333
#3 [10.03,10.956) 10 0.41666667 41.666667 21 87.500000
#4 [10.956,11.882) 3 0.12500000 12.500000 24 100.000000
#5 [53.135,56.121) 4 0.16000000 16.000000 4 16.000000
#6 [56.121,59.107) 8 0.32000000 32.000000 12 48.000000
#7 [59.107,62.092) 8 0.32000000 32.000000 20 80.000000
#....

R moving average

As an example I use the Boston data with 3 columns (id (added), medv, lstat) and 506 observations.
I want to calculate a moving average for k-1 observations for the variable medv. This means that the mean value should be calculated over all observations except a certain row. For id 1, the mean value is calculated from line 2-506. For id 2, the mean value is calculated over line 1 + 3-506. For id 3, the mean value is calculated over the lines 1-2 + 4-506 and so on.
In a second step the calculation of the mean value should be conditional, e.g. above the median and below the median in two different columns. This means that we first check whether a value within each column (medv and lstat) is above or below the median. If the value in medv is above the median, we calculate the mean value of lstat from the values that are above the median in lstat. If the value in medv is below the median, we calculate the mean value of lstat from the values that are below the median. See example table below for the first 10 rows. The median for the first 10 rows is 25.55 for medv and 7.24 for lstat.
Here is the data:
library(mlbench)
data(BostonHousing)
df <- BostonHousing
df$id <- seq.int(nrow(df))
df <- subset(df, select = c(id, medv, lstat))
id medv lstat mean1out meancond
1 24.0 4.98 26.66667 4.50
2 21.6 9.14 26.93333 4.50
3 34.7 4.03 25.47778 17.55
4 33.4 2.94 25.62222 17.55
5 36.2 5.33 25.31111 17.55
6 28.7 5.21 26.14444 17.55
7 22.9 12.43 26.78889 4.50
8 27.1 19.15 26.32222 17.55
9 16.5 29.93 27.50000 4.50
10 18.9 17.10 27.23333 4.50

The first part of the problem is already solved by #r2evans.
For the second part we can calculate median of lstat and medv, compare and assign values.
#First part from #r2evans answer.
n <- nrow(df)
df$mean1out <- (mean(df$medv)*n - df$medv)/(n-1)
#Second part
med_lsat <- median(df$lstat)
med_medv <- median(df$medv)
higher_lsat <- mean(df$lstat[df$lstat > med_lsat])
lower_lsat <- mean(df$lstat[df$lstat < med_lsat])
df$meancond <- ifelse(df$medv > med_medv, higher_lsat, lower_lsat)
df
# id medv lstat mean1out meancond
#1 1 24.0 4.98 26.66667 4.498
#2 2 21.6 9.14 26.93333 4.498
#3 3 34.7 4.03 25.47778 17.550
#4 4 33.4 2.94 25.62222 17.550
#5 5 36.2 5.33 25.31111 17.550
#6 6 28.7 5.21 26.14444 17.550
#7 7 22.9 12.43 26.78889 4.498
#8 8 27.1 19.15 26.32222 17.550
#9 9 16.5 29.93 27.50000 4.498
#10 10 18.9 17.10 27.23333 4.498
data
df <- BostonHousing
df$id <- seq.int(nrow(df))
df <- subset(df, select = c(id, medv, lstat))
df <- head(df, 10)

mean(dat$medv[-3])
# [1] 25.47778
sapply(seq_len(nrow(dat)), function(i) mean(dat$medv[-i]))
# [1] 26.66667 26.93333 25.47778 25.62222 25.31111 26.14444 26.78889 26.32222 27.50000 27.23333
Alternatively (mathematically), without the sapply, you can get the same numbers this way:
n <- nrow(dat)
(mean(dat$medv)*n - dat$medv)/(n-1)
# [1] 26.66667 26.93333 25.47778 25.62222 25.31111 26.14444 26.78889 26.32222 27.50000 27.23333
For your conditional mean, a simple ifelse works:
n <- nrow(dat)
transform(
dat,
a = (mean(dat$medv)*n - dat$medv)/(n-1),
b = ifelse(medv <= median(medv),
mean(lstat[ lstat <= median(lstat) ]),
mean(lstat[ lstat > median(lstat) ]))
)
# id medv lstat mean1out meancond a b
# 1 1 24.0 4.98 26.66667 4.50 26.66667 4.498
# 2 2 21.6 9.14 26.93333 4.50 26.93333 4.498
# 3 3 34.7 4.03 25.47778 17.55 25.47778 17.550
# 4 4 33.4 2.94 25.62222 17.55 25.62222 17.550
# 5 5 36.2 5.33 25.31111 17.55 25.31111 17.550
# 6 6 28.7 5.21 26.14444 17.55 26.14444 17.550
# 7 7 22.9 12.43 26.78889 4.50 26.78889 4.498
# 8 8 27.1 19.15 26.32222 17.55 26.32222 17.550
# 9 9 16.5 29.93 27.50000 4.50 27.50000 4.498
# 10 10 18.9 17.10 27.23333 4.50 27.23333 4.498
(I'm inferring that the differences are rounding errors on data entry.)

How to calculate average for columns?

I need to find the average of every 6 months, starting from v1 to v15. Now that i know that there are v15 columns hence its working with my below code. But there will more than 15 columns and I need a generic code that can solve the purpose.
Logic i am using is: taking the average of columns - 1:6 and printing, then 2:7 and so on- till 15, as i know there are 15 columns. But there will more columns in actual.
csv file:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15
1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.11 0.04 0.04 0.04 0.04 0.04 0.04 0.04
3 3.29 3.56 3.97 3.23 2.96 2.35 0.06 1.72 2.19 1.92 1.84 2.87 2.57 2.24 3.06
4 11.79 15.01 14.76 13.19 18.29 4.51 16.24 11.92 10.49 13.05 12.74 12.95 12.25 14.46 14.27
5 20.11 21.76 21.92 23.67 19.87 25.59 23.04 16.67 22.78 21.32 20.85 21.57 21.99 22.69 22.96
6 24.85 26.56 29.45 24.96 25.91 16.31 27.51 22.56 28.35 26.96 26.53 28.23 28.24 29.85 29.79
7 29.02 32.75 29.95 27.7 29.6 17.91 32.08 25.71 33.16 31.56 30.89 32.68 34.05 36.26 33.27
8 32.83 33.09 17.03 33.23 31.22 39.71 35.43 28.77 37.09 34.18 34.05 36.98 37.16 38.74 37.32
9 32.86 36.34 35.47 33.6 35 42.79 37.22 30.62 38.74 35.83 36.17 39.48 39.18 42.87 39.54
10 36.02 37.66 36.15 34.79 36.84 22.19 38.9 32.62 40.28 37.87 38.09 41.04 41.62 44.94 42.18
11 36.96 39.22 19.13 36.68 37.43 46.26 40.84 33.88 41.31 39.09 39.14 43.46 42.75 47.2 43.8
12 37.34 40.87 35.91 37.66 39.22 46.95 42.26 35.19 42.93 41 40.61 44.73 45.2 48.14 44.49
13 38.92 38.37 41.01 39.01 41 48.89 43.8 37.16 44.1 42.46 41.3 45.47 46.65 50.48 47.6
14 21.67 43.16 20.98 39.84 42 49.62 44.35 37.46 44.63 43.15 42.64 48.48 48.53 53.55 48.57
a <- t(apply(mat,1,function(x){ c(mean(x[1:6]),mean(x[2:7]),mean(x[3:8]),mean(x[4:9]),mean(x[5:10]),mean(x[6:11]),mean(x[7:12]),mean(x[8:13]),mean(x[9:14]),mean(x[10:15])) }))
Please help. thanks in Advance.

We can do this with a rolling mean (rollmean
library(zoo)
t(apply(df1, 1, function(x) rollmean(x, 6)))

Using base R:
n=6
d=lapply(1:(ncol(data)-(n-1)),function(x) x:(x+n-1))
sapply(d,function(w) rowMeans(data[,w]))

another base solution:
rowlingRowMeans <- function(matrix, n_meanrows){
out <- NULL
for(z in 1:(nrow(matrix)-n_meanrows+2)){
out <- cbind(out, rowMeans(matrix[,z:(z+n_meanrows-1)]))
}
return(out)
}
mat <- matrix(rnorm(15*14, 1,10), ncol=15, nrow=14)
rowlingRowMeans(mat, 6)

Error in producing the output

I have problem with my code. I can't trace the error. I have coor data (40 by 2 matrix) as below and a rainfall data (14610 by 40 matrix).
No Longitude Latitude
1 100.69 6.34
2 100.77 6.24
3 100.39 6.11
4 100.43 5.53
5 100.39 5.38
6 101.00 5.71
7 101.06 5.30
8 100.80 4.98
9 101.17 4.48
10 102.26 6.11
11 102.22 5.79
12 102.28 5.31
13 102.02 5.38
14 101.97 4.88
15 102.95 5.53
16 103.13 5.32
17 103.06 4.94
18 103.42 4.76
19 103.42 4.23
20 102.38 4.24
21 101.94 4.23
22 103.04 3.92
23 103.36 3.56
24 102.66 3.03
25 103.19 2.89
26 101.35 3.70
27 101.41 3.37
28 101.75 3.16
29 101.39 2.93
30 102.07 3.09
31 102.51 2.72
32 102.26 2.76
33 101.96 2.74
34 102.19 2.36
35 102.49 2.29
36 103.02 2.38
37 103.74 2.26
38 103.97 1.85
39 103.72 1.76
40 103.75 1.47
rainfall= 14610 by 40 matrix;
coor= 40 by 2 matrix
my_prog=function(rainrain,coordinat,misss,distance)
{
rain3<-rainrain # target station i**
# neighboring stations for target station i
a=coordinat # target station i**
diss=as.matrix(distHaversine(a,coor,r=6371))
mmdis=sort(diss,decreasing=F,index.return=T)
mdis=as.matrix(mmdis$x)
mdis1=as.matrix(mmdis$ix)
dist=cbind(mdis,mdis1)
# NA creation
# create missing values in rainfall data
set.seed(100)
b=sample(1:nrow(rain3),(misss*nrow(rain3)),replace=F)
k=replace(rain3,b,NA)
# pick i closest stations
neig=mdis1[distance] # neighbouring selection distance
# target (with NA) and their neighbors
rainB=rainfal00[,neig]
rainA0=rainB[,2:ncol(rainB)]
rainA<-as.matrix(cbind(k,rainA0))
rain2=na.omit(rainA)
x=as.matrix(rain2[,1]) # used to calculate the correlation
n1=ncol(rainA)-1
#1) normal ratio(nr)
jum=as.matrix(apply(rain2,2,mean))
nr0=(jum[1]/jum)
nr=as.matrix(nr0[2:nrow(nr0),])
m01=as.matrix(rainA[is.na(k),])
m1=m01[,2:ncol(m01)]
out1=as.matrix(sapply(seq_len(nrow(m1)),
function(i) sum(nr*m1[i,],na.rm=T)/n1))
print(out1)
}
impute=my_prog(rainrain=rainfall[,1],coordinat=coor[1,],misss=0.05,distance=mdis<200)
I have run this code and and the output obtained is:
Error in my_prog(rainrain = rainfal00[, 1], misss = 0.05, coordinat = coor[1, :
object 'mdis' not found
I have checked the program, but cannot trace the problem. I would really appreciate if someone could help me.

Merging two dataframes in R with date

I have the following 2 dataframes:
> bvg1
Parameters X18.Oct.14 X19.Oct.14 X20.Oct.14 X21.Oct.14 X22.Oct.14 X23.Oct.14 X24.Oct.14
1 24K Equivalent Plan 29.00 29.60 33.80 36.60 35.30 31.90 29.00
2 24K Equivalent Act 28.80 31.00 35.40 35.90 34.70 33.40 31.90
3 Plan Rep WS 2463.00 2513.00 2869.00 3115.00 2999.00 2714.00 2468.00
4 Act Rep WS 2447.00 2633.00 3013.00 3054.00 2953.00 2842.00 2714.00
5 Rep WS Var -16.00 120.00 144.00 -61.00 -46.00 128.00 246.00
6 Plan Rep Intakes 568.00 461.00 1159.00 1146.00 1126.00 1124.00 1106.00
7 Act Rep Intakes 707.00 494.00 1106.00 1096.00 1274.00 1087.00 1101.00
8 Rep Intakes Var 139.00 33.00 -53.00 -50.00 148.00 -37.00 -5.00
9 Plan Rep Comps_DL 468.00 54.00 836.00 1190.00 1327.00 1286.00 1108.00
10 Act Rep Comps_DL 471.00 70.00 995.00 1137.00 1323.00 1150.00 1073.00
11 Rep Comps Var_DL 3.00 16.00 159.00 -53.00 -4.00 -136.00 -35.00
12 Plan Rep Mandays_DL 148.00 19.00 260.00 368.00 412.00 398.00 345.00
13 Act Rep Mandays_DL 147.00 19.00 303.00 359.00 423.00 374.00 348.00
14 Rep Mandays Var_DL -1.00 1.00 43.00 -9.00 12.00 -24.00 3.00
15 Plan FVR Mandays_DL 0.00 0.00 4.00 18.00 18.00 18.00 18.00
16 Act FVR Mandays_DL 0.00 0.00 4.00 7.00 8.00 8.00 7.00
17 FVR Mandays Var_DL 0.00 0.00 0.00 -11.00 -10.00 -10.00 -11.00
18 Plan Rep Prod_DL 3.16 2.88 3.21 3.23 3.22 3.23 3.21
19 Act Rep Prod_DL 3.21 3.62 3.28 3.16 3.12 3.07 3.08
20 Rep Prod Var_DL 0.05 0.74 0.07 -0.07 -0.10 -0.16 -0.13
> bvg2
Parameters X18.Oct X19.Oct X20.Oct X21.Oct X22.Oct X23.Oct X24.Oct
1 24K Equivalent Plan 30.50 31.30 35.10 36.10 33.60 28.80 25.50
2 24K Equivalent Act 31.40 33.40 36.60 38.10 36.80 34.40 32.10
3 Plan Rep WS 3419.00 3509.00 3933.00 4041.00 3764.00 3220.00 2859.00
4 Act Rep WS 3514.00 3734.00 4098.00 4271.00 4122.00 3852.00 3591.00
5 Rep WS Var 95.00 225.00 165.00 230.00 358.00 632.00 732.00
6 Plan Rep Intakes 813.00 613.00 1559.00 1560.00 1506.00 1454.00 1410.00
7 Act Rep Intakes 964.00 602.00 1629.00 1532.00 1657.00 1507.00 1439.00
8 Rep Intakes Var 151.00 -11.00 70.00 -28.00 151.00 53.00 29.00
9 Plan Rep Comps_DL 675.00 175.00 1331.00 1732.00 1938.00 1706.00 1493.00
10 Act Rep Comps_DL 718.00 224.00 1389.00 1609.00 1848.00 1698.00 1537.00
11 Rep Comps Var_DL 43.00 49.00 58.00 -123.00 -90.00 -8.00 44.00
12 Plan Rep Mandays_DL 203.00 58.00 428.00 541.00 605.00 536.00 475.00
13 Act Rep Mandays_DL 215.00 63.00 472.00 542.00 608.00 556.00 523.00
14 Rep Mandays Var_DL 12.00 5.00 44.00 2.00 3.00 20.00 48.00
15 Plan FVR Mandays_DL 0.00 0.00 1.00 12.00 2.00 32.00 57.00
16 Act FVR Mandays_DL 0.00 0.00 2.00 2.00 5.00 5.00 5.00
17 FVR Mandays Var_DL 0.00 0.00 1.00 -10.00 3.00 -27.00 -52.00
18 Plan Rep Prod_DL 3.33 3.03 3.11 3.20 3.20 3.18 3.14
19 Act Rep Prod_DL 3.34 3.56 2.94 2.97 3.04 3.05 2.94
20 Rep Prod Var_DL 0.01 0.53 -0.17 -0.23 -0.16 -0.13 -0.20
It is a time series data i.e. 24K Equivalent Plan was 29 on 18th Oct, 29.60 on 19th Oct and 33.80 on 20th Oct. First dataframe have data for one business unit and second dataframe have the data for a different business unit.
I want to merge dataframes into 1 and want to analyse the variance i.e. where they differ in values. Draw ggplots like 2 histograms showing the difference, timeseries plots etc.
I have tried the following:
I can merge the two dataframes by:
joined = rbind(bvg1, bvg2)
however, i can't identify the record whether it belongs to bvg1 or bvg2 df.
if i add an additional column i.e.
bvg1$id = "bvg1"
bvg2$id = "bvg2"
then merge command doesn't work and gives the following error:
Error in match.names(clabs, names(xi)) :
names do not match previous names
Any sample code would be highly appreciated.

You can match the column names of the two datasets by stripping the . followed by the digits in the bvg1. This can be done using regex. In the below code, a lookbehind regex is used. It matches the lookbehind (?<=[A-Za-]) i.e. an alphabet followed by . followed by one or more elements .* to the end of string $ and remove those "".
colnames(bvg1) <-gsub("(?<=[A-Za-z])\\..*$", "", colnames(bvg1), perl=TRUE)
res <- rbind(bvg1, bvg2)
dim(res)
#[1] 40 9
head(res,3)
# Parameters X18.Oct X19.Oct X20.Oct X21.Oct X22.Oct X23.Oct X24.Oct
#1 24K Equivalent Plan 29.0 29.6 33.8 36.6 35.3 31.9 29.0
#2 24K Equivalent Act 28.8 31.0 35.4 35.9 34.7 33.4 31.9
#3 Plan Rep WS 2463.0 2513.0 2869.0 3115.0 2999.0 2714.0 2468.0
# id
#1 bvg1
#2 bvg1
#3 bvg1

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex