Computing values for R dataFrame cells without using for loops - r

I have a R dataFrame with the followings:
Serial N year current Average
B 10 14 15
B 10 16 15
C 12 13 12
D 40 20 20
B 11 15 15
C 12 11 12
I would like to have a new column based on the average for a unique serial number. I would like to have something like :
Serial N year current Average temp
B 10 14 15 (15+12+20)/15
B 10 16 15 (15+12+20)/15
C 12 13 12 (15+12+20)/12
D 40 20 20 (15+12+20)/20
B 11 15 15 (15+12+20)/15
C 12 11 12 (15+12+20)/12
temp column is the addition of the average value for each Serial N ( for B,C and D) over the value of the average for that row. How can I computing it without using for loops as rows 1,2 and 5 (Serial N: B) is the same in terms of Average column and temp? I started with this:
for (i in unique(df$Serial_N))
{
.........
}
but I got stuck as I also need the average for other Serial N. How can I do this?

For example, you can try something like the following (assuming your computation matches):
df$temp <- sum(tapply(df$Average, df$SerialN, mean)) / df$Average
Resulting output:
SerialN year current Average temp
1 B 10 14 15 3.133333
2 B 10 16 15 3.133333
3 C 12 13 12 3.916667
4 D 40 20 20 2.350000
5 B 11 15 15 3.133333
6 C 12 11 12 3.916667

Using unique.data.frame() can avoid repeat in Average between different groups
df$temp <- sum((unique.data.frame(df[c("Serial_N","Average")]))$Average) / df$Average

In base R, you can use either
df <- transform(df, temp = sum(tapply(df$Average, df$Serial_N, unique))/df$Average)
or
df$temp <- sum(tapply(df$Average, df$Serial_N, unique))/df$Average
both of which will give you
df
# Serial_N year current Average temp
# 1 B 10 14 15 3.133333
# 2 B 10 16 15 3.133333
# 3 C 12 13 12 3.916667
# 4 D 40 20 20 2.350000
# 5 B 11 15 15 3.133333
# 6 C 12 11 12 3.916667
tapply splits df$Average by the levels of df$Serial_N, and then calls unique on them, which gives you a single average for each group, which you can then sum and divide. transform adds a column (equivalent to dplyr::mutate).

Related

R : Change name of variable using for loop

I have a data, and vectors conatin name of variables, from these vectorsi calculate the sum of variables contained in the vector and i want to put the result in a new variables that have diffrent names
let say i have three vectors
>data
Name A B C D E
r1 1 5 12 21 15
r2 2 4 7 10 9
r3 5 15 6 9 6
r4 7 8 0 7 18
And i have these vectors that are generated using for loop that are in variable vec
V1 <- ("A","B","C")
V2 <- ("B","D")
V3 <- ("D","E")
Edit 1 :
These vector are generated using for loop and i don't know the vectors that will be generated or the elemnts contained in these vector , here i'm giving just an example , i want to calculate the sum of variables in each vector and make the result in new variable in my data frame
The issue is don't know how to give new name to variables created (that contains the sum of each vector)
data$column[j] <- rowSums(all_data_Second_program[,vec])
j <- j+1
To obtain this result for example
Name A B C Column1 D Column2 E Column3
r1 1 5 12 18 21 26 15 36
r2 2 4 7 13 10 14 9 19
r3 5 15 6 26 9 24 6 15
r4 7 8 0 15 7 15 18 25
But i didn't obtain this result
Please tell me if you need any more informations or clarifications
Can you tell me please how to that
Put the vectors in a list and then you can use rowSums in lapply -
list_vec <- list(c("A","B","C"), c("B","D"), c("D","E"))
new_cols <- paste0('Column', seq_along(list_vec))
data[new_cols] <- lapply(list_vec, function(x) rowSums(data[x]))
data
# Name A B C D E Column1 Column2 Column3
#1 r1 1 5 12 21 15 18 26 36
#2 r2 2 4 7 10 9 13 14 19
#3 r3 5 15 6 9 6 26 24 15
#4 r4 7 8 0 7 18 15 15 25
We may use a for loop
for(i in 1:3) {
data[[paste0('Column', i)]] <- rowSums(data[get(paste0('V', i))],
na.rm = TRUE)
}
-output
> data
Name A B C D E Column1 Column2 Column3
1 r1 1 5 12 21 15 18 26 36
2 r2 2 4 7 10 9 13 14 19
3 r3 5 15 6 9 6 26 24 15
4 r4 7 8 0 7 18 15 15 25

How to extract a sample of pairs in grouping variable

My data looks like this:
x y
1 1
2 2
3 2
4 4
5 5
6 6
7 6
8 8
9 9
10 9
11 11
12 12
13 13
14 13
15 14
16 15
17 14
18 16
19 17
20 18
y is a grouping variable. I would like to see how well this grouping went.
Because of this I want to extract a sample of n pairs of cases that are grouped together by variable y
and n pairs of cases that are not grouped together by variable y. In order to calculate the number of
false positives and false negatives (either falsly grouped or not). How do I extract a sample of grouped pairs
and a sample of not-grouped pairs?
I would like the samples to look like this (for n=6) :
Grouped sample:
x y
2 2
3 2
9 9
10 9
15 14
17 14
Not-grouped sample:
x y
1 1
2 2
6 8
6 8
11 11
19 17
How would I go about this in R?
I'm not entirely clear on what you like to do, partly because I feel there is some context missing as to what you're trying to achieve. I also don't quite understand your expected output (for example, the not-grouped sample contains an entry 6 8 that does not exist in your original data...)
That aside, here is a possible approach.
# Maximum number of samples per group
n <- 3;
# Set fixed RNG seed for reproducibility
set.seed(2017);
# Grouped samples
df.grouped <- do.call(rbind.data.frame, lapply(split(df, df$y),
function(x) if (nrow(x) > 1) x[sample(min(n, nrow(x))), ]));
df.grouped;
# x y
#2.3 3 2
#2.2 2 2
#6.6 6 6
#6.7 7 6
#9.10 10 9
#9.9 9 9
#13.13 13 13
#13.14 14 13
#14.15 15 14
#14.17 17 14
# Ungrouped samples
df.ungrouped <- df[sample(nrow(df.grouped)), ];
df.ungrouped;
# x y
#7 7 6
#1 1 1
#9 9 9
#4 4 4
#3 3 2
#2 2 2
#5 5 5
#6 6 6
#10 10 9
#8 8 8
Explanation: Split df based on y, then draw min(n, nrow(x)) samples from subset x containing >1 rows; rbinding gives the grouped df.grouped. We then draw nrow(df.grouped) samples from df to produce the ungrouped df.ungrouped.
Sample data
df <- read.table(text =
"x y
1 1
2 2
3 2
4 4
5 5
6 6
7 6
8 8
9 9
10 9
11 11
12 12
13 13
14 13
15 14
16 15
17 14
18 16
19 17
20 18", header = T)

I don't know how to create this tree in R

I would like to maximize revenue by applying the better campaign at each hour.
I would like to create a tree that would help me choose the better campaign.
At the data below there's a record with the revenue per campaign per hour.
Looking at the data, I may see that campaign A is better between hours 1-12, and that campaign B is better between hours 13-24.
How do I create in R the tree that would tell me that?
hour campaign revenue
1 A 23
1 B 20
2 A 21
2 B 22
3 A 23
3 B 20
4 A 21
4 B 22
5 A 23
5 B 20
6 A 21
6 B 22
7 A 20
7 B 17
8 A 18
8 B 19
9 A 20
9 B 17
10 A 18
10 B 19
11 A 20
11 B 17
12 A 19
12 B 18
13 A 8
13 B 9
14 A 6
14 B 11
15 A 9
15 B 8
16 A 6
16 B 11
17 A 9
17 B 8
18 A 6
18 B 11
19 A 3
19 B 2
20 A 3
20 B 2
21 A 0
21 B 5
22 A 3
22 B 2
23 A 3
23 B 2
24 A 0
24 B 5
I'm not sure what kind of tree you are looking for exactly, but a linear model tree for revenue with regressor campaign and partitioning variable hour might be useful. Using lmtree() in package partykit you can fit a tree that starts out by fitting a linear model with two coefficients (intercept and campaign B effect) and then splits the data as long as there are significant instabilities in at least one of the coefficients:
library("partykit")
(tr <- lmtree(revenue ~ campaign | hour, data = d))
## Linear model tree
##
## Model formula:
## revenue ~ campaign | hour
##
## Fitted party:
## [1] root
## | [2] hour <= 12: n = 24
## | (Intercept) campaignB
## | 20.583333 -1.166667
## | [3] hour > 12: n = 24
## | (Intercept) campaignB
## | 4.666667 1.666667
##
## Number of inner nodes: 1
## Number of terminal nodes: 2
## Number of parameters per node: 2
## Objective function (residual sum of squares): 341.1667
In this (presumably artificial) data, this selects a single split at 12 hours and then has two terminal nodes: one with a negative campaign B effect (i.e., A is better) and one with a positive campaign B effect (i.e., B is better). The resulting plot(tr) yields:
This also brings out that the split is also driven by the change in revenue level and not only by the differing campaign effects (which are fairly small).
The underlying tree algorithm is called "Model-Based Recursive Partitioning" (MOB) and is also applicable to models other than linear regression. See the references in the manual and vignette for more details.
Another algorithm that might potentially be interesting is the QUINT (qualitative interaction trees) by Dusseldorp & Van Mechelen, available in the quint package.
For convenient replication of the example above: The d data frame can be recreated by
d <- read.table(textConnection("hour campaign revenue
1 A 23
1 B 20
2 A 21
2 B 22
3 A 23
3 B 20
4 A 21
4 B 22
5 A 23
5 B 20
6 A 21
6 B 22
7 A 20
7 B 17
8 A 18
8 B 19
9 A 20
9 B 17
10 A 18
10 B 19
11 A 20
11 B 17
12 A 19
12 B 18
13 A 8
13 B 9
14 A 6
14 B 11
15 A 9
15 B 8
16 A 6
16 B 11
17 A 9
17 B 8
18 A 6
18 B 11
19 A 3
19 B 2
20 A 3
20 B 2
21 A 0
21 B 5
22 A 3
22 B 2
23 A 3
23 B 2
24 A 0
24 B 5"), header = TRUE)
Would something like this work?
## create a sequence of hours from column 1 of the data
hr <- as.numeric(unique(data[,1]))
## Set up vectors to hold the A and B campaign "best" hours
A.hours=NULL
B.hours=NULL
## start at the lowest hour
i=1
while(i<=max(hr)) {
## create a subset of data from the current hour
sub.data <- data[matrix(which(data[,1]==hr[i])),]
## find the campaign with the highest revenue
best.camp <- sub.data[which(sub.data[,3]==max(sub.data[,3])),2]
if(best.camp=="A") {
A.hours <- c(A.hours,hr[i])
}
if(best.camp=="B") {
B.hours <- c(B.hours,hr[i])
}
i=i+1
}
The code indicates that during the A.hours (hours: 1 3 5 7 9 11 12 15 17 19 20 22 23), campaign A is more profitable.
However, during B.hours (hours: 2 4 6 8 10 13 14 16 18 21 24), campaign B is more profitable.

matrix addition - Multiple unique identifiers

AM trying to add elements from two different matrices, Each of the matrix has got three unique identifiers as below:
Matrix A:
A B C D E F G H
1 x 1 2 10 11 12 13 10
2 y 1 2 11 12 14 12 13
3 y 1 3 12 10 11 12
The second matrix look like:
A B C D E F G H
1 x 1 2 20 14 17 10 10
2 y 1 2 11 12 14 12 13
3 y 1 3 17 10 19 12
Please note that the variables A, B, and D form unique identifiers for each of the participants.
I would wish to write a code so that as I sum the matrix values I consider this.
You should your data in the long format.
library(reshape2)
dat.l <- melt(dat,id=c('A','B','D'))
dat1.l <- melt(dat1,id=c('A','B','D'))
Then you just sum value :
dat.l$value = dat.l$value + dat1.l$value

Is there a way for a column to keep a running total of all columns that precede it?

If I have a data frame where I am adding columns, and I would like one column to sum them up. I will not know the names of the columns ahead of time, so I guess I would need some kind of function that would count the number of columns and then sum them up.
If my data is like this:
w=1:10
x=11:20
z=data.frame(w,x)
I would like the total for z$w and z$x. But then if I were to add z$y, I would like to have that incorporated into the sum as well.
You should consider not adding a column for the sum, and just call rowSums(z) whenever you need it. That removes the hassle of having to update the column whenever you modify your data.frame.
Now if that's really what you want, here is a little function that will update the sum and always keep it as the last column. You'll have to run it every time you make a change to your data.frame:
> refresh.total <- function(df) {
+ df$total <- NULL
+ df$total <- rowSums(df)
+ return(df)
+ }
>
> z <- refresh.total(z)
> z
w x total
1 1 11 12
2 2 12 14
3 3 13 16
4 4 14 18
5 5 15 20
6 6 16 22
7 7 17 24
8 8 18 26
9 9 19 28
10 10 20 30
>
> z$y <- 2:11
> z <- refresh.total(z)
> z
w x y total
1 1 11 2 14
2 2 12 3 17
3 3 13 4 20
4 4 14 5 23
5 5 15 6 26
6 6 16 7 29
7 7 17 8 32
8 8 18 9 35
9 9 19 10 38
10 10 20 11 41
After you've finished adding in all the columns, you can do:
z$total <- rowSums(z)

Resources