How to mutate a mean of certain rows in a data frame - r

I would like to create a new column which equals to the mean of several variables (columns) in my data frame. However, I'm afraid I can't use 'rowMeans' because I don't want to average all variables. Moreover, I'm hesitate to manually type all the variable names (which are many). For example:
my_data <- data.frame(a = c(1,2,3), b = c(4,5,6), c = c(10,10,10), d = c(13,24,81),
e = c(10, 8, 6), hello = c(1,-1,1), bye = c(1,5,5))
I want to mutate a row called avg which is the average of variables a, b, c, d, and e only. Because in my dataset, the variables names are long (and complex), and there are more than 10 variables, I prefer not to type them out one by one. So I guess I might need to use dplyr package and the mutate function?? Could you please suggest a clever way for me to do that?
The below content is added after your kind comments and answers suggest. Thank you all again:
Actually, the column names that I needed are Mcheck5_1_1, Mcheck5_2_1, ..., Mcheck5_8_1 (so there are 8 in total).However, I tried
my_data$avg = rowMeans(select(my_data, Mcheck5_1_1:Mcheck5_8_1), na.rm = TRUE)
but an error was thrown to me:
Error in select(my_data, Mcheck5_1_1:Mcheck5_8_1) :
unused argument (Mcheck5_1_1:Mcheck5_8_1)
Right now I solved the problem by using the following code:
`idx = grep("Mcheck5_1_1", names(my_data))
my_data$avg = rowMeans(my_data[, idx:idx+7], na.rm = TRUE)`
But is there a more elegant way to do it? Or why couldn't I use select()? Thanks!

I would do something like this
my_data <- data.frame(a = c(1,2,3), b = c(4,5,6), c = c(10,10,10), d = c(13,24,81),
e = c(10, 8, 6), hello = c(1,-1,1), bye = c(1,5,5))
several_variables <- c('a', 'b', 'c', 'd', 'e') #3 or `letters[1:5]`
my_data$avg <- rowMeans(my_data[,several_variables])
my_data
#> a b c d e hello bye avg
#> 1 1 4 10 13 10 1 1 7.6
#> 2 2 5 10 24 8 -1 5 9.8
#> 3 3 6 10 81 6 1 5 21.2
Obviously, if the variables is at some fixed position, and you know they will stay there, you could use the numbered indexing as suggested by Jaap,
my_data$avg <- rowMeans(my_data[,1:5])

Related

Error in Adabag boosting function

df1 <- data.frame(ID = c(1, 2, 3, 4, 5),
var1 = c('a', 'b', 'c', 'd', 'e'),
var2 = c(1, 1, 0, 0, 1))
ada = boosting(formula=var1~., data=df1)
Error in cbind(yval2, yprob, nodeprob) :
el número de filas de las matrices debe coincidir (vea arg 2)
Hi everyone, I'm trying to use boosting function from adabag package, but it's telling me that the number of rows from matrix (?) must be equal. This data is not the original, but it seems to throw the same error.
Could you help me?
Thank you.
You should not use ID as explanatory variable.
Unfortunately your df1 dataset is too small and it is not possibile to understand if ID is the source of your problem.
Below I generate a bigger data set:
library(adabag)
set.seed(1)
n <- 100
df1 <- data.frame(ID = 1:n,
var1 = sample(letters[1:5], n, replace=T),
var2 = sample(c(0,1), n, replace=T))
head(df1)
# ID var1 var2
#
# 1 1 b 1
# 2 2 b 0
# 3 3 c 0
# 4 4 e 1
# 5 5 b 1
# 6 6 e 0
ada <- boosting(var1~var2, data=df1)
ada.pred <- predict.boosting(ada, newdata=df1)
ada.pred$confusion
# Observed Class Predicted Class a b c d e
# b 5 20 2 7 11
# c 2 2 10 2 2
# d 6 3 7 17 4
Pablo, if we have a closer look at your sample data, we will notice a property that makes it impossible for the classification algorithm to handle. Your dataset consists of five samples, each having a unique label i.e. the dependent variable: a, b, c, d, e. The dataset has only one feature (i.e. independent variable var2, as ID should be excluded from the features’ list) consisting of two classes: 0 and 1. It means there are several labels (of the dependent variable) that correspond to the same class of the independent variable. When algorithm tries to build a model, in this process it encounters a problem with defining regression due to the previously described dataset property and throws the error (number of rows of matrices must match (see arg 2)).
Marco's data, instead, has some healthy diversity: in the dataset of six samples, there are only three labels (b, c, e) and two classes (0, 1). The data set is diverse and reliable enough for the algorithm to handle it.
So, in order to use adabag’s boosting (that uses a regression tree called rpart as the control), you should make your data more diverse and reliable. Good luck!

Using a custom summary function for factors within multiple columns

I conducted a survey with a large number of items, each of which has distinct categorical response options stored as factors. I need to summarize these columns in an efficient manner, preferably with functionality like that provided by forcats::fct_count(). I also need to know how many non-NA responses were provided for each variable, since different items were shown to different respondents. I wrote a function to make a tidy little summary data frame, but am struggling to efficiently run this function along each column and then combine the results into a single object (ala ddply).
I've tried sapply(), gather()-ing the data to long format and then running ddply(), but the problem of the distinct levels for each variable seems to keep getting in the way. See below for a reproducible example of the data set and my summarizing function. I could run the function for each variable (as shown below), but I know there's gotta be a more efficient way to do this that doesn't involve creating a ton of individual summary data-frame objects. Thanks for any help you can provide.
data <- data.frame(
ID = c(1:50),
X = as.factor(sample(c("yes", "no", NA), 50, replace = TRUE)),
Y = as.factor(sample(c("a", "b", "c", NA), 50, replace = TRUE)),
Z = as.factor(sample(c("d", "e", "f", "g", "h", NA), 50, replace = TRUE))
)
library(tidyverse)
library(forcats)
factorsummaries.f <- function(x) {
x <- na.omit(x)
counts <- fct_count(fct_drop(x), sort = T)
counts$f <- as.character(counts$f)
total <- data.frame(f = "sum", n = as.numeric(sum(counts$n)))
return(bind_rows(counts, total))
}
factorsummaries.f(data$X)
factorsummaries.f(data$Y)
Perhaps you are looking for purrr::map_dfr
map_dfr(data[,2:ncol(data)], factorsummaries.f, .id = "colname")
#output
colname f n
<chr> <chr> <dbl>
1 X no 18
2 X yes 17
3 X sum 35
4 Y a 14
5 Y c 13
6 Y b 12
7 Y sum 39
8 Z g 10
9 Z d 9
10 Z h 8
11 Z f 6
12 Z e 5
13 Z sum 38

How to change values in a column of a data frame based on conditions in another column?

I would like to have an equivalent of the Excel function "if". It seems basic enough, but I could not find relevant help.
I would like to assess "NA" to specific cells if two following cells in a different columns are not identical. In Excel, the command would be the following (say in C1): if(A1 = A2, B1, "NA"). I then just need to expand it to the rest of the column.
But in R, I am stuck!
Here is an equivalent of my R code so far.
df = data.frame(Type = c("1","2","3","4","4","5"),
File = c("A","A","B","B","B","C"))
df
To get the following Type of each Type in another column, I found a useful function on StackOverflow that does the job.
# determines the following Type of each Type
shift <- function(x, n){
c(x[-(seq(n))], rep(6, n))
}
df$TypeFoll <- shift(df$Type, 1)
df
Now, I would like to keep TypeFoll in a specific row when the File for this row is identical to the File on the next row.
Here is what I tried. It failed!
for(i in 1:length(df$File)){
df$TypeFoll2 <- ifelse(df$File[i] == df$File[i+1], df$TypeFoll, "NA")
}
df
In the end, my data frame should look like:
aim = data.frame(Type = c("1","2","3","4","4","5"),
File = c("A","A","B","B","B","C"),
TypeFoll = c("2","3","4","4","5","6"),
TypeFoll2 = c("2","NA","4","4","NA","6"))
aim
Oh, and by the way, if someone would know how to easily put the columns TypeFoll and TypeFoll2 just after the column Type, it would be great!
Thanks in advance
I would do it as follows (not keeping the result from the shift function)
df = data.frame(Type = c("1","2","3","4","4","5"),
File = c("A","A","B","B","B","C"), stringsAsFactors = FALSE)
# This is your shift function
len=nrow(df)
A1 <- df$File[1:(len-1)]
A2 <- df$File[2:len]
# Why do you save the result of the shift function in the df?
Then assign if(A1 = A2, B1, "NA"). As akrun mentioned ifelse is vectorised: Btw. this is how you append a column to a data.frame
df$TypeFoll2 <- c(ifelse(A1 == A2, df$Type, NA), 6) #Why 6?
As 6 is hardcoded here something like:
df$TypeFoll2 <- c(ifelse(A1 == A2, df$Type, NA), max(df$Type)+1)
Is more generic.
First off, 'for' loops are pretty slow in R, so try to think of this as vector manipulation instead.
df = data.frame(Type = c("1","2","3","4","4","5"),
File = c("A","A","B","B","B","C"));
Create shifted types and files and put it in new columns:
df$TypeFoll = c(as.character(df$Type[2:nrow(df)]), "NA");
df$FileFoll = c(as.character(df$File[2:nrow(df)]), "NA");
Now, df looks like this:
> df
Type File TypeFoll FileFoll
1 1 A 2 A
2 2 A 3 B
3 3 B 4 B
4 4 B 4 B
5 4 B 5 C
6 5 C NA NA
Then, create TypeFoll2 by combining these:
df$TypeFoll2 = ifelse(df$File == df$FileFoll, df$TypeFoll, "NA");
And you should have something that looks a lot like what you want:
> df;
Type File TypeFoll FileFoll TypeFoll2
1 1 A 2 A 2
2 2 A 3 B NA
3 3 B 4 B 4
4 4 B 4 B 4
5 4 B 5 C NA
6 5 C NA NA NA
If you want to remove the FileFoll column:
df$FileFoll = NULL;

Inserting function argument as string within body of function

I am trying to write a function using aggregate() that will allow me to easily specify one or more variables to list by and their names.
data:
FCST_VAR OBS_SID FCST_INIT_HOUR ME
WIND 00000 12 4.00000
WIND 11111 12 -0.74948
WIND 22222 12 -0.97792
WIND 00000 00 -2.15822
WIND 11111 00 0.94710
WIND 22222 00 -2.28489
I can do this for a single variable to group by fairly easily:
aggregate.CNT <- function(input.data, aggregate.by) {
# Calculate mean ME by aggregating specified variable
output.data <- aggregate(input.data$ME,
list(Station_ID = input.data[[OBS_SID]]),
mean, na.rm=T)
}
However, I'm stumped on two things:
Firstly, a way to be able to call the function specifying a name for the 'group by' column (instead of Group1), eg in the case of:
aggregate.CNT <- function(input.data, aggregate.by, group.name) {
# Calculate mean ME by aggregating specified variable
output.data <- aggregate(input.data$ME,
list(group.name = input.data[[OBS_SID]]),
mean, na.rm=T)
}
But this results in the column name in the output being group.name rather than the desired value of the argument.
Secondly, building on that - if I want to optionally specify more than one variable to sort by - with names. I tried using ... but that doesn't seem to possibly since the additional arguments obviously need to be in the form:
list(arg1 = input.data[[arg2]], arg3 = input.data[[arg4]])
And I don't think there's a way to place extra arguments into a arg3 = input.data[[arg4]] format.
So I was wondering if there is a way to use an argument to insert a whole string into the function, eg:
aggregate.CNT <- function(input.data, aggregate.by.list) {
# Calculate mean ME by aggregating specified variable
output.data <- aggregate(input.data$ME,
list(aggregate.by.list),
mean, na.rm=T)
aggregate.CNT(data, "Station_ID = data$OBS_SID, Init_Hour = data$FCST_INIT_HOUR")
If this isn't possible, suggestions for alternative methods are also greatly appreciated.
Thanks
Mal
Try this:
aggregate.CNT <- function(data, by) {
ag <- aggregate(ME ~., data[c("ME", by)], mean, na.rm = TRUE)
if (!is.null(names(by))) names(ag) <- c(names(by), "ME")
ag
}
Here is an example:
> DF <- data.frame(ME = 1:5, g = c(1, 1, 2, 2, 2), b = c(1, 1, 1, 2, 2))
> aggregate.CNT(DF, "g")
g ME
1 1 1.5
2 2 4.0
> aggregate.CNT(DF, c("g", "b"))
g b ME
1 1 1 1.5
2 2 1 3.0
3 2 2 4.5
> aggregate.CNT(DF, c(G = "g", B = "b"))
G B ME
1 1 1 1.5
2 2 1 3.0
3 2 2 4.5
ADDED: by vector may be named.

How to ddply() without sorting?

I use the following code to summarize my data, grouped by Compound, Replicate and Mass.
summaryDataFrame <- ddply(reviewDataFrame, .(Compound, Replicate, Mass),
.fun = calculate_T60_Over_T0_Ratio)
An unfortunate side effect is that the resulting data frame is sorted by those fields. I would like to do this and keep Compound, Replicate and Mass in the same order as in the original data frame. Any ideas? I tried adding a "Sorting" column of sequential integers to the original data, but of course I can't include that in the .variables since I don't want to 'group by' that, and so it is not returned in the summaryDataFrame.
Thanks for the help.
This came up on the plyr mailing list a while back (raised by #kohske no less) and this is a solution offered by Peter Meilstrup for limited cases:
#Peter's version used a function gensym to
# create the col name, but I couldn't track down
# what package it was in.
keeping.order <- function(data, fn, ...) {
col <- ".sortColumn"
data[,col] <- 1:nrow(data)
out <- fn(data, ...)
if (!col %in% colnames(out)) stop("Ordering column not preserved by function")
out <- out[order(out[,col]),]
out[,col] <- NULL
out
}
#Some sample data
d <- structure(list(g = c(2L, 2L, 1L, 1L, 2L, 2L), v = c(-1.90127112738315,
-1.20862680183042, -1.13913266070505, 0.14899803094742, -0.69427656843677,
0.872558638137971)), .Names = c("g", "v"), row.names = c(NA,
-6L), class = "data.frame")
#This one resorts
ddply(d, .(g), mutate, v=scale(v)) #does not preserve order of d
#This one does not
keeping.order(d, ddply, .(g), mutate, v=scale(v)) #preserves order of d
Please do read the thread for Hadley's notes about why this functionality may not be general enough to roll into ddply, particularly as it probably applies in your case as you are likely returning fewer rows with each piece.
Edited to include a strategy for more general cases
If ddply is outputting something that is sorted in an order you do not like you basically have two options: specify the desired ordering on the splitting variables beforehand using ordered factors, or manually sort the output after the fact.
For instance, consider the following data:
d <- data.frame(x1 = rep(letters[1:3],each = 5),
x2 = rep(letters[4:6],5),
x3 = 1:15,stringsAsFactors = FALSE)
using strings, for now. ddply will sort the output, which in this case will entail the default lexical ordering:
> ddply(d,.(x1,x2),summarise, val = sum(x3))
x1 x2 val
1 a d 5
2 a e 7
3 a f 3
4 b d 17
5 b e 8
6 b f 15
7 c d 13
8 c e 25
9 c f 27
> ddply(d[sample(1:15,15),],.(x1,x2),summarise, val = sum(x3))
x1 x2 val
1 a d 5
2 a e 7
3 a f 3
4 b d 17
5 b e 8
6 b f 15
7 c d 13
8 c e 25
9 c f 27
If the resulting data frame isn't ending up in the "right" order, it's probably because you really want some of those variables to be ordered factors. Suppose that we really wanted x1 and x2 ordered like so:
d$x1 <- factor(d$x1, levels = c('b','a','c'),ordered = TRUE)
d$x2 <- factor(d$x2, levels = c('d','f','e'), ordered = TRUE)
Now when we use ddply, the resulting sort will be as we intend:
> ddply(d,.(x1,x2),summarise, val = sum(x3))
x1 x2 val
1 b d 17
2 b f 15
3 b e 8
4 a d 5
5 a f 3
6 a e 7
7 c d 13
8 c f 27
9 c e 25
The moral of the story here is that if ddply is outputting something in an order you didn't intend, it's a good sign that you should be using ordered factors for the variables you're splitting on.
I eventually ended up adding an 'indexing' column to the original data frame. It consisted of two columns pasted with sep="_". Then I made another data frame made of only unique members of the 'indexing' column and a counter 1:length(df). I did my ddply() on the data which returned a sorted data frame. Then to get things back in the original order I did merge() the results data frame and the index data frame (making sure the columns are named the same thing makes this easier). Finally, I did order and removed the extraneous columns.
Not an elegant solution, but one that works.
Thanks for the assist. It got me thinking in the right direction.

Resources