Creating New Column with Maximum Value - r

I've spent a good deal of time looking into this subject, but have not been able to find much. I would like a new column of data titled "Max Region", that gives the name of the column for which the maximum value occurs in each row.
df <- data.frame(Head=c(9, 6, 2, NA), Thorax=c(9, 2, NA, NA), Abdomen=c(NA, NA, 5, 5), Neck=c(4, 3, 5, 2))
# Head Thorax Abdomen Neck
# 9 9 NA 4
# 6 2 NA 3
# 2 NA 5 5
# NA NA 5 2
So far, I've used:
df$MaxRegion <- names(df)[apply(df, 1, which.max)]
However, in the case of a tie, I would really like both columns to result (ie HeadThorax or AbdomenNeck), or just result with "NA". Is this possible with which.max? I've also looked into max.col, but it also doesn't seem to have this function. Thank you so much!

Using the OP's code, if we need to get all the tied max element column names, use %in%(returns FALSE where there are NA) or == on the max, and paste the corresponding names
apply(df, 1, function(x) toString(names(x)[x %in% max(x, na.rm = TRUE)]))
#[1] "Head, Thorax" "Head" "Abdomen, Neck" "Abdomen"
NOTE: which.max returns only the first index of the max value

Another base R option
df$MaxRegion <- mapply(
subset,
list(names(df)),
asplit(df == do.call(pmax, c(df, na.rm = TRUE)), 1)
)
gives
> df
Head Thorax Abdomen Neck MaxRegion
1 9 9 NA 4 Head, Thorax
2 6 2 NA 3 Head
3 2 NA 5 5 Abdomen, Neck
4 NA NA 5 2 Abdomen

Related

Creating a function looping over each row in R

I want to write a function that creates a new column with rowmeans for Columns 1-3, only if more than 2 questions for Columns 1-3 per row were answered, otherwise print 'N'.
Here is my dataframe:
test <- data.frame(Manager1 = c(1, 3, 3), Manager2 = c(3, 4, 1), Manager3 = c(NA , 4, 2), Team1 = c(3, 4, 1))
Desired output:
Manager1 Manager2 Manager3 Team1 mean_score
1 3 3 N
3 4 4 4 3.66667
3 1 2 1 2
My code is as follows, but it's not working:
#create function
mean_score <- function(x) {
for (i in 1:nrow(test)){
if (sum(test[i, x] != "NA", na.rm = TRUE) >2){
test$mean_score[i] <- rowMeans(test[i, x], na.rm = TRUE)
} else
test$mean_score[i] <- print("N")
}
}
#compute function
mean_score(1:3)
What am I missing? Suggestions on better code are welcome too.
I think it is not ideal to put a character together with a numeric value, since it will convert the whole column into character. However, if that is what you want:
my_sum <- function(x,min=2){
s <- mean(x, na.rm = T) # get the mean
no_na <- sum(!is.na(x)) # count the number of non NAs
if(no_na>min){s}else{"N"} # return mean if enough non NAs
}
test$mean <- apply(test[,1:3],1,my_sum)
test
Manager1 Manager2 Manager3 Team1 mean
1 1 3 NA 3 N
2 3 4 4 4 3.66666666666667
3 3 1 2 1 2
str(test)
'data.frame': 3 obs. of 5 variables:
$ Manager1: num 1 3 3
$ Manager2: num 3 4 1
$ Manager3: num NA 4 2
$ Team1 : num 3 4 1
$ mean : chr "N" "3.66666666666667" "2"
You simply can use rowMeans what will return NA if there is one row holding NA what should be here equivalent to only if more than 2 questions for Columns 1-3 per row were answered.
test$mean_score <- rowMeans(test[,1:3])
# Manager1 Manager2 Manager3 Team1 mean_score
#1 1 3 NA 3 NA
#2 3 4 4 4 3.666667
#3 3 1 2 1 2.000000
While GKi has a better answer that's more simple and that you should use here is what I changed your code to be so that it works.
Generally when making a function you want to have the input be the dataframe, in this case text and changing the function from there.
Another important thing of note is you probably want to make a vector of values first and then attach said vector to the dataframe as I do in the code below, but you need to make sure you create an empty vector object to do so. R doesn't really let you slowly add cell data to a dataframe, it prefers that a vector (which can be added to) of equal length be joined to it.
Also you don't need to use print() to insert a character into a vector either.
Hope this helps explain why your function was having issues, but frankly GKi's answer is better for general R use!
mean_score <- function(x) {
mean_score <- vector()
for (i in 1:nrow(x)){
if (sum(x[i,] != "NA", na.rm = TRUE) >3){
mean_score[i] <- rowMeans(x[i,], na.rm = TRUE)
} else
mean_score[i] <- "N"
}
x$mean_score <- mean_score
return(x)
}
mean_score(test)

Limit na.locf in zoo package

I would like to do a last observation carried forward for a variable, but only up to 2 observations. That is, for gaps of data of 3 or more NA, I would only carry the last observation forward for the next 2 observations and leave the rest as NA.
If I do this with the zoo::na.locf, the maxgap parameter implies that if the gap is larger than 2, no NA is replaced. Not even the last 2. Is there any alternative?
x <- c(NA,3,4,5,6,NA,NA,NA,7,8)
zoo::na.locf(x, maxgap = 2) # Doesn't replace the first 2 NAs of after the 6 as the gap of NA is 3.
Desired_output <- c(NA,3,4,5,6,6,6,NA,7,8)
First apply na.locf0 with maxgap = 2 giving x0 and define a grouping variable g using rleid from the data.table package. For each such group use ave to apply keeper which if the group is all NA replaces it with c(1, 1, NA, ..., NA) and otherwise outputs all 1s. Multiply na.locf0(x) by that.
library(data.table)
library(zoo)
mg <- 2
x0 <- na.locf0(x, maxgap = mg)
g <- rleid(is.na(x0))
keeper <- function(x) if (all(is.na(x))) ifelse(seq_along(x) <= mg, 1, NA) else 1
na.locf0(x) * ave(x0, g, FUN = keeper)
## [1] NA 3 4 5 6 6 6 NA 7 8
A solution using base R:
ave(x, cumsum(!is.na(x)), FUN = function(i){ i[1:pmin(length(i), 3)] <- i[1]; i })
# [1] NA 3 4 5 6 6 6 NA 7 8
cumsum(!is.na(x)) groups each run of NAs with most recent non-NA value.
function(i){ i[1:pmin(length(i), 3)] <- i[1]; i } transforms the first two NAs of each group into the leading non-NA value of this group.

Aggregate NAs in R

I'm having trouble handling NAs while calculating aggregated means. Please see the following code:
tab=data.frame(a=c(1:3,1:3), b=c(1,2,NA,3,NA,NA))
tab
a b
1 1 1
2 2 2
3 3 NA
4 1 3
5 2 NA
6 3 NA
attach(tab)
aggregate(b, by=list(a), data=tab, FUN=mean, na.rm=TRUE)
Group.1 x
1 1 2
2 2 2
3 3 NaN
I want NA instead of NaN if the vector has all NAs i.e. I want the output to be
Group.1 x
1 1 2
2 2 2
3 3 NA
I tried using a custom function:
adjmean=function(x) {if(all(is.na(x))) NA else mean(x,na.rm=TRUE)}
However, I get the following error:
aggregate(b, by=list(a), data=tab, FUN=adjmean)
Error in FUN(X[[1L]], ...) :
unused argument (data = list(a = c(1, 2, 3, 1, 2, 3), b = c(1, 2, NA, 3, NA, NA)))
In short, if the column has all NAs I want NA as an output instead of NaN. If it has few NAs, then it should compute the mean ignoring the NAs.
Any help would be appreciated.
Thanks
This is very close to what you had, but replaces mean(x, na.rm=TRUE) with a custom function which either computes the mean of the non-NA values, or supplies NA itself:
R> with(tab,
aggregate(b, by=list(a), FUN=function(x)
if (any(is.finite(z<-na.omit(x)))) mean(z) else NA))
Group.1 x
1 1 2
2 2 2
3 3 NA
R>
That is really one line, but I broke it up to make it fit into the SO display.
And you already had a similar idea, but I altered the function a bit more to return suitable values in all cases.
There is nothing wrong with your function. What is wrong is that you are using an argument in the default method for aggregate that doesn't exist:
adjmean = function(x) {if(all(is.na(x))) NA else mean(x,na.rm=TRUE)}
attach(tab) ## Just because you did it. I don't recommend this.
## Your error
aggregate(b, by=list(a), data=tab, FUN=adjmean)
# Error in FUN(X[[i]], ...) :
# unused argument (data = list(a = c(1, 2, 3, 1, 2, 3), b = c(1, 2, NA, 3, NA, NA)))
## Dropping the "data" argument
aggregate(b, list(a), FUN = adjmean)
# Group.1 x
# 1 1 2
# 2 2 2
# 3 3 NA
If you wanted to use the data argument, you should use the formula method for aggregate. However, this method treats NA differently, so you need an additional argument, na.action.
Example:
detach(tab) ## I don't like having things attached
aggregate(b ~ a, data = tab, adjmean)
# a b
# 1 1 2
# 2 2 2
aggregate(b ~ a, data = tab, adjmean, na.action = na.pass)
# a b
# 1 1 2
# 2 2 2
# 3 3 NA

How to sum numeric values in a character vector?

I have a data frame with rows containing several numbers in each row. The data type of each column in the data frame is factor.
My data frame looks like
> df
C1
1 1, 14, 1, 4
2 2
3 NA
4 3, 7, 5
Now I want to sum up the values of each element in df. Such that I get
Sum
1 20
2 2
3 NA
4 15
I tried strsplit(as.character(df$C1),split=","). However I have no idea how to build get the sum...
df <- data.frame(C1= c("1, 14, 1, 4", "2", NA, "3, 7, 5"))
Using your code:
sapply(strsplit(as.character(df$C1), split=","),
function(x) sum(as.numeric(x)))
#[1] 20 2 NA 15
Or
library(splitstackshape)
Sum1 <- rowSums(cSplit(df, 'C1', sep=','), na.rm=TRUE)
#Assuming that there is only one column
Sum1[!Sum1] <- NA
Sum1
#[1] 20 2 NA 15
Or may be this also
unname(sapply(gsub(",", "+", df$C1),
function(x) eval(parse(text=x))))
# [1] 20 2 NA 15
Converting your data to a data.frame and using rowSums:
rowSums(read.table(text=as.character(df$C1),sep=',',fill=TRUE),na.rm=TRUE)
[1] 20 2 0 15
A mostly for fun approach using eval parse:
unlist(lapply(paste0("c(", as.character(df$C1), ")"), function(x) sum(eval(parse(text=x)))))
## [1] 20 2 NA 15

finding the max value for values the row before through the row after for each row in R

I have looked for an answer high and low. It seems so simple but I am struggling with getting anything to work.
Using R 3.0 in Win 7.
I am looking for a way to find the max value (row by row) for: the row of interest, the row before, and row after.
an example would look something like this:
x max
1 1 NA
2 7 7
3 3 7
4 4 5
5 5 5
I could do this with a loop but I would like to avoid that if possible. I have explored things similar to rowSums and rollmean but they do not quite fit the bill since I want a max for a row after it as well.
Any thoughts are greatly appreciated!!
You could use embed and pmax in base R for this.
d <- data.frame(x=c(1,7,3,4,5))
transform(d, max=c(NA, do.call(pmax, as.data.frame(embed(d$x, 3))), NA))
# x max
# 1 1 NA
# 2 7 7
# 3 3 7
# 4 4 5
# 5 5 NA
Here's an approach using dplyr's lead() and lag() functions:
library(dplyr)
d <- data.frame(x = c(1,7,3,4,5))
mutate(d, max = pmax(lead(x), x, lag(x)))
#> x max
#> 1 1 NA
#> 2 7 7
#> 3 3 7
#> 4 4 5
#> 5 5 NA
Assuming that you want to do this for a matrix, not for a vector(use rollapply for vectors), this is straightforward solution, probably not the best in terms of the speed:
library(Hmisc)
x <- matrix(runif(10), ncol=2)
rowMaxs <- apply(x, 1, max)
row3Maxs <- apply(cbind(rowMaxs,
Lag(rowMaxs, 1),
Lag(rowMaxs, -1)), 1, max)
cbind(x, row3Maxs)
however from the performance standpoint the following might be better:
row3Maxsc <- c(NA,
sapply(2:(length(rowMaxs)-1),
function(i)
max(rowMaxs[i], rowMaxs[i-1], rowMaxs[i+1])
),
NA)
cbind(x, row3Maxs)

Resources