I have a data frame with 2 columns one with numeric values and one with a name. The name repeats itself but has different values each time.
Data <- data.frame(
Value = c(1:10),
Name = rep(LETTERS, each=4)[1:10])
I would like to write a function that takes the 3 highest numbers for each name and calculates mean and median (and in case there aren’t 3 values present throw an NA) and then take all the values for each name and calculate mean and median.
My initial attempt looks something like this:
my.mean <- function (x,y){
top3.x <- ifelse(x > 3 , NA, x)
return(mean(top3.x), median(top3.x))
}
Any hints on how to improve this will be appreciated.
I would probably recommend by for this.
Something put together really quickly might look like this (if I understood your question correctly):
myFun <- function(indf) {
do.call(rbind, with(indf, by(Value, Name, FUN=function(x) {
Vals <- head(sort(x, decreasing=TRUE), 3)
if (length(Vals) < 3) {
c(Mean = NA, Median = NA)
} else {
c(Mean = mean(Vals), Median = median(Vals))
}
})))
}
myFun(Data)
# Mean Median
# A 3 3
# B 7 7
# C NA NA
Note that it is not a very useful function in this form because of how many parameters are hard-coded into the function. It's really only useful if your data is in the form you shared.
Here's a data.table solution, assuming that you don't have any other NAs in your data:
require(data.table) ## 1.9.2+
setDT(Data) ## convert to data.table
Data[order(Name, -Value)][, list(m1=mean(Value[1:3]), m2=median(Value[1:3])), by=Name]
# Name m1 m2
# 1: A 3 3
# 2: B 7 7
# 3: C NA NA
Using dplyr
library(dplyr)
myFun1 <- function(dat){
dat %>%
group_by(Name)%>%
arrange(desc(Value))%>%
mutate(n=n(), Value=ifelse(n<=3, NA_integer_, Value))%>%
summarize(Mean=mean(head(Value,3)), Median=median(head(Value,3)))
}
myFun1(Data)
#Source: local data frame [3 x 3]
# Name Mean Median
#1 A 3 3
#2 B 7 7
#3 C NA NA
Related
I am trying to apply a function to columns of a tibble, or data.frame, depending on the index of columns. It appears to me several time, and I give just one MWE
library(tidyverse)
test <- data.frame(a = c(1,2,3), b = c(7,8,9), c = c(3,5,6))
test <- test %>% as_tibble() %>% mutate_all( ~lead(., 2))
This will lead by 2 every columns (just an example). But what I want is to lead the first column by 1, the second by 2, and so on. Doing something like mutate_all(~lead(., col_number()).
For this little example, I know one way to do it, like:
test <- as.matrix(test)
for (i in 1:ncol(test)){ test[,i] <- lead(test[,i], i) }
There might be other way to do it too, haven't thought about it much (one needs to convert as a matrix first, otherwise it doesn't produce the right result, I don't really know why).
But I'd like to do it with a mutate or apply, being able to get the index of column in general. With a more complex example.
Any idea?
One option is using purrr::map2_df to sequentially lead every column based on column number.
purrr::map2_df(test, seq_along(test), dplyr::lead)
# A tibble: 3 x 3
# a b c
# <dbl> <dbl> <dbl>
#1 2 9 NA
#2 3 NA NA
#3 NA NA NA
We can also use base R Map
test[] <- Map(function(x, y) c(tail(x, -y), rep(NA, y)), test, seq_along(test))
We can use data.table shift
library(data.table)
setDT(test)[, Map(shift, .SD, n = 1:3, type = 'lead')]
# a b c
#1: 2 9 NA
#2: 3 NA NA
#3: NA NA NA
Or using purrr
library(purrr)
map2_dfr(test, 1:3, ~shift(.x, type = 'lead'))
I have a pairwise table of values, and I'm trying to find the fastest way to apply some function to various subsets of this table. I'm experimenting with data.table to see if it will suit my needs.
For example, I start with this vector of data points, which I convert to a pairwise distance matrix.
dat <- c(spA = 4, spB = 10, spC = 8, spD = 1, spE = 5, spF = 9)
pdist <- as.matrix(dist(dat))
pdist[upper.tri(pdist, diag = TRUE)] <- NA
It looks like this:
> pdist
spA spB spC spD spE spF
spA NA NA NA NA NA NA
spB 6 NA NA NA NA NA
spC 4 2 NA NA NA NA
spD 3 9 7 NA NA NA
spE 1 5 3 4 NA NA
spF 5 1 1 8 4 NA
Converting this table to a data.table
library(data.table)
pdist <- as.data.table(pdist, keep.rownames=TRUE)
setkey(pdist, rn)
> pdist
rn spA spB spC spD spE spF
1: spA NA NA NA NA NA NA
2: spB 6 NA NA NA NA NA
3: spC 4 2 NA NA NA NA
4: spD 3 9 7 NA NA NA
5: spE 1 5 3 4 NA NA
6: spF 5 1 1 8 4 NA
If I have some subset that I want to extract the values for,
sub <- c('spB', 'spF', 'spD')
I can do the following, which yields the submatrix that I am interested in:
> pdist[.(sub), sub, with=FALSE]
spB spF spD
1: NA NA NA
2: 1 NA 8
3: 9 NA NA
Now, how can I apply a function, for example taking the mean (but potentially a custom function), of all values in this subset? I can do it this way, but I wonder if there are better ways in line with data.table manipulation.
> mean(unlist(pdist[.(sub), sub, with=FALSE]), na.rm=TRUE)
[1] 6
UPDATE
Following up on this, I decided to see how different in performance a matrix vs a data.table approach would be:
dat <- runif(1000)
names(dat) <- paste0('sp', 1:1000)
spSub <- replicate(10000, sample(names(dat), 100), simplify=TRUE)
# calculate pairwise distance matrix
pdist <- as.matrix(dist(dat))
pdist[upper.tri(pdist, diag = TRUE)] <- NA
# convert to data.table
pdistDT <- as.data.table(pdist, keep.rownames='sp')
setkey(pdistDT, sp)
matMethod <- function(pdist, sub) {
return(mean(pdist[sub, sub], na.rm=TRUE))
}
dtMethod <- function(pdistDT, sub) {
return(mean(unlist(pdistDT[.(sub), sub, with=FALSE]), na.rm=TRUE))
}
> system.time(q1 <- lapply(spSub, function(x) matMethod(pdist, x)))
user system elapsed
18.116 0.154 18.317
> system.time(q2 <- lapply(spSub, function(x) dtMethod(pdistDT, x)))
user system elapsed
795.456 13.357 806.820
It appears that going through the data.table step here is leading to a big performance cost.
Please see the solution posted here for an every more general solution. It may also help:
data.table: transforming subset of columns with a function, row by row
To apply the function, you can do the following:
Part 1. A Step-by-Step Solution
(1.a) Get the data into Data.Table format:
library(data.table)
library(magrittr) #for access to pipe operator
pdist <- as.data.table(pdist, keep.rownames=TRUE)
setkey(pdist, rn)
(1.b) Then, Get the list of Column Names:
# Get the list of names
sub <- c('spB', 'spF', 'spD')
(1.c) Define the function you want to apply
#Define the function you wish to apply
# Where, normalize is just a function as defined in the question:
normalize <- function(X, X.mean = mean(X, na.rm=T), X.sd = sd(X, na.rm=T)){
X <- (X - X.mean) / X.sd
return(X)}
(1.d) Apply the function:
# Voila:
pdist[, unlist(.SD, use.names = FALSE), .SDcols = sub] %>% normalize()
#Or, you can apply the function inside the [], as below:
pdist[, unlist(.SD, use.names = FALSE) %>% normalize(), .SDcols = sub]
# Or, if you prefer to do it without the pipe operator:
pdist[, normalize(unlist(.SD, use.names = FALSE)), .SDcols = sub]
Part 2. Some Advantages for Data.Table approach
Since you seem familiar with matrix approach, I just wanted to point out some advantages of keeping the data.table approach
(2.a) Apply functions within group by using the "by ="
One advantage over matrix is that you can still apply functions within group by using the "by =" argument.
In the example here, I assume you have a variable called "Grp."
With the by=Grp line, the normalization is within group now.
pdist[, unlist(.SD) %>% normalize(), .SDcols = sub, by=Grp]
(2.b) Another advantage is that you can keep other identifying information, for example, if each row has a "participant identifier" P.Id that you wish to keep and repeat:
pdist[, .(Combined.Data = unlist(.SD)), .SDcols = sub, by=P.Id][order(P.Id),.(P.Id, Transformed = normalize(Combined.Data), Combined.Data)]
In the first step, done in this portion of the code: pdist[, .(Combined.Data = unlist(.SD)), .SDcols = sub, by=P.Id]
First, we create a new column called Combined.Data for data in all three columns identified in "sub"
Next to each row of the combined data, the appropriate Participant Id will repeat in column P.Id
In the second step, done in this portion of the code:
[,.(P.Id, Normalized = normalize(Combined.Data), Combined.Data)]
We can create a new column called Normalized to store the normalized values that result from applying the function normalize()
In addition, we can also include the Combined.Data column as well
So, with this single line:
pdist[, .(Combined.Data = unlist(.SD)), .SDcols = sub, by=P.Id][order(P.Id),.(P.Id, Transformed = normalize(Combined.Data), Combined.Data)]
we subset columns,
collapse data across the subset,
keep track of the identifier for each datum (P.Id) even when collapsed,
apply a transformation on the entire collapsed data, and
end-up with a neat output in the form of a data table with 3 columns: (1) P.Id, (2) Transformed, & (3) Combined.Data (original values).
and, the order(P.Id) allows the output to appear meaningfully ordered.
The same would be possible with matrix approach, but would be much more cumbersome and take more lines of code.
Data table allows for powerful manipulation and management of data, especially when you start chaining operations together.
(2.c) Finally, if you just wish to keep row information as simple row.numbers, you can use the .I feature of the data.table package:
pdist[, .(.I, normalize(unlist(.SD)), .SDcols = sub]
This feature can be quite helpful, especially if you dont have a participant or row identifier that is inherently meaningful.
Part 3. Disadvantage: Time Cost
I recreated the corrected time cost shown above and the solution for Data Table does take significantly longer
dat <- runif(1000)
names(dat) <- paste0('sp', 1:1000)
spSub <- replicate(10000, sample(names(dat), 100), simplify=TRUE)
# calculate pairwise distance matrix
pdist <- as.matrix(dist(dat))
pdist[upper.tri(pdist, diag = TRUE)] <- NA
# convert to data.table
pdistDT <- as.data.table(pdist, keep.rownames='sp')
# pdistDT$sp %<>% as.factor()
setkey(pdistDT, sp)
matMethod <- function(pdist, sub) {
return(mean(pdist[sub, sub], na.rm=TRUE))
}
dtMethod <- function(pdistDT, sub) {
return(pdistDT[sub, sub, with = FALSE] %>%
unlist(., recursive = FALSE, use.names = FALSE) %>%
mean(., na.rm = TRUE))
}
dtMethod1 <- function(pdistDT, sub) {
return(pdistDT[sub, sub, with = FALSE] %>%
melt.data.table(., measure.vars = sub, na.rm=TRUE) %$%
mean(value))
}
system.time(q1 <- apply(spSub, MARGIN = 2, function(x) matMethod(pdist, x)))
# user system elapsed
# 2.86 0.00 3.27
system.time(q2 <- apply(spSub, MARGIN = 2, function(x) dtMethod(pdistDT, x)))
# user system elapsed
# 57.20 0.02 57.23
system.time(q3 <- apply(spSub, MARGIN = 2, function(x) dtMethod1(pdistDT, x)))
# user system elapsed
# 62.78 0.06 62.91
how to calculate colMedian using colMedian function. I get an error: Argument 'x' must be a matrix or a vector.
col_medians <- round(colMedians(impute_marks[,-1], na.rm=TRUE),0)
k <- which(is.na(impute_marks), arr.ind=TRUE)
impute_marks[k] <- col_medians[k[,-1]]
I need to do the below operation for all the columns other than first column in the data frame.
Below code works fine. but I in for loop gives an error unknown courses when looped.
impute_marks$c1[is.na(impute_marks$c1)] <- round(mean(impute_marks$c1[!is.na(impute_marks$c1)]),0)
here, impute_marks is the name of the dataset and c1 is the column name.
using the above operation I am able to find the mean and replace all NA values in c1 (column). But I have 30+ columns. How can I write the above operation in a for loop to loop through each course and replace NA value with the mean?
my function for the operation:
impute_marks$F27SA[is.na(impute_marks$F27SA)] <- round(mean(impute_marks$F27SA[!is.na(impute_marks$F27SA)]),0)
imputing_using_mean <- function()
{
courses <- names(impute_marks)[2:26]
for(i in seq_along(courses))
{
impute_marks$courses[[i]][is.na(impute_marks$courses[[i]])] <- round(mean(impute_marks$courses[[i]][!is.na(impute_marks$courses[[i]])]),0)
}
}
imputing_using_mean()
Essentially the same as answer from #Aaron on
Replace NA values by row means . Tweaked to account for the first column.
marks <- read.table(text="
a 1 NA 3
b 1 2 3
c NA NA NA
")
col_means <- round(colMeans(marks[,-1], na.rm=TRUE), 0)
k <- which(is.na(marks), arr.ind=TRUE)
marks[k] <- col_means[k[,2]-1]
# V1 V2 V3 V4
#1 a 1 2 3
#2 b 1 2 3
#3 c 1 2 3
Below is a solution for calculating median for each column and replacing each NA values with the median calculated for each column. same goes for mean as well but the step to convert it to a matrix is not required.
# first convert it to matrix
matrix_marks <- as.matrix(impute_marks)
$calculate the median for each column
col_medians <- round(colMedians(matrix_marks[,-1], na.rm=TRUE),0)
#get the index for each NA values
k <- which(is.na(matrix_marks), arr.ind=TRUE)
finally replace those values with median value.
matrix_marks[k] <- col_medians[k[,-1]]
Suppose I have the following data frame:
m <- data.frame(a = c(".","1",2:10),
b = c(".","2",4:12),
c = c(rep(".",11)))
I use apply to get the max value of each row:
maxrowval <- apply(m,1,max)
fin <- cbind(m,maxrowval)
The problem is that rows 9 and 10 of fin does not give the max values.
I must be missing something here but can't point to the source of the problem. Maybe something to do with factors and the levels. Any help is appreciated.
Combining the character issue mentioned in the comments with the max function and removing "-Inf" from the results.
foo <- function(x){
tmp <- max(as.numeric(as.character(x)), na.rm = T)
ifelse(tmp == "-Inf", NA, tmp)
}
apply(m, 1, foo)
[1] NA 2 4 5 6 7 8 9 10 11 12
I have a data frame:
id <- c(rep(1, 4), rep(2, 3), rep(3, 2), 4)
rate <- c(rep(1, 3), NA, 0.5, 0.6, NA, 0.7, NA, NA)
df <- data.frame(id, rate)
and I need to replace the NA based on the following conditions:
for (i in 1:dim(df)[1]) {
if (is.na(df$rate[i])) {
mrate <- round(mean(df$rate[df$id == df$id[i]], na.rm = T), 1)
if (is.nan(mrate)) {
df$rate[i] <- 1
} else {
df$rate[i] <- mrate
}
}
}
Apparently the for loop is simply too slow on a big data frame with >200K rows. How can I use a much faster way without using for loop?
Thanks!
This is a solution using data.tables:
library(data.table)
dt <- data.table( df, key = "id" )
dt[ , rate := ifelse( is.na(rate), round( mean(rate, na.rm=TRUE), 1), rate ), by = id ]
dt[ is.na(rate), rate := 1 ]
dt
id rate
1: 1 1.0
2: 1 1.0
3: 1 1.0
4: 1 1.0
5: 2 0.5
6: 2 0.6
7: 2 0.6
8: 3 0.7
9: 3 0.7
10: 4 1.0
I am not sure though, if the ifelse could/should be avoided.
As mentioned in my comment, for loops in R are not specifically slow. However, often a for loop indicates other inefficiencies in code. In this case, the subset operation that is repeated for each row to determine the mean is most likely the slowest bit of code.
for (i in 1:dim(df)[1]) {
if (is.na(df$rate[i])) {
mrate <- round(mean(df$rate[df$id == df$id[i]], na.rm = T), 1) ## This line!
if (is.nan(mrate)) {
df$rate[i] <- 1
} else {
df$rate[i] <- mrate
}
}
}
If instead, these group averages are determined before hand, the loop can do a rapid lookup.
foo <- aggregate(df$rate, list(df$id), mean, na.rm=TRUE)
for (i in 1:dim(df)[1]) {
if (is.na(df$rate[i])) {
mrate <- foo$x[foo$Group.1 == df$id[i]]
...
However, I am still doing a subset at df$id[i] on the large data.frame. Instead, using one of the tools that implements a split-apply-combine strategy is a good idea. Also, lets write a function that takes a single value and a pre-computed group average and does the right thing:
myfun <- function(DF) {
avg <- avgs$rate[avgs$id == unique(DF$id)]
if (is.nan(avg)) {
avg <- 1
}
DF$rate[is.na(DF$rate)] <- avg
return (DF)
}
The plyr version:
library(plyr)
avgs <- ddply(df, .(id), summarise, rate=mean(rate, na.rm=TRUE))
result <- ddply(df, .(id), myfun)
And the likely much faster data.table version:
library(data.table)
DT <- data.table(df)
setkey(DT, id)
DT[, avg := mean(rate, na.rm=TRUE), by=id]
DT[is.nan(avg), avg := 1]
DT[, rate := ifelse(is.na(rate), avg, rate)]
This way, we've avoided all lookup subsetting in leiu of adding a pre-calculated column and can now do row-wise lookups which are fast and efficient. The extra column can be dropped inexpensively using:
DT[, avg := NULL]
The whole shebang can be written into a function or a data.table expression. But, IMO, that often comes at the expense of clarity!
I'm not sure this exactly answers the OP's question, but for others who read this later, there is a different and much faster method of performing calculations on a subset of data other than actually subsetting the data: vector math. The engineers in the crowd will know what I'm talking about.
Instead of subsetting, assign a very rapid function to create an identity vector and multiply the data by the identity.
Now, this isn't faster for all cases. There are cases where vectorized functions are actually slower than item-explicit functions, and it all depends on your specific application. [Insert O-notation rant of your choice here.]
Here is how we would do a vector math implementation for this case:
# Create the NA identity vector.
na_identity <- is.na(df$rate)
# Initialize the final data frame.
# This is for non-destructive purposes.
df_revised <- df
# Replace all NA occurrences in final
# data frame with zero values.
df_revised$rate[na_identity] <- 0
# Loop through each unique [id]
# value in the data.
# Create an identity vector for the
# current ID, calculate the mean
# rate for that ID (replacing NaN with 1),
# and insert the mean for any NA values
# associated with that ID.
for (i in unique(df$id)){
id_identity <- df$id==i
id_mean <- sum(df_revised$rate * id_identity * !na_identity) / sum(id_identity * !na_identity)
if(is.nan(id_mean)){id_mean <- 1}
df_revised$rate <- df_revised$rate + id_mean * id_identity * na_identity
}
# id rate
# 1 1 1.00
# 2 1 1.00
# 3 1 1.00
# 4 1 1.00
# 5 2 0.50
# 6 2 0.60
# 7 2 0.55
# 8 3 0.70
# 9 3 0.70
# 10 4 1.00
From a vector math perspective, this code is easily readable. In this small example, the code is very fast, but the cycle time increases directly with the number of unique ID values. I'm not sure if this is the correct approach for the OP's larger application, but the solution is workable and theoretically sound and removes the need for complex and difficult to read logic blocks.