I want to replace apply() (and it's derivatives) with a purrr function.
I have a data.frame like this one:
> df
V1 V2 V3
1 NA 2 3
2 2 NA 3
3 3 3 NA
And I want to apply two functions rowwise: min(x, na.rm = T) and which.min(x)and return the results as a dataframe.
If I know how many columns there are I can do e.g. this:
pmap_dfr(df, function(V1, V2, V3) {data.frame(min = pmin(V1, V2, V3, na.rm = T),
where = which.min(c(V1, V2, V3)))})
min where
1 2 2
2 2 1
3 3 1
How can I make pmap() or any other purrr function take the whole row as argument just like apply()does?
func <- function(x) {data.frame(min = min(x, na.rm = T), where = which.min(x))}
> Reduce(rbind, apply(df,1, func))
min where
V2 2 2
V1 2 1
V11 3 1
I probably just missed a feature or some trick. Thanks for your help.
Your solution will work for all columns if you use an ellipsis.
pmap_dfr(df, ~data.frame(min = min(..., na.rm = TRUE), where = which.min(c(...))))
min where
1 2 2
2 2 1
3 3 1
One possibility could be:
df %>%
mutate(min = invoke(pmin, na.rm = TRUE, .),
where = max.col(!is.na(-.)[, 1:length(.)], ties.method = "first"))
V1 V2 V3 min where
1 NA 2 3 2 2
2 2 NA 3 2 1
3 3 3 NA 3 1
Or if you want to keep just the last the two columns:
df %>%
transmute(min = invoke(pmin, na.rm = TRUE, .),
where = max.col(!is.na(-.)[, 1:length(.)], ties.method = "first"))
Not a purrr-solution, but a data.table oneliner.
library(data.table)
dt <- fread("row V1 V2 V3
1 NA 2 3
2 2 NA 3
3 3 3 NA")
melt(dt, id.vars = "row")[ , .SD[which.min(value)], by = row]
row variable value
1: 1 V2 2
2: 2 V1 2
3: 3 V1 3
Related
I'm trying to add a named row between two data.frames d1 and d2. My Desired_output is shown below.
I have tried a solution but it failed to get me to my desired output. Is there a solution to this?
d1 <- data.frame(b = 1:2, SE = 2:3)
d2 <- data.frame(b = 0:1, SE = 1:2)
a <- "obs"
# Solution failed:
dplyr::bind_rows(d1, !!a := rep(NA, ncol(d1)), d2)
Desired_output =
" b SE
1 1 2
2 2 3
obs NA NA
4 0 1
5 1 2"
In base R, we may also do
rbind(d1, `row.names<-`(d1[NA,][1,], a),
`row.names<-`(d2, nrow(d1) + seq_len(nrow(d2))))
-output
b SE
1 1 2
2 2 3
obs NA NA
3 0 1
4 1 2
Maybe this will help -
d1 <- data.frame(b = 1:2, SE = 2:3)
d2 <- data.frame(b = 0:1, SE = 1:2)
a <- "obs"
d3 <- d1[1, ]
d3[] <- NA
rownames(d3) <- a
rbind(d1, d3, d2)
# b SE
#1 1 2
#2 2 3
#obs NA NA
#11 0 1
#21 1 2
Here is a an alternative solution, in case you want it in a pipe!
We wrap the whole procedure arround add_row from tibble package.
With bind_rows we bind both tables together and add a row before index2.
Then we have to change between rownames_to_column and vice versa.
library(tibble)
library(dplyr)
add_row(bind_rows(d1,d2),
b = NA,
SE = NA,
.before = 3)%>%
data.frame() %>%
rownames_to_column("x") %>%
mutate(x = ifelse(x == "3", "obs", x)) %>%
column_to_rownames("x")
)
b SE
1 1 2
2 2 3
obs NA NA
4 0 1
5 1 2
I am trying to find a way to group rows and assign an index to each group without using a loop. The difficulty is that the grouping variable num has no unique identifier; num is a vector of numbers (defined as character). I want to group all rows that match at least by any of the numbers in the vector. The vectors are of different lengths and include up to 20 numbers.
Let me give an example:
I have a dataframe like this:
df <- data.frame(id = c(1:5), num = c('111;222', '333;111;444', '000;88888;1', '9999;111', '1'))
I split num by the separator ; such that every row becomes a vector:
library(dplyr)
df <- df %>%
mutate(num = str_split(num, ';'))
I want to index all rows for which at least one number in the vector matches with the same index. The result should look as follows:
id num group_index
1 c('111','222') 1
2 c('333','111','444') 1
3 c('000','88888','1') 2
4 c('9999','444') 1
5 '1' 2
The example illustrates another difficulty: Group 1 is identified by '111' as well as '444', even though row 1 is part of group 1 and does not contain '444' in num.
If num was just a string, I would do the following
df <- group_by(num) %>%
mutate(group_index = group_indices(.,num))
Now, I thought I should perhaps start by identifying groups. A first approach, which does not work is:
df <- df %>%
group_by_if(num, any(num, str_c(num, collapse = '|')) == T)
I know that I could start by writing a loop. However, R is not very efficient with loops, so I would prefer a solution without a loop – if there is any? Any hints would help!
Ok, this answer can probably be shortened (probably a lot), but I think the use of igraph keeps every nice and visible for visual inspection of the number of groups.
library( data.table )
library( igraph )
#make df a data.table
setDT(df)
#split num-column to v1, v2, ... ,vn
df[, paste0("v", 1:length( tstrsplit(df$num, ";"))) := tstrsplit( num, ";")]
# id num v1 v2 v3
# 1: 1 111;222 111 222 <NA>
# 2: 2 333;111;444 333 111 444
# 3: 3 000;88888;1 000 88888 1
# 4: 4 9999;111 9999 111 <NA>
# 5: 5 1 1 <NA> <NA>
#now melt to long format
df.melt <- melt(df, id.vars = "id", measure.vars = patterns("^v[0-9]"), value.name = "from" )
#create links
df.melt[, to := shift( from, type = "lead"), by = .(id)][]
#drop inomplete rows
df.melt <- df.melt[ complete.cases(df.melt), ]
# id variable from to
# 1: 1 v1 111 222
# 2: 2 v1 333 111
# 3: 3 v1 000 88888
# 4: 4 v1 9999 111
# 5: 2 v2 111 444
# 6: 3 v2 88888 1
g = graph_from_data_frame( df.melt[ , .(from, to)])
# plot(g)
looks like we have two separate groups to work with. let's find out which node (number) belongs to which group, and use this info on the original df
dt.lookup <- as.data.table( components(g)$membership, keep.rownames = TRUE )
# V1 V2
# 1: 111 1
# 2: 333 1
# 3: 000 2
# 4: 9999 1
# 5: 88888 2
# 6: 222 1
# 7: 444 1
# 8: 1 2
#go back to the molten data of the original df
df.melt <- melt(df, id.vars = "id", measure.vars = patterns("^v[0-9]"))
df.melt <- df.melt[ complete.cases(df.melt), ]
#perform update join to get the groupnumber
df.melt[ dt.lookup, group := i.V2, on = .(value = V1) ]
# id variable value group
# 1: 1 v1 111 1
# 2: 2 v1 333 1
# 3: 3 v1 000 2
# 4: 4 v1 9999 1
# 5: 5 v1 1 2
# 6: 1 v2 222 1
# 7: 2 v2 111 1
# 8: 3 v2 88888 2
# 9: 4 v2 111 1
# 10: 2 v3 444 1
# 11: 3 v3 1 2
#summarise to go back to oroiginal df form
df.melt[, .(num = paste0( value, collapse = ";"),
group = paste0( unique(group), collapse = ",")),
by = .(id) ][]
final output
# id num group
# 1: 1 111;222 1
# 2: 2 333;111;444 1
# 3: 3 000;88888;1 2
# 4: 4 9999;111 1
# 5: 5 1 2
For each row in my dataframe, I want to find the second highest occurring value, as well as the least occurring value. How can i do this?
Df:
label v1 v2 v3 v4 v5 v6
5 3 3 3 6 6 8
5 7 1 1 1 7 0
5 3 5 6 6 6 5
I want to consider all columns besides 'label'
Expected output:
second largest occuring least occuring
6 8
7 0
5 3
Edit: I have updated the example after the answer was accepted to make it less confusing
A dplyr solution:
df %>%
rowid_to_column() %>%
gather(var, val, -label, -rowid) %>%
group_by(rowid, val) %>%
tally() %>%
summarise(second_largest_occuring = val[dense_rank(n) == 2],
least_occuring = val[n == min(n)]) %>%
ungroup() %>%
select(-rowid)
# A tibble: 3 x 2
second_largest_occuring least_occuring
<int> <int>
1 2 1
2 2 0
3 5 3
Data:
df <- read.table(text = "label v1 v2 v3 v4 v5 v6
5 3 3 3 2 2 1
5 2 1 1 1 2 0
5 3 5 6 6 6 5", header= TRUE)
Another dplyr solution that is a bit more readable and handles errors for NA and instances where there are multiple occurrences of the second largest. This solution also allows you to select multiple columns using dplyr language.
library(dplyr)
dat = read.table(text = 'label v1 v2 v3 v4 v5 v6
5 3 3 3 2 2 1
5 2 1 1 1 2 0
5 3 5 6 6 6 5', header = T)
second_largest <- function(x,na.rm = TRUE) {
if(na.rm) { x <- na.omit(x) } # omit NA values
second_largest <- x[dense_rank(x) == 2] # return all values where the rank is equal to 2nd largest
second_largest <- max(second_largest) # keep one value out of all the second largest, or NA
return(second_largest)
}
df <- dat %>%
mutate(
second_largest = select(., v1:v6) %>% apply(1, second_largest,na.rm = TRUE), # apply second_largest func to every row
min = select(., v1:v6) %>% apply(1,min,na.rm = TRUE) # apply min to every row
)
# label v1 v2 v3 v4 v5 v6 second_largest min
# 1 5 3 3 3 2 2 1 2 1
# 2 5 2 1 1 1 2 0 1 0
# 3 5 3 5 6 6 6 5 5 3
A few things to notice.
In the apply statement the 1 indicates that the function should be applied to the rows.
Update
If you want the value of the second most frequent number you just plug in a new function to do that.
second_most_frequent <- function(x, is_numeric = TRUE) {
out <- x %>%
table() %>% # Create a table of frequencies as characters
as.data.frame(stringsAsFactors = FALSE) %>%
arrange(desc(Freq)) %>% # Arrange with frequency descending
.[,1] %>% # Select the first column
.[2] # select the second most frequent (WARNING: Doesn't check for ties)
if(is_numeric){ out <- as.numeric(out) }
return(out)
}
df <- df %>%
mutate(
second_most_freq = select(., v1:v6) %>% apply(1,second_most_frequent,is_numeric = TRUE)
)
# label v1 v2 v3 v4 v5 v6 second_largest min second_most_freq
# 1 5 3 3 3 2 2 1 2 1 2
# 2 5 2 1 1 1 2 0 1 0 2
# 3 5 3 5 6 6 6 5 5 3 5
I was told that passing equation as strings and evaluating them is bad practice. How can I still create a function which takes an equation and evaluates it without the string version and without using third party packages?
This is my function:
replaceFormula <- function(df, column, formula){
df[column] <- eval(parse(text=formula), df)
return(df)
}
This is my use case:
set.seed(24)
dataset <- matrix(sample(c(NA, 1:5), 25, replace = TRUE), 5)
df <- as.data.frame(dataset)
replaceFormula(df, 'V5', 'V3+V4')
Update:
Is this also possible with conditions?
My example function:
replaceFactor <- function(df, column, condition, what){
df[column] <- sapply(df[column],function(x) ifelse(eval(parse(text=condition), df), what, x))
return(df)
}
My usecase:
set.seed(24)
dataset <- matrix(sample(c(NA, 1:5), 25, replace = TRUE), 5)
df <- as.data.frame(dataset)
replaceFactor(df, 'V5', 'V1==1', 'GOOD')
It looks like you've crafted yourself a kludgey version of transform
> set.seed(24)
> dataset <- matrix(sample(c(NA, 1:5), 25, replace = TRUE), 5)
> df <- as.data.frame(dataset)
> transform(df, V5 = V3 + V4)
V1 V2 V3 V4 V5
1 1 5 3 5 8
2 1 1 2 1 3
3 4 4 4 NA NA
4 3 4 4 3 7
5 3 1 1 NA NA
We can pass the formula as a quosure and evaluate it by unquoting (!! or UQ) in the devel version of dplyr (or soon to be released 0.6.0)
library(dplyr)
replaceFormula <- function(dat, Col, form){
Col <- quo_name(enquo(Col))
dat %>%
mutate(UQ(Col) := UQ(form))
}
replaceFormula(df, V5, quo(V3 + V4))
# V1 V2 V3 V4 V5
#1 1 5 3 5 8
#2 1 1 2 1 3
#3 4 4 4 NA NA
#4 3 4 4 3 7
#5 3 1 1 NA NA
Update
Based on the OP's comments, we can also pass an expression to evaluate and change the values based on that
replaceFormulaNew <- function(dat, Col, form, what){
Col <- enquo(Col)
ColN <- quo_name(Col)
what <- quo_name(enquo(what))
dat %>%
mutate(UQ(ColN) := ifelse(UQ(form), what, UQ(Col)))
}
replaceFormulaNew(df, V5, quo(V1==1), GOOD)
# V1 V2 V3 V4 V5
#1 1 5 3 5 GOOD
#2 1 1 2 1 GOOD
#3 4 4 4 NA 4
#4 3 4 4 3 <NA>
#5 3 1 1 NA 1
replaceFormulaNew(df, V5, quo(V3 < V4), GOOD)
# V1 V2 V3 V4 V5
#1 1 5 3 5 GOOD
#2 1 1 2 1 3
#3 4 4 4 NA <NA>
#4 3 4 4 3 <NA>
#5 3 1 1 NA <NA>
The enquo takes the input argument and convert it to quosure while quo_name converts it to string for evaluation in mutate to assign the evaluated output to the column specified in the input
I have a function in my real-world problem that returns a list. Is there any way to use this with the dplyr mutate()? This toy example doesn't work -:
it = data.table(c("a","a","b","b","c"),c(1,2,3,4,5), c(2,3,4,2,2))
myfun = function(arg1,arg2) {
temp1 = arg1 + arg2
temp2 = arg1 - arg2
list(temp1,temp2)
}
myfun(1,2)
it%.%mutate(new = myfun(V2,V3))
I see that it is cycling through the output of the function in the first "column" of the new variable, but do not understand why.
Thanks!
The idiomatic way to do this using data.table would be to use the := (assignment by reference) operator. Here's an illustration:
it[, c(paste0("V", 4:5)) := myfun(V2, V3)]
If you really want a list, why not:
as.list(it[, myfun(V2, V3)])
Alternatively, maybe this is what you want, but why don't you just use the data.table functionality:
it[, c(.SD, myfun(V2, V3))]
# V1 V2 V3 V4 V5
# 1: a 1 2 3 -1
# 2: a 2 3 5 -1
# 3: b 3 4 7 -1
# 4: b 4 2 6 2
# 5: c 5 2 7 3
Note that if myfun were to name it's output, then the names would show up in the final result columns:
# V1 V2 V3 new.1 new.2
# 1: a 1 2 3 -1
# 2: a 2 3 5 -1
# 3: b 3 4 7 -1
# 4: b 4 2 6 2
# 5: c 5 2 7 3
Given the title to this question, I thought I'd post a tidyverse solution that uses dplyr::mutate. Note that myfun needs to output a data.frame to work.
library(tidyverse)
it = data.frame(
v1 = c("a","a","b","b","c"),
v2 = c(1,2,3,4,5),
v3 = c(2,3,4,2,2))
myfun = function(arg1,arg2) {
temp1 = arg1 + arg2
temp2 = arg1 - arg2
data.frame(temp1, temp2)
}
it %>%
nest(data = c(v2, v3)) %>%
mutate(out = map(data, ~myfun(.$v2, .$v3))) %>%
unnest(cols = c(data, out))
#> # A tibble: 5 x 5
#> v1 v2 v3 temp1 temp2
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 a 1 2 3 -1
#> 2 a 2 3 5 -1
#> 3 b 3 4 7 -1
#> 4 b 4 2 6 2
#> 5 c 5 2 7 3
Created on 2020-02-04 by the reprex package (v0.3.0)
The mutate() function is designed to add new columns to the existing data frame. A data frame is a list of vectors of the same length. Thus, you cant add a list as a new column, because a list is not a vector.
You can rewrite your function as two functions, each of which return a vector. Then apply each of these separately using mutate() and it should work.