Related
I have a roster of employees, and I need to know at what department they are in most often. It is trivial to tabulate employee ID against department name, but it is trickier to return the department name, rather than the number of roster counts, from the frequency table. A simple example below (column names = departments, row names = employee ids).
DF <- matrix(sample(1:9,9),ncol=3,nrow=3)
DF <- as.data.frame.matrix(DF)
> DF
V1 V2 V3
1 2 7 9
2 8 3 6
3 1 5 4
Now how do I get
> DF2
RE
1 V3
2 V1
3 V2
One option using your data (for future reference, use set.seed() to make examples using sample reproducible):
DF <- data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,4))
colnames(DF)[apply(DF,1,which.max)]
[1] "V3" "V1" "V2"
A faster solution than using apply might be max.col:
colnames(DF)[max.col(DF,ties.method="first")]
#[1] "V3" "V1" "V2"
...where ties.method can be any of "random" "first" or "last"
This of course causes issues if you happen to have two columns which are equal to the maximum. I'm not sure what you want to do in that instance as you will have more than one result for some rows. E.g.:
DF <- data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(7,6,4))
apply(DF,1,function(x) which(x==max(x)))
[[1]]
V2 V3
2 3
[[2]]
V1
1
[[3]]
V2
2
One solution could be to reshape the date from wide to long putting all the departments in one column and counts in another, group by the employer id (in this case, the row number), and then filter to the department(s) with the max value. There are a couple of options for handling ties with this approach too.
library(tidyverse)
# sample data frame with a tie
df <- data_frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,5))
# If you aren't worried about ties:
df %>%
rownames_to_column('id') %>% # creates an ID number
gather(dept, cnt, V1:V3) %>%
group_by(id) %>%
slice(which.max(cnt))
# A tibble: 3 x 3
# Groups: id [3]
id dept cnt
<chr> <chr> <dbl>
1 1 V3 9.
2 2 V1 8.
3 3 V2 5.
# If you're worried about keeping ties:
df %>%
rownames_to_column('id') %>%
gather(dept, cnt, V1:V3) %>%
group_by(id) %>%
filter(cnt == max(cnt)) %>% # top_n(cnt, n = 1) also works
arrange(id)
# A tibble: 4 x 3
# Groups: id [3]
id dept cnt
<chr> <chr> <dbl>
1 1 V3 9.
2 2 V1 8.
3 3 V2 5.
4 3 V3 5.
# If you're worried about ties, but only want a certain department, you could use rank() and choose 'first' or 'last'
df %>%
rownames_to_column('id') %>%
gather(dept, cnt, V1:V3) %>%
group_by(id) %>%
mutate(dept_rank = rank(-cnt, ties.method = "first")) %>% # or 'last'
filter(dept_rank == 1) %>%
select(-dept_rank)
# A tibble: 3 x 3
# Groups: id [3]
id dept cnt
<chr> <chr> <dbl>
1 2 V1 8.
2 3 V2 5.
3 1 V3 9.
# if you wanted to keep the original wide data frame
df %>%
rownames_to_column('id') %>%
left_join(
df %>%
rownames_to_column('id') %>%
gather(max_dept, max_cnt, V1:V3) %>%
group_by(id) %>%
slice(which.max(max_cnt)),
by = 'id'
)
# A tibble: 3 x 6
id V1 V2 V3 max_dept max_cnt
<chr> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 2. 7. 9. V3 9.
2 2 8. 3. 6. V1 8.
3 3 1. 5. 5. V2 5.
If you're interested in a data.table solution, here's one. It's a bit tricky since you prefer to get the id for the first maximum. It's much easier if you'd rather want the last maximum. Nevertheless, it's not that complicated and it's fast!
Here I've generated data of your dimensions (26746 * 18).
Data
set.seed(45)
DF <- data.frame(matrix(sample(10, 26746*18, TRUE), ncol=18))
data.table answer:
require(data.table)
DT <- data.table(value=unlist(DF, use.names=FALSE),
colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
setkey(DT, colid, value)
t1 <- DT[J(unique(colid), DT[J(unique(colid)), value, mult="last"]), rowid, mult="first"]
Benchmarking:
# data.table solution
system.time({
DT <- data.table(value=unlist(DF, use.names=FALSE),
colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
setkey(DT, colid, value)
t1 <- DT[J(unique(colid), DT[J(unique(colid)), value, mult="last"]), rowid, mult="first"]
})
# user system elapsed
# 0.174 0.029 0.227
# apply solution from #thelatemail
system.time(t2 <- colnames(DF)[apply(DF,1,which.max)])
# user system elapsed
# 2.322 0.036 2.602
identical(t1, t2)
# [1] TRUE
It's about 11 times faster on data of these dimensions, and data.table scales pretty well too.
Edit: if any of the max ids is okay, then:
DT <- data.table(value=unlist(DF, use.names=FALSE),
colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
setkey(DT, colid, value)
t1 <- DT[J(unique(colid)), rowid, mult="last"]
Based on the above suggestions, the following data.table solution worked very fast for me:
library(data.table)
set.seed(45)
DT <- data.table(matrix(sample(10, 10^7, TRUE), ncol=10))
system.time(
DT[, col_max := colnames(.SD)[max.col(.SD, ties.method = "first")]]
)
#> user system elapsed
#> 0.15 0.06 0.21
DT[]
#> V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 col_max
#> 1: 7 4 1 2 3 7 6 6 6 1 V1
#> 2: 4 6 9 10 6 2 7 7 1 3 V4
#> 3: 3 4 9 8 9 9 8 8 6 7 V3
#> 4: 4 8 8 9 7 5 9 2 7 1 V4
#> 5: 4 3 9 10 2 7 9 6 6 9 V4
#> ---
#> 999996: 4 6 10 5 4 7 3 8 2 8 V3
#> 999997: 8 7 6 6 3 10 2 3 10 1 V6
#> 999998: 2 3 2 7 4 7 5 2 7 3 V4
#> 999999: 8 10 3 2 3 4 5 1 1 4 V2
#> 1000000: 10 4 2 6 6 2 8 4 7 4 V1
And also comes with the advantage that can always specify what columns .SD should consider by mentioning them in .SDcols:
DT[, MAX2 := colnames(.SD)[max.col(.SD, ties.method="first")], .SDcols = c("V9", "V10")]
In case we need the column name of the smallest value, as suggested by #lwshang, one just needs to use -.SD:
DT[, col_min := colnames(.SD)[max.col(-.SD, ties.method = "first")]]
One option from dplyr 1.0.0 could be:
DF %>%
rowwise() %>%
mutate(row_max = names(.)[which.max(c_across(everything()))])
V1 V2 V3 row_max
<dbl> <dbl> <dbl> <chr>
1 2 7 9 V3
2 8 3 6 V1
3 1 5 4 V2
In some contexts, it could be safer to use pmap() (requires purrr):
DF %>%
mutate(row_max = pmap_chr(across(everything()), ~ names(c(...)[which.max(c(...))])))
Sample data:
DF <- structure(list(V1 = c(2, 8, 1), V2 = c(7, 3, 5), V3 = c(9, 6,
4)), class = "data.frame", row.names = c(NA, -3L))
A dplyr solution:
Idea:
add rowids as a column
reshape to long format
filter for max in each group
Code:
DF = data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,4))
DF %>%
rownames_to_column() %>%
gather(column, value, -rowname) %>%
group_by(rowname) %>%
filter(rank(-value) == 1)
Result:
# A tibble: 3 x 3
# Groups: rowname [3]
rowname column value
<chr> <chr> <dbl>
1 2 V1 8
2 3 V2 5
3 1 V3 9
This approach can be easily extended to get the top n columns.
Example for n=2:
DF %>%
rownames_to_column() %>%
gather(column, value, -rowname) %>%
group_by(rowname) %>%
mutate(rk = rank(-value)) %>%
filter(rk <= 2) %>%
arrange(rowname, rk)
Result:
# A tibble: 6 x 4
# Groups: rowname [3]
rowname column value rk
<chr> <chr> <dbl> <dbl>
1 1 V3 9 1
2 1 V2 7 2
3 2 V1 8 1
4 2 V3 6 2
5 3 V2 5 1
6 3 V3 4 2
This is a fast and simple tidyverse solution, that can easily be applied to any subset of columns in a data.frame. The version below also uses ifelse to add missing values if all columns are 0. The missing values would be useful if, e.g., someone wants to use it to recombine one-hot encoded columns. It works on the data in the question, but here's an example of a one-hot encoded data set that it also works on.
data <- data.frame(
oh_a = c(1,0,0,1,0,0)
,oh_b = c(0,1,1,0,0,0)
,oh_c = c(0,0,0,0,1,0)
,d = c("l","m","n","o","p","q"))
f <- function(x){ifelse(rowSums(x)==0, NA, names(x)[max.col(x, "first")])}
data %>%
mutate(transformed = f(across(starts_with("oh"))))
output:
oh_a oh_b oh_c d transformed
1 1 0 0 l oh_a
2 0 1 0 m oh_b
3 0 1 0 n oh_b
4 1 0 0 o oh_a
5 0 0 1 p oh_c
6 0 0 0 q <NA>
A simple for loop can also be handy:
> df<-data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,4))
> df
V1 V2 V3
1 2 7 9
2 8 3 6
3 1 5 4
> df2<-data.frame()
> for (i in 1:nrow(df)){
+ df2[i,1]<-colnames(df[which.max(df[i,])])
+ }
> df2
V1
1 V3
2 V1
3 V2
Here is an answer that works with data.table and is simpler. This assumes your data.table is named yourDF:
j1 <- max.col(yourDF[, .(V1, V2, V3, V4)], "first")
yourDF$newCol <- c("V1", "V2", "V3", "V4")[j1]
Replace ("V1", "V2", "V3", "V4") and (V1, V2, V3, V4) with your column names
This one is fast:
with(DF, {
names(DF)[(V1 > V2 & V1 > V3) * 1 + (V2 > V3 & V2 > V1) * 2 + (V3 > V1 & V3 > V2)*3]
})
I constructed the following code below. It shall assign the value "1" or "2" to vector v2, if an element in vector v1 occurs twice, e.g. "A" in vector v1 appears twice, hence in the respective rows, v2 should once read "1" and in the other case "2".
The code works sort of fine, except in some cases, a similar number is assigned to v2, when an element in v1 occurs twice, this should obviously not be the case.
Can anybody help me with the issue? Thanks!
v1 <- c(rep(c("A","B","C","D","E","F","G"),rep(2,7)),c("H","I","J","K"))
v2 <- rep(3,length(v1))
df1 <- data.frame(v1,v2)
for (i in 1:length(df1$v1)) {
if (sum(df1$v1[i]==df1$v1)==2 & df1$v2[i]==3) {
df1$v2[i] <- sample(c(1,2),1,replace=TRUE)
} else if (sum(df1$v1[i]==df1$v1)==2 & df1$v2[i]==1) {
df1$v2[i] <- 2
} else if (sum(df1$v1[i]==df1$v1)==2 & df1$v2[i]==2) {
df1$v2[i] <- 1
} else {
df1$v2[i] <- 2
}
}
I think that I have understood what you require and hopefully the below should do what you want, using dplyr. It will randomly assign integer values from 1 to n, where n is the number of occurrences of a given letter (note this is generalizable from your requirement of 2 occurrences).
library(dplyr)
df1 <- data.frame(v1 = c(rep(c("A","B","C","D","E","F","G"),rep(2,7)),c("H","I","J","K")))
df1 <- df1 %>%
group_by(v1) %>%
mutate(v2 = case_when(n() > 1 ~ sample(c(1:n()), n(), replace = FALSE),
TRUE ~ 1L))
v1 <- c(rep(c("A","B","C","D","E","F","G"),rep(2,7)),c("H","I","J","K"))
value = 1:length(v1)
v2 <- rep(3,length(v1))
df1 <- data.frame(v1,value,v2)
library(dplyr)
set.seed(9)
df1 %>%
sample_frac(1) %>% # shuffle rows
group_by(v1) %>% # for each v1 value
mutate(v2 = row_number()) %>% # count and flag occurences
ungroup() %>% # forget the grouping
arrange(v1) # order by v1 (only for visualisation purposes)
# # A tibble: 18 x 3
# v1 value v2
# <fct> <int> <int>
# 1 A 1 1
# 2 A 2 2
# 3 B 4 1
# 4 B 3 2
# 5 C 5 1
# 6 C 6 2
# 7 D 7 1
# 8 D 8 2
# 9 E 9 1
#10 E 10 2
#11 F 12 1
#12 F 11 2
#13 G 14 1
#14 G 13 2
#15 H 15 1
#16 I 16 1
#17 J 17 1
#18 K 18 1
Using base R, I think you can arrive at what you want somewhat easily by using table and sequence in connection and manipulating the output.
Edit: After your comments. I now think I understand what you what.
res <- data.frame(v1, v2 = sequence(table(v1)), row.names = NULL)
res <- res[sample(1:nrow(res)), ] # Scramble data order
res <- res[order(res$v1), ] # Reorder by v1 column
# v1 v2
#1 A 1
#2 A 2
#3 B 1
#4 B 2
#5 C 1
#6 C 2
#7 D 2 # note 2 comes first here
#8 D 1
#9 E 1
#10 E 2
#11 F 1
#12 F 2
#13 G 1
#14 G 2
#15 H 1
#16 I 1
#17 J 1
#18 K 1
Edit2 "randomly" sorting before assigning:
df1 <- data.frame(v1)
df1[order(rank(v1, ties.method = "random")), "v2"] <- sequence(table(v1))
df1
I have a data frame like so:
df <- data.frame(
id = c(1, 1, 2, 2),
V1 = c(1:4),
V2 = c(5:8),
V3 = c(9:12))
Printed to the console it looks like this:
# id V1 V2 V3
# 1 1 1 5 9
# 2 1 2 6 10
# 3 2 3 7 11
# 4 2 4 8 12
Now, I would like to transform it to this shape:
# id V1 V2 V3 V4 V5 V6
# 1 1 1 5 9 2 6 10
# 2 2 3 7 11 4 8 12
How can I do this with base R or the tidyverse?
a possible tidyverse solution
wide <- df %>%
group_by(id) %>%
mutate(obs = row_number()) %>%
gather(var, val, V1:V3) %>%
unite(comb, obs, var) %>%
spread(comb, val)
colnames(wide)[-1] <- paste("V", seq(1,ncol(wide) -1), sep = "")
# A tibble: 2 x 7
# Groups: id [2]
# id V1 V2 V3 V4 V5 V6
#1 1 1 5 9 2 6 10
#2 2 3 7 11 4 8 12
You could do it with e.g. using by.
df2 <- do.call(rbind,
by(df, df$id, function(x) c(x[1, "id"], as.vector(t(x[names(x) != "id"]))))
)
colnames(df2) <- c("id", paste0("V", seq(ncol(df2)-1)))
id V1 V2 V3 V4 V5 V6
1 1 1 5 9 2 6 10
2 2 3 7 11 4 8 12
Base R:
lists <- Map(function(x) data.frame(c(x[1,], x[2,-1])), split(df, df$id))
df2 <- do.call(rbind, lists)
To change the column names:
colnames(df2) <- c("id", paste0("V", seq_along(df2[-1])))
And the result:
# > df2
# id V1 V2 V3 V4 V5 V6
# 1 1 1 5 9 2 6 10
# 2 2 3 7 11 4 8 12
I was told that passing equation as strings and evaluating them is bad practice. How can I still create a function which takes an equation and evaluates it without the string version and without using third party packages?
This is my function:
replaceFormula <- function(df, column, formula){
df[column] <- eval(parse(text=formula), df)
return(df)
}
This is my use case:
set.seed(24)
dataset <- matrix(sample(c(NA, 1:5), 25, replace = TRUE), 5)
df <- as.data.frame(dataset)
replaceFormula(df, 'V5', 'V3+V4')
Update:
Is this also possible with conditions?
My example function:
replaceFactor <- function(df, column, condition, what){
df[column] <- sapply(df[column],function(x) ifelse(eval(parse(text=condition), df), what, x))
return(df)
}
My usecase:
set.seed(24)
dataset <- matrix(sample(c(NA, 1:5), 25, replace = TRUE), 5)
df <- as.data.frame(dataset)
replaceFactor(df, 'V5', 'V1==1', 'GOOD')
It looks like you've crafted yourself a kludgey version of transform
> set.seed(24)
> dataset <- matrix(sample(c(NA, 1:5), 25, replace = TRUE), 5)
> df <- as.data.frame(dataset)
> transform(df, V5 = V3 + V4)
V1 V2 V3 V4 V5
1 1 5 3 5 8
2 1 1 2 1 3
3 4 4 4 NA NA
4 3 4 4 3 7
5 3 1 1 NA NA
We can pass the formula as a quosure and evaluate it by unquoting (!! or UQ) in the devel version of dplyr (or soon to be released 0.6.0)
library(dplyr)
replaceFormula <- function(dat, Col, form){
Col <- quo_name(enquo(Col))
dat %>%
mutate(UQ(Col) := UQ(form))
}
replaceFormula(df, V5, quo(V3 + V4))
# V1 V2 V3 V4 V5
#1 1 5 3 5 8
#2 1 1 2 1 3
#3 4 4 4 NA NA
#4 3 4 4 3 7
#5 3 1 1 NA NA
Update
Based on the OP's comments, we can also pass an expression to evaluate and change the values based on that
replaceFormulaNew <- function(dat, Col, form, what){
Col <- enquo(Col)
ColN <- quo_name(Col)
what <- quo_name(enquo(what))
dat %>%
mutate(UQ(ColN) := ifelse(UQ(form), what, UQ(Col)))
}
replaceFormulaNew(df, V5, quo(V1==1), GOOD)
# V1 V2 V3 V4 V5
#1 1 5 3 5 GOOD
#2 1 1 2 1 GOOD
#3 4 4 4 NA 4
#4 3 4 4 3 <NA>
#5 3 1 1 NA 1
replaceFormulaNew(df, V5, quo(V3 < V4), GOOD)
# V1 V2 V3 V4 V5
#1 1 5 3 5 GOOD
#2 1 1 2 1 3
#3 4 4 4 NA <NA>
#4 3 4 4 3 <NA>
#5 3 1 1 NA <NA>
The enquo takes the input argument and convert it to quosure while quo_name converts it to string for evaluation in mutate to assign the evaluated output to the column specified in the input
I have a function in my real-world problem that returns a list. Is there any way to use this with the dplyr mutate()? This toy example doesn't work -:
it = data.table(c("a","a","b","b","c"),c(1,2,3,4,5), c(2,3,4,2,2))
myfun = function(arg1,arg2) {
temp1 = arg1 + arg2
temp2 = arg1 - arg2
list(temp1,temp2)
}
myfun(1,2)
it%.%mutate(new = myfun(V2,V3))
I see that it is cycling through the output of the function in the first "column" of the new variable, but do not understand why.
Thanks!
The idiomatic way to do this using data.table would be to use the := (assignment by reference) operator. Here's an illustration:
it[, c(paste0("V", 4:5)) := myfun(V2, V3)]
If you really want a list, why not:
as.list(it[, myfun(V2, V3)])
Alternatively, maybe this is what you want, but why don't you just use the data.table functionality:
it[, c(.SD, myfun(V2, V3))]
# V1 V2 V3 V4 V5
# 1: a 1 2 3 -1
# 2: a 2 3 5 -1
# 3: b 3 4 7 -1
# 4: b 4 2 6 2
# 5: c 5 2 7 3
Note that if myfun were to name it's output, then the names would show up in the final result columns:
# V1 V2 V3 new.1 new.2
# 1: a 1 2 3 -1
# 2: a 2 3 5 -1
# 3: b 3 4 7 -1
# 4: b 4 2 6 2
# 5: c 5 2 7 3
Given the title to this question, I thought I'd post a tidyverse solution that uses dplyr::mutate. Note that myfun needs to output a data.frame to work.
library(tidyverse)
it = data.frame(
v1 = c("a","a","b","b","c"),
v2 = c(1,2,3,4,5),
v3 = c(2,3,4,2,2))
myfun = function(arg1,arg2) {
temp1 = arg1 + arg2
temp2 = arg1 - arg2
data.frame(temp1, temp2)
}
it %>%
nest(data = c(v2, v3)) %>%
mutate(out = map(data, ~myfun(.$v2, .$v3))) %>%
unnest(cols = c(data, out))
#> # A tibble: 5 x 5
#> v1 v2 v3 temp1 temp2
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 a 1 2 3 -1
#> 2 a 2 3 5 -1
#> 3 b 3 4 7 -1
#> 4 b 4 2 6 2
#> 5 c 5 2 7 3
Created on 2020-02-04 by the reprex package (v0.3.0)
The mutate() function is designed to add new columns to the existing data frame. A data frame is a list of vectors of the same length. Thus, you cant add a list as a new column, because a list is not a vector.
You can rewrite your function as two functions, each of which return a vector. Then apply each of these separately using mutate() and it should work.