according to my last question i have an new belonging question. After Editing my post and ask there and wait abot a week i want to try it here again.
This time with a better example:
Equip<- c(1,1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,6,6,6)
Notif <-c(1,1,3,4,2,2,2,5,6,7,9,9,15,10,11,12,13,14,16,17,18,19)
rank <- c(1,1,2,3,1,1,1,1,2,3,1,1,2,1,2,3,1,2,3,4,5,6)
Component <- c("Ventil","Motor","Ventil","Ventil","Vergaser","Vergaser","Bremse",
"Lichtmaschine","Bremse","Lichtmaschine","Bremse","Motor","Lichtmaschine",
"Bremse","Bremse","Motor","Vergaser","Motor","Vergaser","Motor",
"Vergaser","Motor")
df <- data.frame(Equip,Notif,rank,Component)
Equip is my subject and rank the actual visit number. Component is the subject what have to be looked for.
I want to have an output like this:
If an Equip(subject) was visited 2 times( rank 1 and 2) look by all Equips with rank 1&2 , if there is any Component which was regarded the first and the second time.
If an Equip(subject) was visited 3 times (rank 1 ,2 and 3) for this look by all Equips, if there is any Component list up 3 times like Equip 1, rank 1, Component Motor, Equip 1, rank 2, Component Motor, Equip 1, rank 3, Component Motor
The output should have the name of the Component, like True "Motor"
I have a code but with this, i can just compare the 1 and the 2 visit, the 2 and the 3 together and so on( i cannot split up again with the ranks, like Equips with 2 ranks, Equips with 3 ranks and so on)
the code is this:
a <- lapply(split(df,df$Equip),function(x){
ll <- split(x,x$rank)
if(length(ll)>1 )
ii <- intersect(ll[[1]]$Component,ll[[2]]$Component ) ## test intersection
else
ii <- NA
c(length(ii)> 0 && !is.na(ii),ii)
})
b <- unlist(a)
c <- table(b,b)
rowSums(c)
Hopefully you can help me. Please ask if there are any questions.
according to your question about the output, and to your way of solution,
Equip Component V1 idx
1: 1 Ventil TRUE 3
2: 2 NA False 1
3: 3 NA False 3
4: 4 NA FALSE 2
5: 5 NA FALSE 3
6: 6 NA FALSE 6
Something like that, but if its easier, Equip and idx is not neccessarilly needed
for Equip with 2 ranks:
TRUE FALSE
0 1
for Equip with 3 ranks:
TRUE FALSE
1 2
for Equip with 6 ranks:
TRUE FALSE
0 1
Here's the output I think would be of interest to you. Its using data.table.
First, we create a data.table from your data.frame df with keys = Equip, Component as follows.
require(data.table) # load package
# then create the data.table with keys as specified above
# Check that both these columns are already sorted out for you!
dt <- data.table(df, key=c("Equip", "Component"))
Second, we create a function that'll give the desired output for a given rank query (2, 3 etc..)
this.check <- function(idx) {
chk <- seq(1, idx)
o <- subset(dt[, all(chk %in% rank), by=c("Equip", "Component")], V1 == TRUE)
if (nrow(o) > 0) o[, idx:=idx]
}
What does this do? Let's run this for rank=1,2. We run this by:
> this.check(2)
# output
Equip Component V1 idx
1: 1 Ventil TRUE 2
2: 5 Bremse TRUE 2
This tells you that for Equip = 1 and 5, there are Components = Ventil and Bremse with rank = 1 and 2, respectively (indicated with idx=2). You also get the column V1 = TRUE, even though I, as #Carl pointed out already, don't understand the need for this. If you require, you can change the column names of this output by using setnames
Third, we use this function to query ranks=1,2, then ranks=1,2,3 .. and so on. This can be accomplished with a simple lapply as follows:
# Let's run the function for idx = 2 to 6.
# This will check from rank = 1,2 until rank=1,2,3,4,5,6
o <- lapply(2:6, function(idx) {
this.check(idx)
})
> o
[[1]]
Equip Component V1 idx
1: 1 Ventil TRUE 2
2: 5 Bremse TRUE 2
[[2]]
Equip Component V1 idx
1: 1 Ventil TRUE 3
[[3]]
NULL
[[4]]
NULL
[[5]]
NULL
It shows that for rank=1,2 and rank=1,2,3 you have some Component. For others there's nothing = NULL.
Finally, we can bind all of these together using rbind to get one single data.table as follows:
o <- do.call(rbind, o)
> o
Equip Component V1 idx
1: 1 Ventil TRUE 2
2: 5 Bremse TRUE 2
3: 1 Ventil TRUE 3
Here, idx=2 are the Component that satisfies rank=1,2 and idx=3 are the ones that satisfy rank=1,2,3.
Putting it all together:
this.check <- function(idx) {
chk <- seq(1, idx)
o <- subset(dt[, all(chk %in% rank), by=c("Equip", "Component")], V1 == TRUE)
if (nrow(o) > 0) o[, idx:=idx]
}
o <- do.call(rbind, lapply(2:6, function(idx) {
this.check(idx)
}))
I hope this helps.
Edit: (After series of exchanges in comments, this is the new solution I propose. I hope this is what you are after.)
require(data.table)
dt <- data.table(df, key=c("Equip", "Component"))
dt[, `:=`(e.max=max(rank)), by=Equip]
dt[, `:=`(ec.max=max(rank)), by=c("Equip", "Component")]
setkey(dt, "e.max", "ec.max")
this.check <- function(idx) {
t1 <- dt[J(idx,idx)]
t2 <- t1[, identical(as.numeric(seq_len(idx)), as.numeric(rank)),
by=c("Equip", "Component")]
o <- table(t2$V1)
if (length(o) == 1)
o <- c(o, "TRUE"=0)
o <- c("idx"=idx, o)
}
o <- do.call(rbind, lapply(2:6, function(idx) this.check(idx)))
> o
# idx FALSE TRUE
# [1,] 2 1 0
# [2,] 3 2 1
# [3,] 4 1 0
# [4,] 5 1 0
# [5,] 6 1 0
If I make an array of your data, columnwise, as
foo<-cbind(Equip,Notif, rank, Component)
eqp<-1 # later, loop over all values
foo[c( which( foo[,1]==eqp & (foo[,3]==1 | foo[,3]==2) ) ),4]
[1] "Ventil" "Motor" "Ventil"
Feed those results to table and extract items with count ==2
Clearly any item which shows up twice is what you want.
This is not an answer I'd recommend using, since tools like ddply and aggregate will do this much more cleanly, but I want to be sure that this is the answer you're after, assuming a loop over eqp values in the original Equip .
Related
Description
ifelse() function allows to filter the values in a vector through a series of tests, each of them producing different actions in case of a positive result. For instance, let xx be a data.frame, as follows:
xx <- data.frame(a=c(1,2,1,3), b=1:4)
xx
a b
1 1
2 2
1 3
3 4
Suppose that you want to create a new column, c, from column b, but depending on the values in column a in the following way:
For each row,
if the value in column a is 1, the value in column c, is the same value in column b.
if the value in column a is 2, the value in column c, is 100 times the value in column b.
in any other case, the value in column c is the negative of the value in column b.
Using ifelse(), a solution could be:
xx$c <- ifelse(xx$a==1, xx$b,
ifelse(xx$a==2, xx$b*100,
-xx$b))
xx
a b c
1 1 1
2 2 200
1 3 3
3 4 -4
Problem 1
An aesthetic problem arises when the number of tests increases, say, four tests:
xx$c <- ifelse(xx$a==1, xx$b,
ifelse(xx$a==2, xx$b*100,
ifelse(xx$a==3, ...,
ifelse(xx$a==4, ...,
...))))
I found partial solution to the problem in this page, which consists in the definition of the functions if.else_(), i_(), e_(), as follows:
library(lazyeval)
i_ <- function(if_stat, then) {
if_stat <- lazyeval::expr_text(if_stat)
then <- lazyeval::expr_text(then)
sprintf("ifelse(%s, %s, ", if_stat, then)
}
e_ <- function(else_ret) {
else_ret <- lazyeval::expr_text(else_ret)
else_ret
}
if.else_ <- function(...) {
args <- list(...)
for (i in 1:(length(args) - 1) ) {
if (substr(args[[i]], 1, 6) != "ifelse") {
stop("All but the last argument, need to be if.then_ functions.", call. = FALSE)
}
}
if (substr(args[[length(args)]], 1, 6) == "ifelse"){
stop("Last argument needs to be an else_ function.", call. = FALSE)
}
args$final <- paste(rep(')', length(args) - 1), collapse = '')
eval_string <- do.call('paste', args)
eval(parse(text = eval_string))
}
In this way, the problem given in the Description, can be rewritten as follows:
xx <- data.frame(a=c(1,2,1,3), b=1:4)
xx$c <- if.else_(
i_(xx$a==1, xx$b),
i_(xx$a==2, xx$b*100),
e_(-xx$b)
)
xx
a b c
1 1 1
2 2 200
1 3 3
3 4 -4
And the code for the four tests will simply be:
xx$c <- if.else_(
i_(xx$a==1, xx$b),
i_(xx$a==2, xx$b*100),
i_(xx$a==3, ...), # dots meaning actions for xx$a==3
i_(xx$a==4, ...), # dots meaning actions for xx$a==4
e_(...) # dots meaning actions for any other case
)
Problem 2 & Question
The given code apparently solves the problem. Then, I wrote the following test function:
test.ie <- function() {
dd <- data.frame(a=c(1,2,1,3), b=1:4)
if.else_(
i_(dd$a==1, dd$b),
i_(dd$a==2, dd$b*100),
e_(-dd$b)
) # it should give c(1, 200, 3, -4)
}
When I tried the test:
test.ie()
it spit the following error message:
Error in ifelse(dd$a == 1, dd$b, ifelse(dd$a == 2, dd$b * 100, -dd$b)) :
object 'dd' not found
Question
Since the if.else_() syntactic constructor is not supposed to run only from the console, is there a way for it to 'know' the variables from the function that calls it?
Note
In "Best way to replace a lengthy ifelse structure in R", a similar problem was posted. However, the given solution there focuses on building the table's new column with the given constant output values (the "then" or "else" slots of the ifelse() function), whereas my case addresses a syntactic problem in which the "then" or "else" slots can even be expressions in terms of other data.frame elements or variables.
I think you can use dplyr::case_when inside dplyr::mutate to achieve this.
library(dplyr)
df <- tibble(a=c(1,2,1,3), b=1:4)
df %>%
mutate(
foo = case_when(
.$a == 1 ~ .$b,
.$a == 2 ~ .$b * 100L,
TRUE ~ .$b * -1L
)
)
#> # A tibble: 4 x 3
#> a b foo
#> <dbl> <int> <int>
#> 1 1 1 1
#> 2 2 2 200
#> 3 1 3 3
#> 4 3 4 -4
In the upcoming relase of dplyr 0.6.0 you won't need to use the akward work-around of .$, and you can just use:
df %>%
mutate(
foo = case_when(
a == 1 ~ b,
a == 2 ~ b * 100L,
TRUE ~ b * -1L
)
)
Taking into account MrFlick's advice, I re-coded the if.else_() function as follows:
if.else_ <- function(...) {
args <- list(...)
for (i in 1:(length(args) - 1) ) {
if (substr(args[[i]], 1, 6) != "ifelse") {
stop("All but the last argument, need to be if.then_ functions.", call. = FALSE)
}
}
if (substr(args[[length(args)]], 1, 6) == "ifelse"){
stop("Last argument needs to be an else_ function.", call. = FALSE)
}
args$final <- paste(rep(')', length(args) - 1), collapse = '')
eval_string <- do.call('paste', args)
eval(parse(text = eval_string), envir = parent.frame())
}
Now the test.ie() function runs properly
test.ie()
[1] 1 200 3 -4
With full respect to the OP's remarkable effort to improve nested ifelse(), I prefer a different approach which I believe is easy to write, concise, maintainable and fast:
xx <- data.frame(a=c(1L,2L,1L,3L), b=1:4)
library(data.table)
# coerce to data.table, and set the default first
setDT(xx)[, c:= -b]
xx[a == 1L, c := b] # 1st special case
xx[a == 2L, c := 100L*b] # 2nd special case, note use of integer 100L
# xx[a == 3L, c := ...] # other cases
# xx[a == 4L, c := ...]
#...
xx
# a b c
#1: 1 1 1
#2: 2 2 200
#3: 1 3 3
#4: 3 4 -4
Note that for the 2nd special case b is multiplied by the integer constant 100L to make sure that the right hand sides are all of type integer in order to avoid type conversion to double.
Edit 2: This can also be written in an even more concise (but still maintainable) way as a one-liner:
setDT(xx)[, c:= -b][a == 1L, c := b][a == 2L, c := 100*b][]
data.table chaining works here, because c is updated in place so that subsequent expressions are acting on all rows of xx even if the previous expression was a selective update of a subset of rows.
Edit 1: This approach can be implemented with base R as well:
xx <- data.frame(a=c(1L,2L,1L,3L), b=1:4)
xx$c <- -xx$b
idx <- xx$a == 1L; xx$c[idx] <- xx$b[idx]
idx <- xx$a == 2L; xx$c[idx] <- 100 * xx$b[idx]
xx
# a b c
#1 1 1 1
#2 2 2 200
#3 1 3 3
#4 3 4 -4
I have a matrix of integers
m <- rbind(c(1,2),
c(3,6),
c(5,1),
c(2,1),
c(6,3))
and I am looking for a function that takes this matrix as input and outputs a vector flag with length(flag) == ncol(m) that assigns the rows that contain the same set of integers the same unique (let's say integer) value.
For the above example, the desired output would be:
flag <- c(1, 2, 3, 1, 2)
So rows 1 and 4 inm get the same flag 1, because they both contain the same set of integers, in this case {1, 2}. Similarly, rows 2 and 5 get the same flag.
The solution should work for any number of columns.
The only thing I could come up with is the following approach ...
FlagSymmetric <- function(x) {
vec_sim <- rep(NA, nrow(x)) # object containing flags
ind_ord <- ncol(x)
counter <- 1
for(i in 1:nrow(x)) {
if(is.na(vec_sim[i])) { # if that row is not flagged yet, proceed ...
vec_sim[i] <- counter # ... and give the next free flag
for(j in (i+1):nrow(x)) {
if( (i+1) > nrow(x) ) next # in case of tiny matrices
ind <- x[j, ] %in% x[i, ]
if(sum(ind)==ind_ord) vec_sim[j] <- counter # if the same, assign flag
}
counter <- counter + 1
}
}
return(vec_sim)
}
... which does what I want:
> FlagSymmetric(m)
[1] 1 2 3 1 2
If n = nrow(m) this needs 1/2 n^2 operations. Of course, I could make it much quicker by writing this in C++, but this only alleviates my problem to some extent, because I am working with matrices with a potentially huge number of rows.
I guess there must be a smarter way of doing this.
EDIT:
Additional, more general example (sorting row and pasting to character string not possible):
m2 <- rbind(c(1,112),
c(11,12),
c(12,11),
c(112,1),
c(6,3))
flag2 <- c(1, 2, 2, 1, 3) # desired output
FlagSymmetric(m2) # works
[1] 1 2 2 1 3
Assuming you only have numeric data in your matrix.
First converting the matrix to dataframe,
m <- data.frame(m)
We can sort every row and paste them together. Convert them to factor and then to numeric to get unique numbers for every combination
m$flag <- as.numeric(factor(apply(m, 1, function(x) paste0(sort(x), collapse = ""))))
m
# X1 X2 flag
#1 1 2 1
#2 3 6 3
#3 5 1 2
#4 2 1 1
#5 6 3 3
EDIT
The above solution does not work for every combination as explained in the new example. To differentiate between each number, as #d.b commented we can use any non-empty collapse argument. For updated example,
as.numeric(factor(apply(m2, 1, function(x) paste0(sort(x), collapse = "-"))))
#[1] 1 2 2 1 3
table has a peculiar behaviour in the sense that it uses the variable name as table 'title'
> table(c("A","A","B"))
A B
2 1
> a<-c("A","A","B");table(a)
a
A B
2 1
This behaviour is not convenient if you have a function which returns contingency tables
> aux <- function(x) return(table(x))
> aux(a)
x
A B
2 1
Is there a way to remove the table 'title'? Can I remove the table title and not get that blank line? I found a workaround, but I am not entirely satisfied with it.
> aux <- function(x) return(table(identity(x)))
> aux(a)
A B
2 1
You can use deparse.level = 0 in table(). Check help(table) for its possible values with explanation.
a <- c("A", "A", "B")
table(a)
# a
# A B
# 2 1
table(a, deparse.level = 0)
#
# A B
# 2 1
We can use as.vector
as.vector(table(a))
#[1] 2 1
If we need to remove the 'a' in the second case
tbl <- table(a)
names(dimnames(tbl)) <- NULL
I have a data.table with multiple categorical variables for which I would like to create contrast (or "dummy") variables along with many more numerical variables which I would like to simply pass by reference.
Example dataset:
library('data.table')
d <- data.table(1:3, # there are lots of numerics, so I want to avoid copying
letters[1:3], # convert these to factor then dummy variable
10:12,
LETTERS[24:26])
# >d
# V1 V2 V3 V4
# 1: 1 a 10 X
# 2: 2 b 11 Y
# 3: 3 c 12 Z
The desired result looks like:
>dummyDT(d)
V1 V3 V2.b V2.c V4.Y V4.Z
1: 1 10 0 0 0 0
2: 2 11 1 0 1 0
3: 3 12 0 1 0 1
which can be produced with:
# this does what I want but is slow and inelegant and not idiomatic data.table
categorToMatrix <- function(x, name_prefix='Var'){
# set levels in order of appearance to avoid default re-sort by alpha
m <- contrasts(factor(x, levels=unique(x)))
dimnames(m) <- list(NULL, paste(name_prefix, colnames(m), sep='.') )
m
}
dummyDT <- function(d){
toDummy <- which(sapply(d, function(x) is.factor(x) | is.character(x)))
if(length(toDummy)>0){
dummyComponent <-
data.table(
do.call(cbind, lapply(toDummy, function(j) {
categorToMatrix(d[[j]], name_prefix = names(d)[j])
} )
)
)
asIs <- (1:ncol(d))[-toDummy]
if(length(asIs)>0) {
allCols <- cbind(d[,asIs,with=FALSE], dummyComponent)
} else allCols <- dummyComponent
} else allCols <- d
return(allCols)
}
(I do not care about maintaining original column ordering.)
I have tried in addition to the above, the approach of splitting each matrix into a list of columns, as in:
# split a matrix into list of columns and keep track of column names
# expanded from #Tommy's answer at: https://stackoverflow.com/a/6821395/2573061
splitMatrix <- function(m){
setNames( lapply(seq_len(ncol(m)), function(j) m[,j]), colnames(m) )
}
# Example:
splitMatrix(categoricalToMatrix(d$V2, name_prefix='V2'))
# $V2.b
# [1] 0 1 0
#
# $V2.c
# [1] 0 0 1
which works for an individual column, but then when I try to lapply to multiple columns, these lists get somehow coerced into string-rows and recycled, which is baffling me:
dummyDT2 <- function(d){
stopifnot(inherits(d,'data.table'))
toDummy <- which(sapply(d, function(x) is.factor(x) | is.character(x)))
if(length(toDummy)>0){
dummyComponent <- d[, lapply(.SD, function(x) splitMatrix( categorToMatrix(x) ) ) ,
.SDcols=isChar]
asIs <- (1:ncol(d))[-toDummy]
if(length(asIs)>0) {
allCols <- cbind(d[,asIs,with=FALSE], dummyComponent)
} else allCols <- dummyComponent
} else allCols <- d
return(allCols)
}
dummyDT2(d)
# V1 V3 V2
# 1: 1 10 0,1,0
# 2: 2 11 0,0,1
# 3: 3 12 0,1,0
# Warning message:
# In data.table::data.table(...) :
# Item 2 is of size 2 but maximum size is 3 (recycled leaving remainder of 1 items)
I then tried wrapping splitMatrix with data.table() and got an amusingly laconic error message.
I know that functions like caret::dummyVars exist for data.frame. I am trying to create a data.table optimized version.
Closely related question: How to one-hot-encode factor variables with data.table?
But there are two differences: I do not want full-rank dummy variables (because I'm using this for regression) but rather contrast variables (n-1 of these for n levels) and I have multiple numeric variables that I do not want to OHE.
I want to read a dataframe read if the first column is T or F and depending on this I will add a new entry to a new column in the matrix using data from the second column.
If z[,1] == true set z[,4] to 2*z[,2]
else set z[,4] to z[,2]
Set if the row in column 1 is true, set the new entry to 2 times the second column, other wise just set it to the value of the second column at that index
Lets create z:
set.seed(4)
z <- data.frame(first=c(T, F, F, T, F), second=sample(-2:2),
third=letters[5:1], stringsAsFactors=FALSE)
z
here is my for loop:
for(i in 1:nrow(z)){
if(z$first == TRUE){
z$newVar2 <- 2*z$second
}
else{
z$newVar2 <- z$second
}
}
Here is without a for loop:
z$newVar<-ifelse(z$first==TRUE, 2*z$second, z$second)
Is there a way to do this with apply? Is there a more efficient way to accomplish this task?
Not what you asked exactly but if working with a matrix data structure, you might as well explore data.table way of going about it:
#Make data.table
setDT(z)
setkey(z)
#Write function to do all the stuff
myfun <- function(first, second){ifelse(first, 2*second, second)}
#Do stuff
z[, newvar2:=myfun(first, second)]
#Printing z
first second third newvar2
1: FALSE -2 d -2
2: FALSE -1 a -1
3: FALSE 1 c 1
4: TRUE 0 e 0
5: TRUE 2 b 4
We can use data.table in a more efficient way still without defining a function, by making use of the fact that TRUE == 1
## use set.seed because we are sampling
set.seed(123)
z <- data.frame(first=c(T, F, F, T, F),
second=sample(-2:2),
third=letters[5:1], stringsAsFactors=FALSE)
library(data.table)
setDT(z)[, newvar2 := (first + 1) * second]
z
# first second third newvar2
# 1: TRUE -1 e -2
# 2: FALSE 1 d 1
# 3: FALSE 2 c 2
# 4: TRUE 0 b 0
# 5: FALSE -2 a -2