ifelse not working in mutate if column not supplied in condition - r

It seems to me that if a column is not included in the conditional part of the ifelse statement, the ifelse statement in a dplyr mutate function does not work as expected:
mdf <- data.frame(a=c(1,2,3), b=c(3,4,5))
# this works:
> mdf %>% mutate(c=ifelse(a==1,0,1))
a b c
1 1 3 0
2 2 4 1
3 3 5 1
# This does not work (expected column c to be equal to a):
> mdf %>% mutate(c=ifelse(0==1,0,a))
a b c
1 1 3 1
2 2 4 1
3 3 5 1
# This does not work either (expected column c to be equal to a):
> mdf %>% mutate(c=ifelse("a" %in% names(.),a,0))
a b c
1 1 3 1
2 2 4 1
3 3 5 1
If using a regular if-statement, it does work:
> mdf %>% mutate(c=if("a" %in% names(.)){a}else{1})
a b c
1 1 3 1
2 2 4 2
3 3 5 3
However, I was hoping to use the ifelse statement, since it has a cleaner syntax. Is there a way to achieve the desired result with the ifelse statement?

I see that the length of the conditional statement determines what is returned. If the condition evaluates to one True/False value, it only returns one value (instead of the entire column). The value returned seems to be the first value of the desired column:
> mdf %>% mutate(c=ifelse("a" %in% names(.),a,0))
a b c
1 1 3 1
2 2 4 1
3 3 5 1
If I increase the lenght of the condition to the number of rows, the ifelse statement will return the entire column:
> mdf %>% mutate(c=ifelse(rep("a", nrow(.)) %in% names(.),a,0))
a b c
1 1 3 1
2 2 4 2
3 3 5 3

Related

Filtering of dataframe columns displaying a counter intuitive behavior (R)

Take as an example the dataframe below. I need to change the dataframe by keeping only the columns that are in the filter objects.
test <- data.frame(A = c(1,6,1,2,3) , B = c(1,2,1,1,2), C = c(1,7,6,4,1), D = c(1,1,1,1,1))
filter <- c("A", "B", "C", "D")
filter2 <- c("A","B","D")
To do that I'm using this piece of code:
`%ni%` <- Negate(`%in%`)
test <- test[,-which(names(test) %ni% filter2)]
If I use the filter2 object I get what is expected:
A B D
1 1 1 1
2 6 2 1
3 1 1 1
4 2 1 1
5 3 2 1
However, if I use the filter object, I get a dataframe with zero columns:
data frame with 0 columns and 5 rows
I expected to get an untouched dataframe, since filter had all test columns in it. Why does this happen, and how can I write a more reliable code not to get empty dataframes in these situations?
Use ! instead of -
test[,!(names(test) %ni% filter2)]
test[,!(names(test) %ni% filter)]
by wrapping with which and using -, it works only when the length of output of which is greater than 0
> which(names(test) %ni% filter2)
[1] 3
> which(names(test) %ni% filter)
integer(0)
By doing the -, there is no change in the integer(0) case
> -which(names(test) %ni% filter)
integer(0)
> -which(names(test) %ni% filter2)
[1] -3
thus,
> test[integer(0)]
data frame with 0 columns and 5 rows
I think you can simplify the column selection process by subsetting the dataframe with character vector of column names.
test[filter]
# A B C D
#1 1 1 1 1
#2 6 2 7 1
#3 1 1 6 1
#4 2 1 4 1
#5 3 2 1 1
test[filter2]
# A B D
#1 1 1 1
#2 6 2 1
#3 1 1 1
#4 2 1 1
#5 3 2 1

How can I replace values in an R matrix matching on rowname, columnname and value

I have a really big matrix and I'd like to replace the values in it using a lookup table.
I have a table of values (which looks a bit like this):
Origin Destination Distance Final
1 1 1 A
1 1 2 B
1 1 3 E
1 2 2 F
1 3 1 B
1 3 2 C
2 1 1 B
2 2 1 A
2 3 3 C
3 1 1 A
3 1 2 D
3 2 1 B
3 3 2 A
...
and I have a matrix, which looks something like this:
x 1 1 3 1 2 1 ...
1 1 3 2 1 2 1
1 2 2 1 2 2 1
3 2 1 2 1 1 2
1 3 1 2 1 2 1
2 1 1 3 1 1 1
1 2 2 1 3 1 1
...
I'm trying to match my matrix rownames with the Origin column, the matrix Colnames with the Destination Column and the matrix values with the Distance Column and then replace that value with the Final Column.
The Matrix is 4000 by 4000.
The Table is 27 by 4
So when I'm done it should look like:
x 1 1 3 1 2 1 ...
1 A E C A F A
1 B B B B F A
3 D A A A B A
1 E A C A F A
2 B B C B A B
1 B B B E A A
...
I'm currently using a little loop, which looks like this;
for (i in 1:nrow(CategoryTable)){
Origin <- CategoryTable[i,"O"]
Dest <- CategoryTable[i,"D"]
Distance <- CategoryTable[i,"Dist"]
Final <- CategoryTable[i,"Final"]
CategoryGrid[CategoryGrid == Distance][CategoryGrid[row.names(CategoryGrid) %in% Origin,colnames(CategoryGrid) %in% Dest]] <-CategoryTable[i,"Final"]
}
Based on this question (Replace all values in a matrix <0.1 with 0) I can replace all the things matching a specific value or the things matching a column or row. But I can't match all at once.
The active ingredient of the current attempt is:
CategoryGrid[CategoryGrid == Distance][CategoryGrid[row.names(CategoryGrid) %in% Origin,colnames(CategoryGrid) %in% Dest]] <-CategoryTable[i,"Final"]
So I was trying to match the rows and columns and then pass that as a boolean vector to the value match, and then do the RHS assignation.
However, what I actually get is:
Error in CategoryGrid[row.names(CategoryGrid) %in% Origin, colnames(CategoryGrid) %in% :
incorrect number of dimensions
How would you go about achieving this?
Replacing the active ingredient with the following did the trick.
CategoryGrid[row.names(CategoryGrid) == Origin , colnames(CategoryGrid) == Dest] <- apply(CategoryGrid[row.names(CategoryGrid) == Origin , colnames(CategoryGrid) == Dest], MARGIN=c(1,2),function(x) ifelse(x == Distance, Final, x))
Specify the rules for the rows and columns and then do an apply and put the third variable in it's own function.

vectorise rows of a dataframe, apply vector function, return to original dataframe r

Given the following df:
a=c('a','b','c')
b=c(1,2,5)
c=c(2,3,4)
d=c(2,1,6)
df=data.frame(a,b,c,d)
a b c d
1 a 1 2 2
2 b 2 3 1
3 c 5 4 6
I'd like to apply a function that normally takes a vector (and returns a vector) like cummax row by row to the columns in position b to d.
Then, I'd like to have the output back in the df, either as a vector in a new column of the df, or replacing the original data.
I'd like to avoid writing it as a for loop that would iterate every row, pull out the content of the cells into a vector, do its thing and put it back.
Is there a more efficient way? I've given the apply family functions a go, but I'm struggling to first get a good way to vectorise content of columns by row and get the right output.
the final output could look something like that (imagining I've applied a cummax() function).
a b c d
1 a 1 2 2
2 b 2 3 3
3 c 5 5 6
or
a b c d output
1 a 1 2 2 (1,2,2)
2 b 2 3 1 (2,3,3)
3 c 5 4 6 (5,5,6)
where output is a vector.
Seems this would just be a simple apply problem that you want to cbind to df:
> cbind(df, apply(df[ , 4:2] # work with columns in reverse order
, 1, # do it row-by-row
cummax) )
a b c d 1 2 3
d a 1 2 2 2 1 6
c b 2 3 1 2 3 6
b c 5 4 6 2 3 6
Ouch. Bitten by failing to notice that this would be returned in a column oriented matrix and need to transpose that result; Such a newbie mistake. But it does show the value of having a question with a reproducible dataset I suppose.
> cbind(df, t(apply(df[ , 4:2] , 1, cummax) ) )
a b c d d c b
1 a 1 2 2 2 2 2
2 b 2 3 1 1 3 3
3 c 5 4 6 6 6 6
To destructively assign the result to df you would just use:
df <- # .... that code.
This does the concatenation with commas (and as a result no longer needs to be transposed:
> cbind(df, output=apply(df[ , 4:2] , 1, function(x) paste( cummax(x), collapse=",") ) )
a b c d output
1 a 1 2 2 2,2,2
2 b 2 3 1 1,3,3
3 c 5 4 6 6,6,6

for loop & if function in R

I was writing a loop with if function in R. The table is like below:
ID category
1 a
1 b
1 c
2 a
2 b
3 a
3 b
4 a
5 a
I want to use the for loop with if function to add another column to count each grouped ID, like below count column:
ID category Count
1 a 1
1 b 2
1 c 3
2 a 1
2 b 2
3 a 1
3 b 2
4 a 1
5 a 1
My code is (output is the table name):
for (i in 2:nrow(output1)){
if(output1[i,1] == output[i-1,1]){
output1[i,"rn"]<- output1[i-1,"rn"]+1
}
else{
output1[i,"rn"]<-1
}
}
But the result returns as all count column values are all "1".
ID category Count
1 a 1
1 b 1
1 c 1
2 a 1
2 b 1
3 a 1
3 b 1
4 a 1
5 a 1
Please help me out... Thanks
There are packages and vectorized ways to do this task, but if you are practicing with loops try:
output1$rn <- 1
for (i in 2:nrow(output1)){
if(output1[i,1] == output1[i-1,1]){
output1[i,"rn"]<- output1[i-1,"rn"]+1
}
else{
output1[i,"rn"]<-1
}
}
With your original code, when you made this call output1[i-1,"rn"]+1 in the third line of your loop, you were referencing a row that didn't exist on the first pass. By first creating the row and filling it with the value 1, you give the loop something explicit to refer to.
output1
# ID category rn
# 1 1 a 1
# 2 1 b 2
# 3 1 c 3
# 4 2 a 1
# 5 2 b 2
# 6 3 a 1
# 7 3 b 2
# 8 4 a 1
# 9 5 a 1
With the package dplyr you can accomplish it quickly with:
library(dplyr)
output1 %>% group_by(ID) %>% mutate(rn = 1:n())
Or with data.table:
library(data.table)
setDT(output1)[,rn := 1:.N, by=ID]
With base R you can also use:
output1$rn <- with(output1, ave(as.character(category), ID, FUN=seq))
There are vignettes and tutorials on the two packages mentioned, and by searching ?ave in the R console for the last approach.
looping solution will be painfully slow for bigger data. Here is one line solution using data.table:
require(data.table)
a<-data.table(ID=c(1,1,1,2,2,3,3,4,5),category=c('a','b','c','a','b','a','b','a','a'))
a[,':='(category_count = 1:.N),by=.(ID)]
what you want is actually a column of factor level. do this
df$count=as.numeric(df$category)
this will give out put as
ID category count
1 1 a 1
2 1 b 2
3 1 c 3
4 2 a 1
5 2 b 2
6 3 a 1
7 3 b 2
8 4 a 1
9 5 a 1
provided your category is already a factor. if not first convert to factor
df$category=as.factor(df$category)
df$count=as.numeric(df$category)

Conditionally dropping duplicates from a data.frame

Im am trying to figure out how to subset my dataset according to the repeated value of the variable s, taking also into account the id associated to the row.
Suppose my dataset is:
dat <- read.table(text = "
id s
1 2
1 2
1 1
1 3
1 3
1 3
2 3
2 3
3 2
3 2",
header=TRUE)
What I would like to do is, for each id, to keep only the first row for which s = 3. The result with dat would be:
id s
1 2
1 2
1 1
1 3
2 3
3 2
3 2
I have tried to use both duplicated() and which() for using subset() in a second moment, but I am not going anywhere. The main problem is that it is not sufficient to isolate the first row of the s = 3 "blocks", because in some cases (as here between id = 1 and id = 2) the 3's overlap between one id and another.. Which strategy would you adopt?
Like this:
subset(dat, s != 3 | s == 3 & !duplicated(dat))
# id s
# 1 1 2
# 2 1 2
# 3 1 1
# 4 1 3
# 7 2 3
# 9 3 2
# 10 3 2
Note that subset can be dangerous to work with (see Why is `[` better than `subset`?), so the longer but safer version would be:
dat[dat$s != 3 | dat$s == 3 & !duplicated(dat), ]

Resources