How can I replace values in an R matrix matching on rowname, columnname and value - r

I have a really big matrix and I'd like to replace the values in it using a lookup table.
I have a table of values (which looks a bit like this):
Origin Destination Distance Final
1 1 1 A
1 1 2 B
1 1 3 E
1 2 2 F
1 3 1 B
1 3 2 C
2 1 1 B
2 2 1 A
2 3 3 C
3 1 1 A
3 1 2 D
3 2 1 B
3 3 2 A
...
and I have a matrix, which looks something like this:
x 1 1 3 1 2 1 ...
1 1 3 2 1 2 1
1 2 2 1 2 2 1
3 2 1 2 1 1 2
1 3 1 2 1 2 1
2 1 1 3 1 1 1
1 2 2 1 3 1 1
...
I'm trying to match my matrix rownames with the Origin column, the matrix Colnames with the Destination Column and the matrix values with the Distance Column and then replace that value with the Final Column.
The Matrix is 4000 by 4000.
The Table is 27 by 4
So when I'm done it should look like:
x 1 1 3 1 2 1 ...
1 A E C A F A
1 B B B B F A
3 D A A A B A
1 E A C A F A
2 B B C B A B
1 B B B E A A
...
I'm currently using a little loop, which looks like this;
for (i in 1:nrow(CategoryTable)){
Origin <- CategoryTable[i,"O"]
Dest <- CategoryTable[i,"D"]
Distance <- CategoryTable[i,"Dist"]
Final <- CategoryTable[i,"Final"]
CategoryGrid[CategoryGrid == Distance][CategoryGrid[row.names(CategoryGrid) %in% Origin,colnames(CategoryGrid) %in% Dest]] <-CategoryTable[i,"Final"]
}
Based on this question (Replace all values in a matrix <0.1 with 0) I can replace all the things matching a specific value or the things matching a column or row. But I can't match all at once.
The active ingredient of the current attempt is:
CategoryGrid[CategoryGrid == Distance][CategoryGrid[row.names(CategoryGrid) %in% Origin,colnames(CategoryGrid) %in% Dest]] <-CategoryTable[i,"Final"]
So I was trying to match the rows and columns and then pass that as a boolean vector to the value match, and then do the RHS assignation.
However, what I actually get is:
Error in CategoryGrid[row.names(CategoryGrid) %in% Origin, colnames(CategoryGrid) %in% :
incorrect number of dimensions
How would you go about achieving this?

Replacing the active ingredient with the following did the trick.
CategoryGrid[row.names(CategoryGrid) == Origin , colnames(CategoryGrid) == Dest] <- apply(CategoryGrid[row.names(CategoryGrid) == Origin , colnames(CategoryGrid) == Dest], MARGIN=c(1,2),function(x) ifelse(x == Distance, Final, x))
Specify the rules for the rows and columns and then do an apply and put the third variable in it's own function.

Related

ifelse not working in mutate if column not supplied in condition

It seems to me that if a column is not included in the conditional part of the ifelse statement, the ifelse statement in a dplyr mutate function does not work as expected:
mdf <- data.frame(a=c(1,2,3), b=c(3,4,5))
# this works:
> mdf %>% mutate(c=ifelse(a==1,0,1))
a b c
1 1 3 0
2 2 4 1
3 3 5 1
# This does not work (expected column c to be equal to a):
> mdf %>% mutate(c=ifelse(0==1,0,a))
a b c
1 1 3 1
2 2 4 1
3 3 5 1
# This does not work either (expected column c to be equal to a):
> mdf %>% mutate(c=ifelse("a" %in% names(.),a,0))
a b c
1 1 3 1
2 2 4 1
3 3 5 1
If using a regular if-statement, it does work:
> mdf %>% mutate(c=if("a" %in% names(.)){a}else{1})
a b c
1 1 3 1
2 2 4 2
3 3 5 3
However, I was hoping to use the ifelse statement, since it has a cleaner syntax. Is there a way to achieve the desired result with the ifelse statement?
I see that the length of the conditional statement determines what is returned. If the condition evaluates to one True/False value, it only returns one value (instead of the entire column). The value returned seems to be the first value of the desired column:
> mdf %>% mutate(c=ifelse("a" %in% names(.),a,0))
a b c
1 1 3 1
2 2 4 1
3 3 5 1
If I increase the lenght of the condition to the number of rows, the ifelse statement will return the entire column:
> mdf %>% mutate(c=ifelse(rep("a", nrow(.)) %in% names(.),a,0))
a b c
1 1 3 1
2 2 4 2
3 3 5 3

Filtering of dataframe columns displaying a counter intuitive behavior (R)

Take as an example the dataframe below. I need to change the dataframe by keeping only the columns that are in the filter objects.
test <- data.frame(A = c(1,6,1,2,3) , B = c(1,2,1,1,2), C = c(1,7,6,4,1), D = c(1,1,1,1,1))
filter <- c("A", "B", "C", "D")
filter2 <- c("A","B","D")
To do that I'm using this piece of code:
`%ni%` <- Negate(`%in%`)
test <- test[,-which(names(test) %ni% filter2)]
If I use the filter2 object I get what is expected:
A B D
1 1 1 1
2 6 2 1
3 1 1 1
4 2 1 1
5 3 2 1
However, if I use the filter object, I get a dataframe with zero columns:
data frame with 0 columns and 5 rows
I expected to get an untouched dataframe, since filter had all test columns in it. Why does this happen, and how can I write a more reliable code not to get empty dataframes in these situations?
Use ! instead of -
test[,!(names(test) %ni% filter2)]
test[,!(names(test) %ni% filter)]
by wrapping with which and using -, it works only when the length of output of which is greater than 0
> which(names(test) %ni% filter2)
[1] 3
> which(names(test) %ni% filter)
integer(0)
By doing the -, there is no change in the integer(0) case
> -which(names(test) %ni% filter)
integer(0)
> -which(names(test) %ni% filter2)
[1] -3
thus,
> test[integer(0)]
data frame with 0 columns and 5 rows
I think you can simplify the column selection process by subsetting the dataframe with character vector of column names.
test[filter]
# A B C D
#1 1 1 1 1
#2 6 2 7 1
#3 1 1 6 1
#4 2 1 4 1
#5 3 2 1 1
test[filter2]
# A B D
#1 1 1 1
#2 6 2 1
#3 1 1 1
#4 2 1 1
#5 3 2 1

vectorise rows of a dataframe, apply vector function, return to original dataframe r

Given the following df:
a=c('a','b','c')
b=c(1,2,5)
c=c(2,3,4)
d=c(2,1,6)
df=data.frame(a,b,c,d)
a b c d
1 a 1 2 2
2 b 2 3 1
3 c 5 4 6
I'd like to apply a function that normally takes a vector (and returns a vector) like cummax row by row to the columns in position b to d.
Then, I'd like to have the output back in the df, either as a vector in a new column of the df, or replacing the original data.
I'd like to avoid writing it as a for loop that would iterate every row, pull out the content of the cells into a vector, do its thing and put it back.
Is there a more efficient way? I've given the apply family functions a go, but I'm struggling to first get a good way to vectorise content of columns by row and get the right output.
the final output could look something like that (imagining I've applied a cummax() function).
a b c d
1 a 1 2 2
2 b 2 3 3
3 c 5 5 6
or
a b c d output
1 a 1 2 2 (1,2,2)
2 b 2 3 1 (2,3,3)
3 c 5 4 6 (5,5,6)
where output is a vector.
Seems this would just be a simple apply problem that you want to cbind to df:
> cbind(df, apply(df[ , 4:2] # work with columns in reverse order
, 1, # do it row-by-row
cummax) )
a b c d 1 2 3
d a 1 2 2 2 1 6
c b 2 3 1 2 3 6
b c 5 4 6 2 3 6
Ouch. Bitten by failing to notice that this would be returned in a column oriented matrix and need to transpose that result; Such a newbie mistake. But it does show the value of having a question with a reproducible dataset I suppose.
> cbind(df, t(apply(df[ , 4:2] , 1, cummax) ) )
a b c d d c b
1 a 1 2 2 2 2 2
2 b 2 3 1 1 3 3
3 c 5 4 6 6 6 6
To destructively assign the result to df you would just use:
df <- # .... that code.
This does the concatenation with commas (and as a result no longer needs to be transposed:
> cbind(df, output=apply(df[ , 4:2] , 1, function(x) paste( cummax(x), collapse=",") ) )
a b c d output
1 a 1 2 2 2,2,2
2 b 2 3 1 1,3,3
3 c 5 4 6 6,6,6

for loop & if function in R

I was writing a loop with if function in R. The table is like below:
ID category
1 a
1 b
1 c
2 a
2 b
3 a
3 b
4 a
5 a
I want to use the for loop with if function to add another column to count each grouped ID, like below count column:
ID category Count
1 a 1
1 b 2
1 c 3
2 a 1
2 b 2
3 a 1
3 b 2
4 a 1
5 a 1
My code is (output is the table name):
for (i in 2:nrow(output1)){
if(output1[i,1] == output[i-1,1]){
output1[i,"rn"]<- output1[i-1,"rn"]+1
}
else{
output1[i,"rn"]<-1
}
}
But the result returns as all count column values are all "1".
ID category Count
1 a 1
1 b 1
1 c 1
2 a 1
2 b 1
3 a 1
3 b 1
4 a 1
5 a 1
Please help me out... Thanks
There are packages and vectorized ways to do this task, but if you are practicing with loops try:
output1$rn <- 1
for (i in 2:nrow(output1)){
if(output1[i,1] == output1[i-1,1]){
output1[i,"rn"]<- output1[i-1,"rn"]+1
}
else{
output1[i,"rn"]<-1
}
}
With your original code, when you made this call output1[i-1,"rn"]+1 in the third line of your loop, you were referencing a row that didn't exist on the first pass. By first creating the row and filling it with the value 1, you give the loop something explicit to refer to.
output1
# ID category rn
# 1 1 a 1
# 2 1 b 2
# 3 1 c 3
# 4 2 a 1
# 5 2 b 2
# 6 3 a 1
# 7 3 b 2
# 8 4 a 1
# 9 5 a 1
With the package dplyr you can accomplish it quickly with:
library(dplyr)
output1 %>% group_by(ID) %>% mutate(rn = 1:n())
Or with data.table:
library(data.table)
setDT(output1)[,rn := 1:.N, by=ID]
With base R you can also use:
output1$rn <- with(output1, ave(as.character(category), ID, FUN=seq))
There are vignettes and tutorials on the two packages mentioned, and by searching ?ave in the R console for the last approach.
looping solution will be painfully slow for bigger data. Here is one line solution using data.table:
require(data.table)
a<-data.table(ID=c(1,1,1,2,2,3,3,4,5),category=c('a','b','c','a','b','a','b','a','a'))
a[,':='(category_count = 1:.N),by=.(ID)]
what you want is actually a column of factor level. do this
df$count=as.numeric(df$category)
this will give out put as
ID category count
1 1 a 1
2 1 b 2
3 1 c 3
4 2 a 1
5 2 b 2
6 3 a 1
7 3 b 2
8 4 a 1
9 5 a 1
provided your category is already a factor. if not first convert to factor
df$category=as.factor(df$category)
df$count=as.numeric(df$category)

R counting strings variables in each row of a dataframe

I have a dataframe that looks something like this, where each row represents a samples, and has repeats of the the same strings
> df
V1 V2 V3 V4 V5
1 a a d d b
2 c a b d a
3 d b a a b
4 d d a b c
5 c a d c c
I want to be able to create a new dataframe, where ideally the headers would be the string variables in the previous dataframe (a, b, c, d) and the contents of each row would be the number of occurrences of each the respective variable from
the original dataframe. Using the example from above, this would look like
> df2
a b c d
1 2 1 0 2
2 2 1 1 1
3 2 1 0 1
4 1 1 1 2
5 1 0 3 1
In my actual dataset, there are hundreds of variables, and thousands of samples, so it'd be ideal if I could automatically pull out the names from the original dataframe, and alphabetize them into the headers for the new dataframe.
You may try
library(qdapTools)
mtabulate(as.data.frame(t(df)))
Or
mtabulate(split(as.matrix(df), row(df)))
Or using base R
Un1 <- sort(unique(unlist(df)))
t(apply(df ,1, function(x) table(factor(x, levels=Un1))))
You can stack the columns and then use table:
table(cbind(id = 1:nrow(mydf),
stack(lapply(mydf, as.character)))[c("id", "values")])
# values
# id a b c d
# 1 2 1 0 2
# 2 2 1 1 1
# 3 2 2 0 1
# 4 1 1 1 2
# 5 1 0 3 1

Resources