Computing difference between rows in a data frame - r

I have a data frame. I would like to compute how "far" each row is from a given row. Let us consider it for the 1st row. Let the data frame be as follows:
> sampleDF
X1 X2 X3
1 5 5
4 2 2
2 9 1
7 7 3
What I wish to do is the following:
Compute the difference between the 1st row & others: sampleDF[1,]-sampleDF[2,]
Consider only the absolute value: abs(sampleDF[1,]-sampleDF[2,])
Compute the sum of the newly formed data frame of differences: rowSums(newDF)
Now to do this for the whole data frame.
newDF <- sapply(2:4,function(x) { return (abs(sampleDF[1,]-sampleDF[x,]));})
This creates a problem in that the result is a transposed list. Hence,
newDF <- as.data.frame(t(sapply(2:4,function(x) { return (abs(sampleDF[1,]-sampleDF[x,]));})))
But another problem arises while computing rowSums:
> class(newDF)
[1] "data.frame"
> rowSums(newDF)
Error in base::rowSums(x, na.rm = na.rm, dims = dims, ...) :
'x' must be numeric
> newDF
X1 X2 X3
1 3 3 3
2 1 4 4
3 6 2 2
>
Puzzle 1: Why do I get this error? I did notice that newDF[1,1] is a list & not a number. Is it because of that? How can I ensure that the result of the sapply & transpose is a simple data frame of numbers?
So I proceed to create a global data frame & modify it within the function:
sapply(2:4,function(x) { newDF <<- as.data.frame(rbind(newDF,abs(sampleDF[1,]-sampleDF[x,])));})
> newDF
X1 X2 X3
2 3 3 3
3 1 4 4
4 6 2 2
> rowSums(outDF)
2 3 4
9 9 10
>
This is as expected.
Puzzle 2: Is there a cleaner way to achieve this? How can I do this for every row in the data frame (shown above is only for "distance" from row 1. Would need to do this for other rows as well)? Is running a loop the only option?

To put it in words, you are trying to compute the Manhattan distance:
dist(sampleDF, method = "Manhattan")
# 1 2 3
# 2 9
# 3 9 10
# 4 10 9 9
Regarding your implementation, I think the problem is that your inner function is returning a data.frame when it should return a numeric vector. Doing return(unlist(abs(sampleDF[1,]-sampleDF[x,]))) should fix it.

Related

compute columns means after assigning one dataframe to a field of another data frame R

I have one data frame for example:
> df=data.frame(a=1:4,b=2:5)
> df
a b
1 1 2
2 2 3
3 3 4
4 4 5
Then I create another data frame and assign the data frame above to a field of the other one:
> df2=data.frame(c=3:6)
> df2$df1=df
> df2
c df1.a df1.b
1 3 1 2
2 4 2 3
3 5 3 4
4 6 4 5
When I compute the column means of the data frame, I got the error:
> colMeans(df2)
Error in if (inherits(X[[j]], "data.frame") && ncol(xj) > 1L) X[[j]] <- as.matrix(X[[j]]) :
missing value where TRUE/FALSE needed
Could anyone help to solve this problem?
Check ncol(df2) to see that there are only 2 "columns". The colMeans function cannot take the mean of the second element of the df2 list because it isn't a single column but two. Instead of df2$df1 = df, you can do df2 <- cbind(df2, df). If you want the column names to be the same as in your example you can do
sapply(1:ncol(df), function(i) df2[,paste0('df1','.',names(df)[i])] <<- df[,i])

remove duplicate row based only of previous row

I'm trying to remove duplicate rows from a data frame, based only on the previous row. The duplicate and unique functions will remove all duplicates, leaving you only with unique rows, which is not what I want.
I've illustrated the problem here with a loop. I need to vectorize this because my actual data set is much to large to use a loop on.
x <- c(1,1,1,1,3,3,3,4)
y <- c(1,1,1,1,3,3,3,4)
z <- c(1,2,1,1,3,2,2,4)
xy <- data.frame(x,y,z)
xy
x y z
1 1 1 1
2 1 1 2
3 1 1 1
4 1 1 1 #this should be removed
5 3 3 3
6 3 3 2
7 3 3 2 #this should be removed
8 4 4 4
# loop that produces desired output
toRemove <- NULL
for (i in 2:nrow(xy)){
test <- as.vector(xy[i,] == xy[i-1,])
if (!(FALSE %in% test)){
toRemove <- c(toRemove, i) #build a vector of rows to remove
}
}
xy[-toRemove,] #exclude rows
x y z
1 1 1 1
2 1 1 2
3 1 1 1
5 3 3 3
6 3 3 2
8 4 4 4
I've tried using dplyr's lag function, but it only works on single columns, when I try to run it over all 3 columns it doesn't work.
ifelse(xy[,1:3] == lag(xy[,1:3],1), NA, xy[,1:3])
Any advice on how to accomplish this?
Looks like we want to remove if the row is same as above:
# make an index, if cols not same as above
ix <- c(TRUE, rowSums(tail(xy, -1) == head(xy, -1)) != ncol(xy))
# filter
xy[ix, ]
Why don't you just iterate the list while keeping track of the previous row to compare it to the next row?
If this is true at some point: remember that row position and remove it from the list then start iterating from the beginning of the list.
Don't delete row while iterating because you will get concurrent modification error.

R - efficient comparison of subsets of rows between data frames

thank you for any help.
I need to check the total number of matches from the elements of each row of a data frame (df1) on rows of another data frame (df2).
The data frames have different number of columns (5 in the first one versus 6 in the second one, for instance). And there is no exact formation rule for the rows (so I can not find a way of doing this through combinatory analysis)
This routine must check all the rows from the first data frame against all the rows of the second data frame, resulting a total number of occurences by the number of hits.
Not all the possible sums are of interest. Actually I am looking for a specific total (which I call "hits" in this text).
In other words: how many times a subset of each row of df2 of size "hits" can be found in rows of df1.
Here is an example:
> ### Example
> ### df1 and df2 here are regularly formed just for illustration purposes
>
> require(combinat)
>
> df1 <- as.data.frame(t(combn(6,5)))
> df2 <- as.data.frame(t(combn(7,6)))
>
> df1
V1 V2 V3 V4 V5
1 1 2 3 4 5
2 1 2 3 4 6
3 1 2 3 5 6
4 1 2 4 5 6
5 1 3 4 5 6
6 2 3 4 5 6
>
> df2
V1 V2 V3 V4 V5 V6
1 1 2 3 4 5 6
2 1 2 3 4 5 7
3 1 2 3 4 6 7
4 1 2 3 5 6 7
5 1 2 4 5 6 7
6 1 3 4 5 6 7
7 2 3 4 5 6 7
>
In this example, please note, for instance, that subsets of size 5, from row #1 of df2 can be found 6 times in the rows of df1. And so on.
I tried something like this:
> ### Check how many times subsets of size "hits" from rows from df2 are found in rows of df1
>
> myfn <- function(dfa,dfb,hits) {
+ sapply(c(1:dim(dfb)[1]),function(y) { sum(c(apply(dfa,1,function(x,i) { sum(x %in% dfb[i,]) },i=y))==hits) })
+ }
>
> r1 <- myfn(df1,df2,5)
>
> cbind(df2,"hits.eq.5" = r1)
V1 V2 V3 V4 V5 V6 hits.eq.5
1 1 2 3 4 5 6 6
2 1 2 3 4 5 7 1
3 1 2 3 4 6 7 1
4 1 2 3 5 6 7 1
5 1 2 4 5 6 7 1
6 1 3 4 5 6 7 1
7 2 3 4 5 6 7 1
This seems to do what I need, but it is too slow! I need using this routine on large data frames (about 200 K rows)
I am currently using R 3.1.2 GUI 1.65 Mavericks build (6833)
Can anyone provide a faster or more clever way of doing this? Than you again.
Best regards,
Vaccaro
Using apply(...) on data frames is very inefficient. This is because apply(...) takes a matrix as argument, so if you pass a data frame it will coerce that to a matrix. In your example you convert df1 to a matrix every time you call apply(...), which is nrow(df2) times.
Also, by using sapply(1:nrow(df2),...) and dfb[i,] you are using data frame row indexing, which is also very inefficient. You are much better off converting everything to matrix class at the beginning and then using apply(...) twice.
Finally, there is no reason to use a call to c(...). apply(...) already returns a vector (in this case), so you are just incurring the overhead of another function call to no effect.
Doing these things alone speeds up your code by about a factor of 20.
set.seed(1)
nrows <- 100
df1 <- data.frame(matrix(sample(1:5,5*nrows,replace=TRUE),nc=5))
df2 <- data.frame(matrix(sample(1:6,6*nrows,replace=TRUE),nc=6))
myfn <- function(dfa,dfb,hits) {
sapply(c(1:dim(dfb)[1]),function(y) { sum(c(apply(dfa,1,function(x,i) { sum(x %in% dfb[i,]) },i=y))==hits) })
}
myfn.2 <- function(dfa,dfb,hits) {
ma <- as.matrix(dfa)
mb <- as.matrix(dfb)
apply(mb,1,function(y) { sum(apply(ma,1,function(x) { sum(x %in% y) })==hits) })
}
system.time(r1<-myfn(df1,df2,3))
# user system elapsed
# 1.99 0.00 2.00
system.time(r2<-myfn.2(df1,df2,3))
# user system elapsed
# 0.09 0.00 0.10
identical(r1,r2)
# [1] TRUE
There is another approach which takes advantage of the fact that R is extremely efficient at manipulating lists. Since a data frame is just a list of vectors, we can improve performance by putting your rows into data frame columns and then using sapply(..) on that. This is faster than myfn.2(...) above, but only by about 20%.
myfn.3 <-function(dfa,dfb,hits) {
df1.t <- data.frame(t(dfa)) # rows into columns
df2.t <- data.frame(t(dfb))
sapply(df2.t,function(col2)sum(sapply(df1.t,function(col1)sum(col1 %in% col2)==hits)))
}
library(microbenchmark)
microbenchmark(myfn.2(df1,df2,5),myfn.3(df1,df2,5),times=10)
# Unit: milliseconds
# expr min lq median uq max neval
# myfn.2(df1, df2, 5) 92.84713 94.06418 96.41835 98.44738 99.88179 10
# myfn.3(df1, df2, 5) 75.53468 77.44348 79.24123 82.28033 84.12457 10
If you really have a dataset with 55MM rows, then I think you need to rethink this problem. I have no idea what you are trying to accomplish, but this seems like a brute force approach.

Variable Length Core Name Identification

I have a data set with the following row-naming scheme:
a.X.V
where:
a is a fixed-length core ID
X is a variable-length string that subsets a, which means I should keep X
V is a variable-length ID which specifies the individual elements of a.X to be averaged
. is one of {-,_}
What I am trying to do is take column averages of all the a.X's. A sample:
sampleList <- list("a.12.1"=c(1,2,3,4,5), "b.1.23"=c(3,4,1,4,5), "a.12.21"=c(5,7,2,8,9), "b.1.555"=c(6,8,9,0,6))
sampleList
$a.12.1
[1] 1 2 3 4 5
$b.1.23
[1] 3 4 1 4 5
$a.12.21
[1] 5 7 2 8 9
$b.1.555
[1] 6 8 9 0 6
Currently I am manually gsubbing out the .Vs to get a list of general :
sampleList <- t(as.data.frame(sampleList))
y <- rowNames(sampleList)
y <- gsub("(\\w\\.\\d+)\\.d+", "\\1", y)
Is there a faster way to do this?
This is one half of 2 issues I've encountered in a workflow. The other half was answered here.
You can use a vector of patterns to find the locations of the columns you want to group. I included a pattern I knew wouldn't match anything in order to show that the solution is robust to that situation.
# A *named* vector of patterns you want to group by
patterns <- c(a.12="^a.12",b.12="^b.12",c.12="^c.12")
# Find the locations of those patterns in your list
inds <- lapply(patterns, grep, x=names(sampleList))
# Calculate the mean of each list element that matches the pattern
out <- lapply(inds, function(i)
if(l <- length(i)) Reduce("+",sampleList[i])/l else NULL)
# Set the names of the output
names(out) <- names(patterns)
Perhaps you could consider messing with your data structure to make it easier to apply some standard tools:
sampleList <- list("a.12.1"=c(1,2,3,4,5),
"b.1.23"=c(3,4,1,4,5), "a.12.21"=c(5,7,2,8,9),
"b.1.555"=c(6,8,9,0,6))
library(reshape2)
m1 <- melt(do.call(cbind,sampleList))
m2 <- cbind(m1,colsplit(m1$Var2,"\\.",c("coreID","val1","val2")))
The results looks like this:
head(m2)
Var1 Var2 value coreID val1 val2
1 1 a.12.1 1 a 12 1
2 2 a.12.1 2 a 12 1
3 3 a.12.1 3 a 12 1
Then you can more easily do something like this:
aggregate(value~val1,mean,data=subset(m2,coreID=="a"))
R is poised to do this stuff if you would just move to data.frames instead of lists. Make Your 'a', 'X', and 'V' into their own columns. Then you can use ave, by, aggregate, subset, etc.
data.frame(do.call(rbind, sampleList),
do.call(rbind, strsplit(names(sampleList), '\\.')))
# X1 X2 X3 X4 X5 X1.1 X2.1 X3.1
# a.12.1 1 2 3 4 5 a 12 1
# b.1.23 3 4 1 4 5 b 1 23
# a.12.21 5 7 2 8 9 a 12 21
# b.1.555 6 8 9 0 6 b 1 555

Pull coefficients from a data frame based on information in another data frame

Right now I have two data frames in R, contains some data that looks like this:
> data
p a i
1 1 1 2.2561469
2 5 2 0.2316390
3 2 3 0.4867456
4 3 1 0.1511705
5 4 2 0.8838884
And one the contains coefficients that looks like this:
> coef
3 2 1
1 29420.50 31029.75 29941.96
2 26915.00 27881.00 27050.00
3 27756.00 28904.00 28699.40
4 28345.33 29802.33 28377.56
5 28217.00 29409.00 28738.67
These data frames are connected as each value in data$a corresponds to a column name in coef and data$p corresponds to row names in coef.
I need to apply these coefficients to multiply these coefficients by the values in data$i by matching the row and column names in coef to data$a and data$p.
In other words, for each row in data, I need to use data$a and data$p for each row to pull a specific number from coef that will be multiplied by the value of data$i for that row to create a new vector in data that looks something like this:
> data
p a i z
1 1 1 2.2561469 67553
2 5 2 0.2316390 6812
3 2 3 0.4867456 .
4 3 1 0.1511705 .
5 4 2 0.8838884 .
I was thinking I should create factors in my coef data frame based on the row and column names but am unsure of where to go from there.
Thanks in advance,
Ian
If you order your coef data.frame, you can just index them as though the column names weren't there.
coef <- coef[,order(names(coef))]
Then apply a function to each row:
myfun <- function(x) {
x[3]*coef[x[1], x[2]]
}
data$z <- apply(data, 1, myfun)
> data
p a i z
1 1 1 2.2561469 67553.460
2 5 2 0.2316390 6812.271
3 2 3 0.4867456 13100.758
4 3 1 0.1511705 4338.503
5 4 2 0.8838884 26341.934
>

Resources