I would like to apply a function to an R data table object that compares values in two columns and returns a result. Here's the example, for data table X:
X <- as.data.table(list(POSITION=c(1,4,5,9,24,36,42,56),
FIRST=c("A","BB","AA","B","AAA","B","A,B"),
SECOND=c("B","AA","B","AAA","BBB","AB,ABB","B,A")))
POSITION FIRST SECOND
1: 1 A B
2: 4 BB AA
3: 5 AA B
4: 9 B AAA
5: 24 AAA BBB
6: 36 B AB,ABB
7: 42 A,B B,A
8: 56 A B
I would like to perform the following logical comparisons of the data in columns "FIRST" and "SECOND", to create a "RESULT" column:
SAME = length of FIRST and SECOND are both one character
BLOCK = Character length of FIRST and SECOND are the same,
but greater than one, and not mixed (i.e. no comma)
LESS = SECOND has fewer characters, but neither is mixed
MORE = SECOND has more characters, but neither is mixed
MIXED = either firs of second contains a comma
Thus, the desired result would look like:
POSITION FIRST SECOND RESULTS
1 A B SAME
4 BB AA BLOCK
5 A B,A MIXED
9 AA B LESS
24 B AAA MORE
28 BBB A,B MIXED
36 AAA BBB BLOCK
42 B AB,ABB MIXED
56 A,B B,A MIXED
So the following works, but is slow over a file with 4 million rows!
X[, RESULT := ifelse(nchar(FIRST)+nchar(SECOND)==2,"SAME",
ifelse(grepl(",", FIRST) | grepl(",",SECOND), "MIXED",
ifelse(nchar(FIRST) > nchar(SECOND), "LESS",
ifelse(nchar(FIRST) < nchar(SECOND), "MORE","BLOCK")))]
But it does give thew desired result:
POSITION FIRST SECOND RESULT
1: 1 A B SAME
2: 4 BB AA BLOCK
3: 5 AA B LESS
4: 9 B AAA MORE
5: 24 AAA BBB BLOCK
6: 36 B AB,ABB MIXED
7: 42 A,B B,A MIXED
8: 56 A B SAME
I actually have several more conditions to test, and some of them get more complicated that just character counts. Rather than a long ifelse statement, is it possible to apply a function, taking the two columns as input? For example:
checkType <- function(x) {
if(nchar(x$FIRST)+nchar(x$SECOND)==2) {
type <- "SNP"
} else if(!grepl(",", x$SECOND) & !grepl(",",x$FIRST) & (nchar(x$FIRST) > nchar(x$SECOND))) {
type <- "LESS"
} else if(!grepl(",", x$SECOND) & !grepl(",",x$FIRST) & (nchar(x$FIRST) < nchar(x$SECOND))) {
type <- "MORE"
} else if (!grepl(",", x$SECOND) & !grepl(",",x$FIRST) & (nchar(x$FIRST) == nchar(x$SECOND)) & nchar(x$SECOND)>1) {
type <-"BLOCK"
} else {
type <- "MIXED"
}
return(type)
}
> checkType(X[1,])
[1] "SAME"
for(i in 1:nrow(X)) X[i, RESULT := checkType(X[i,])]
So while the above works, it's obviously not the optimal way to run things with data.table. However, I tried lapply and apply, but neither work:
X[, RESULT3 := lapply(.SD, checkType)]
Error in x$FIRST : $ operator is invalid for atomic vectors
nchar(x$FIRST)
FUN(X[[1L]], ...)
lapply(.SD, checkType)
eval(expr, envir, enclos)
eval(jsub, SDenv, parent.frame())
`[.data.table`(X, , `:=`(RESULT3, lapply(.SD, checkType)))
X[, `:=`(RESULT3, lapply(.SD, checkType))]
Same result with apply(.SD, 1, checkType). Is what I am trying to do possible by applying a function?
Note that the data table produced by your code (first line below, pasted from your snippet above), is not the same as the data table shown in the "desired results" box below it.
Nevertheless, this might actually be faster, and would definitely be easier to understand. It produces a result which I think is consistent with your rules.
X <- as.data.table(list(POSITION=c(1,4,5,9,24,36,42,56),
FIRST=c("A","BB","AA","B","AAA","B","A,B"),
SECOND=c("B","AA","B","AAA","BBB","AB,ABB","B,A")))
X$mixed <- grepl(',',X$FIRST) | grepl(',',X$SECOND)
X$nf <- nchar(X$FIRST)
X$ns <- nchar(X$SECOND)
X$RESULT = ""
setkey(X,nf,ns)
X[J(1,1),RESULT:="SAME"]
X[!mixed & nf==ns & nf>1 & ns>1]$RESULT <- "BLOCK"
X[!mixed & nf > ns]$RESULT <- "LESS"
X[!mixed & nf < ns]$RESULT <- "MORE"
X[(mixed)]$RESULT <- "MIXED"
setkey(X,POSITION)
Your categories are not mutually exclusive, so I assume these rules apply in order (for example what about FIRST="," and SECOND=","?
Also, I think your definitions of MORE and LESS are the same.
So both the answers from #Frank and #jlhoward give the desired result, and were much quicker than my initial attempt. From these answers however, this approach (createResult1) was about 4 times faster over a file with 1,000,000 rows:
createResult1 <- function(X) {
X[,`:=`(
cf=nchar(FIRST),
cs=nchar(SECOND),
mf=grepl(',',FIRST),
ms=grepl(',',SECOND)
)]
X[cf==1&cs==1, RESULT:="SAME"]
X[cf > cs, RESULT:="LESS"]
X[cf < cs, RESULT:="MORE"]
X[cf==cs & cs>1, RESULT:="BLOCK"]
X[(mf)|(ms), RESULT:="MIXED"]
X[,c('cf','cs','mf','ms'):=NULL]
return(X)
}
createResult2 <- function(X) { ##Frank
X[,`:=`(
cf=nchar(FIRST),
cs=nchar(SECOND),
mf=grepl(',',FIRST),
ms=grepl(',',SECOND)
)][,RESULT:=ifelse(cf==1&cs==1,"SAME",
ifelse(mf | ms, "MIXED",
ifelse(cf > cs, "LESS",
ifelse(cf < cs, "MORE","BLOCK"))))
][
,c('cf','cs','mf','ms'):=NULL
]
return(X)
}
createResult3 <- function(X) { ##jlhoward
X$mixed <- grepl(',',X$FIRST) | grepl(',',X$SECOND)
X$nf <- nchar(X$FIRST)
X$ns <- nchar(X$SECOND)
X$RESULT = ""
setkey(X,nf,ns)
X[J(1,1),RESULT:="SAME"]
X[!mixed & nf==ns & nf>1 & ns>1]$RESULT <- "BLOCK"
X[!mixed & nf > ns]$RESULT <- "LESS"
X[!mixed & nf < ns]$RESULT <- "MORE"
X[(mixed)]$RESULT <- "MIXED"
X[,c('nf','ns','mixed'):=NULL]
setkey(X,POSITION)
return(X)
}
Create the same data table as above, but with 1,000,000 rows
X <- as.data.table(list(POSITION=rep(c(1,4,5,9,24,36,42,56),1000000),
FIRST=rep(c("A","BB","AA","B","AAA","B","A,B"),1000000),
SECOND=rep(c("B","AA","B","AAA","BBB","AB,ABB","B,A"),1000000)))
Y <- copy(X)
Z <- copy(X)
Here are the results:
> system.time(X <- createResult1(X))
user system elapsed
4.06 0.05 4.12
> system.time(Y <- createResult2(Y))
user system elapsed
18.53 0.36 18.94
> system.time(Z <- createResult2(Z))
user system elapsed
18.63 0.29 18.97
> identical(X,Y)
[1] TRUE
> identical(X,Z)
[1] TRUE
How should I modify my code to update variables within a loop?
Specifically, I want to do something like the following:
myMatrix1 <- read.table(someFile)
myMatrix2 <- read.table(someFile2)
for (i in nrow(myMatrix2))
{
myMatrix3 <- myMatrix1[which(doSomeTest),]
myMatrix4 <- rep(myMatrix2$header1,nrow(myMatrix1))
myMatrix5 <- rep(myMatrix2$header2, nrow(myMatrix1))
myMatrix6 <- cbind(myMatrix3, myMatrix4, myMatrix5)
# *see question
}
How can I get myMatrix6 to be updated instead of reassigned the product of cbind(myMatrix3, myMatrix4, myMatrix5)? In other words, if the first iteration (i = 1) gave a myMatrix6 of:
> 1 1 1 1
> 2 2 2 2
and the second iteration (i = 2) gave myMatrix 6 of:
> 3 3 3 3
> 4 4 4 4
how do I get a dataframe(?) of:
> 1 1 1 1
> 2 2 2 2
> 3 3 3 3
> 4 4 4 4
UPDATE:
I have - thanks to DWin and Timo's suggestions - got the following. However, the following code has taken me about 2 hours to run on my datasets. Are there any ways to make it run any faster??? (without using a more powerful computer I may add)
# create empty matrix for sedimentation
myMatrix6 <- data.frame(NA,NA,NA,NA)[0,]
names(myMatrix6) <- letters[1:4]
# create empty matrix for bore
myMatrix7 <- data.frame(NA,NA,NA,NA)[0,]
names(myMatrix7) <- letters[1:4]
for (i in 1:nrow(myMatrix2))
{
# create matrix that has the value of myMatrix1$begin being
# situated between the values of myMatrix2begin[i] and myMatrix2finish[i]
myMatrix3 <- myMatrix1[which((myMatrix1$begin > myMatrix2$begin[i]) & (myMatrix1$begin < myMatrix2$finish[i])),]
myMatrix4 <- rep(myMatrix2$sedimentation, nrow(myMatrix3))
if (is.na(myMatrix2$boreWidth[i])) {
myMatrix5 <- rep(NA, nrow(myMatrix3))
}
else if (myMatrix2$boreWidth[i] == 0) {
myMatrix5 <- rep(TRUE, nrow(myMatrix3))
}
else if (myMatrix2$boreWidth[i] > 0) {
myMatrix5 <- rep(FALSE, nrow(myMatrix3))
}
myMatrix6 <- rbind(myMatrix6, cbind(myMatrix3, myMatrix4))
myMatrix7 <- rbind(myMatrix7, cbind(myMatrix3, myMatrix5))
}
You instead initialize myMatrix6 to an empty data.frame and rbind the results (which may be inefficient). If efficiency is a concern then you pre-allocate to the size you want and fill in rows in the data.frame with indexing.
# Method # 1 code
myMatrix6 <- data.frame(NA,NA,NA,NA)[0,]
names(myMatrix6) <- letters[1:4]
for (i in nrow(myMatrix2)) {
myMatrix3 <- myMatrix1[which(doSomeTest),]
myMatrix4 <- rep(myMatrix2$header1,nrow(myMatrix1))
myMatrix5 <- rep(myMatrix2$header2, nrow(myMatrix1))
myMatrix6 <- rbind( myMatrix6, cbind(myMatrix3, myMatrix4, myMatrix5) )
}
In your code, you are not dealing with matrices (in the sense of R), but data frames, as read.table returns a data frame.
In either way, you can append one matrix/data frame to another (assuming column names match) with rbind command
For example, if
> a = data.frame(x=c(1,2,3),y=c(4,5,6),z=c(7,8,9))
> b = data.frame(x=c(4,5),y=c(5,6),z=c(6,7))
then
> rbind(a,b)
x y z
1 1 4 7
2 2 5 8
3 3 6 9
4 4 5 6
5 5 6 7
There are other gotchas in the code you provide. For example
for (i in length(someVector)))
should be
for (i in 1:length(someVector)))
R has many functions for iterating over data.frames, vectors etc and can do all kinds of data transformations. Most of the time one does not need to write a for loop.
If you would provide more details about what you are trying to do, maybe we can find a simpler solution.
EDIT:
It seems from your post update that you are trying to do some sort of conversion between 'wide' and 'long' format and filter out some lines that fail a test. Correct me, if I am wrong.
Anyway, if that is the case, you should check out reshape command. Also, there is a reshape package containing extremely useful commands melt and cast, which can do that kind of transformations quite efficiently. Also, there is merge command for doing certain "join" operations for data frames. I'm quite sure your problem could be solved by using a combination of above commands, but it depends on exact details.
For filtering rows/columns with some criteria, check out subset command.
I am trying to write a loop that would search for the right date in the data.frame B (date_B[j]) and would copy the related value X_B[j] into the X_A[i] variable related to the same date date_A[i].
The challenge is that a) the target data.frame A has several of the same dates but b) not systematically all the dates that the data.frame (B) has. The (B) includes all the needed dates though. Consequently, the data frames are of different lengths.
Questions:
Why the following loop does not work, but does not return error messages?
How to fix it?
Are there any other ways to solve this problem in R?
The data frames are:
A =
date_A X_A
1 2010-01-01
2 2010-01-02
3 2010-01-03
4 2010-01-02
5 2010-02-03
6 2010-01-01
7 2010-01-02
.
.
.
20000
B=
date_B X_B
1 2010-01-01 7.9
2 2010-01-02 8.5
3 2010-01-03 2.1
.
.
400
My goal is:
A=
date_A X_A
1 2010-01-01 7.9
2 2010-01-02 8.5
3 2010-01-03 2.1
4 2010-01-02 8.5
5 2010-02-03 2.1
6 2010-01-01 7.9
7 2010-01-02 8.5
I wrote the following loop, but for some reason it does not find its way past the first row. In other words, it does not change the values of the other X_A cells, although the loop keeps running endlessly.
i=1; j=1;
while (i <= nrow(A))
while (j <= nrow(B)) {
if (A$date_A[i]==B$date_B[j]) A$X_A[i] <- B$X_B[j];
j <- j+1; if (j == nrow(B))
i <- i+1;
j=1
}
Thanks for your help.
With this sort of problem merge makes it much easier. With your example I do not get a match with the seventh row but perhaps you had a typo. My A dataframe only had the date_A column. If you want to rename the X_B column, then the names()<- will do it easily;
merge(A, B, by.x=1, by.y=1, all.x=TRUE)
#---result---
date_A X_B
1 2010-01-01 7.9
2 2010-01-01 7.9
3 2010-01-02 8.5
4 2010-01-02 8.5
5 2010-01-02 8.5
6 2010-01-03 2.1
7 2010-02-03 NA
With this data:
A <- data.frame( date_A = c('2010-01-01', '2010-01-02', '2010-01-03', '2010-01-02',
'2010-02-03', '2010-01-01', '2010-01-02') )
B <- data.frame(
date_B = c('2010-01-01','2010-01-02','2010-01-03'),
X_B = c(7.9,8.5,2.1))
You can use match() to index the X_B values in the right order:
A$X_A <- B$X_B[match(A$date_A,B$date_B)]
match() returns the indexes of the locations of B$date_B in A$date_A. Another trick to use is to use the levels of the factor as index:
A$X_A <- B$X_B[A$date_A]
which works because each factor has levels in sorted order and correspond to numeric values (1,2,3...). So if B is sorted according to these levels this returns the correct indexes. (you might want to sort B to be sure: B <- B[order(B$date_B),])
As for why the loop doesn't work. First, I think you really don't want to use ; in R scripts ever. It makes code so much harder to read. Best is if you learn to write clear code. In your code you can use assigners more consistent and use proper indenting. For example:
i <- 1
j <- 1
while (i <= nrow(A))
{
while (j <= nrow(B))
{
if (A$date_A[i]==B$date_B[j]) A$X_A[i] <- B$X_B[j]
j <- j+1
if (j == nrow(B)) i <- i+1
j <- 1
}
}
This is your code, but it is much clearer to read. For me this does not run because the levels are not comparible (due to the typo) so I put in an as.character() call. This is probably not needed in the real dataset.
Indexing immediately shows the biggest problem here: You have misplaced j <- 1 outside the if (j == nrow(B)) part. Using ; terminates the line and thus the conditional part. Because of this j is set to 1 in each loop.
Changing that makes it run better, but you still get an error because the while loop for j might not finish before i is larger then the number of rows in A. This can be changed by setting an AND statement and collapsing both while loops in one. Finally you need to set the if statement to larger then the number of rows in B or you omit one row. This should work:
i <- 1
j <- 1
while (j <= nrow(B) & i <= nrow(A))
{
if (as.character(A$date_A[i])==as.character(B$date_B[j])) A$X_A[i] <- B$X_B[j]
j <- j+1
if (j > nrow(B))
{
i <- i+1
j <- 1
}
}
But this is only meant to show what went wrong, I'd never recommended doing something like this this way. Even when you really want to use loops you are probably better of with for loops.
Wow! Your code scares me. At the very least, use a for loop for this kind of thing (although #Dwin's solution is the way to go for this problem):
for(i in seq(nrow(A)))
{
for(j in seq(nrow(B)))
{
if(A$date_A[i]==B$date_B[j])
{
A$X_A[i] <- B$X_B[j]
}
}
}
This will prevent all the ugliness with manually trying to do the increments at the end of your while loops (in your own code, the j=1 needed to be moved outside the inner brackets, by the way).
Note: this code, as yours, does not solve the issue when B contains two rows with the same date as in A (it will always use the value of the last row in B for that date). It serves to help you understand for instead of while for simple incremental loops.