I am trying to write a loop that would search for the right date in the data.frame B (date_B[j]) and would copy the related value X_B[j] into the X_A[i] variable related to the same date date_A[i].
The challenge is that a) the target data.frame A has several of the same dates but b) not systematically all the dates that the data.frame (B) has. The (B) includes all the needed dates though. Consequently, the data frames are of different lengths.
Questions:
Why the following loop does not work, but does not return error messages?
How to fix it?
Are there any other ways to solve this problem in R?
The data frames are:
A =
date_A X_A
1 2010-01-01
2 2010-01-02
3 2010-01-03
4 2010-01-02
5 2010-02-03
6 2010-01-01
7 2010-01-02
.
.
.
20000
B=
date_B X_B
1 2010-01-01 7.9
2 2010-01-02 8.5
3 2010-01-03 2.1
.
.
400
My goal is:
A=
date_A X_A
1 2010-01-01 7.9
2 2010-01-02 8.5
3 2010-01-03 2.1
4 2010-01-02 8.5
5 2010-02-03 2.1
6 2010-01-01 7.9
7 2010-01-02 8.5
I wrote the following loop, but for some reason it does not find its way past the first row. In other words, it does not change the values of the other X_A cells, although the loop keeps running endlessly.
i=1; j=1;
while (i <= nrow(A))
while (j <= nrow(B)) {
if (A$date_A[i]==B$date_B[j]) A$X_A[i] <- B$X_B[j];
j <- j+1; if (j == nrow(B))
i <- i+1;
j=1
}
Thanks for your help.
With this sort of problem merge makes it much easier. With your example I do not get a match with the seventh row but perhaps you had a typo. My A dataframe only had the date_A column. If you want to rename the X_B column, then the names()<- will do it easily;
merge(A, B, by.x=1, by.y=1, all.x=TRUE)
#---result---
date_A X_B
1 2010-01-01 7.9
2 2010-01-01 7.9
3 2010-01-02 8.5
4 2010-01-02 8.5
5 2010-01-02 8.5
6 2010-01-03 2.1
7 2010-02-03 NA
With this data:
A <- data.frame( date_A = c('2010-01-01', '2010-01-02', '2010-01-03', '2010-01-02',
'2010-02-03', '2010-01-01', '2010-01-02') )
B <- data.frame(
date_B = c('2010-01-01','2010-01-02','2010-01-03'),
X_B = c(7.9,8.5,2.1))
You can use match() to index the X_B values in the right order:
A$X_A <- B$X_B[match(A$date_A,B$date_B)]
match() returns the indexes of the locations of B$date_B in A$date_A. Another trick to use is to use the levels of the factor as index:
A$X_A <- B$X_B[A$date_A]
which works because each factor has levels in sorted order and correspond to numeric values (1,2,3...). So if B is sorted according to these levels this returns the correct indexes. (you might want to sort B to be sure: B <- B[order(B$date_B),])
As for why the loop doesn't work. First, I think you really don't want to use ; in R scripts ever. It makes code so much harder to read. Best is if you learn to write clear code. In your code you can use assigners more consistent and use proper indenting. For example:
i <- 1
j <- 1
while (i <= nrow(A))
{
while (j <= nrow(B))
{
if (A$date_A[i]==B$date_B[j]) A$X_A[i] <- B$X_B[j]
j <- j+1
if (j == nrow(B)) i <- i+1
j <- 1
}
}
This is your code, but it is much clearer to read. For me this does not run because the levels are not comparible (due to the typo) so I put in an as.character() call. This is probably not needed in the real dataset.
Indexing immediately shows the biggest problem here: You have misplaced j <- 1 outside the if (j == nrow(B)) part. Using ; terminates the line and thus the conditional part. Because of this j is set to 1 in each loop.
Changing that makes it run better, but you still get an error because the while loop for j might not finish before i is larger then the number of rows in A. This can be changed by setting an AND statement and collapsing both while loops in one. Finally you need to set the if statement to larger then the number of rows in B or you omit one row. This should work:
i <- 1
j <- 1
while (j <= nrow(B) & i <= nrow(A))
{
if (as.character(A$date_A[i])==as.character(B$date_B[j])) A$X_A[i] <- B$X_B[j]
j <- j+1
if (j > nrow(B))
{
i <- i+1
j <- 1
}
}
But this is only meant to show what went wrong, I'd never recommended doing something like this this way. Even when you really want to use loops you are probably better of with for loops.
Wow! Your code scares me. At the very least, use a for loop for this kind of thing (although #Dwin's solution is the way to go for this problem):
for(i in seq(nrow(A)))
{
for(j in seq(nrow(B)))
{
if(A$date_A[i]==B$date_B[j])
{
A$X_A[i] <- B$X_B[j]
}
}
}
This will prevent all the ugliness with manually trying to do the increments at the end of your while loops (in your own code, the j=1 needed to be moved outside the inner brackets, by the way).
Note: this code, as yours, does not solve the issue when B contains two rows with the same date as in A (it will always use the value of the last row in B for that date). It serves to help you understand for instead of while for simple incremental loops.
Related
Let's say I have data like:
> data[295:300,]
Date sulfate nitrate ID
295 2003-10-22 NA NA 1
296 2003-10-23 NA NA 1
297 2003-10-24 3.47 0.363 1
298 2003-10-25 NA NA 1
299 2003-10-26 NA NA 1
300 2003-10-27 NA NA 1
Now I would like to add all the nitrate values into a new list/vector. I'm using the following code:
i <- 1
my_list <- c()
for(val in data)
{
my_list[i] <- val
i <- i + 1
}
But this is what happens:
Warning message:
In x[i] <- val :
number of items to replace is not a multiple of replacement length
> i
[1] 2
> x
[1] NA
Where am I going wrong? The data is part of a Coursera R Programming coursework. I can assure you that this is not an assignment/quiz. I have been trying to understand what is the best way append elements into a list with a loop? I have not proceeded to the lapply or sapply part of the coursework, so thinking about workarounds.
Thanks in advance.
If it's a duplicate question, please direct me to it.
As we mention in the comments, you are not looping over the rows of your data frame, but the columns (also sometimes variables). Hence, loop over data$nitrate.
i <- 1
my_list <- c()
for(val in data$nitrate)
{
my_list[i] <- val
i <- i + 1
}
Now, instead of looping over your values, a better way is to use that you want the new vector and the old data to have the same index, so loop over the index i. How do you tell R how many indexes there are? Here you have several choices again: 1:nrow(data), 1:length(data$nitrate) and several other ways. Below I have given you a few examples of how to extract from the data frame.
my_vector <- c()
for(i in 1:nrow(data)){
my_vector[i] <- data$nitrate[i] ## Version 1 of extracting from data.frame
my_vector[i] <- data[i,"nitrate"] ## Version 2: [row,column name]
my_vector[i] <- data[i,3] ## Version 3: [row,column number]
}
My suggestion: Rather than calling the collection a list, call it a vector, since that is what it is. Vectors and lists behave a little differently in R.
Of course, in reality you don't want to get the data out one by one. A much more efficient way of getting your data out is
my_vector2 <- data$nitrate
This is my example dataset
> a
V1 V2
1 A1 5437
2 A1 5437
3 A1 5437
4 A2 1819
5 A2 1758
6 A2 1212
7 A2 1212
8 A3 1456
I want to compute unique values for column V2, so the result will be:
A1 1
A2 3
A3 1
I have started write my code, but I don't have idea - how should this look like:
old_id <- a[1,2]
old_art <- a[2,1]
for (i in nrow(a)){
if (old_id == a[1,i+2] && old_art == a[i+2,1]){
new_id[i] <- old_id[1,i+2]
new_art[i] <- i
}
i <- i+1
}
I know very simple solution like:
tapply(a[,2], a[,1], function(t) length(unique(t)))
but my task is to use loop function - probably for and if
This sounds like homework. But for loops run through all elements in the vector on the right hand side of in. This also means that your for loop will increment automatically and so you don't need i <- i+1.
Hence, your for loop should look like this
for (i in 1:nrow(a)) {
< your code >
}
# i <- i + 1 # No need for this!
Notice i in 1:nrow(a) and no i in nrow(a). I haven't checked your code, only your forsyntax.
Remember, for loops are just functions; so
for (i in 1:3) {
print(i)
}
#[1] 1
#[1] 2
#[1] 3
is the same as
`for`(i, 1:3, print(i))
#[1] 1
#[1] 2
#[1] 3
See ?"for".
Your question, specifically, concerns the ussage of for and if. Here is my approach:
You can define for in two forms: the "classic C style" and the "vector style".
The "classic C style" would be something like this:
for(i = 1; i <= nrow(a); i = i + 1) {
# Your code goes here
}
The "vector style" would be something like this:
for(i in 1:nrow(a)) {
# Your code goes here
}
Notice that, in both cases, the for statement is the one that increments the value of i. Also, remember that in R the starting index is one (unlike many C-like languages, where the starting index is usually zero).
As for your if statement, R uses only one & for and (and only one | for or), so your if statement should be something like this:
if(old_id == a[1,i+2] & old_art == a[i+2,1]) {
# More code here
}
Finally, if you want to debug your code, check this link.
My data set has about 54,000 rows. I want to set a value (First_Pass) to either T or F depending upon both a value in another column and also whether or not that other column's value has been seen before. I have a for loop that does exactly what I need it to do. However, that loop is only for a subset of the data. I need that same for loop to be run individually for different subsets based upon factor levels.
This seems like the perfect case for the plyr functions as I want to split the data into subsets, apply a function (my for loop) and then rejoin the data. However, I cannot get it to work. First, I give a sample of the df, called char.data.
session_id list Sent_Order Sentence_ID Cond1 Cond2 Q_ID Was_y CI CI_Delta character tsle tsoc Direct
5139 2 b 9 25 rc su 25 correct 1 0 T 995 56 R
5140 2 b 9 25 rc su 25 correct 2 1 h 56 56 R
5141 2 b 9 25 rc su 25 correct 3 1 e 56 56 R
5142 2 b 9 25 rc su 25 correct 4 1 56 37 R
There is some clutter in there. The key columns are session_id, Sentence_ID, CI, and CI_Delta.
I then initialise a column called First_Pass to "F"
char.data$First_Pass <- "F"
I want to now calculate when First_Pass is actually "T" for each combination of session_id and Sentence_ID. I created a toy set, which is just one subset to work out the overall logic. Here's the code of a for loop that gives me just what I want for the toy data.
char.data.toy$First_Pass <- "F"
l <-c(200)
for (i in 1:nrow(char.data.toy)) {
if(char.data.toy[i,]$CI_Delta >= 0 & char.data.toy[i,]$CI %nin% l){
char.data.toy[i,]$First_Pass <- "T"
l <- c(l,char.data.toy[i,]$CI)}
}
I now want to take this loop and run it for every session_id and Sentence_ID subset. I've created a function called set_fp and then called it inside ddply. Here is that code:
#define function
set_fp <- function (df){
l <- 200
for (i in 1:nrow(df)) {
if(df[i,]$CI_Delta >= 0 & df[i,]$CI %nin% l){
df[i,]$First_Pass <- "T"
l <- c(l,df[i,]$CI)}
else df[i,]$First_Pass <- "F"
return(df)
}
}
char.data.fp <- ddply(char.data,c("session_id","Sentence_ID"),function(df)set_fp(df))
Unfortunately, this is not quite right. For a long time, I was getting all "F" values for First_Pass. Now I'm getting 24 T values, when it should be many more, so I suspect, it's only keeping the last subset or something similar. Help?
This is a little hard to test with only the four rows that you've provided. I created random data to see if it works and it seems to work for me. Try it on you data too.
This uses the data.table library and doesn't try to run loops inside a ddply. I'm assuming the means aren't important.
library(data.table)
dt <- data.table(df)
l <- c(200)
# subsetting to keep only the important fields
dt <- dt[,list(session_id, Sentence_ID, CI, CI_Delta)]
# Initialising First_Pass
dt[,First_Pass := 'F']
# The next two lines are basically rewording your logic -
# Within each group of session_id, Sentence_ID, identify the duplicate CI entries. These would have been inserted in l. The first time occurence of these CI entries is marked false as they wouldn't have been in l when that row was being checked
dt[CI_Delta >= 0,duplicatedCI := duplicated(CI), by = c("session_id", "Sentence_ID")]
# So if the CI value hasn't occurred before within the session_id,Sentence_ID group, and it doesn't appear in l, then mark it as "T"
dt[!(CI %in% l) & !(duplicatedCI), First_Pass := "T"]
# Just for curiosity's sake, calculating l too
l <- c(l,dt[duplicatedCI == FALSE,CI])
I am trying to use data.table where my j function could and will return a different number of columns on each call. I would like it to behave like rbind.fill in that it fills any missing columns with NA.
fetch <- function(by) {
if(by == 1)
data.table(A=c("a"), B=c("b"))
else
data.table(B=c("b"))
}
data <- data.table(id=c(1,2))
result <- data[, fetch(.BY), by=id]
In this case 'result' may end up with two columns; A and B. 'A' and 'B' was returned as part of the first call to 'fetch' and only 'B' was returned as part of the second. I would like the example code to return this result.
id A B
1 1 a b
2 2 <NA> b
Unfortunately, when run I get this error.
Error in `[.data.table`(data, , fetch(.BY, .SD), by = id) :
j doesn't evaluate to the same number of columns for each group
I can do this with plyr as follows, but in my real world use case plyr is running out of memory. Each call to fetch occurs rather quickly, but the memory crash occurs when plyr tries to merge all of the data back together. I am trying to see if data.table might solve this problem for me.
result <- ddply(data, "id", fetch)
Any thoughts appreciated.
DWin's approach is good. Or you could return a list column instead, where each cell is itself a vector. That's generally a better way of handling variable length vectors.
DT = data.table(A=rep(1:3,1:3),B=1:6)
DT
A B
1: 1 1
2: 2 2
3: 2 3
4: 3 4
5: 3 5
6: 3 6
ans = DT[, list(list(B)), by=A]
ans
A V1
1: 1 1
2: 2 2,3 # V1 is a list column. These aren't strings, the
3: 3 4,5,6 # vectors just display with commas
ans$V1[3]
[[1]]
[1] 4 5 6
ans$V1[[3]]
[1] 4 5 6
ans[,sapply(V1,length)]
[1] 1 2 3
So in your example you could use this as follows:
library(plyr)
rbind.fill(data[, list(list(fetch(.BY))), by = id]$V1)
# A B
#1 a b
#2 <NA> b
Or, just make the list returned conformant :
allcols = c("A","B")
fetch <- function(by) {
if(by == 1)
list(A=c("a"), B=c("b"))[allcols]
else
list(B=c("b"))[allcols]
}
Here are two approaches. The first roughly follows your strategy:
data[,list(A=if(.BY==1) 'a' else NA_character_,B='b'), by=id]
And the second does things in two steps:
DT <- copy(data)[,`:=`(A=NA_character_,B='b')][id==1,A:='a']
Using a by just to check for a single value seems wasteful (maybe computationally, but also in terms of clarity); of course, it could be that your application isn't really like that.
Try
data.table(A=NA, B=c("b"))
#NickAllen: I'm not sure from the comments whether you understood my suggestion. (I was posting from a mobile phone that limited my cut-paste capabilities and I suspect my wife was telling me to stop texting to S0 or she would divorce me.) What I meant was this:
fetch <- function(by) {
if(by == 1)
data.table(A=c("a"), B=c("b"))
else
data.table(A=NA, B=c("b"))
}
data <- data.table(id=c(1,2))
result <- data[, fetch(.BY), by=id]
I have a rather odd problematic going on.
Let X be a dataset with about 300.000 rows and 300 columns. Assume that a lot of the entrys in X have missing values (which in this case equal zero in reality).
What i want to do:
subtract for each row the third column from the farmost right column, which is not missing.
save the difference, as well as the colname. if the difference is not negative, search for the next not missing value in the row, going left, and now calculate the difference between the already calculated difference and the new not missing value. do this as long the difference is not negative and each time save the colname.
i already wrote something, to do this for me - the problem is, it effectively takes about 53h to finish and i reckon that the dataset isn't even big in particular.
could you guys please help me :(
b <- c()
length(b) <- 193145
d <- 0;
for (i in (1:193145))
{
d <- 0;
for (j in (271:4))
{
while(is.na(x[i,j]))
{
j <- j-1;
}
d <- (d+x[i,j]);
if ((x[i,3]-d)&&(j>3))
{
b[i] <- colnames(x)[j]
j <- 2
}
else if (j==3)
{
b[i] <- "older"
}
j<-j-1;
}
i<-i+1;
}
UPDATE:
Hey guys, thanks for the fast responses. The i<-i+1 bit is completely false, as I forgot,that by the end of an for loop,i gets incremented anyways.
okay, a short example
A B C D E F G H I
AB001BWIF085 SS13 2980 NA NA 4000 NA NA 3000
AB001BWCE475 SS12 3800 NA NA 5000 NA NA 2000
AB001BWIF087 SS13 2980 NA NA 2000 NA NA 500
what do i want to do? i want to loop over every row, and subtract the value in the third column from every value in the following columns, beginning from the farmost right.i want the COLNAME of the object,which is not NA, to be saved with the difference to my value from the third column.
and do you have some examples for the vectorize package? as i couldn't really grasp the ones presented inside of the help.
Thanks again! :)
EXPECTED RESULT:
A col_name_1 difference_1 col_name_2 difference_2 ...
AB001BWIF085 I -20 NA NA
AB001BWCE475 I 1200 F -3800
AB001BWIF087 I 2480 F 480 "older"
If the difference is not going to drop below 0, i want an entry to be "older", indicating this case.