Iterating through all rows in R, removing those that fit criteria - r

R data frame. It has about a dozen columns and 150 or so rows. I want to iterate through each row and remove it, under these two conditions
It's value in column 8 is undefined
The value for the row ABOVE it, in column 8 IS defined.
My code looks like this, but it keeps crashing. It's gotta be a dumb mistake, but I can't figure it out.
for (i in 2:nrow(newfile)){
if (is.na(newfile[i,8]) && !is.na(newfile[(i-1),8]){
newfile<-newfile[-i,]
}
}
Obviously in this example, newfile is my dataframe.
The error I get
Error in [.data.frame(newfile, -i, ) : object 'i' not found
Problem solved, but some test data if you guys wanted to muck around:
23 L8 29141078 744319 27165443
24 L8 27165443 NA NA
25 L8 28357836 8293 25116398
26 L8 25116398 NA NA
27 L8 28357836 21600 25116398
28 L8 25116398 NA NA
29 L8 40929564 NA NA
30 L8 40929564 NA NA
31 L8 41917264 33234 39446503
32 L8 39446503 NA NA
33 L8 41917264 33981 39446503
34 L8 39446503 NA NA
Obviously a little modified here, so now you are comparing column 4 with the one above it (or you can use column 5, either way)

The problem is that you're changing the data frame out from under yourself; the original evaluation of nrow(newfile) doesn't get updated as you go along (it would if you had a C-style loop for (i=1; i<=nrow(newfile); i++) ...). In a while loop, on the other hand, the condition will get re-evaluated every time through the loop, so I think this will work.
i <- 2
while (i<=nrow(newfile)){
if (is.na(newfile[i,8]) && !is.na(newfile[i-1,8])) {
newfile<-newfile[-i,]
}
i <- i+1
}
You didn't give us an easily reproducible answer (i.e. a test dataset with answers), so I'm not going to test this right now.
Careful thought (which I don't have time to give this at the moment) might lead to a non-iterative (and hence perhaps very much faster, if that matters) way to do this.

Hmm, if I do this, I get
Error in if (is.na(newfile[i,8]) && !is.na(newfile[(i-1),8]) { :
missing value where TRUE/FALSE needed
This is because you're removing rows while you're iterating through them, so by the time you get to nrow(newfile) (which is the original number of rows, since the nrow(newfile) is evaluated once at the beginning of the foor loop), it may not exist any more because rows have been removed.
You can avoid looping altogether by constructing a logical index of which rows to keep (ie vector of length nrow(newfile) with TRUE if you want to keep the row and FALSE otherwise):
n <- nrow(newfile)
# first bit says "is the row NA (for rows 2:n)"
# second bit says "is the row above *not* NA (for rows 1:(n-1))
# the & finds rows satisfying *both* conditions (first row always gets kept)
toRemove <- c(FALSE,is.na(newfile[-1,8])) & c(FALSE,!is.na(newfile[-n,8]))
toKeep <- !toRemove
newfile <- newfile[toKeep,]
You could do it all in one line if that's your thing:
newfile <- newfile[ !(c(FALSE,is.na(newfile[-1,8])) & c(FALSE,!is.na(newfile[-nrow(newfile),8]))), ]

Here is another solution. But it keeps NA values if the previous value is also NA.
#create some dummy data
newfile <- matrix(runif(800), ncol = 8)
newfile[rbinom(100, 1, 0.25) == 1, 8] <- NA
#the selection
newfile[-which(diff(is.na(newfile[, 8])) == 1) - 1, ]

Related

Appending to an R List one by one

Let's say I have data like:
> data[295:300,]
Date sulfate nitrate ID
295 2003-10-22 NA NA 1
296 2003-10-23 NA NA 1
297 2003-10-24 3.47 0.363 1
298 2003-10-25 NA NA 1
299 2003-10-26 NA NA 1
300 2003-10-27 NA NA 1
Now I would like to add all the nitrate values into a new list/vector. I'm using the following code:
i <- 1
my_list <- c()
for(val in data)
{
my_list[i] <- val
i <- i + 1
}
But this is what happens:
Warning message:
In x[i] <- val :
number of items to replace is not a multiple of replacement length
> i
[1] 2
> x
[1] NA
Where am I going wrong? The data is part of a Coursera R Programming coursework. I can assure you that this is not an assignment/quiz. I have been trying to understand what is the best way append elements into a list with a loop? I have not proceeded to the lapply or sapply part of the coursework, so thinking about workarounds.
Thanks in advance.
If it's a duplicate question, please direct me to it.
As we mention in the comments, you are not looping over the rows of your data frame, but the columns (also sometimes variables). Hence, loop over data$nitrate.
i <- 1
my_list <- c()
for(val in data$nitrate)
{
my_list[i] <- val
i <- i + 1
}
Now, instead of looping over your values, a better way is to use that you want the new vector and the old data to have the same index, so loop over the index i. How do you tell R how many indexes there are? Here you have several choices again: 1:nrow(data), 1:length(data$nitrate) and several other ways. Below I have given you a few examples of how to extract from the data frame.
my_vector <- c()
for(i in 1:nrow(data)){
my_vector[i] <- data$nitrate[i] ## Version 1 of extracting from data.frame
my_vector[i] <- data[i,"nitrate"] ## Version 2: [row,column name]
my_vector[i] <- data[i,3] ## Version 3: [row,column number]
}
My suggestion: Rather than calling the collection a list, call it a vector, since that is what it is. Vectors and lists behave a little differently in R.
Of course, in reality you don't want to get the data out one by one. A much more efficient way of getting your data out is
my_vector2 <- data$nitrate

ifelse in 'r' returns NA value

I have two data frames, and I want to match contents of one with other, for this I am using following function:
t <- read.csv("F:/M.Tech/Semester4/Thesis/Code/Book1.csv")
s <- read.csv("F:/M.Tech/Semester4/Thesis/Code/a4.csv")
x <- nrow(s)
y <- nrow(t)
for(i in 1:x)
for(j in 1:y)
ifelse (match(s[i,2], t[j,1]), s[i,9] <- t[j,2] , s[i,9] <- 0)
for this code, when the contents match then it works fine. But the else part returns NA. How can I assign 0 to all the places where there is no match.
I am getting the result as:
# word count word tf score word robability log values TFxIDF score Keyword Probability
# yemen 380 yemen 1 0.053938964 2.919902172 2.919902172 NA
# strikes 116 strikes 0.305263158 0.016465578 4.106483233 1.25355804 0.5
# deadly 105 deadly 0.276315789 0.014904187 4.206113074 1.162215455 0.7
# new 88 new 0.231578947 0.012491128 4.38273661 1.014949531 NA
Instead of the NA. I want to store 0 there.
Issue 1: ifelse returns one of two values, depending on the test condition. It's not a flow control function that executes code snippet one or code snippet two based on a condition.
This is right:
my_var <- ifelse(thing_to_test, value_if_true, value_if_false)
This is wrong, and doesn't make sense in R
ifelse(thing_to_test, my_var <- value_if_true, my_var <- value_if_false)
Issue 2: make sure thing_to_test is a logical expression.
Putting those things together, you can see you should follow the instruction left by Richard Scriven as a comment above

Unexpected row(s) of NAs when selecting subset of dataframe

When selecting a subset of data from a dataframe, I get row(s) entirely made up of NA values that were not present in the original dataframe. For example:
example.df[example.df$census_tract == 27702, ]
returns:
census_tract number_households_est
NA NA NA
23611 27702 2864
Where did that first row of NAs come from? And why is it returned even though example.df$census_tract != 27702 for that row?
That is because there is a missing observation
> sum(is.na(example.df$census_tract))
[1] 1
> example.df[which(is.na(example.df$census_tract)), ]
census_tract number_households_est
64 NA NA
When == evaluates the 64th row it gives NA because by default we can't know wheter 27702 is equal to the missing value. Therefore the result is missing (aka NA). So a NA is putted in the logical vector used for indexing purposes. And this gives, by default, a full-of-NA row, because we are asking for a row but "we don't know which one".
The proper way is
> example.df[example.df$census_tract %in% 27702, ]
census_tract number_households_est
23611 27702 2864
HTH, Luca

operate a custom loop inside ddply

My data set has about 54,000 rows. I want to set a value (First_Pass) to either T or F depending upon both a value in another column and also whether or not that other column's value has been seen before. I have a for loop that does exactly what I need it to do. However, that loop is only for a subset of the data. I need that same for loop to be run individually for different subsets based upon factor levels.
This seems like the perfect case for the plyr functions as I want to split the data into subsets, apply a function (my for loop) and then rejoin the data. However, I cannot get it to work. First, I give a sample of the df, called char.data.
session_id list Sent_Order Sentence_ID Cond1 Cond2 Q_ID Was_y CI CI_Delta character tsle tsoc Direct
5139 2 b 9 25 rc su 25 correct 1 0 T 995 56 R
5140 2 b 9 25 rc su 25 correct 2 1 h 56 56 R
5141 2 b 9 25 rc su 25 correct 3 1 e 56 56 R
5142 2 b 9 25 rc su 25 correct 4 1 56 37 R
There is some clutter in there. The key columns are session_id, Sentence_ID, CI, and CI_Delta.
I then initialise a column called First_Pass to "F"
char.data$First_Pass <- "F"
I want to now calculate when First_Pass is actually "T" for each combination of session_id and Sentence_ID. I created a toy set, which is just one subset to work out the overall logic. Here's the code of a for loop that gives me just what I want for the toy data.
char.data.toy$First_Pass <- "F"
l <-c(200)
for (i in 1:nrow(char.data.toy)) {
if(char.data.toy[i,]$CI_Delta >= 0 & char.data.toy[i,]$CI %nin% l){
char.data.toy[i,]$First_Pass <- "T"
l <- c(l,char.data.toy[i,]$CI)}
}
I now want to take this loop and run it for every session_id and Sentence_ID subset. I've created a function called set_fp and then called it inside ddply. Here is that code:
#define function
set_fp <- function (df){
l <- 200
for (i in 1:nrow(df)) {
if(df[i,]$CI_Delta >= 0 & df[i,]$CI %nin% l){
df[i,]$First_Pass <- "T"
l <- c(l,df[i,]$CI)}
else df[i,]$First_Pass <- "F"
return(df)
}
}
char.data.fp <- ddply(char.data,c("session_id","Sentence_ID"),function(df)set_fp(df))
Unfortunately, this is not quite right. For a long time, I was getting all "F" values for First_Pass. Now I'm getting 24 T values, when it should be many more, so I suspect, it's only keeping the last subset or something similar. Help?
This is a little hard to test with only the four rows that you've provided. I created random data to see if it works and it seems to work for me. Try it on you data too.
This uses the data.table library and doesn't try to run loops inside a ddply. I'm assuming the means aren't important.
library(data.table)
dt <- data.table(df)
l <- c(200)
# subsetting to keep only the important fields
dt <- dt[,list(session_id, Sentence_ID, CI, CI_Delta)]
# Initialising First_Pass
dt[,First_Pass := 'F']
# The next two lines are basically rewording your logic -
# Within each group of session_id, Sentence_ID, identify the duplicate CI entries. These would have been inserted in l. The first time occurence of these CI entries is marked false as they wouldn't have been in l when that row was being checked
dt[CI_Delta >= 0,duplicatedCI := duplicated(CI), by = c("session_id", "Sentence_ID")]
# So if the CI value hasn't occurred before within the session_id,Sentence_ID group, and it doesn't appear in l, then mark it as "T"
dt[!(CI %in% l) & !(duplicatedCI), First_Pass := "T"]
# Just for curiosity's sake, calculating l too
l <- c(l,dt[duplicatedCI == FALSE,CI])

subtracting columns from other columns in R data.frame

I have a rather odd problematic going on.
Let X be a dataset with about 300.000 rows and 300 columns. Assume that a lot of the entrys in X have missing values (which in this case equal zero in reality).
What i want to do:
subtract for each row the third column from the farmost right column, which is not missing.
save the difference, as well as the colname. if the difference is not negative, search for the next not missing value in the row, going left, and now calculate the difference between the already calculated difference and the new not missing value. do this as long the difference is not negative and each time save the colname.
i already wrote something, to do this for me - the problem is, it effectively takes about 53h to finish and i reckon that the dataset isn't even big in particular.
could you guys please help me :(
b <- c()
length(b) <- 193145
d <- 0;
for (i in (1:193145))
{
d <- 0;
for (j in (271:4))
{
while(is.na(x[i,j]))
{
j <- j-1;
}
d <- (d+x[i,j]);
if ((x[i,3]-d)&&(j>3))
{
b[i] <- colnames(x)[j]
j <- 2
}
else if (j==3)
{
b[i] <- "older"
}
j<-j-1;
}
i<-i+1;
}
UPDATE:
Hey guys, thanks for the fast responses. The i<-i+1 bit is completely false, as I forgot,that by the end of an for loop,i gets incremented anyways.
okay, a short example
A B C D E F G H I
AB001BWIF085 SS13 2980 NA NA 4000 NA NA 3000
AB001BWCE475 SS12 3800 NA NA 5000 NA NA 2000
AB001BWIF087 SS13 2980 NA NA 2000 NA NA 500
what do i want to do? i want to loop over every row, and subtract the value in the third column from every value in the following columns, beginning from the farmost right.i want the COLNAME of the object,which is not NA, to be saved with the difference to my value from the third column.
and do you have some examples for the vectorize package? as i couldn't really grasp the ones presented inside of the help.
Thanks again! :)
EXPECTED RESULT:
A col_name_1 difference_1 col_name_2 difference_2 ...
AB001BWIF085 I -20 NA NA
AB001BWCE475 I 1200 F -3800
AB001BWIF087 I 2480 F 480 "older"
If the difference is not going to drop below 0, i want an entry to be "older", indicating this case.

Resources