subtracting columns from other columns in R data.frame - r

I have a rather odd problematic going on.
Let X be a dataset with about 300.000 rows and 300 columns. Assume that a lot of the entrys in X have missing values (which in this case equal zero in reality).
What i want to do:
subtract for each row the third column from the farmost right column, which is not missing.
save the difference, as well as the colname. if the difference is not negative, search for the next not missing value in the row, going left, and now calculate the difference between the already calculated difference and the new not missing value. do this as long the difference is not negative and each time save the colname.
i already wrote something, to do this for me - the problem is, it effectively takes about 53h to finish and i reckon that the dataset isn't even big in particular.
could you guys please help me :(
b <- c()
length(b) <- 193145
d <- 0;
for (i in (1:193145))
{
d <- 0;
for (j in (271:4))
{
while(is.na(x[i,j]))
{
j <- j-1;
}
d <- (d+x[i,j]);
if ((x[i,3]-d)&&(j>3))
{
b[i] <- colnames(x)[j]
j <- 2
}
else if (j==3)
{
b[i] <- "older"
}
j<-j-1;
}
i<-i+1;
}
UPDATE:
Hey guys, thanks for the fast responses. The i<-i+1 bit is completely false, as I forgot,that by the end of an for loop,i gets incremented anyways.
okay, a short example
A B C D E F G H I
AB001BWIF085 SS13 2980 NA NA 4000 NA NA 3000
AB001BWCE475 SS12 3800 NA NA 5000 NA NA 2000
AB001BWIF087 SS13 2980 NA NA 2000 NA NA 500
what do i want to do? i want to loop over every row, and subtract the value in the third column from every value in the following columns, beginning from the farmost right.i want the COLNAME of the object,which is not NA, to be saved with the difference to my value from the third column.
and do you have some examples for the vectorize package? as i couldn't really grasp the ones presented inside of the help.
Thanks again! :)
EXPECTED RESULT:
A col_name_1 difference_1 col_name_2 difference_2 ...
AB001BWIF085 I -20 NA NA
AB001BWCE475 I 1200 F -3800
AB001BWIF087 I 2480 F 480 "older"
If the difference is not going to drop below 0, i want an entry to be "older", indicating this case.

Related

doing for loop in R

I have a file that I have filtered my SNPs for LD (in the example below;my.filtered.snp.id). I want to keep only these SNPs in my genotype matrix (geno_snp), I am trying to write a for loops in R, and I would appreciate any help to fix my code. I want to keep those lines (the whole line including snp.id and genotype information) in the genotype matrix where snp.id matches with snp.id in my my.filtered.snp.id and delete those that are not match.
head(my.filtered.snp.id)
Chr10_31458
Chr10_31524
Chr10_45901
Chr10_102754
Chr10_102828
Chr10_103480
head (geno_snp)
XRQChr10_103805 NA NA NA 0 NA 0 NA NA NA NA NA 0 0
XRQChr10_103937 NA NA NA 0 NA 1 NA NA NA NA NA 0 2
XRQChr10_103990 NA NA NA 0 NA 0 NA NA NA NA NA 0 NA
I am trying something like this:
for (i in 1:length(geno_snp[,1])){
for (j in 1:length(my.filtered.snp.id)){
if geno_snp[i,] == my.filtered.snp.i[j]
print (the whole line in geno_snp)
}
else (remove the line)
}
If I understood it correctly, you want a subset of your data.frame geno_snp in which the row names must match the selected SNP IDs from the vector my.filtered.snp.id.
Please check if this solution works for you:
index <- unlist(sapply(row.names(geno_snp), function(x) grep(pattern = x, x = my.filtered.snp.id)))
selected_subset <- geno_snp[index,]
What I did was to create an index adressing the rows with names that were a match with any value in my.filtered.snp.id. Then I used the index to make the subset of the dataframe. Since the result from applying the grep function with the aid of sapply was in the form of a list, I used unlist to obtain the results in the form of a vector.
EDIT:
I noticed you had some row.names that weren't an exact match with your original my.filtered.snp.id values. In this case, maybe what you wanna do is:
index <- unlist(sapply(my.filtered.snp.id, function(x) grep(pattern = x, x = row.names(geno_snp))))
selected_subset <- geno_snp[index,]
The thing is that you have row.names beggining with XRQ... so in this last case the code uses the reference values from my.filtered.snp.id to detect matches in row.names(geno_snp), even if there is this XRQ string in the beggining of it.
Finally, in the case I have misunderstood your data and what I'm calling row names here are, in fact, data in a column (the SNP IDs), just use geno_snp[,1] instead of row.names(geno_snp) in both codes above.

Applying percentage change between two columns, same row

Perform % change between Close and 10ema Column. Desired output is % difference close column [3]
I have my data within a dataframe
Close 10ema % Difference Close – 10ema / 10ema *100
12.81398 NA NA
13.2636 NA NA
13.54461 NA NA
13.76941 NA NA
13.82561 NA NA
13.88181 NA NA
13.76941 NA NA
13.88181 NA NA
13.4884 NA NA
13.4884 13.572704 -0.621128995
13.376 13.53693964 -1.188892325
13.376 13.50767788 -0.974837314
13.4884 13.50417281 -0.11679956
13.4322 13.49108685 -0.436487059
13.376 13.47016197 -0.699041087
13.376 13.45304161 -0.572670563
13.376 13.43903404 -0.469037013
I am looking to perform the % difference between Close and 10ema.
In Excel I would use:
=Sum (close 1 - 10ema)10ema *100
The R code I am using:
new.dataframe$close.prct.ema.10 <- apply(new.dataframe[,c('Close', 'ema.10')], 1, function(x) { (x[1]-x[2]/x[2]) * 100 } )
I am specifying the columns to apply the function too in [,C ('Close', 'ema.10')]
Also the 1 before the function is telling the code to perform the function row by row as in:, 1, function (x)
This portion (x[1]/x[2]/x[2]) * 100 }) is an attempt to say:
Close - ema10 / ema 10 *100
The 1 and 2 point to the order that the column names are stated in the [,C ('Close', 'ema.10')
However, the result is not working. The code looks just fine to me and makes sense, what am I missing here?
The answer is below. It was a order issue with the syntax. Here is the working solution which applies a percentage change between two columns.
new.dataframe$close.prct.ema.10 <- apply(new.dataframe[,c('Close', 'ema.10')], 1, function(x) { (x[1]-x[2])/x[2] * 100 } )
(x[1]-x[2])/x[2] * 100
Note the brackets first around (x1-x2) / 2 *100

Appending to an R List one by one

Let's say I have data like:
> data[295:300,]
Date sulfate nitrate ID
295 2003-10-22 NA NA 1
296 2003-10-23 NA NA 1
297 2003-10-24 3.47 0.363 1
298 2003-10-25 NA NA 1
299 2003-10-26 NA NA 1
300 2003-10-27 NA NA 1
Now I would like to add all the nitrate values into a new list/vector. I'm using the following code:
i <- 1
my_list <- c()
for(val in data)
{
my_list[i] <- val
i <- i + 1
}
But this is what happens:
Warning message:
In x[i] <- val :
number of items to replace is not a multiple of replacement length
> i
[1] 2
> x
[1] NA
Where am I going wrong? The data is part of a Coursera R Programming coursework. I can assure you that this is not an assignment/quiz. I have been trying to understand what is the best way append elements into a list with a loop? I have not proceeded to the lapply or sapply part of the coursework, so thinking about workarounds.
Thanks in advance.
If it's a duplicate question, please direct me to it.
As we mention in the comments, you are not looping over the rows of your data frame, but the columns (also sometimes variables). Hence, loop over data$nitrate.
i <- 1
my_list <- c()
for(val in data$nitrate)
{
my_list[i] <- val
i <- i + 1
}
Now, instead of looping over your values, a better way is to use that you want the new vector and the old data to have the same index, so loop over the index i. How do you tell R how many indexes there are? Here you have several choices again: 1:nrow(data), 1:length(data$nitrate) and several other ways. Below I have given you a few examples of how to extract from the data frame.
my_vector <- c()
for(i in 1:nrow(data)){
my_vector[i] <- data$nitrate[i] ## Version 1 of extracting from data.frame
my_vector[i] <- data[i,"nitrate"] ## Version 2: [row,column name]
my_vector[i] <- data[i,3] ## Version 3: [row,column number]
}
My suggestion: Rather than calling the collection a list, call it a vector, since that is what it is. Vectors and lists behave a little differently in R.
Of course, in reality you don't want to get the data out one by one. A much more efficient way of getting your data out is
my_vector2 <- data$nitrate

Iterating through all rows in R, removing those that fit criteria

R data frame. It has about a dozen columns and 150 or so rows. I want to iterate through each row and remove it, under these two conditions
It's value in column 8 is undefined
The value for the row ABOVE it, in column 8 IS defined.
My code looks like this, but it keeps crashing. It's gotta be a dumb mistake, but I can't figure it out.
for (i in 2:nrow(newfile)){
if (is.na(newfile[i,8]) && !is.na(newfile[(i-1),8]){
newfile<-newfile[-i,]
}
}
Obviously in this example, newfile is my dataframe.
The error I get
Error in [.data.frame(newfile, -i, ) : object 'i' not found
Problem solved, but some test data if you guys wanted to muck around:
23 L8 29141078 744319 27165443
24 L8 27165443 NA NA
25 L8 28357836 8293 25116398
26 L8 25116398 NA NA
27 L8 28357836 21600 25116398
28 L8 25116398 NA NA
29 L8 40929564 NA NA
30 L8 40929564 NA NA
31 L8 41917264 33234 39446503
32 L8 39446503 NA NA
33 L8 41917264 33981 39446503
34 L8 39446503 NA NA
Obviously a little modified here, so now you are comparing column 4 with the one above it (or you can use column 5, either way)
The problem is that you're changing the data frame out from under yourself; the original evaluation of nrow(newfile) doesn't get updated as you go along (it would if you had a C-style loop for (i=1; i<=nrow(newfile); i++) ...). In a while loop, on the other hand, the condition will get re-evaluated every time through the loop, so I think this will work.
i <- 2
while (i<=nrow(newfile)){
if (is.na(newfile[i,8]) && !is.na(newfile[i-1,8])) {
newfile<-newfile[-i,]
}
i <- i+1
}
You didn't give us an easily reproducible answer (i.e. a test dataset with answers), so I'm not going to test this right now.
Careful thought (which I don't have time to give this at the moment) might lead to a non-iterative (and hence perhaps very much faster, if that matters) way to do this.
Hmm, if I do this, I get
Error in if (is.na(newfile[i,8]) && !is.na(newfile[(i-1),8]) { :
missing value where TRUE/FALSE needed
This is because you're removing rows while you're iterating through them, so by the time you get to nrow(newfile) (which is the original number of rows, since the nrow(newfile) is evaluated once at the beginning of the foor loop), it may not exist any more because rows have been removed.
You can avoid looping altogether by constructing a logical index of which rows to keep (ie vector of length nrow(newfile) with TRUE if you want to keep the row and FALSE otherwise):
n <- nrow(newfile)
# first bit says "is the row NA (for rows 2:n)"
# second bit says "is the row above *not* NA (for rows 1:(n-1))
# the & finds rows satisfying *both* conditions (first row always gets kept)
toRemove <- c(FALSE,is.na(newfile[-1,8])) & c(FALSE,!is.na(newfile[-n,8]))
toKeep <- !toRemove
newfile <- newfile[toKeep,]
You could do it all in one line if that's your thing:
newfile <- newfile[ !(c(FALSE,is.na(newfile[-1,8])) & c(FALSE,!is.na(newfile[-nrow(newfile),8]))), ]
Here is another solution. But it keeps NA values if the previous value is also NA.
#create some dummy data
newfile <- matrix(runif(800), ncol = 8)
newfile[rbinom(100, 1, 0.25) == 1, 8] <- NA
#the selection
newfile[-which(diff(is.na(newfile[, 8])) == 1) - 1, ]

R: Loop with repetitive values

I am trying to write a loop that would search for the right date in the data.frame B (date_B[j]) and would copy the related value X_B[j] into the X_A[i] variable related to the same date date_A[i].
The challenge is that a) the target data.frame A has several of the same dates but b) not systematically all the dates that the data.frame (B) has. The (B) includes all the needed dates though. Consequently, the data frames are of different lengths.
Questions:
Why the following loop does not work, but does not return error messages?
How to fix it?
Are there any other ways to solve this problem in R?
The data frames are:
A =
date_A X_A
1 2010-01-01
2 2010-01-02
3 2010-01-03
4 2010-01-02
5 2010-02-03
6 2010-01-01
7 2010-01-02
.
.
.
20000
B=
date_B X_B
1 2010-01-01 7.9
2 2010-01-02 8.5
3 2010-01-03 2.1
.
.
400
My goal is:
A=
date_A X_A
1 2010-01-01 7.9
2 2010-01-02 8.5
3 2010-01-03 2.1
4 2010-01-02 8.5
5 2010-02-03 2.1
6 2010-01-01 7.9
7 2010-01-02 8.5
I wrote the following loop, but for some reason it does not find its way past the first row. In other words, it does not change the values of the other X_A cells, although the loop keeps running endlessly.
i=1; j=1;
while (i <= nrow(A))
while (j <= nrow(B)) {
if (A$date_A[i]==B$date_B[j]) A$X_A[i] <- B$X_B[j];
j <- j+1; if (j == nrow(B))
i <- i+1;
j=1
}
Thanks for your help.
With this sort of problem merge makes it much easier. With your example I do not get a match with the seventh row but perhaps you had a typo. My A dataframe only had the date_A column. If you want to rename the X_B column, then the names()<- will do it easily;
merge(A, B, by.x=1, by.y=1, all.x=TRUE)
#---result---
date_A X_B
1 2010-01-01 7.9
2 2010-01-01 7.9
3 2010-01-02 8.5
4 2010-01-02 8.5
5 2010-01-02 8.5
6 2010-01-03 2.1
7 2010-02-03 NA
With this data:
A <- data.frame( date_A = c('2010-01-01', '2010-01-02', '2010-01-03', '2010-01-02',
'2010-02-03', '2010-01-01', '2010-01-02') )
B <- data.frame(
date_B = c('2010-01-01','2010-01-02','2010-01-03'),
X_B = c(7.9,8.5,2.1))
You can use match() to index the X_B values in the right order:
A$X_A <- B$X_B[match(A$date_A,B$date_B)]
match() returns the indexes of the locations of B$date_B in A$date_A. Another trick to use is to use the levels of the factor as index:
A$X_A <- B$X_B[A$date_A]
which works because each factor has levels in sorted order and correspond to numeric values (1,2,3...). So if B is sorted according to these levels this returns the correct indexes. (you might want to sort B to be sure: B <- B[order(B$date_B),])
As for why the loop doesn't work. First, I think you really don't want to use ; in R scripts ever. It makes code so much harder to read. Best is if you learn to write clear code. In your code you can use assigners more consistent and use proper indenting. For example:
i <- 1
j <- 1
while (i <= nrow(A))
{
while (j <= nrow(B))
{
if (A$date_A[i]==B$date_B[j]) A$X_A[i] <- B$X_B[j]
j <- j+1
if (j == nrow(B)) i <- i+1
j <- 1
}
}
This is your code, but it is much clearer to read. For me this does not run because the levels are not comparible (due to the typo) so I put in an as.character() call. This is probably not needed in the real dataset.
Indexing immediately shows the biggest problem here: You have misplaced j <- 1 outside the if (j == nrow(B)) part. Using ; terminates the line and thus the conditional part. Because of this j is set to 1 in each loop.
Changing that makes it run better, but you still get an error because the while loop for j might not finish before i is larger then the number of rows in A. This can be changed by setting an AND statement and collapsing both while loops in one. Finally you need to set the if statement to larger then the number of rows in B or you omit one row. This should work:
i <- 1
j <- 1
while (j <= nrow(B) & i <= nrow(A))
{
if (as.character(A$date_A[i])==as.character(B$date_B[j])) A$X_A[i] <- B$X_B[j]
j <- j+1
if (j > nrow(B))
{
i <- i+1
j <- 1
}
}
But this is only meant to show what went wrong, I'd never recommended doing something like this this way. Even when you really want to use loops you are probably better of with for loops.
Wow! Your code scares me. At the very least, use a for loop for this kind of thing (although #Dwin's solution is the way to go for this problem):
for(i in seq(nrow(A)))
{
for(j in seq(nrow(B)))
{
if(A$date_A[i]==B$date_B[j])
{
A$X_A[i] <- B$X_B[j]
}
}
}
This will prevent all the ugliness with manually trying to do the increments at the end of your while loops (in your own code, the j=1 needed to be moved outside the inner brackets, by the way).
Note: this code, as yours, does not solve the issue when B contains two rows with the same date as in A (it will always use the value of the last row in B for that date). It serves to help you understand for instead of while for simple incremental loops.

Resources