Removing NAs when multiplying columns - r

This is a really simple question, but I am hoping someone will be able to help me avoid extra lines of unnecessary code. I have a simple dataframe:
Df.1 <- data.frame(A = c(5,4,7,6,8,4),B = (c(1,5,2,4,9,1)),C=(c(2,3,NA,5,NA,9)))
What I want to do is produce an extra column which is the multiplication of A, B and C, which I will then cbind to the original dataframe.
So, I would normally use:
attach(Df.1)
D<-A*B*C
But obviously where the NAs are in column C, I get an NA in variable D. I don't want to exclude all the NA rows, rather just ignore the NA values in this column (and then the value in D would simply be the multiplication of A and B, or where C was available, A*B*C.
I know I could simply replace the NAs with 1s, so the calculation remains unchanged, or use if statements, but I was wodnering what the simplist way of doing this is?
Any ideas?

You can use prod which has an na.rm argument. To do it by row use apply:
apply(Df.1,1,prod,na.rm=TRUE)
[1] 10 60 14 120 72 36

As #James said, prod and apply will work, but you don't need to waste memory storing it in a separate variable, or even cbinding it
Df.1$D = apply(Df.1, 1, prod, na.rm=T)
Assigning the new variable in the data frame directly will work.
> Df.1 <- data.frame(A = c(5,4,7,6,8,4),B = (c(1,5,2,4,9,1)),C=(c(2,3,NA,5,NA,9)))
> Df.1
A B C
1 5 1 2
2 4 5 3
3 7 2 NA
4 6 4 5
5 8 9 NA
6 4 1 9
> Df.1$D = apply(Df.1, 1, prod, na.rm=T)
> Df.1$D
[1] 10 60 14 120 72 36
> Df.1
A B C D
1 5 1 2 10
2 4 5 3 60
3 7 2 NA 14
4 6 4 5 120
5 8 9 NA 72
6 4 1 9 36

Related

R logical indexing by equality on multiple columns

Say I have the following dataframe, df:
A B C
1 4 25 a
2 3 79 b
3 4 25 c
4 6 17 d
5 4 21 e
6 5 25 f
How to I index the lines where elements in certain columns match a vector, e.g. df$A == 4 & df$B == 25?
I would have thought something like: df[df[,c("A", "B")] == c(4, 25),] would have worked, but this doesn't give me the right answer (it returns no lines).
I would like a method that uses a vector of column names to match on, and a vector of values to match.
Try this:
df[colSums(t(df[,c("A", "B")]) == c(4, 25))==2,]
x A B C
1 1 4 25 a
3 3 4 25 c
This works recycling the vector c(4, 25).
You could also use a simple merge, which would have analogies for data.table and dplyr and in sql as well:
merge(df, setNames(list(4,25),c("A","B")))
# A B C
#1 4 25 a
#2 4 25 c

Excel OFFSET function in r

I am trying to simulate the OFFSET function from Excel. I understand that this can be done for a single value but I would like to return a range. I'd like to return a group of values with an offset of 1 and a group size of 2. For example, on row 4, I would like to have a group with values of column a, rows 3 & 2. Sorry but I am stumped.
Is it possible to add this result to the data frame as another column using cbind or similar? Alternatively, could I use this in a vectorized function so I could sum or mean the result?
Mockup Example:
> df <- data.frame(a=1:10)
> df
a
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
> #PROCESS
> df
a b
1 1 NA
2 2 (1)
3 3 (1,2)
4 4 (2,3)
5 5 (3,4)
6 6 (4,5)
7 7 (5,6)
8 8 (6,7)
9 9 (7,8)
10 10 (8,9)
This should do the trick:
df$b1 <- c(rep(NA, 1), head(df$a, -1))
df$b2 <- c(rep(NA, 2), head(df$a, -2))
Note that the result will have to live in two columns, as columns in data frames only support simple data types. (Unless you want to resort to complex numbers.) head with a negative argument cuts the negated value of the argument from the tail, try head(1:10, -2). rep is repetition, c is concatenation. The <- assignment adds a new column if it's not there yet.
What Excel calls OFFSET is sometimes also referred to as lag.
EDIT: Following Greg Snow's comment, here's a version that's more elegant, but also more difficult to understand:
df <- cbind(df, as.data.frame((embed(c(NA, NA, df$a), 3))[,c(3,2)]))
Try it component by component to see how it works.
Do you want something like this?
> df <- data.frame(a=1:10)
> b=t(sapply(1:10, function(i) c(df$a[(i+2)%%10+1], df$a[(i+4)%%10+1])))
> s = sapply(1:10, function(i) sum(b[i,]))
> df = data.frame(df, b, s)
> df
a X1 X2 s
1 1 4 6 10
2 2 5 7 12
3 3 6 8 14
4 4 7 9 16
5 5 8 10 18
6 6 9 1 10
7 7 10 2 12
8 8 1 3 4
9 9 2 4 6
10 10 3 5 8

Skip NA values using "FUN=first"

there's probably really an simple explaination as to what I'm doing wrong, but I've been working on this for quite some time today and I still can not get this to work. I thought this would be a walk in the park, however, my code isn't quite working as expected.
So for this example, let's say I have a data frame as followed.
df
Row# user columnB
1 1 NA
2 1 NA
3 1 NA
4 1 31
5 2 NA
6 2 NA
7 2 15
8 3 18
9 3 16
10 3 NA
Basically, I would like to create a new column that uses the first (as well as last) function (within the TTR library package) to obtain the first non-NA value for each user. So my desired data frame would be this.
df
Row# user columnB firstValue
1 1 NA 31
2 1 NA 31
3 1 NA 31
4 1 31 31
5 2 NA 15
6 2 NA 15
7 2 15 15
8 3 18 18
9 3 16 18
10 3 NA 18
I've looked around mainly using google, but I couldn't really find my exact answer.
Here's some of my code that I've tried, but I didn't get the results that I wanted (note, I'm bringing this from memory, so there are quite a few more variations of these, but these are the general forms that I've been trying).
df$firstValue<-ave(df$columnB,df$user,FUN=first,na.rm=True)
df$firstValue<-ave(df$columnB,df$user,FUN=function(x){x,first,na.rm=True})
df$firstValue<-ave(df$columnB,df$user,FUN=function(x){first(x,na.rm=True)})
df$firstValue<-by(df,df$user,FUN=function(x){x,first,na.rm=True})
Failed, these just give the first value of each group, which would be NA.
Again, these are just a few examples from the top of my head, I played around with na.rm, using na.exclude, na.omit, na.action(na.omit), etc...
Any help would be greatly appreciated. Thanks.
A data.table solution
require(data.table)
DT <- data.table(df, key="user")
DT[, firstValue := na.omit(columnB)[1], by=user]
Here is a solution with plyr :
ddply(df, .(user), transform, firstValue=na.omit(columnB)[1])
Which gives :
Row user columnB firstValue
1 1 1 NA 31
2 2 1 NA 31
3 3 1 NA 31
4 4 1 31 31
5 5 2 NA 15
6 6 2 NA 15
7 7 2 15 15
8 8 3 18 18
9 9 3 16 18
If you want to capture the last value, you can do :
ddply(df, .(user), transform, firstValue=tail(na.omit(columnB),1))
Using data.table
library (data.table)
DT <- data.table(df, key="user")
DT <- setnames(DT[unique(DT[!is.na(columnB), list(columnB), by="user"])], "columnB.1", "first")
Using a very small helper function
finite <- function(x) x[is.finite(x)]
here is an one-liner using only standard R functions:
df <- cbind(df, firstValue = unlist(sapply(unique(df[,1]), function(user) rep(finite(df[df[,1] == user,2])[1], sum(df[,1] == user))))
For a better overview, here is the one-liner unfolded into a "multi-liner":
# for each user, find the first finite (in this case non-NA) value of the second column and replicate it as many times as the user has rows
# then, the results of all users are joined into one vector (unlist) and appended to the data frame as column
df <- cbind(
df,
firstValue = unlist(
sapply(
unique(df[,1]),
function(user) {
rep(
finite(df[df[,1] == user,2])[1],
sum(df[,1] == user)
)
}
)
)
)

R self reference

In R I find myself doing something like this a lot:
adataframe[adataframe$col==something]<-adataframe[adataframe$col==something)]+1
This way is kind of long and tedious. Is there some way for me
to reference the object I am trying to change such as
adataframe[adataframe$col==something]<-$self+1
?
Try package data.table and its := operator. It's very fast and very short.
DT[col1==something, col2:=col3+1]
The first part col1==something is the subset. You can put anything here and use the column names as if they are variables; i.e., no need to use $. Then the second part col2:=col3+1 assigns the RHS to the LHS within that subset, where the column names can be assigned to as if they are variables. := is assignment by reference. No copies of any object are taken, so is faster than <-, =, within and transform.
Also, soon to be implemented in v1.8.1, one end goal of j's syntax allowing := in j like that is combining it with by, see question: when should I use the := operator in data.table.
UDPDATE : That was indeed released (:= by group) in July 2012.
You should be paying more attention to Gabor Grothendeick (and not just in this instance.) The cited inc function on Matt Asher's blog does all of what you are asking:
(And the obvious extension works as well.)
add <- function(x, inc=1) {
eval.parent(substitute(x <- x + inc))
}
# Testing the `inc` function behavior
EDIT: After my temporary annoyance at the lack of approval in the first comment, I took the challenge of adding yet a further function argument. Supplied with one argument of a portion of a dataframe, it would still increment the range of values by one. Up to this point has only been very lightly tested on infix dyadic operators, but I see no reason it wouldn't work with any function which accepts only two arguments:
transfn <- function(x, func="+", inc=1) {
eval.parent(substitute(x <- do.call(func, list(x , inc)))) }
(Guilty admission: This somehow "feels wrong" from the traditional R perspective of returning values for assignment.) The earlier testing on the inc function is below:
df <- data.frame(a1 =1:10, a2=21:30, b=1:2)
inc <- function(x) {
eval.parent(substitute(x <- x + 1))
}
#---- examples===============>
> inc(df$a1) # works on whole columns
> df
a1 a2 b
1 2 21 1
2 3 22 2
3 4 23 1
4 5 24 2
5 6 25 1
6 7 26 2
7 8 27 1
8 9 28 2
9 10 29 1
10 11 30 2
> inc(df$a1[df$a1>5]) # testing on a restricted range of one column
> df
a1 a2 b
1 2 21 1
2 3 22 2
3 4 23 1
4 5 24 2
5 7 25 1
6 8 26 2
7 9 27 1
8 10 28 2
9 11 29 1
10 12 30 2
> inc(df[ df$a1>5, ]) #testing on a range of rows for all columns being transformed
> df
a1 a2 b
1 2 21 1
2 3 22 2
3 4 23 1
4 5 24 2
5 8 26 2
6 9 27 3
7 10 28 2
8 11 29 3
9 12 30 2
10 13 31 3
# and even in selected rows and grepped names of columns meeting a criterion
> inc(df[ df$a1 <= 3, grep("a", names(df)) ])
> df
a1 a2 b
1 3 22 1
2 4 23 2
3 4 23 1
4 5 24 2
5 8 26 2
6 9 27 3
7 10 28 2
8 11 29 3
9 12 30 2
10 13 31 3
Here is what you can do. Let us say you have a dataframe
df = data.frame(x = 1:10, y = rnorm(10))
And you want to increment all the y by 1. You can do this easily by using transform
df = transform(df, y = y + 1)
I'd be partial to (presumably the subset is on rows)
ridx <- adataframe$col==something
adataframe[ridx,] <- adataframe[ridx,] + 1
which doesn't rely on any fancy / fragile parsing, is reasonably expressive about the operation being performed, and is not too verbose. Also tends to break lines into nicely human-parse-able units, and there is something appealing about using standard idioms -- R's vocabulary and idiosyncrasies are already large enough for my taste.

Re-inserting NAs into a vector

I have a vector of values which include NAs. The values need to be processed by an external program that can't handle NAs, so they are stripped out, its written to a file, processed, then read back in, resulting in a vector of the length of the number of non-NAs. Example, suppose the input is 7 3 4 NA 5 4 6 NA 1 NA, then the output would just be 7 values. What I need to do is re-insert the NAs in position.
So, given two vectors X and Y:
> X
[1] 64 1 9 100 16 NA 25 NA 4 49 36 NA 81
> Y
[1] 8 1 3 10 4 5 2 7 6 9
produce:
8 1 3 10 4 NA 5 NA 2 7 6 NA 9
(you may notice that X is Y^2, thats just for an example).
I could knock out a function to do this but I wonder if there's any nice tricksy ways of doing it... split, list, length... hmmm...
na.omit keeps an attribute of the locations of the NA in the original series, so you could use that to know where to put the missing values:
Y <- sqrt(na.omit(X))
Z <- rep(NA,length(Y)+length(attr(Y,"na.action")))
Z[-attr(Y,"na.action")] <- Y
#> Z
# [1] 8 1 3 10 4 NA 5 NA 2 7 6 NA 9
Answering my own question is probably very bad form, but I think this is probably about the neatest:
rena <- function(X,Z){
Y=rep(NA,length(X))
Y[!is.na(X)]=Z
Y
}
Can also try replace:
replace(X, !is.na(X), Y)
Another variant on the same theme
rena <- function(X,Z){
X[which(!is.na(X))]=Z
X
}
R automatically fills the rest with NA.
Edit: Corrected by Marek.

Resources