Skip NA values using "FUN=first" - r

there's probably really an simple explaination as to what I'm doing wrong, but I've been working on this for quite some time today and I still can not get this to work. I thought this would be a walk in the park, however, my code isn't quite working as expected.
So for this example, let's say I have a data frame as followed.
df
Row# user columnB
1 1 NA
2 1 NA
3 1 NA
4 1 31
5 2 NA
6 2 NA
7 2 15
8 3 18
9 3 16
10 3 NA
Basically, I would like to create a new column that uses the first (as well as last) function (within the TTR library package) to obtain the first non-NA value for each user. So my desired data frame would be this.
df
Row# user columnB firstValue
1 1 NA 31
2 1 NA 31
3 1 NA 31
4 1 31 31
5 2 NA 15
6 2 NA 15
7 2 15 15
8 3 18 18
9 3 16 18
10 3 NA 18
I've looked around mainly using google, but I couldn't really find my exact answer.
Here's some of my code that I've tried, but I didn't get the results that I wanted (note, I'm bringing this from memory, so there are quite a few more variations of these, but these are the general forms that I've been trying).
df$firstValue<-ave(df$columnB,df$user,FUN=first,na.rm=True)
df$firstValue<-ave(df$columnB,df$user,FUN=function(x){x,first,na.rm=True})
df$firstValue<-ave(df$columnB,df$user,FUN=function(x){first(x,na.rm=True)})
df$firstValue<-by(df,df$user,FUN=function(x){x,first,na.rm=True})
Failed, these just give the first value of each group, which would be NA.
Again, these are just a few examples from the top of my head, I played around with na.rm, using na.exclude, na.omit, na.action(na.omit), etc...
Any help would be greatly appreciated. Thanks.

A data.table solution
require(data.table)
DT <- data.table(df, key="user")
DT[, firstValue := na.omit(columnB)[1], by=user]

Here is a solution with plyr :
ddply(df, .(user), transform, firstValue=na.omit(columnB)[1])
Which gives :
Row user columnB firstValue
1 1 1 NA 31
2 2 1 NA 31
3 3 1 NA 31
4 4 1 31 31
5 5 2 NA 15
6 6 2 NA 15
7 7 2 15 15
8 8 3 18 18
9 9 3 16 18
If you want to capture the last value, you can do :
ddply(df, .(user), transform, firstValue=tail(na.omit(columnB),1))

Using data.table
library (data.table)
DT <- data.table(df, key="user")
DT <- setnames(DT[unique(DT[!is.na(columnB), list(columnB), by="user"])], "columnB.1", "first")

Using a very small helper function
finite <- function(x) x[is.finite(x)]
here is an one-liner using only standard R functions:
df <- cbind(df, firstValue = unlist(sapply(unique(df[,1]), function(user) rep(finite(df[df[,1] == user,2])[1], sum(df[,1] == user))))
For a better overview, here is the one-liner unfolded into a "multi-liner":
# for each user, find the first finite (in this case non-NA) value of the second column and replicate it as many times as the user has rows
# then, the results of all users are joined into one vector (unlist) and appended to the data frame as column
df <- cbind(
df,
firstValue = unlist(
sapply(
unique(df[,1]),
function(user) {
rep(
finite(df[df[,1] == user,2])[1],
sum(df[,1] == user)
)
}
)
)
)

Related

unpivot data frame - get original data series

I need to clean up the following data frame
df <- data.frame(metric=c(10,20,30,40,NA), cnt=c(1,2,1,2,2))
> df
metric cnt
1 10 1
2 20 2
3 30 1
4 40 2
5 NA 2
I need go back to the original data series (un-pivot ??) which would be like below.
metric
1 10
2 20
3 20
4 30
5 40
6 40
7 NA
8 NA
Is this a use case for tidyr ? If yes, a tidyr based solution would also be helpful.
We can use rep
df1 <- data.frame(metric = rep(df$metric, df$cnt))
There is the function inverse.rle() for inverse RLE. See help("rle"):
df <- data.frame(metric=c(10,20,30,40,NA), cnt=c(1,2,1,2,2))
names(df) <- c("values", "lengths")
inverse.rle(df) # or
data.frame(metric=inverse.rle(df))

Full outer join of multiple dataframes stored as elements of a list using data.table

I'm trying to do a full outer join of multiple dataframes stored as elements of a list using data.table. I have successfully done this using the merge_recurse() function of the reshape package, but it is very slow with larger datasets, and I'd like to speed up the merge by using data.table. I'm not sure the best way for data.table to handle the list structure with multiple dataframes. I'm also not sure if I've written the Reduce() function correctly on unique keys to do a full outer join on multiple dataframes.
Here's a small example:
#Libraries
library("reshape")
library("data.table")
#Specify list of multiple dataframes
filelist <- list(data.frame(x=c(1,1,1,2,2,2,3,3,3), y=c(1,2,3,1,2,3,1,2,3), a=1:9),
data.frame(x=c(1,1,1,2,2,2,3,3,4), y=c(1,2,3,1,2,3,1,2,1), b=seq(from=0, by=5, length.out=9)),
data.frame(x=c(1,1,1,2,2,2,3,3,4), y=c(1,2,3,1,2,3,1,2,2), c=seq(from=0, by=10, length.out=9)))
#Merge with merge_recurse()
listMerged <- merge_recurse(filelist, by=c("x","y"))
#Attempt with data.table
ids <- lapply(filelist, function(x) x[,c("x","y")])
unique_keys <- unique(do.call("rbind", ids))
dt <- data.table(filelist)
setkey(dt, c("x","y")) #error here
Reduce(function(x, y) x[y[J(unique_keys)]], filelist)
Here's my expected output:
> listMerged
x y a b c
1 1 1 1 0 0
2 1 2 2 5 10
3 1 3 3 10 20
4 2 1 4 15 30
5 2 2 5 20 40
6 2 3 6 25 50
7 3 1 7 30 60
8 3 2 8 35 70
9 3 3 9 NA NA
10 4 1 NA 40 NA
11 4 2 NA NA 80
Here are my resources:
Suggestion to use Reduce() function on data.table (see last comment of answer)
Suggestion to use "unique keys" to do full outer join in data.table
This worked for me:
library("reshape")
library("data.table")
##
filelist <- list(
data.frame(
x=c(1,1,1,2,2,2,3,3,3),
y=c(1,2,3,1,2,3,1,2,3),
a=1:9),
data.frame(
x=c(1,1,1,2,2,2,3,3,4),
y=c(1,2,3,1,2,3,1,2,1),
b=seq(from=0, by=5, length.out=9)),
data.frame(
x=c(1,1,1,2,2,2,3,3,4),
y=c(1,2,3,1,2,3,1,2,2),
c=seq(from=0, by=10, length.out=9)))
##
## I used copy so that this would
## not modify 'filelist'
dtList <- copy(filelist)
lapply(dtList,setDT)
lapply(dtList,function(x){
setkeyv(x,cols=c("x","y"))
})
##
> Reduce(function(x,y){
merge(x,y,all=T,allow.cartesian=T)
},dtList)
x y a b c
1: 1 1 1 0 0
2: 1 2 2 5 10
3: 1 3 3 10 20
4: 2 1 4 15 30
5: 2 2 5 20 40
6: 2 3 6 25 50
7: 3 1 7 30 60
8: 3 2 8 35 70
9: 3 3 9 NA NA
10: 4 1 NA 40 NA
11: 4 2 NA NA 80
Also I noticed a couple of problems in your code. dt <- data.table(filelist) resulted in
> dt
filelist
1: <data.frame>
2: <data.frame>
3: <data.frame>
which is most likely the cause of the error in setkey(dt, c("x","y")) that you pointed out above. Also, did this work for you?
Reduce(function(x, y) x[y[J(unique_keys)]], filelist)
I'm just curious, because I was getting an error when I tried to run it (using dtList instead of filelist)
Error in eval(expr, envir, enclos) : could not find function "J"
which I believe has to do with the changes implemented since version 1.8.8 of data.table, explained by #Arun in this answer.

get z standardized score within each group

Here is the data.
set.seed(23) data<-data.frame(ID=rep(1:12), group=rep(1:3,times=4), value=(rnorm(12,mean=0.5, sd=0.3)))
ID group value
1 1 1 0.4133934
2 2 2 0.6444651
3 3 3 0.1350871
4 4 1 0.5924411
5 5 2 0.3439465
6 6 3 0.3673059
7 7 1 0.3202062
8 8 2 0.8883733
9 9 3 0.7506174
10 10 1 0.3301955
11 11 2 0.7365258
12 12 3 0.1502212
I want to get z-standardized scores within each group. so I try
library(weights)
data_split<-split(data, data$group) #split the dataframe
stan<-lapply(data_split, function(x) stdz(x$value)) #compute z-scores within group
However, It looks wrong because I want to add a new variable following 'value'
How can I do that? Kindly provide some suggestions(sample code). Any help is greatly appreciated .
Use this instead:
within(data, stan <- ave(value, group, FUN=stdz))
No need to call split nor lapply.
One way using data.table package:
library(data.table)
library(weights)
set.seed(23)
data <- data.table(ID=rep(1:12), group=rep(1:3,times=4), value=(rnorm(12,mean=0.5, sd=0.3)))
setkey(data, ID)
dataNew <- data[, list(ID, stan = stdz(value)), by = 'group']
the result is:
group ID stan
1: 1 1 -0.6159312
2: 1 4 0.9538398
3: 1 7 -1.0782747
4: 1 10 0.7403661
5: 2 2 -1.2683237
6: 2 5 0.7839781
7: 2 8 0.8163844
8: 2 11 -0.3320388
9: 3 3 0.6698418
10: 3 6 0.8674548
11: 3 9 -0.2131335
12: 3 12 -1.3241632
I tried Ferdinand.Kraft's solution but it didn't work for me. I think the stdz function isn't included in the basic R install. Moreover, the within part troubled me in a large dataset with many variables. I think the easiest way is:
data$value.s <- ave(data$value, data$group, FUN=scale)
Add the new column while in your function, and have the function return the whole data frame.
stanL<-lapply(data_split, function(x) {
x$stan <- stdz(x$value)
x
})
stan <- do.call(rbind, stanL)

R - Create a column with entries only for the first row of each subset

For instance if I have this data:
ID Value
1 2
1 2
1 3
1 4
1 10
2 9
2 9
2 12
2 13
And my goal is to find the smallest value for each ID subset, and I want the number to be in the first row of the ID group while leaving the other rows blank, such that:
ID Value Start
1 2 2
1 2
1 3
1 4
1 10
2 9 9
2 9
2 12
2 13
My first instinct is to create an index for the IDs using
A <- transform(A, INDEX=ave(ID, ID, FUN=seq_along)) ## A being the name of my data
Since I am a noob, I get stuck at this point. For each ID=n, I want to find the min(A$Value) for that ID subset and place that into the cell matching condition of ID=n and INDEX=1.
Any help is much appreciated! I am sorry that I keep asking questions :(
Here's a solution:
within(A, INDEX <- "is.na<-"(ave(Value, ID, FUN = min), c(FALSE, !diff(ID))))
ID Value INDEX
1 1 2 2
2 1 2 NA
3 1 3 NA
4 1 4 NA
5 1 10 NA
6 2 9 9
7 2 9 NA
8 2 12 NA
9 2 13 NA
Update:
How it works? The command ave(Value, ID, FUN = min) applies the function min to each subset of Value along the values of ID. For the example, it returns a vector of five times 2 and four times 9. Since all values except the first in each subset should be NA, the function "is.na<-" replaces all values at the logical index defined by c(FALSE, !diff(ID)). This index is TRUE if a value is identical with the preceding one.
You're almost there. We just need to make a custom function instead of seq_along and to split value by ID (not ID by ID).
first_min <- function(x){
nas <- rep(NA, length(x))
nas[which.min(x)] <- min(x, na.rm=TRUE)
nas
}
This function makes a vector of NAs and replaces the first element with the minimum value of Value.
transform(dat, INDEX=ave(Value, ID, FUN=first_min))
## ID Value INDEX
## 1 1 2 2
## 2 1 2 NA
## 3 1 3 NA
## 4 1 4 NA
## 5 1 10 NA
## 6 2 9 9
## 7 2 9 NA
## 8 2 12 NA
## 9 2 13 NA
You can achieve this with a tapply one-liner
df$Start<-as.vector(unlist(tapply(df$Value,df$ID,FUN = function(x){ return (c(min(x),rep("",length(x)-1)))})))
I keep going back to this question and the above answers helped me greatly.
There is a basic solution for beginners too:
A$Start<-NA
A[!duplicated(A$ID),]$Start<-A[!duplicated(A$ID),]$Value
Thanks.

R self reference

In R I find myself doing something like this a lot:
adataframe[adataframe$col==something]<-adataframe[adataframe$col==something)]+1
This way is kind of long and tedious. Is there some way for me
to reference the object I am trying to change such as
adataframe[adataframe$col==something]<-$self+1
?
Try package data.table and its := operator. It's very fast and very short.
DT[col1==something, col2:=col3+1]
The first part col1==something is the subset. You can put anything here and use the column names as if they are variables; i.e., no need to use $. Then the second part col2:=col3+1 assigns the RHS to the LHS within that subset, where the column names can be assigned to as if they are variables. := is assignment by reference. No copies of any object are taken, so is faster than <-, =, within and transform.
Also, soon to be implemented in v1.8.1, one end goal of j's syntax allowing := in j like that is combining it with by, see question: when should I use the := operator in data.table.
UDPDATE : That was indeed released (:= by group) in July 2012.
You should be paying more attention to Gabor Grothendeick (and not just in this instance.) The cited inc function on Matt Asher's blog does all of what you are asking:
(And the obvious extension works as well.)
add <- function(x, inc=1) {
eval.parent(substitute(x <- x + inc))
}
# Testing the `inc` function behavior
EDIT: After my temporary annoyance at the lack of approval in the first comment, I took the challenge of adding yet a further function argument. Supplied with one argument of a portion of a dataframe, it would still increment the range of values by one. Up to this point has only been very lightly tested on infix dyadic operators, but I see no reason it wouldn't work with any function which accepts only two arguments:
transfn <- function(x, func="+", inc=1) {
eval.parent(substitute(x <- do.call(func, list(x , inc)))) }
(Guilty admission: This somehow "feels wrong" from the traditional R perspective of returning values for assignment.) The earlier testing on the inc function is below:
df <- data.frame(a1 =1:10, a2=21:30, b=1:2)
inc <- function(x) {
eval.parent(substitute(x <- x + 1))
}
#---- examples===============>
> inc(df$a1) # works on whole columns
> df
a1 a2 b
1 2 21 1
2 3 22 2
3 4 23 1
4 5 24 2
5 6 25 1
6 7 26 2
7 8 27 1
8 9 28 2
9 10 29 1
10 11 30 2
> inc(df$a1[df$a1>5]) # testing on a restricted range of one column
> df
a1 a2 b
1 2 21 1
2 3 22 2
3 4 23 1
4 5 24 2
5 7 25 1
6 8 26 2
7 9 27 1
8 10 28 2
9 11 29 1
10 12 30 2
> inc(df[ df$a1>5, ]) #testing on a range of rows for all columns being transformed
> df
a1 a2 b
1 2 21 1
2 3 22 2
3 4 23 1
4 5 24 2
5 8 26 2
6 9 27 3
7 10 28 2
8 11 29 3
9 12 30 2
10 13 31 3
# and even in selected rows and grepped names of columns meeting a criterion
> inc(df[ df$a1 <= 3, grep("a", names(df)) ])
> df
a1 a2 b
1 3 22 1
2 4 23 2
3 4 23 1
4 5 24 2
5 8 26 2
6 9 27 3
7 10 28 2
8 11 29 3
9 12 30 2
10 13 31 3
Here is what you can do. Let us say you have a dataframe
df = data.frame(x = 1:10, y = rnorm(10))
And you want to increment all the y by 1. You can do this easily by using transform
df = transform(df, y = y + 1)
I'd be partial to (presumably the subset is on rows)
ridx <- adataframe$col==something
adataframe[ridx,] <- adataframe[ridx,] + 1
which doesn't rely on any fancy / fragile parsing, is reasonably expressive about the operation being performed, and is not too verbose. Also tends to break lines into nicely human-parse-able units, and there is something appealing about using standard idioms -- R's vocabulary and idiosyncrasies are already large enough for my taste.

Resources