row numbers for explicit rows in r - r

I need to get row numbers for explicit rows grouped over id. Let's say dataframe (df) looks like this:
id a b
3 2 NA
3 3 2
3 10 NA
3 21 0
3 2 NA
4 1 5
4 1 0
4 5 NA
I need to create one more column that would give row number sequence excluding the case where b == 0.
desired output:
id a b row
3 2 NA 1
3 3 2 2
3 10 NA 3
3 21 0 -
3 2 NA 4
4 1 5 1
4 1 0 -
4 5 NA 2
I used dplyr but not able to achieve the same,
My code:
df <- df %>%
group_by(id) %>%
mutate(row = row_number(id[b != 0]))
Please suggest some better way to do this.

I would propose using the data.table package for its nice capability in operating on subsets and thus avoiding inefficient operations such as ifelse or evaluation the whole data set. Also, it is better to keep you vector in numeric class (for future operations), thus NA will be probably preferable to - (character), here's a possible solution
library(data.table)
setDT(df)[is.na(b) | b != 0, row := seq_len(.N), by = id]
# id a b row
# 1: 3 2 NA 1
# 2: 3 3 2 2
# 3: 3 10 NA 3
# 4: 3 21 0 NA
# 5: 3 2 NA 4
# 6: 4 1 5 1
# 7: 4 1 0 NA
# 8: 4 5 NA 2
The idea here is to operate only on the rows where is.na(b) | b != 0 and generate a sequence of each group size (.N) while updating row in place (using :=). All the rest of the rows will be assigned with NAs by default.

Related

Set values of a column to NA after a given point

I have a dataset like this:
ID NUMBER X
1 5 2
1 3 4
1 6 3
1 2 5
2 7 3
2 3 5
2 9 3
2 4 2
and I'd like to set values of variable X to NA after the variable NUMBER increses (even though after it decreases again) for each ID, and obtaining:
ID NUMBER X
1 5 2
1 3 4
1 6 NA
1 2 NA
2 7 3
2 3 5
2 9 NA
2 4 NA
How can I do it?
Thanks for your help!
Surely not the most elegant solution, but it is quite intuitive:
library(data.table)
setDT(d)
d[, n := ifelse(NUMBER > shift(NUMBER, 1, "lag"),1,0), by=ID]
d[is.na(n), n := 0]
d[, n := cumsum(n), by=ID]
d[n>0, X := NA ]
d
ID NUMBER X n
1: 1 5 2 0
2: 1 3 4 0
3: 1 6 NA 1
4: 1 2 NA 1
5: 2 7 3 0
6: 2 3 5 0
7: 2 9 NA 1
8: 2 4 NA 1
You can do this with dplyr package. If your dataframe is called df then you can use this code:
df %>% group_by(ID) %>%
mutate ( X = c(X[1:(min(which(diff(Number) > 0)))],rep("NA",length(X)-(min(which(diff(Number) > 0)))))) %>%
as.data.frame()
I first grouped them with ID and then I found the first increasing number with diff and which.

spreading data in R - allowing multiple values per cell

With these data
d <- data.frame(time=1:5, side=c("r","r","r","l","l"), val = c(1,2,1,2,1))
d
time side val
1 1 r 1
2 2 r 2
3 3 r 1
4 4 l 2
5 5 l 1
We can spread to a tidy dataframe like this:
library(tidyverse)
d %>% spread(side,val)
Which gives:
time l r
1 1 NA 1
2 2 NA 2
3 3 NA 1
4 4 2 NA
5 5 1 NA
But say we have more than one val for a given time/side. For example:
d <- data.frame(time=c(1:5,5), side=c("r","r","r","l","l","l"), val = c(1,2,1,2,1,2))
time side val
1 1 r 1
2 2 r 2
3 3 r 1
4 4 l 2
5 5 l 1
6 5 l 2
Now this won't work because of duplicated values:
d %>% spread(side,val)
Error: Duplicate identifiers for rows (5, 6)
Is there an efficient way to force this behavior (or alternative). The output would be e.g.
time l r
1 1 NA 1
2 2 NA 2
3 3 NA 1
4 4 2 NA
5 5 1, 2 NA
The data.table/reshape2 equivalent of tidyr::spread is dcast. It has a more complicated syntax than spread, but it's more flexible. To accomplish your task we can use the below chunk.
We use the formula to 'spread' side by time (filling with the values in the val column), provide the fill value of NA, and specify we want to list elements together when aggregation is needed per value of time.
library(data.table)
d <- data.table(time=c(1:5,5),
side=c("r","r","r","l","l","l"),
val = c(1,2,1,2,1,2))
data.table::dcast(d, time ~ side,
value.var='val',
fill=NA,
fun.aggregate=list)
#OUTPUT
# time l r
# 1: 1 NA 1
# 2: 2 NA 2
# 3: 3 NA 1
# 4: 4 2 NA
# 5: 5 1,2 NA

data.table and pmin with na.rm=TRUE argument

I am trying to calculate the minimum across rows using the pmin function and data.table (similar to the post row-by-row operations and updates in data.table) but with a character list of columns using something like the with=FALSE syntax, and with the na.rm=TRUE argument.
DT <- data.table(x = c(1,1,2,3,4,1,9),
y = c(2,4,1,2,5,6,6),
z = c(3,5,1,7,4,5,3),
a = c(1,3,NA,3,5,NA,2))
> DT
x y z a
1: 1 2 3 1
2: 1 4 5 3
3: 2 1 1 NA
4: 3 2 7 3
5: 4 5 4 5
6: 1 6 5 NA
7: 9 6 3 2
I can calculate the minimum across rows using columns directly:
DT[,min_val := pmin(x,y,z,a,na.rm=TRUE)]
giving
> DT
x y z a min_val
1: 1 2 3 1 1
2: 1 4 5 3 1
3: 2 1 1 NA 1
4: 3 2 7 3 2
5: 4 5 4 5 4
6: 1 6 5 NA 1
7: 9 6 3 2 2
However, I am trying to do this over an automatically generated large set of columns, and I want to be able to do this across this arbitrary list of columns, stored in a col_names variable, col_names <- c("a","y","z')
I can do this:
DT[, col_min := do.call(pmin,DT[,col_names,with=FALSE])]
But it gives me NA values. I can't figure out how to pass the na.rm=TRUE argument into the do.call. I've tried defining the function as
DT[, col_min := do.call(function(x) pmin(x,na.rm=TRUE),DT[,col_names,with=FALSE])]
but this gives me an error. I also tried passing in the argument as an additional element in a list, but I think pmin (or do.call) gets confused between the DT non-standard evaluation of column names and the argument.
Any ideas?
If we need to get the minimum value of each row of the whole dataset, use the pmin, on .SD concatenate the na.rm=TRUE as a list with .SD for the do.call(pmin.
DT[, col_min:= do.call(pmin, c(.SD, list(na.rm=TRUE)))]
DT
# x y z a col_min
#1: 1 2 3 1 1
#2: 1 4 5 3 1
#3: 2 1 1 NA 1
#4: 3 2 7 3 2
#5: 4 5 4 5 4
#6: 1 6 5 NA 1
#7: 9 6 3 2 2
If we want only to do this only for a subset of column names stored in 'col_names', use the .SDcols.
DT[, col_min:= do.call(pmin, c(.SD, list(na.rm=TRUE))),
.SDcols= col_names]

Why didn't mutate fill all rows? Was using mutate and ifelse to look up imputed values from another dataframe

Here is the deal. Was trying to use mutate from the plyr package to look up an appropriate value from another dataframe, if, the v variable in the original dataframe was NA. The looked up value is supposed to go into a new variable imputed. I also defined a custom function for this look up purpose.
Here is the code:
if(!require(plyr)){
install.packages("plyr")
library(plyr)
}
df = data.frame(d=c(1,1,1,2,2,2,3,3,3),
g=rep(c(1,2,3),3),
v=c(5,NA,NA,5,NA,NA,5,NA,NA))
imputed = data.frame(g=c(1,2,3),
v=c(5,10,15))
getImputed = function(p){
imputed[imputed$g==p,"v"]
}
df = mutate(df,imputed=ifelse(is.na(v),getImputed(g),v))
df
And this is the resulting dataframe:
d g v imputed
1 1 1 5 5
2 1 2 NA 10
3 1 3 NA 15
4 2 1 5 5
5 2 2 NA NA
6 2 3 NA NA
7 3 1 5 5
8 3 2 NA NA
9 3 3 NA NA
As one can see, only the first 3 rows were successfully filled in by mutate. It is likely that the ifelse function is the issue, but I can't see why : (
What is weird is that, if the imputed dataframe has 4 rows, like this:
imputed = data.frame(g=c(1,2,3,4),
v=c(5,10,15,20))
then the df dataframe was filled up properly:
d g v imputed
1 1 1 5 5
2 1 2 NA 10
3 1 3 NA 15
4 2 1 5 5
5 2 2 NA 10
6 2 3 NA 15
7 3 1 5 5
8 3 2 NA 10
9 3 3 NA 15
but R gave me a warning saying:
Warning message:
In imputed$g == p :
longer object length is not a multiple of shorter object length
Am I overlooking something?
The problem is your getImputed function. The mutate function does not iterate over rows. It passes columns as a vectors to functions so each function is basically called one. Your getInputed function works if you pass a single value, but not so great with a vector
getImputed(1)
# [1] 5
getImputed(c(1,2))
# [1] 5 10
# Warning message:
# In imputed$g == p :
# longer object length is not a multiple of shorter object length
A better way to write the function would be
getImputed2 <- function(p){
imputed$v[match(p, imputed$g)]
}
This will properly handle a vector of values
mutate(df,imputed=ifelse(is.na(v),getImputed2(g),v))
# d g v imputed
# 1 1 1 5 5
# 2 1 2 NA 10
# 3 1 3 NA 15
# 4 2 1 5 5
# 5 2 2 NA 10
# 6 2 3 NA 15
# 7 3 1 5 5
# 8 3 2 NA 10
# 9 3 3 NA 15
You might also consider joining and replacing
mutate(join(df, setNames(imputed, c("g","v2")), by=c(g="g")),
v=ifelse(is.na(v), v2, v), v2=NULL)

Populating a data frame with corresponding values from another

I have a data frame containing values read in from an experiment with independent variables A and B which doesn't cover all possible permutations of A and B. I need to create a data frame which does contain all permutations, with zeros in those places where that particular pair of values isn't present in the data.
To create some sample data,
interactions <- unique(data.frame(A = sample(1:5, 10, replace=TRUE),
B = sample(1:5, 10, replace=TRUE)))
interactions <- interactions[interactions$A < interactions$B, ]
interactions$val <- runif(nrow(interactions))
possible.interactions <- data.frame(t(combn(1:5, 2)))
names(possible.interactions) <- c('A', 'B')
which creates
interactions
A B val
1 5 0.6881106
1 2 0.5286560
2 4 0.5026426
and
possible.interactions
A B
1 2
1 3
1 4
1 5
2 3
2 4
2 5
3 4
3 5
4 5
and I want to output
A B val
1 2 NA
1 3 0.5286560
1 4 NA
1 5 0.6881106
2 3 NA
2 4 0.5026426
2 5 NA
3 4 NA
3 5 NA
4 5 NA
What is the fastest way to do this?
Here is a base solution that is much faster (~10x) than merge:
possible.interactions$val <- interactions$val[
match(
do.call(paste, possible.interactions),
do.call(paste, interactions[1:2])
) ]
This produces (note, different to what you expect b/c you didn't set seed):
# A B val
# 1 1 2 0.59809242
# 2 1 3 0.92861520
# 3 1 4 0.64279549
# 4 1 5 NA
# 5 2 3 0.03554058
# 6 2 4 NA
# 7 2 5 NA
# 8 3 4 NA
# 9 3 5 NA
# 10 4 5 NA
This assumes A & B do not contain spaces and that interactions has no duplicate A-B pairs (will always match to first).
And the data.table version:
possible.DT <- data.table(possible.interactions)
DT <- data.table(interactions, key=c("A", "B"))
DT[possible.DT]
Though this is only worthwhile if your tables are large or you have uses for other benefits of data.table. I've found speed to be comparable to match in simple cases if you include the overhead of creating and keying the tables. I'm sure there are cases where data.table is much faster, especially if you key once and then use that key a lot.
For completeness, here is the merge version:
merge(possible.interactions, interactions, all.x=T)
If order is important to you, I recommend using join from the plyr package. As opposed to merge which does not provide an intuitive ordering when there are unmatched elements.
library(plyr)
join(interactions,possible.interactions,type="right")
Joining by: A, B
A B val
1 1 2 NA
2 1 3 NA
3 1 4 0.007602083
4 1 5 0.853415110
5 2 3 NA
6 2 4 0.321098658
7 2 5 NA
8 3 4 NA
9 3 5 NA
10 4 5 NA

Resources