spreading data in R - allowing multiple values per cell - r

With these data
d <- data.frame(time=1:5, side=c("r","r","r","l","l"), val = c(1,2,1,2,1))
d
time side val
1 1 r 1
2 2 r 2
3 3 r 1
4 4 l 2
5 5 l 1
We can spread to a tidy dataframe like this:
library(tidyverse)
d %>% spread(side,val)
Which gives:
time l r
1 1 NA 1
2 2 NA 2
3 3 NA 1
4 4 2 NA
5 5 1 NA
But say we have more than one val for a given time/side. For example:
d <- data.frame(time=c(1:5,5), side=c("r","r","r","l","l","l"), val = c(1,2,1,2,1,2))
time side val
1 1 r 1
2 2 r 2
3 3 r 1
4 4 l 2
5 5 l 1
6 5 l 2
Now this won't work because of duplicated values:
d %>% spread(side,val)
Error: Duplicate identifiers for rows (5, 6)
Is there an efficient way to force this behavior (or alternative). The output would be e.g.
time l r
1 1 NA 1
2 2 NA 2
3 3 NA 1
4 4 2 NA
5 5 1, 2 NA

The data.table/reshape2 equivalent of tidyr::spread is dcast. It has a more complicated syntax than spread, but it's more flexible. To accomplish your task we can use the below chunk.
We use the formula to 'spread' side by time (filling with the values in the val column), provide the fill value of NA, and specify we want to list elements together when aggregation is needed per value of time.
library(data.table)
d <- data.table(time=c(1:5,5),
side=c("r","r","r","l","l","l"),
val = c(1,2,1,2,1,2))
data.table::dcast(d, time ~ side,
value.var='val',
fill=NA,
fun.aggregate=list)
#OUTPUT
# time l r
# 1: 1 NA 1
# 2: 2 NA 2
# 3: 3 NA 1
# 4: 4 2 NA
# 5: 5 1,2 NA

Related

Set values of a column to NA after a given point

I have a dataset like this:
ID NUMBER X
1 5 2
1 3 4
1 6 3
1 2 5
2 7 3
2 3 5
2 9 3
2 4 2
and I'd like to set values of variable X to NA after the variable NUMBER increses (even though after it decreases again) for each ID, and obtaining:
ID NUMBER X
1 5 2
1 3 4
1 6 NA
1 2 NA
2 7 3
2 3 5
2 9 NA
2 4 NA
How can I do it?
Thanks for your help!
Surely not the most elegant solution, but it is quite intuitive:
library(data.table)
setDT(d)
d[, n := ifelse(NUMBER > shift(NUMBER, 1, "lag"),1,0), by=ID]
d[is.na(n), n := 0]
d[, n := cumsum(n), by=ID]
d[n>0, X := NA ]
d
ID NUMBER X n
1: 1 5 2 0
2: 1 3 4 0
3: 1 6 NA 1
4: 1 2 NA 1
5: 2 7 3 0
6: 2 3 5 0
7: 2 9 NA 1
8: 2 4 NA 1
You can do this with dplyr package. If your dataframe is called df then you can use this code:
df %>% group_by(ID) %>%
mutate ( X = c(X[1:(min(which(diff(Number) > 0)))],rep("NA",length(X)-(min(which(diff(Number) > 0)))))) %>%
as.data.frame()
I first grouped them with ID and then I found the first increasing number with diff and which.

Index and count unique combination of variables using R, but do NOT remove duplicates

Take this data frame for example:
DT <- data.table(A = rep(1:3, each=4),
B = rep(c(NA,1,2,4), each=3),
C = rep(1:2, 6))
I want to append a column that assign index to unique combinations of A and B, but ignore C. I also want another column that count the number of duplicates, that looks like this:
A B C Index Count
1: 1 NA 1 1 3
2: 1 NA 2 1 3
3: 1 NA 1 1 3
4: 1 1 2 2 1
5: 2 1 1 3 2
6: 2 1 2 3 2
7: 2 2 1 4 2
8: 2 2 2 4 2
9: 3 2 1 5 1
10: 3 4 2 6 3
11: 3 4 1 6 3
12: 3 4 2 6 3
I don't want to trim the data frame and (preferably)I don't want to reorder the rows.
I tried setDT, such as
setDT(DT)[,.(.I, .N), by = names(DT[,1:2])]
But the I column is not the index I want, and Column C is gone.
Thanks in advance!

Replace na in column by value corresponding to column name in seperate table

I have a data frame which looks like this
data <- data.frame(ID = c(1,2,3,4,5),A = c(1,4,NA,NA,4),B = c(1,2,NA,NA,NA),C= c(1,2,3,4,NA))
> data
ID A B C
1 1 1 1 1
2 2 4 2 2
3 3 NA NA 3
4 4 NA NA 4
5 5 4 NA NA
I have a mapping file as well which looks like this
reference <- data.frame(Names = c("A","B","C"),Vals = c(2,5,6))
> reference
Names Vals
1 A 2
2 B 5
3 C 6
I want my data file to be modified using the reference file in a way which would yield me this final data frame
> final_data
ID A B C
1 1 1 1 1
2 2 4 2 2
3 3 2 5 3
4 4 2 5 4
5 5 4 5 6
What is the fastest way I can acheive this in R?
We can do this with Map
data[as.character(reference$Names)] <- Map(function(x,y) replace(x,
is.na(x), y), data[as.character(reference$Names)], reference$Vals)
data
# ID A B C
#1 1 1 1 1
#2 2 4 2 2
#3 3 2 5 3
#4 4 2 5 4
#5 5 4 5 6
EDIT: Based on #thelatemail's comments.
NOTE: NO external packages used
As we are looking for efficient solution, another approach would be set from data.table
library(data.table)
setDT(data)
v1 <- as.character(reference$Names)
for(j in seq_along(v1)){
set(data, i = which(is.na(data[[v1[j]]])), j= v1[j], value = reference$Vals[j] )
}
NOTE: Only a single efficient external package used.
One approach is to compute a logical matrix of the target columns capturing which cells are NA. We can then index-assign the NA cells with the replacement values. The tricky part is ensuring the replacement vector aligns with the indexed cells:
im <- is.na(data[as.character(reference$Names)]);
data[as.character(reference$Names)][im] <- rep(reference$Vals,colSums(im));
data;
## ID A B C
## 1 1 1 1 1
## 2 2 4 2 2
## 3 3 2 5 3
## 4 4 2 5 4
## 5 5 4 5 6
If reference was the same wide format as data, dplyr's new (v. 0.5.0) coalesce function is built for replacing NAs; together with purrr, which offers alternate notations for *apply functions, it makes the process very simple:
library(dplyr)
# spread reference to wide, add ID column for mapping
reference_wide <- data.frame(ID = NA_real_, tidyr::spread(reference, Names, Vals))
reference_wide
# ID A B C
# 1 NA 2 5 6
# now coalesce the two column-wise and return a df
purrr::map2_df(data, reference_wide, coalesce)
# Source: local data frame [5 x 4]
#
# ID A B C
# <dbl> <dbl> <dbl> <dbl>
# 1 1 1 1 1
# 2 2 4 2 2
# 3 3 2 5 3
# 4 4 2 5 4
# 5 5 4 5 6

row numbers for explicit rows in r

I need to get row numbers for explicit rows grouped over id. Let's say dataframe (df) looks like this:
id a b
3 2 NA
3 3 2
3 10 NA
3 21 0
3 2 NA
4 1 5
4 1 0
4 5 NA
I need to create one more column that would give row number sequence excluding the case where b == 0.
desired output:
id a b row
3 2 NA 1
3 3 2 2
3 10 NA 3
3 21 0 -
3 2 NA 4
4 1 5 1
4 1 0 -
4 5 NA 2
I used dplyr but not able to achieve the same,
My code:
df <- df %>%
group_by(id) %>%
mutate(row = row_number(id[b != 0]))
Please suggest some better way to do this.
I would propose using the data.table package for its nice capability in operating on subsets and thus avoiding inefficient operations such as ifelse or evaluation the whole data set. Also, it is better to keep you vector in numeric class (for future operations), thus NA will be probably preferable to - (character), here's a possible solution
library(data.table)
setDT(df)[is.na(b) | b != 0, row := seq_len(.N), by = id]
# id a b row
# 1: 3 2 NA 1
# 2: 3 3 2 2
# 3: 3 10 NA 3
# 4: 3 21 0 NA
# 5: 3 2 NA 4
# 6: 4 1 5 1
# 7: 4 1 0 NA
# 8: 4 5 NA 2
The idea here is to operate only on the rows where is.na(b) | b != 0 and generate a sequence of each group size (.N) while updating row in place (using :=). All the rest of the rows will be assigned with NAs by default.

Populating a data frame with corresponding values from another

I have a data frame containing values read in from an experiment with independent variables A and B which doesn't cover all possible permutations of A and B. I need to create a data frame which does contain all permutations, with zeros in those places where that particular pair of values isn't present in the data.
To create some sample data,
interactions <- unique(data.frame(A = sample(1:5, 10, replace=TRUE),
B = sample(1:5, 10, replace=TRUE)))
interactions <- interactions[interactions$A < interactions$B, ]
interactions$val <- runif(nrow(interactions))
possible.interactions <- data.frame(t(combn(1:5, 2)))
names(possible.interactions) <- c('A', 'B')
which creates
interactions
A B val
1 5 0.6881106
1 2 0.5286560
2 4 0.5026426
and
possible.interactions
A B
1 2
1 3
1 4
1 5
2 3
2 4
2 5
3 4
3 5
4 5
and I want to output
A B val
1 2 NA
1 3 0.5286560
1 4 NA
1 5 0.6881106
2 3 NA
2 4 0.5026426
2 5 NA
3 4 NA
3 5 NA
4 5 NA
What is the fastest way to do this?
Here is a base solution that is much faster (~10x) than merge:
possible.interactions$val <- interactions$val[
match(
do.call(paste, possible.interactions),
do.call(paste, interactions[1:2])
) ]
This produces (note, different to what you expect b/c you didn't set seed):
# A B val
# 1 1 2 0.59809242
# 2 1 3 0.92861520
# 3 1 4 0.64279549
# 4 1 5 NA
# 5 2 3 0.03554058
# 6 2 4 NA
# 7 2 5 NA
# 8 3 4 NA
# 9 3 5 NA
# 10 4 5 NA
This assumes A & B do not contain spaces and that interactions has no duplicate A-B pairs (will always match to first).
And the data.table version:
possible.DT <- data.table(possible.interactions)
DT <- data.table(interactions, key=c("A", "B"))
DT[possible.DT]
Though this is only worthwhile if your tables are large or you have uses for other benefits of data.table. I've found speed to be comparable to match in simple cases if you include the overhead of creating and keying the tables. I'm sure there are cases where data.table is much faster, especially if you key once and then use that key a lot.
For completeness, here is the merge version:
merge(possible.interactions, interactions, all.x=T)
If order is important to you, I recommend using join from the plyr package. As opposed to merge which does not provide an intuitive ordering when there are unmatched elements.
library(plyr)
join(interactions,possible.interactions,type="right")
Joining by: A, B
A B val
1 1 2 NA
2 1 3 NA
3 1 4 0.007602083
4 1 5 0.853415110
5 2 3 NA
6 2 4 0.321098658
7 2 5 NA
8 3 4 NA
9 3 5 NA
10 4 5 NA

Resources