Populating a data frame with corresponding values from another - r

I have a data frame containing values read in from an experiment with independent variables A and B which doesn't cover all possible permutations of A and B. I need to create a data frame which does contain all permutations, with zeros in those places where that particular pair of values isn't present in the data.
To create some sample data,
interactions <- unique(data.frame(A = sample(1:5, 10, replace=TRUE),
B = sample(1:5, 10, replace=TRUE)))
interactions <- interactions[interactions$A < interactions$B, ]
interactions$val <- runif(nrow(interactions))
possible.interactions <- data.frame(t(combn(1:5, 2)))
names(possible.interactions) <- c('A', 'B')
which creates
interactions
A B val
1 5 0.6881106
1 2 0.5286560
2 4 0.5026426
and
possible.interactions
A B
1 2
1 3
1 4
1 5
2 3
2 4
2 5
3 4
3 5
4 5
and I want to output
A B val
1 2 NA
1 3 0.5286560
1 4 NA
1 5 0.6881106
2 3 NA
2 4 0.5026426
2 5 NA
3 4 NA
3 5 NA
4 5 NA
What is the fastest way to do this?

Here is a base solution that is much faster (~10x) than merge:
possible.interactions$val <- interactions$val[
match(
do.call(paste, possible.interactions),
do.call(paste, interactions[1:2])
) ]
This produces (note, different to what you expect b/c you didn't set seed):
# A B val
# 1 1 2 0.59809242
# 2 1 3 0.92861520
# 3 1 4 0.64279549
# 4 1 5 NA
# 5 2 3 0.03554058
# 6 2 4 NA
# 7 2 5 NA
# 8 3 4 NA
# 9 3 5 NA
# 10 4 5 NA
This assumes A & B do not contain spaces and that interactions has no duplicate A-B pairs (will always match to first).
And the data.table version:
possible.DT <- data.table(possible.interactions)
DT <- data.table(interactions, key=c("A", "B"))
DT[possible.DT]
Though this is only worthwhile if your tables are large or you have uses for other benefits of data.table. I've found speed to be comparable to match in simple cases if you include the overhead of creating and keying the tables. I'm sure there are cases where data.table is much faster, especially if you key once and then use that key a lot.
For completeness, here is the merge version:
merge(possible.interactions, interactions, all.x=T)

If order is important to you, I recommend using join from the plyr package. As opposed to merge which does not provide an intuitive ordering when there are unmatched elements.
library(plyr)
join(interactions,possible.interactions,type="right")
Joining by: A, B
A B val
1 1 2 NA
2 1 3 NA
3 1 4 0.007602083
4 1 5 0.853415110
5 2 3 NA
6 2 4 0.321098658
7 2 5 NA
8 3 4 NA
9 3 5 NA
10 4 5 NA

Related

spreading data in R - allowing multiple values per cell

With these data
d <- data.frame(time=1:5, side=c("r","r","r","l","l"), val = c(1,2,1,2,1))
d
time side val
1 1 r 1
2 2 r 2
3 3 r 1
4 4 l 2
5 5 l 1
We can spread to a tidy dataframe like this:
library(tidyverse)
d %>% spread(side,val)
Which gives:
time l r
1 1 NA 1
2 2 NA 2
3 3 NA 1
4 4 2 NA
5 5 1 NA
But say we have more than one val for a given time/side. For example:
d <- data.frame(time=c(1:5,5), side=c("r","r","r","l","l","l"), val = c(1,2,1,2,1,2))
time side val
1 1 r 1
2 2 r 2
3 3 r 1
4 4 l 2
5 5 l 1
6 5 l 2
Now this won't work because of duplicated values:
d %>% spread(side,val)
Error: Duplicate identifiers for rows (5, 6)
Is there an efficient way to force this behavior (or alternative). The output would be e.g.
time l r
1 1 NA 1
2 2 NA 2
3 3 NA 1
4 4 2 NA
5 5 1, 2 NA
The data.table/reshape2 equivalent of tidyr::spread is dcast. It has a more complicated syntax than spread, but it's more flexible. To accomplish your task we can use the below chunk.
We use the formula to 'spread' side by time (filling with the values in the val column), provide the fill value of NA, and specify we want to list elements together when aggregation is needed per value of time.
library(data.table)
d <- data.table(time=c(1:5,5),
side=c("r","r","r","l","l","l"),
val = c(1,2,1,2,1,2))
data.table::dcast(d, time ~ side,
value.var='val',
fill=NA,
fun.aggregate=list)
#OUTPUT
# time l r
# 1: 1 NA 1
# 2: 2 NA 2
# 3: 3 NA 1
# 4: 4 2 NA
# 5: 5 1,2 NA

Replace na in column by value corresponding to column name in seperate table

I have a data frame which looks like this
data <- data.frame(ID = c(1,2,3,4,5),A = c(1,4,NA,NA,4),B = c(1,2,NA,NA,NA),C= c(1,2,3,4,NA))
> data
ID A B C
1 1 1 1 1
2 2 4 2 2
3 3 NA NA 3
4 4 NA NA 4
5 5 4 NA NA
I have a mapping file as well which looks like this
reference <- data.frame(Names = c("A","B","C"),Vals = c(2,5,6))
> reference
Names Vals
1 A 2
2 B 5
3 C 6
I want my data file to be modified using the reference file in a way which would yield me this final data frame
> final_data
ID A B C
1 1 1 1 1
2 2 4 2 2
3 3 2 5 3
4 4 2 5 4
5 5 4 5 6
What is the fastest way I can acheive this in R?
We can do this with Map
data[as.character(reference$Names)] <- Map(function(x,y) replace(x,
is.na(x), y), data[as.character(reference$Names)], reference$Vals)
data
# ID A B C
#1 1 1 1 1
#2 2 4 2 2
#3 3 2 5 3
#4 4 2 5 4
#5 5 4 5 6
EDIT: Based on #thelatemail's comments.
NOTE: NO external packages used
As we are looking for efficient solution, another approach would be set from data.table
library(data.table)
setDT(data)
v1 <- as.character(reference$Names)
for(j in seq_along(v1)){
set(data, i = which(is.na(data[[v1[j]]])), j= v1[j], value = reference$Vals[j] )
}
NOTE: Only a single efficient external package used.
One approach is to compute a logical matrix of the target columns capturing which cells are NA. We can then index-assign the NA cells with the replacement values. The tricky part is ensuring the replacement vector aligns with the indexed cells:
im <- is.na(data[as.character(reference$Names)]);
data[as.character(reference$Names)][im] <- rep(reference$Vals,colSums(im));
data;
## ID A B C
## 1 1 1 1 1
## 2 2 4 2 2
## 3 3 2 5 3
## 4 4 2 5 4
## 5 5 4 5 6
If reference was the same wide format as data, dplyr's new (v. 0.5.0) coalesce function is built for replacing NAs; together with purrr, which offers alternate notations for *apply functions, it makes the process very simple:
library(dplyr)
# spread reference to wide, add ID column for mapping
reference_wide <- data.frame(ID = NA_real_, tidyr::spread(reference, Names, Vals))
reference_wide
# ID A B C
# 1 NA 2 5 6
# now coalesce the two column-wise and return a df
purrr::map2_df(data, reference_wide, coalesce)
# Source: local data frame [5 x 4]
#
# ID A B C
# <dbl> <dbl> <dbl> <dbl>
# 1 1 1 1 1
# 2 2 4 2 2
# 3 3 2 5 3
# 4 4 2 5 4
# 5 5 4 5 6

row numbers for explicit rows in r

I need to get row numbers for explicit rows grouped over id. Let's say dataframe (df) looks like this:
id a b
3 2 NA
3 3 2
3 10 NA
3 21 0
3 2 NA
4 1 5
4 1 0
4 5 NA
I need to create one more column that would give row number sequence excluding the case where b == 0.
desired output:
id a b row
3 2 NA 1
3 3 2 2
3 10 NA 3
3 21 0 -
3 2 NA 4
4 1 5 1
4 1 0 -
4 5 NA 2
I used dplyr but not able to achieve the same,
My code:
df <- df %>%
group_by(id) %>%
mutate(row = row_number(id[b != 0]))
Please suggest some better way to do this.
I would propose using the data.table package for its nice capability in operating on subsets and thus avoiding inefficient operations such as ifelse or evaluation the whole data set. Also, it is better to keep you vector in numeric class (for future operations), thus NA will be probably preferable to - (character), here's a possible solution
library(data.table)
setDT(df)[is.na(b) | b != 0, row := seq_len(.N), by = id]
# id a b row
# 1: 3 2 NA 1
# 2: 3 3 2 2
# 3: 3 10 NA 3
# 4: 3 21 0 NA
# 5: 3 2 NA 4
# 6: 4 1 5 1
# 7: 4 1 0 NA
# 8: 4 5 NA 2
The idea here is to operate only on the rows where is.na(b) | b != 0 and generate a sequence of each group size (.N) while updating row in place (using :=). All the rest of the rows will be assigned with NAs by default.

Why didn't mutate fill all rows? Was using mutate and ifelse to look up imputed values from another dataframe

Here is the deal. Was trying to use mutate from the plyr package to look up an appropriate value from another dataframe, if, the v variable in the original dataframe was NA. The looked up value is supposed to go into a new variable imputed. I also defined a custom function for this look up purpose.
Here is the code:
if(!require(plyr)){
install.packages("plyr")
library(plyr)
}
df = data.frame(d=c(1,1,1,2,2,2,3,3,3),
g=rep(c(1,2,3),3),
v=c(5,NA,NA,5,NA,NA,5,NA,NA))
imputed = data.frame(g=c(1,2,3),
v=c(5,10,15))
getImputed = function(p){
imputed[imputed$g==p,"v"]
}
df = mutate(df,imputed=ifelse(is.na(v),getImputed(g),v))
df
And this is the resulting dataframe:
d g v imputed
1 1 1 5 5
2 1 2 NA 10
3 1 3 NA 15
4 2 1 5 5
5 2 2 NA NA
6 2 3 NA NA
7 3 1 5 5
8 3 2 NA NA
9 3 3 NA NA
As one can see, only the first 3 rows were successfully filled in by mutate. It is likely that the ifelse function is the issue, but I can't see why : (
What is weird is that, if the imputed dataframe has 4 rows, like this:
imputed = data.frame(g=c(1,2,3,4),
v=c(5,10,15,20))
then the df dataframe was filled up properly:
d g v imputed
1 1 1 5 5
2 1 2 NA 10
3 1 3 NA 15
4 2 1 5 5
5 2 2 NA 10
6 2 3 NA 15
7 3 1 5 5
8 3 2 NA 10
9 3 3 NA 15
but R gave me a warning saying:
Warning message:
In imputed$g == p :
longer object length is not a multiple of shorter object length
Am I overlooking something?
The problem is your getImputed function. The mutate function does not iterate over rows. It passes columns as a vectors to functions so each function is basically called one. Your getInputed function works if you pass a single value, but not so great with a vector
getImputed(1)
# [1] 5
getImputed(c(1,2))
# [1] 5 10
# Warning message:
# In imputed$g == p :
# longer object length is not a multiple of shorter object length
A better way to write the function would be
getImputed2 <- function(p){
imputed$v[match(p, imputed$g)]
}
This will properly handle a vector of values
mutate(df,imputed=ifelse(is.na(v),getImputed2(g),v))
# d g v imputed
# 1 1 1 5 5
# 2 1 2 NA 10
# 3 1 3 NA 15
# 4 2 1 5 5
# 5 2 2 NA 10
# 6 2 3 NA 15
# 7 3 1 5 5
# 8 3 2 NA 10
# 9 3 3 NA 15
You might also consider joining and replacing
mutate(join(df, setNames(imputed, c("g","v2")), by=c(g="g")),
v=ifelse(is.na(v), v2, v), v2=NULL)

R: remove multiple rows based on missing values in fewer rows

I have an R data frame with data from multiple subjects, each tested several times. To perform statistics on the set, there is a factor for subject ("id") and a row for each observation (given by factor "session"). I.e.
print(allData)
id session measure
1 1 7.6
2 1 4.5
3 1 5.5
1 2 7.1
2 2 NA
3 2 4.9
In the above example, is there a simple way to remove all rows with id==2, given that the "measure" column contains NA in one of the rows where id==2?
More generally, since I actually have a lot of measures (columns) and four sessions (rows) for each subject, is there an elegant way to remove all rows with a given level of the "id" factor, given that (at least) one of the rows with this "id"-level contains NA in a column?
I have the intuition that there could be a build-in function that could solve this problem more elegantly than my current solution:
# Which columns to check for NA's in
probeColumns = c('measure1','measure4') # Etc...
# A vector which contains all levels of "id" that are present in rows with NA's in the probeColumns
idsWithNAs = allData[complete.cases(allData[probeColumns])==FALSE,"id"]
# All rows that isn't in idsWithNAs
cleanedData = allData[!allData$id %in% idsWithNAs,]
Thanks,
/Jonas
You can use the ddply function from the plyr package to 1) subset your data by id, 2)
apply a function that will return NULL if the sub data.frame contains NA in the columns of your choice, or the data.frame itself otherwise, and 3) concatenate everything back into a data.frame.
allData <- data.frame(id = rep(1:4, 3),
session = rep(1:3, each = 4),
measure1 = sample(c(NA, 1:11)),
measure2 = sample(c(NA, 1:11)),
measure3 = sample(c(NA, 1:11)),
measure4 = sample(c(NA, 1:11)))
allData
# id session measure1 measure2 measure3 measure4
# 1 1 1 3 7 10 6
# 2 2 1 4 4 9 9
# 3 3 1 6 6 7 10
# 4 4 1 1 5 2 3
# 5 1 2 NA NA 5 11
# 6 2 2 7 10 6 5
# 7 3 2 9 8 4 2
# 8 4 2 2 9 1 7
# 9 1 3 5 1 3 8
# 10 2 3 8 3 8 1
# 11 3 3 11 11 11 4
# 12 4 3 10 2 NA NA
# Which columns to check for NA's in
probeColumns = c('measure1','measure4')
library(plyr)
ddply(allData, "id",
function(df)if(any(is.na(df[, probeColumns]))) NULL else df)
# id session measure1 measure2 measure3 measure4
# 1 2 1 4 4 9 9
# 2 2 2 7 10 6 5
# 3 2 3 8 3 8 1
# 4 3 1 6 6 7 10
# 5 3 2 9 8 4 2
# 6 3 3 11 11 11 4
Using your example two last commands of it can be transformed in such string. It should produce the same result and it looks simplier.
cleanedData <- allData[complete.cases(allData[,probeColumns]),]
This is a correct version which uses only base package. Just for fun. :) But it's neither compact nor simple. Answer of flodel is neater. Even your initial solution is more compact and I think faster.
cleanedData <- do.call(rbind, sapply(unique(allData[,"id"]), function(x) {if(all(!is.na(allData[allData$id==x, probeColumn]))) allData[allData$id==x,]}))

Resources