Replace na in column by value corresponding to column name in seperate table - r

I have a data frame which looks like this
data <- data.frame(ID = c(1,2,3,4,5),A = c(1,4,NA,NA,4),B = c(1,2,NA,NA,NA),C= c(1,2,3,4,NA))
> data
ID A B C
1 1 1 1 1
2 2 4 2 2
3 3 NA NA 3
4 4 NA NA 4
5 5 4 NA NA
I have a mapping file as well which looks like this
reference <- data.frame(Names = c("A","B","C"),Vals = c(2,5,6))
> reference
Names Vals
1 A 2
2 B 5
3 C 6
I want my data file to be modified using the reference file in a way which would yield me this final data frame
> final_data
ID A B C
1 1 1 1 1
2 2 4 2 2
3 3 2 5 3
4 4 2 5 4
5 5 4 5 6
What is the fastest way I can acheive this in R?

We can do this with Map
data[as.character(reference$Names)] <- Map(function(x,y) replace(x,
is.na(x), y), data[as.character(reference$Names)], reference$Vals)
data
# ID A B C
#1 1 1 1 1
#2 2 4 2 2
#3 3 2 5 3
#4 4 2 5 4
#5 5 4 5 6
EDIT: Based on #thelatemail's comments.
NOTE: NO external packages used
As we are looking for efficient solution, another approach would be set from data.table
library(data.table)
setDT(data)
v1 <- as.character(reference$Names)
for(j in seq_along(v1)){
set(data, i = which(is.na(data[[v1[j]]])), j= v1[j], value = reference$Vals[j] )
}
NOTE: Only a single efficient external package used.

One approach is to compute a logical matrix of the target columns capturing which cells are NA. We can then index-assign the NA cells with the replacement values. The tricky part is ensuring the replacement vector aligns with the indexed cells:
im <- is.na(data[as.character(reference$Names)]);
data[as.character(reference$Names)][im] <- rep(reference$Vals,colSums(im));
data;
## ID A B C
## 1 1 1 1 1
## 2 2 4 2 2
## 3 3 2 5 3
## 4 4 2 5 4
## 5 5 4 5 6

If reference was the same wide format as data, dplyr's new (v. 0.5.0) coalesce function is built for replacing NAs; together with purrr, which offers alternate notations for *apply functions, it makes the process very simple:
library(dplyr)
# spread reference to wide, add ID column for mapping
reference_wide <- data.frame(ID = NA_real_, tidyr::spread(reference, Names, Vals))
reference_wide
# ID A B C
# 1 NA 2 5 6
# now coalesce the two column-wise and return a df
purrr::map2_df(data, reference_wide, coalesce)
# Source: local data frame [5 x 4]
#
# ID A B C
# <dbl> <dbl> <dbl> <dbl>
# 1 1 1 1 1
# 2 2 4 2 2
# 3 3 2 5 3
# 4 4 2 5 4
# 5 5 4 5 6

Related

spreading data in R - allowing multiple values per cell

With these data
d <- data.frame(time=1:5, side=c("r","r","r","l","l"), val = c(1,2,1,2,1))
d
time side val
1 1 r 1
2 2 r 2
3 3 r 1
4 4 l 2
5 5 l 1
We can spread to a tidy dataframe like this:
library(tidyverse)
d %>% spread(side,val)
Which gives:
time l r
1 1 NA 1
2 2 NA 2
3 3 NA 1
4 4 2 NA
5 5 1 NA
But say we have more than one val for a given time/side. For example:
d <- data.frame(time=c(1:5,5), side=c("r","r","r","l","l","l"), val = c(1,2,1,2,1,2))
time side val
1 1 r 1
2 2 r 2
3 3 r 1
4 4 l 2
5 5 l 1
6 5 l 2
Now this won't work because of duplicated values:
d %>% spread(side,val)
Error: Duplicate identifiers for rows (5, 6)
Is there an efficient way to force this behavior (or alternative). The output would be e.g.
time l r
1 1 NA 1
2 2 NA 2
3 3 NA 1
4 4 2 NA
5 5 1, 2 NA
The data.table/reshape2 equivalent of tidyr::spread is dcast. It has a more complicated syntax than spread, but it's more flexible. To accomplish your task we can use the below chunk.
We use the formula to 'spread' side by time (filling with the values in the val column), provide the fill value of NA, and specify we want to list elements together when aggregation is needed per value of time.
library(data.table)
d <- data.table(time=c(1:5,5),
side=c("r","r","r","l","l","l"),
val = c(1,2,1,2,1,2))
data.table::dcast(d, time ~ side,
value.var='val',
fill=NA,
fun.aggregate=list)
#OUTPUT
# time l r
# 1: 1 NA 1
# 2: 2 NA 2
# 3: 3 NA 1
# 4: 4 2 NA
# 5: 5 1,2 NA

Shifting rows up in columns and flush remaining ones

I have a problem with moving the rows to one upper row. When the rows become completely NA I would like to flush those rows (see the pic below). My current approach for this solution however still keeping the second rows.
Here is my approach
data <- data.frame(gr=c(rep(1:3,each=2)),A=c(1,NA,2,NA,4,NA), B=c(NA,1,NA,3,NA,7),C=c(1,NA,4,NA,5,NA))
> data
gr A B C
1 1 1 NA 1
2 1 NA 1 NA
3 2 2 NA 4
4 2 NA 3 NA
5 3 4 NA 5
6 3 NA 7 NA
so using this approach
data.frame(apply(data,2,function(x){x[complete.cases(x)]}))
gr A B C
1 1 1 1 1
2 1 2 3 4
3 2 4 7 5
4 2 1 1 1
5 3 2 3 4
6 3 4 7 5
As we can see still I am having the second rows in each group!
The expected output
> data
gr A B C
1 1 1 1 1
2 2 2 3 4
3 3 4 7 5
thanks!
If there's at most one valid value per gr, you can use na.omit then take the first value from it:
data %>% group_by(gr) %>% summarise_all(~ na.omit(.)[1])
# [1] is optional depending on your actual data
# A tibble: 3 x 4
# gr A B C
# <int> <dbl> <dbl> <dbl>
#1 1 1 1 1
#2 2 2 3 4
#3 3 4 7 5
You can do it with dplyr like this:
data$ind <- rep(c(1,2), replace=TRUE)
data %>% fill(A,B,C) %>% filter(ind == 2) %>% mutate(ind=NULL)
gr A B C
1 1 1 1 1
2 2 2 3 4
3 3 4 7 5
Depending on how consistent your full data is, this may need to be adjusted.
One more solution using data.table:-
data <- data.frame(gr=c(rep(1:3,each=2)),A=c(1,NA,2,NA,4,NA), B=c(NA,1,NA,3,NA,7),C=c(1,NA,4,NA,5,NA))
library(data.table)
library(zoo)
setDT(data)
data[, A := na.locf(A), by = gr]
data[, B := na.locf(B), by = gr]
data[, C := na.locf(C), by = gr]
data <- unique(data)
data
gr A B C
1: 1 1 1 1
2: 2 2 3 4
3: 3 4 7 5

R - Loop through a data table with combination of dcast of sum

I have a table similar this, with more columns. What I am trying to do is creating a new table that shows, for each ID, the number of Counts of each Type, the Value of each Type.
df
ID Type Counts Value
1 A 1 5
1 B 2 4
2 A 2 1
2 A 3 4
2 B 1 3
2 B 2 3
I am able to do it for one single column by using
dcast(df[,j=list(sum(Counts,na.rm = TRUE)),by = c("ID","Type")],ID ~ paste(Type,"Counts",sep="_"))
However, I want to use a loop through each column within the data table. but there is no success, it will always add up all the rows. I have try to use
sum(df[[i]],na.rm = TRUE)
sum(names(df)[[i]] == "",na.rm = TRUE)
sum(df[[names(df)[i]]],na.rm = TRUE)
j = list(apply(df[,c(3:4),with=FALSE],2,function(x) sum(x,na.rm = TRUE)
I want to have a new table similar like
ID A_Counts B_Counts A_Value B_Value
1 1 2 5 4
2 5 3 5 6
My own table have more columns, but the idea is the same. Do I over-complicated it or is there a easy trick I am not aware of? Please help me. Thank you!
You have to melt your data first, and then dcast it:
library(reshape2)
df2 <- melt(df,id.vars = c("ID","Type"))
# ID Type variable value
# 1 1 A Counts 1
# 2 1 B Counts 2
# 3 2 A Counts 2
# 4 2 A Counts 3
# 5 2 B Counts 1
# 6 2 B Counts 2
# 7 1 A Value 5
# 8 1 B Value 4
# 9 2 A Value 1
# 10 2 A Value 4
# 11 2 B Value 3
# 12 2 B Value 3
dcast(df2,ID ~ Type + variable,fun.aggregate=sum)
# ID A_Counts A_Value B_Counts B_Value
# 1 1 1 5 2 4
# 2 2 5 5 3 6
Another solution with base functions only:
df3 <- aggregate(cbind(Counts,Value) ~ ID + Type,df,sum)
# ID Type Counts Value
# 1 1 A 1 5
# 2 2 A 5 5
# 3 1 B 2 4
# 4 2 B 3 6
reshape(df3, idvar='ID', timevar='Type',direction="wide")
# ID Counts.A Value.A Counts.B Value.B
# 1 1 1 5 2 4
# 2 2 5 5 3 6
Data
df <- read.table(text ="ID Type Counts Value
1 A 1 5
1 B 2 4
2 A 2 1
2 A 3 4
2 B 1 3
2 B 2 3",stringsAsFactors=FALSE,header=TRUE)

Separate unique and duplicate entries in dataframe based off id

I have a dataframe with an id variable, which may be duplicated. I want to split this into two dataframes, one which contains only the entries where the id's are duplicated, the other which shows only the id's which are unique. What is the best way of doing this?
For example, say I had the data frame:
dataDF <- data.frame(id = c(1,1,2,3,4,4,5,6),
a = c(1,2,3,4,5,6,7,8),
b = c(8,7,6,5,4,3,2,1))
i.e. the following
id a b
1 1 1 8
2 1 2 7
3 2 3 6
4 3 4 5
5 4 5 4
6 4 6 3
7 5 7 2
8 6 8 1
I want to get the following dataframes:
id a b
1 1 1 8
2 1 2 7
5 4 5 4
6 4 6 3
and
id a b
3 2 3 6
4 3 4 5
7 5 7 2
8 6 8 1
I am currently doing this as follows
dupeIds <- unique(subset(dataDF, duplicated(dataDF$id))$id)
uniqueDF <- subset(dataDF, !id %in% dupeIds)
dupeDF <- subset(dataDF, id %in% dupeIds)
which seems to work but it seems a bit off to subset three times, is there a simpler way of doing this? Thanks
Use duplicated twice, once top down, and once bottom up, and then use split to get it all in a list, like this:
split(dataDF, duplicated(dataDF$id) | duplicated(dataDF$id, fromLast = TRUE))
# $`FALSE`
# id a b
# 3 2 3 6
# 4 3 4 5
# 7 5 7 2
# 8 6 8 1
#
# $`TRUE`
# id a b
# 1 1 1 8
# 2 1 2 7
# 5 4 5 4
# 6 4 6 3
If you need to split this out into separate data.frames in your workspace (not sure why you would need to do that), assign names to the list items (eg names(mylist) <- c("nodupe", "dupe")) and then use list2env.

Populating a data frame with corresponding values from another

I have a data frame containing values read in from an experiment with independent variables A and B which doesn't cover all possible permutations of A and B. I need to create a data frame which does contain all permutations, with zeros in those places where that particular pair of values isn't present in the data.
To create some sample data,
interactions <- unique(data.frame(A = sample(1:5, 10, replace=TRUE),
B = sample(1:5, 10, replace=TRUE)))
interactions <- interactions[interactions$A < interactions$B, ]
interactions$val <- runif(nrow(interactions))
possible.interactions <- data.frame(t(combn(1:5, 2)))
names(possible.interactions) <- c('A', 'B')
which creates
interactions
A B val
1 5 0.6881106
1 2 0.5286560
2 4 0.5026426
and
possible.interactions
A B
1 2
1 3
1 4
1 5
2 3
2 4
2 5
3 4
3 5
4 5
and I want to output
A B val
1 2 NA
1 3 0.5286560
1 4 NA
1 5 0.6881106
2 3 NA
2 4 0.5026426
2 5 NA
3 4 NA
3 5 NA
4 5 NA
What is the fastest way to do this?
Here is a base solution that is much faster (~10x) than merge:
possible.interactions$val <- interactions$val[
match(
do.call(paste, possible.interactions),
do.call(paste, interactions[1:2])
) ]
This produces (note, different to what you expect b/c you didn't set seed):
# A B val
# 1 1 2 0.59809242
# 2 1 3 0.92861520
# 3 1 4 0.64279549
# 4 1 5 NA
# 5 2 3 0.03554058
# 6 2 4 NA
# 7 2 5 NA
# 8 3 4 NA
# 9 3 5 NA
# 10 4 5 NA
This assumes A & B do not contain spaces and that interactions has no duplicate A-B pairs (will always match to first).
And the data.table version:
possible.DT <- data.table(possible.interactions)
DT <- data.table(interactions, key=c("A", "B"))
DT[possible.DT]
Though this is only worthwhile if your tables are large or you have uses for other benefits of data.table. I've found speed to be comparable to match in simple cases if you include the overhead of creating and keying the tables. I'm sure there are cases where data.table is much faster, especially if you key once and then use that key a lot.
For completeness, here is the merge version:
merge(possible.interactions, interactions, all.x=T)
If order is important to you, I recommend using join from the plyr package. As opposed to merge which does not provide an intuitive ordering when there are unmatched elements.
library(plyr)
join(interactions,possible.interactions,type="right")
Joining by: A, B
A B val
1 1 2 NA
2 1 3 NA
3 1 4 0.007602083
4 1 5 0.853415110
5 2 3 NA
6 2 4 0.321098658
7 2 5 NA
8 3 4 NA
9 3 5 NA
10 4 5 NA

Resources