How to replace NAs of a variable with values from another dataframe - r

i hope this one isn´t stupid.
I have two dataframes with Variables ID and gender/sex. In df1, there are NAs. In df2, the variable is complete. I want to complete the column in df1 with the values from df2.
(In df1 the variable is called "gender". In df2 it is called "sex".)
Here is what i tried so far:
#example-data
ID<-seq(1,30,by=1)
df1<-as.data.frame(ID)
df2<-df1
df1$gender<-c(NA,"2","1",NA,"2","2","2","2","2","2",NA,"2","1","1",NA,"2","2","2","2","2","1","2","2",NA,"2","2","2","2","2",NA)
df2$sex<-c("2","2","1","2","2","2","2","2","2","2","2","2","1","1","2","2","2","2","2","2","1","2","2","2","2","2","2","2","2","2")
#Approach 1:
NAs.a <- is.na(df1$gender)
df1$gender[NAs.a] <- df2[match(df1$ID[NAs.a], df2$ID),]$sex
#Approach 2 (i like dplyr a lot, perhaps there´s a way to use it):
library("dplyr")
temp<-df2 %>% select(ID,gender)
#EDIT:
#df<-left_join(df1$gender,df2$gender, by="ID")
df<-left_join(df1,df2, by="ID")
Thank you very much.

Here's a quick solution using data.tables binary join this will join only gender with sex and leave all the rest of the columns untouched
library(data.table)
setkey(setDT(df1), ID)
df1[df2, gender := i.sex][]
# ID gender
# 1: 1 2
# 2: 2 2
# 3: 3 1
# 4: 4 2
# 5: 5 2
# 6: 6 2
# 7: 7 2
# 8: 8 2
# 9: 9 2
# 10: 10 2
# 11: 11 2
# 12: 12 2
# 13: 13 1
# 14: 14 1
# 15: 15 2
# 16: 16 2
# 17: 17 2
# 18: 18 2
# 19: 19 2
# 20: 20 2
# 21: 21 1
# 22: 22 2
# 23: 23 2
# 24: 24 2
# 25: 25 2
# 26: 26 2
# 27: 27 2
# 28: 28 2
# 29: 29 2
# 30: 30 2

This would probably be the simplest with base R.
idx <- is.na(df1$gender)
df1$gender[idx] = df2$sex[idx]

You could do
df1 %>% select(ID) %>% left_join(df2, by = "ID")
# ID sex
#1 1 2
#2 2 2
#3 3 1
#4 4 2
#5 5 2
#6 6 2
#.. ..
This assumes - as in the example - that all ID's from df1 are also present in df2 and have a sex/gender information there.
If you have other columns in your data you could also try this instead:
df1 %>% select(-gender) %>% left_join(df2[c("ID", "sex")], by = "ID")

Related

R (data.table): call different columns in a loop

I am trying to call different columns of a data.table inside a loop, to get unique values of each column.
Consider the simple data.table below.
> df <- data.table(var_a = rep(1:10, 2),
+ var_b = 1:20)
> df
var_a var_b
1: 1 1
2: 2 2
3: 3 3
4: 4 4
5: 5 5
6: 6 6
7: 7 7
8: 8 8
9: 9 9
10: 10 10
11: 1 11
12: 2 12
13: 3 13
14: 4 14
15: 5 15
16: 6 16
17: 7 17
18: 8 18
19: 9 19
20: 10 20
My code works when I call for a specific column outside a loop,
> unique(df$var_a)
[1] 1 2 3 4 5 6 7 8 9 10
> unique(df[, var_a])
[1] 1 2 3 4 5 6 7 8 9 10
> unique(df[, "var_a"])
var_a
1: 1
2: 2
3: 3
4: 4
5: 5
6: 6
7: 7
8: 8
9: 9
10: 10
but not when I do so within a loop that goes through different columns of the data.table.
> for(v in c("var_a","var_b")){
+ print(v)
+ df$v
+ unique(df[, .v])
+ unique(df[, "v"])
+ }
[1] "var_a"
Error in `[.data.table`(df, , .v) :
j (the 2nd argument inside [...]) is a single symbol but column name '.v' is not found. Perhaps you intended DT[, ...v]. This difference to data.frame is deliberate and explained in FAQ 1.1.
>
> unique(df[, ..var_a])
Error in `[.data.table`(df, , ..var_a) :
Variable 'var_a' is not found in calling scope. Looking in calling scope because you used the .. prefix.
For the first problem, when you're referencing a column name indirectly, you can either use double-dot ..v syntax, or add with=FALSE in the data.table::[ construct:
for (v in c("var_a", "var_b")) {
print(v)
print(df$v)
### either one of these will work:
print(unique(df[, ..v]))
# print(unique(df[, v, with = FALSE]))
}
# [1] "var_a"
# NULL
# var_a
# <int>
# 1: 1
# 2: 2
# 3: 3
# 4: 4
# 5: 5
# 6: 6
# 7: 7
# 8: 8
# 9: 9
# 10: 10
# [1] "var_b"
# NULL
# var_b
# <int>
# 1: 1
# 2: 2
# 3: 3
# 4: 4
# 5: 5
# 6: 6
# 7: 7
# 8: 8
# 9: 9
# 10: 10
# 11: 11
# 12: 12
# 13: 13
# 14: 14
# 15: 15
# 16: 16
# 17: 17
# 18: 18
# 19: 19
# 20: 20
# var_b
But this just prints it without changing anything. If all you want to do is look at unique values within each column (and not change the underlying frame), then I'd likely go with
lapply(df[,.(var_a, var_b)], unique)
# $var_a
# [1] 1 2 3 4 5 6 7 8 9 10
# $var_b
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
which shows the name and unique values. The use of lapply (whether on df as a whole or a subset of columns) is also preferable to another recommendation to use apply(df, 2, unique), though in this case it returns the same results.
Use .subset2 to refer to a column by its name:
for(v in c("var_a","var_b")) {
print(unique(.subset2(df, v)))
}
following the information on the first error, this would be the correct way to call in a loop:
for(v in c("var_a","var_b")){
print(unique(df[, ..v]))
}
# won't print all the lines
as for the second error you have not declared a variable called "var_a", it looks like you want to select by name.
# works as you have shown
unique(df[, "var_a"])
# works once the variable is declared
var_a <- "var_a"
unique(df[, ..var_a])
You may also be interested in the env param of data.table (see development version); here is an illustration below, but you could use this in a loop too.
v="var_a"
df[, v, env=list(v=v)]
Output:
[1] 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Why can't I remove current observation using .I in data.table?

Recently I saw a question (can't find the link) that was something like this
I want to add a column on a data.frame that computes the variance of a different column while removing the current observation.
dt = data.table(
id = c(1:13),
v = c(9,5,8,1,25,14,7,87,98,63,32,12,15)
)
So, with a for() loop:
res = NULL
for(i in 1:13){
res[i] = var(dt[-i,v])
}
I tried doing this in data.table, using negative indexing with .I, but to my surprise none of the following works:
#1
dt[,var := var(dt[,v][-.I])]
#2
dt[,var := var(dt$v[-.I])]
#3
fun = function(x){
v = c(9,5,8,1,25,14,7,87,98,63,32,12,15)
var(v[-x])
}
dt[,var := fun(.I)]
#4
fun = function(x){
var(dt[-x,v])
}
dt[,var := fun(.I)]
All of those gives the same output:
id v var
1: 1 9 NA
2: 2 5 NA
3: 3 8 NA
4: 4 1 NA
5: 5 25 NA
6: 6 14 NA
7: 7 7 NA
8: 8 87 NA
9: 9 98 NA
10: 10 63 NA
11: 11 32 NA
12: 12 12 NA
13: 13 15 NA
What am I missing? I thought it was a problem with .I being passed to functions, but a dummy example:
fun = function(x,c){
x*c
}
dt[,dummy := fun(.I,2)]
id v var
1: 1 9 2
2: 2 5 4
3: 3 8 6
4: 4 1 8
5: 5 25 10
6: 6 14 12
7: 7 7 14
8: 8 87 16
9: 9 98 18
10: 10 63 20
11: 11 32 22
12: 12 12 24
13: 13 15 26
works fine.
Why can't I use .I in this specific scenario?
You may use .BY:
a list containing a length 1 vector for each item in by
dt[ , var_v := dt[id != .BY$id, var(v)], by = id]
Variance is calculated once per row (by = id). In each calculation, the current row is excluded using id != .BY$id in the 'inner' i.
all.equal(dt$var_v, res)
# [1] TRUE
Why doesn't your code work? Because...
.I is an integer vector equal to seq_len(nrow(x)),
...your -.I not only removes current observation, it removes all rows in one go from 'v'.
A small illustration which starts with your attempt (just without the assignment :=) and simplifies it step by step:
# your attempt
dt[ , var(dt[, v][-.I])]
# [1] NA
# without the `var`, indexing only
dt[ , dt[ , v][-.I]]
# numeric(0)
# an empty vector
# same indexing written in a simpler way
dt[ , v[-.I]]
# numeric(0)
# even more simplified, with a vector of values
# and its corresponding indexes (equivalent to .I)
v <- as.numeric(11:14)
i <- 1:4
v[i]
# [1] 11 12 13 14
x[-i]
# numeric(0)
Here's a brute-force thought:
exvar <- function(x, na.rm = FALSE) sapply(seq_len(length(x)), function(i) var(x[-i], na.rm = na.rm))
dt[,var := exvar(v)]
dt
# id v var
# 1: 1 9 1115.538
# 2: 2 5 1098.265
# 3: 3 8 1111.515
# 4: 4 1 1077.841
# 5: 5 25 1153.114
# 6: 6 14 1132.697
# 7: 7 7 1107.295
# 8: 8 87 822.447
# 9: 9 98 684.697
# 10: 10 63 1040.265
# 11: 11 32 1153.697
# 12: 12 12 1126.424
# 13: 13 15 1135.538

For each value of one column, find which is the last value of another vector that is lower

Finding the last position of a vector that is less than a given value is fairly straightforward (see e.g. this question
But, doing this line by line for a column in a data.frame or data.table is horribly slow. For example, we can do it like this (which is ok on small data, but not good on big data)
library(data.table)
set.seed(123)
x = sort(sample(20,5))
# [1] 6 8 15 16 17
y = data.table(V1 = 1:20)
y[, last.x := tail(which(x <= V1), 1), by = 1:nrow(y)]
# V1 last.x
# 1: 1 NA
# 2: 2 NA
# 3: 3 NA
# 4: 4 NA
# 5: 5 NA
# 6: 6 1
# 7: 7 1
# 8: 8 2
# 9: 9 2
# 10: 10 2
# 11: 11 2
# 12: 12 2
# 13: 13 2
# 14: 14 2
# 15: 15 3
# 16: 16 4
# 17: 17 5
# 18: 18 5
# 19: 19 5
# 20: 20 5
Is there a fast, vectorised way to get the same thing? Preferably using data.table or base R.
You may use findInterval
y[ , last.x := findInterval(V1, x)]
Slightly more convoluted using cut. But on the other hand, you get the NAs right away:
y[ , last.x := as.numeric(cut(V1, c(x, Inf), right = FALSE))]
Pretty simple in base R
x<-c(6L, 8L, 15L, 16L, 17L)
y<-1:20
cumsum(y %in% x)
[1] 0 0 0 0 0 1 1 2 2 2 2 2 2 2 3 4 5 5 5 5

Data.table selecting columns by name, e.g. using grepl

Say I have the following data.table:
dt <- data.table("x1"=c(1:10), "x2"=c(1:10),"y1"=c(10:1),"y2"=c(10:1), desc = c("a","a","a","b","b","b","b","b","c","c"))
I want to sum columns starting with an 'x', and sum columns starting with an 'y', by desc. At the moment I do this by:
dt[,.(Sumx=sum(x1,x2), Sumy=sum(y1,y2)), by=desc]
which works, but I would like to refer to all columns with "x" or "y" by their column names, eg using grepl().
Please could you advise me how to do so? I think I need to use with=FALSE, but cannot get it to work in combination with by=desc?
One-liner:
melt(dt, id="desc", measure.vars=patterns("^x", "^y"), value.name=c("x","y"))[,
lapply(.SD, sum), by=desc, .SDcols=x:y]
Long version (by #Frank):
First, you probably don't want to store your data like that. Instead...
m = melt(dt, id="desc", measure.vars=patterns("^x", "^y"), value.name=c("x","y"))
desc variable x y
1: a 1 1 10
2: a 1 2 9
3: a 1 3 8
4: b 1 4 7
5: b 1 5 6
6: b 1 6 5
7: b 1 7 4
8: b 1 8 3
9: c 1 9 2
10: c 1 10 1
11: a 2 1 10
12: a 2 2 9
13: a 2 3 8
14: b 2 4 7
15: b 2 5 6
16: b 2 6 5
17: b 2 7 4
18: b 2 8 3
19: c 2 9 2
20: c 2 10 1
Then you can do...
setnames(m[, lapply(.SD, sum), by=desc, .SDcols=x:y], 2:3, paste0("Sum", c("x", "y")))[]
# desc Sumx Sumy
#1: a 12 54
#2: b 60 50
#3: c 38 6
For more on improving the data structure you're working with, read about tidying data.
Use mget with grep is an option, where grep("^x", ...) returns the column names starting with x and use mget to get the column data, unlist the result and then you can calculate the sum:
dt[,.(Sumx=sum(unlist(mget(grep("^x", names(dt), value = T)))),
Sumy=sum(unlist(mget(grep("^y", names(dt), value = T))))), by=desc]
# desc Sumx Sumy
#1: a 12 54
#2: b 60 50
#3: c 38 6

How to lag by an integer variable using R?

Say I have the following historical league results:
Season <- c(1,1,2,2,3,3,4,4,5,5)
Team <- c("Diverpool","Deverton","Diverpool","Deverton","Diverpool","Deverton","Diverpool","Deverton","Diverpool","Deverton")
End.Rank <- c(8,17,4,15,3,6,4,16,3,17)
PLRank <- cbind(Season,Team,End.Rank)
I want to (efficiently) create a one year lagged variable for each team based on two criteria:
lag End.Rank by Season (i.e. t-1 with Season as the time variable)
separate by team (Deverton's lagged End.Rank vs. Diverpool's lagged End.Rank)
Essentially, I'd like the output to be as follows:
l.End.Rank <- c(NA,NA,8,17,4,15,3,6,4,16)
Tried lag(), and lost when trying to do it in a for() loop at the moment.
You can try one of the following...
Note that I've used a data.frame instead of the matrix you get with cbind:
PLRank <- data.frame(Season, Team, End.Rank)
With "data.table":
library(data.table)
setDT(PLRank)[, l.End.Rank := shift(End.Rank), by = .(Team)][]
# Season Team End.Rank l.End.Rank
# 1: 1 Diverpool 8 NA
# 2: 1 Deverton 17 NA
# 3: 2 Diverpool 4 8
# 4: 2 Deverton 15 17
# 5: 3 Diverpool 3 4
# 6: 3 Deverton 6 15
# 7: 4 Diverpool 4 3
# 8: 4 Deverton 16 6
# 9: 5 Diverpool 3 4
# 10: 5 Deverton 17 16
Or, with "dplyr":
library(dplyr)
PLRank %>%
group_by(Team) %>%
mutate(l.End.Rank = lag(End.Rank))
# Source: local data frame [10 x 4]
# Groups: Team [2]
#
# Season Team End.Rank l.End.Rank
# (dbl) (fctr) (dbl) (dbl)
# 1 1 Diverpool 8 NA
# 2 1 Deverton 17 NA
# 3 2 Diverpool 4 8
# 4 2 Deverton 15 17
# 5 3 Diverpool 3 4
# 6 3 Deverton 6 15
# 7 4 Diverpool 4 3
# 8 4 Deverton 16 6
# 9 5 Diverpool 3 4
# 10 5 Deverton 17 16
Update
I had honestly entirely misread that you wanted this grouped by Season.
If you are lagging by season, perhaps you should consider widening the data, so that each season has just one row. Then a lag by season would be easy.
Examples:
Here, we use dcast from "data.table" to spread the values of "End.Rank" out by "Team". Then, we lag just the newly created columns.
library(data.table)
teams <- as.character(unique(PLRank$Team))
dcast(as.data.table(PLRank), Season ~ Team, value.var = "End.Rank")[
, (teams) := lapply(.SD, shift), .SDcols = teams][]
# Season Deverton Diverpool
# 1: 1 NA NA
# 2: 2 17 8
# 3: 3 15 4
# 4: 4 6 3
# 5: 5 16 4
Or, if you wanted both the team names and the values to be in a wide form, you could try something like:
dcast(as.data.table(PLRank)[, ind := sequence(.N), by = Season],
Season ~ ind, value.var = c("Team", "End.Rank"))[
, c("End.Rank_1", "End.Rank_2") := lapply(.SD, shift),
.SDcols = c("End.Rank_1", "End.Rank_2")][]
# Season Team_1 Team_2 End.Rank_1 End.Rank_2
# 1: 1 Diverpool Deverton NA NA
# 2: 2 Diverpool Deverton 8 17
# 3: 3 Diverpool Deverton 4 15
# 4: 4 Diverpool Deverton 3 6
# 5: 5 Diverpool Deverton 4 16
The approach in "dplyr" is similar. Since you're going to a wide form, you also need "tidyr" to be loaded.
library(dplyr)
library(tidyr)
PLRank %>%
spread(Team, End.Rank) %>%
mutate_each(funs(lag), -Season)
# Season Deverton Diverpool
# 1 1 NA NA
# 2 2 17 8
# 3 3 15 4
# 4 4 6 3
# 5 5 16 4

Resources