R data.table value from previous row with conditional statement - r

I would like to update a data table value depending on whether it meets a criteria and return either the value from another column or the value from the row above (same column).
As an example:
library( data.table )
data <- data.table( Col1 = 1:5, Col2 = letters[1:5] )
I would like to return the following:
data2 <- data.table( Col1= 1:5, Col2= letters[1:5], Col3= c("NA", "NA", "3", "3", "3"))
I have read the ?shift help page but I can't adapt it to using a conditional statement and returning a value in the same column. To get my desired outcome I have tried:
data[ , ( Col3 ) := ifelse( get( Col2 ) == "c", get( Col1 ) , shift( Col3 ))]
I would be grateful for some advice.
*Please ignore my use of get() for this example as I am aware it may not be the best approach.

This old, so far unanswered question has been revived recently.
As of today I am aware of the following approaches:
1. zoo::na.locf()
According to Frank's comment:
data3 <- data.table(Col1= 1:10, Col2 = c(letters[1:5],letters[1:5]))
data3[Col2=='c', Col3 := Col1][, Col3 := zoo::na.locf(Col3, na.rm=FALSE)]
data3[]
Col1 Col2 Col3
1: 1 a NA
2: 2 b NA
3: 3 c 3
4: 4 d 3
5: 5 e 3
6: 6 a 3
7: 7 b 3
8: 8 c 8
9: 9 d 8
10: 10 e 8
2. cumsum()
data3 <- data.table(Col1= 1:10, Col2 = c(letters[1:5],letters[1:5]))
data3[, Col3 := Col1[which(Col2 == "c")], by = cumsum(Col2 == "c")]
data3[]
Col1 Col2 Col3
1: 1 a NA
2: 2 b NA
3: 3 c 3
4: 4 d 3
5: 5 e 3
6: 6 a 3
7: 7 b 3
8: 8 c 8
9: 9 d 8
10: 10 e 8

Related

Replacing some values of a column based on some match in data.table

Let say I have below data.table
library(data.table)
DT = data.table(Col1 = LETTERS[1:10], Col2 = c(1,4,2,3,6,NA,4,2, 5, 4))
DT
Col1 Col2
1: A 1
2: B 4
3: C 2
4: D 3
5: E 6
6: F NA
7: G 4
8: H 2
9: I 5
10: J 4
Now I want to replace the 4 and NA values in Col2 by 999
In actual scenario, I have very large DT, so I am looking for most efficient way to achieve the same.
Any insight will be highly appreciated.
An option with na_if/replace_na
library(dplyr)
library(data.table)
DT[, Col2 := replace_na(na_if(Col2, 4), 999)]

Nested full_join with suffixes for more than 2 data.frames

I want to merge several data.frames with some common columns and append a suffix to the column names to keep track from where does the data for each column come from.
I can do it easily with the suffix term in the first full_join, but when I do the second join, no suffixes are added. I can rename the third data.frame so it has suffixes, but I wanted to know if there is another way of doing it using the suffix term.
Here is an example code:
x = data.frame(col1 = c("a","b","c"), col2 = 1:3, col3 = 1:3)
y = data.frame(col1 = c("b","c","d"), col2 = 4:6, col3 = 1:3)
z = data.frame(col1 = c("c","d","a"), col2 = 7:9, col3 = 1:3)
> df = full_join(x, y, by = "col1", suffix = c("_x","_y")) %>%
full_join(z, by = "col1", suffix = c("","_z"))
> df
col1 col2_x col3_x col2_y col3_y col2 col3
1 a 1 1 NA NA 9 3
2 b 2 2 4 1 NA NA
3 c 3 3 5 2 7 1
4 d NA NA 6 3 8 2
I was expecting that col2 and col3 from data.frame z would have a "_z" suffix. I have tried using empty suffixes while merging two data.frames and it works.
I can work around by renaming the columns in z before doing the second full_join, but in my real data I have several common columns, and if I wanted to merge more data.frames it would complicate the code. This is my expected output.
> colnames(z) = paste0(colnames(z),"_z")
> df = full_join(x, y, by = "col1", suffix = c("_x","_y")) %>%
full_join(z, by = c("col1"="col1_z"))
> df
col1 col2_x col3_x col2_y col3_y col2_z col3_z
1 a 1 1 NA NA 9 3
2 b 2 2 4 1 NA NA
3 c 3 3 5 2 7 1
4 d NA NA 6 3 8 2
I have seen other similar problems in which adding an extra column to keep track of the source data.frame is used, but I was wondering why does not the suffix term work with multiple joins.
PS: If I keep the first suffix empty, I can add suffixes in the second join, but that will leave the col2 and col3 form x without suffix.
> df = full_join(x, y, by = "col1", suffix = c("","_y")) %>%
full_join(z, by = "col1", suffix = c("","_z"))
> df
col1 col2 col3 col2_y col3_y col2_z col3_z
1 a 1 1 NA NA 9 3
2 b 2 2 4 1 NA NA
3 c 3 3 5 2 7 1
4 d NA NA 6 3 8 2
You can do it like this:
full_join(x, y, by = "col1", suffix = c("","_y")) %>%
full_join(z, by = "col1", suffix = c("_x","_z"))
col1 col2_x col3_x col2_y col3_y col2_z col3_z
1 a 1 1 NA NA 9 3
2 b 2 2 4 1 NA NA
3 c 3 3 5 2 7 1
4 d NA NA 6 3 8 2
Adding the suffix for xat the last join should do the trick.

Renaming Variables Dynamically

I have a file named 'schema'. Based on the file, I need to rename other data frames. For example, 'Var1' of TableA needs to be renamed to 'Col1'. Similarly, VarA of TableA needs to be renamed to ColA. In short, all variables listed in 'FROM' colume of schema needs to be renamed to column 'To'.
Schema <- read.table(header = TRUE, text =
'Tables From To
A Var1 Col1
A Var2 Col2
A Var3 Col3
B VarA ColA
B VarB ColB
B VarC ColC
')
A <- data.frame(Var1 = 1:3,
Var2 = 2:4,
Var3 = 3:5)
B <- data.frame(VarA = 1:3,
VarB = 2:4,
VarC = 3:5)
We could use match:
lapply(list(A = A, B = B), function(i){
setNames(i, Schema$To[ match(names(i), Schema$From) ])
})
# $A
# Col1 Col2 Col3
# 1 1 2 3
# 2 2 3 4
# 3 3 4 5
#
# $B
# ColA ColB ColC
# 1 1 2 3
# 2 2 3 4
# 3 3 4 5
Or:
Anew <- setNames(A, Schema$To[ match(names(A), Schema$From) ])
Bnew <- setNames(B, Schema$To[ match(names(B), Schema$From) ])
Or list2env:
list2env(lapply(list(A = A, B = B), function(i){
setNames(i, Schema$To[ match(names(i), Schema$From) ])
}), envir = globalenv())
Edit: When there is no match Schema then use keep column name as is:
list2env(lapply(list(A = A, B = B), function(i){
# check if there is a match, if not keep name unchaged
x <- as.character(Schema$To[ match(names(i), Schema$From) ])
ix <- which(is.na(x))
x[ ix ] <- names(i)[ ix ]
# retunr with updated names
setNames(i, x)
}), envir = globalenv())
The following code can extract retrieve the name of tables (A and B) from Schema and to the name replacement task:
r <- Map(function(v) function(v) {
r <- get(v)
names(r)[names(r) %in% Schema$From] <- as.character(Schema$To[Schema$From %in% names(r)])
assign(v,r)},
as.character(unique(Schema$Tables)))
which gives
> r
$A
Col1 Col2 Col3
1 1 2 3
2 2 3 4
3 3 4 5
$B
ColA ColB ColC
1 1 2 3
2 2 3 4
3 3 4 5
If you don't want result as list, you can do something like
list2env(Map(function(v) {
r <- get(v)
names(r)[names(r) %in% Schema$From] <- as.character(Schema$To[Schema$From %in% names(r)])
assign(v,r)},
as.character(unique(Schema$Tables))),envir = .GlobalEnv)
or
for (v in as.character(unique(Schema$Tables))) {
r <- get(v)
names(r)[names(r) %in% Schema$From] <- as.character(Schema$To[Schema$From %in% names(r)])
assign(v,r)
}
then you will keep your object A and B
> A
Col1 Col2 Col3
1 1 2 3
2 2 3 4
3 3 4 5
> B
ColA ColB ColC
1 1 2 3
2 2 3 4
3 3 4 5
lut <- setNames(as.character(Schema$To), Schema$From)
setNames(A, lut[names(A)])
Col1 Col2 Col3
1 1 2 3
2 2 3 4
3 3 4 5
setNames(B, lut[names(B)])
ColA ColB ColC
1 1 2 3
2 2 3 4
3 3 4 5

Editing a dataframe to create a paired sample; removing records without a matching date in another group

I have done a bunch of searching for a solution to this and either can't find one or don't know it when I see it. I've seen some topics that are close to this but deal with matching between two different dataframes, whereas this is dealing with a single dataframe.
I have a dataframe with two groups (factors, col1) and a sampling date (date, col2), and then the measurement (numeric, col3). I would like to eventually run a statistical test on a paired sample between group A and B, so in order to create the paired sample, I want to only keep the records that have a measurement taken on the same day for both groups. In other words, remove the records in group A that do not have a corresponding measurement taken on the same day in group B, and vice versa. In the sample data below, that would result in rows 4 and 8 being removed. Another way of thinking of it is, how do I search for and remove records with only one occurrence of each date?
Sample data:
my.df <- data.frame(col1 = as.factor(c(rep("A", 4), rep("B", 4))),
col2 = as.Date(c("2001-01-01", "2001-01-02", "2001-01-03",
"2001-01-04", "2001-01-01", "2001-01-02", "2001-01-03",
"2001-02-03")),
col3 = sample(8))
Here are a few alternatives:
1) ave
> subset(my.df, ave(col3, col2, FUN = length) > 1)
col1 col2 col3
1 A 2001-01-01 3
2 A 2001-01-02 2
3 A 2001-01-03 6
5 B 2001-01-01 7
6 B 2001-01-02 4
7 B 2001-01-03 1
2) split / Filter / do.call
> do.call("rbind", Filter(function(x) nrow(x) > 1, split(my.df, my.df$col2)))
col1 col2 col3
2001-01-01.1 A 2001-01-01 3
2001-01-01.5 B 2001-01-01 7
2001-01-02.2 A 2001-01-02 2
2001-01-02.6 B 2001-01-02 4
2001-01-03.3 A 2001-01-03 6
2001-01-03.7 B 2001-01-03 1
3) dplyr (2) translates nearly directly into a dplyr solution:
> library(dplyr)
> my.df %>% group_by(col2) %>% filter(n() > 1)
Source: local data frame [6 x 3]
Groups: col2
col1 col2 col3
1 A 2001-01-01 5
2 A 2001-01-02 1
3 A 2001-01-03 7
4 B 2001-01-01 2
5 B 2001-01-02 4
6 B 2001-01-03 6
4) data.table The last two solutions can also be translated to data.table
> data.table(my.df)[, if (.N > 1) .SD, by = col2]
col2 col1 col3
1: 2001-01-01 A 5
2: 2001-01-01 B 2
3: 2001-01-02 A 1
4: 2001-01-02 B 4
5: 2001-01-03 A 7
6: 2001-01-03 B 6
5) tapply
> na.omit(tapply(my.df$col3, my.df[c('col2', 'col1')], identity))
col1
col2 A B
2001-01-01 3 7
2001-01-02 2 4
2001-01-03 6 1
attr(,"na.action")
2001-02-03 2001-01-04
5 4
6) merge
> merge(subset(my.df, col1 == 'A'), subset(my.df, col1 == 'B'), by = 2)
col2 col1.x col3.x col1.y col3.y
1 2001-01-01 A 3 B 7
2 2001-01-02 A 2 B 4
3 2001-01-03 A 6 B 1
7) sqldf (6) is similar to the following sqldf solution:
> sqldf("select * from `my.df` A join `my.df` B
+ on A.col2 = B.col2 and A.col1 = 'A' and B.col1 = 'B'")
col1 col2 col3 col1 col2 col3
1 A 2001-01-01 5 B 2001-01-01 2
2 A 2001-01-02 1 B 2001-01-02 4
3 A 2001-01-03 7 B 2001-01-03 6

Merge data frames and overwrite values

How do I merge 2 similar data frames but have one with greater importance?
For example:
Dataframe 1
Date Col1 Col2
jan 2 1
feb 4 2
march 6 3
april 8 NA
Dataframe 2
Date Col2 Col3
jan 9 10
feb 8 20
march 7 30
april 6 40
merge these by Date with dataframe 1 taking precedence but dataframe 2 filling blanks
DataframeMerge
Date Col1 Col2 Col3
jan 2 1 10
feb 4 2 20
march 6 3 30
april 8 6 40
EDIT - SOLUTION
commonNames <- names(df1)[which(colnames(df1) %in% colnames(df2))]
commonNames <- commonNames[commonNames != "key"]
dfmerge<- merge(df1,df2,by="key",all=T)
for(i in commonNames){
left <- paste(i, ".x", sep="")
right <- paste(i, ".y", sep="")
dfmerge[is.na(dfmerge[left]),left] <- dfmerge[is.na(dfmerge[left]),right]
dfmerge[right]<- NULL
colnames(dfmerge)[colnames(dfmerge) == left] <- i
}
merdat <- merge(dfrm1,dfrm2, by="Date") # seems self-documenting
# explanation for next line in text below.
merdat$Col2.y[ is.na(merdat$Col2.y) ] <- merdat$Col2.x[ is.na(merdat$Col2.y) ]
Then just rename 'merdat$Col2.y' to 'merdat$Col2' and drop 'merdat$Col2.x'.
In reply to request for more comments: One way to update only sections of a vector is to construct a logical vector for indexing and apply it using "[" to both sides of an assignment. Another way is to devise a logical vector that is only on the LHS of an assignment but then make a vector using rep() that has the same length as sum(logical.vector). The goal is both instances is to have the same length (and order) for assignment as the items being replaced.
Update using v1.9.6 of data.table's on= argument (which allows for adhoc joins:
setDT(df1)[df2, `:=`(Col2 = ifelse(is.na(Col2), i.Col2, Col2),
Col3 = i.Col3), on="Date"][]
Here's a data.table solution. Make sure your df1 and df2's Date column is factor with desired levels (for ordering)
require(data.table)
dt1 <- data.table(df1, key="Date")
dt2 <- data.table(df2, key="Date")
# Col2 refers to the Col2 of dt1 and i.col2 refers to that of dt2
dt1[dt2, `:=`(Col3 = Col3, Col1 = Col1,
Col2 = ifelse(is.na(Col2), i.Col2, Col2))]
# the result is stored in dt1
> dt1
# Date Col1 Col2 Col3
# 1: jan 2 1 10
# 2: feb 4 2 20
# 3: march 6 3 30
# 4: april 8 6 40
Here is a dplyr solution. Credit to #docendo discimus
df1 <- data.frame(y = c("A", "B", "C", "D"), x1 = c(1,2,NA, 4))
y x1
1 A 1
2 B 2
3 C NA
4 D 4
df2 <- data.frame(y = c("A", "B", "C"), x1 = c(5, 6, 7))
y x1
1 A 5
2 B 6
3 C 7
dplyr
left_join(df1, df2, by="y") %>%
transmute(y, x1 = ifelse(is.na(x1.y), x1.x, x1.y))
y x1
1 A 5
2 B 6
3 C 7
Consider this example:
> d1 <- data.frame(x=1:4, a=2:5, b=c(3,4,5,NA))
> d1
x a b
1 1 2 3
2 2 3 4
3 3 4 5
4 4 5 NA
> d2 <- data.frame(x=1:4, b=c(6,7,8,9), c=11:14)
> d2
x b c
1 1 6 11
2 2 7 12
3 3 8 13
4 4 9 14
Now use merge and within, with ifelse:
> within(merge(d1, d2, by="x"), {b <- ifelse(is.na(b.x),b.y,b.x); b.x <- NULL; b.y <- NULL})
x a c b
1 1 2 11 3
2 2 3 12 4
3 3 4 13 5
4 4 5 14 9

Resources