Removing certain rows and replacing values based on a condition - r

I have the following data:
set.seed(2)
d <- data.frame(iteration=c(1,1,2,2,2,3,4,5,6,6,6),
value=sample(11),
var3=sample(11))
iteration value var3
1 1 3 7
2 1 8 4
3 2 6 8
4 2 2 3
5 2 7 9
6 3 9 11
7 4 1 10
8 5 4 1
9 6 10 2
10 6 11 6
11 6 5 5
Now, I want the following:
1. IF there are more than one iteration to remove the last row AND replace the value of the last row with the previous value.
So in the example above here is the output that I want:
d<-data.frame(iteration=c(1,2,2,3,4,5,6,6),
value=c(8,6,7,9,1,4,10,5))
iteration value var3
1 1 8 7
2 2 6 8
3 2 7 3
4 3 9 11
5 4 1 10
6 5 4 1
7 6 10 2
8 6 5 6

We can use data.table
library(data.table)
setDT(d)[, .(value = if(.N>1) c(value[seq_len(.N-2)], value[.N]) else value), iteration]
# iteration value
#1: 1 8
#2: 2 6
#3: 2 7
#4: 3 9
#5: 4 1
#6: 5 4
#7: 6 10
#8: 6 5
Update
Based on the update in OP's post, we can first create a new column with the lead values in 'value', assign the 'value1' to 'value' only for those meet the conditions in 'i1', then subset the rows
setDT(d)[, value1 := shift(value, type = "lead"), iteration]
i1 <- d[, if(.N >1) .I[.N-1], iteration]$V1
d[i1, value := value1]
d[d[, if(.N > 1) .I[-.N] else .I, iteration]$V1][, value1 := NULL][]
# iteration value var3
#1: 1 8 7
#2: 2 6 8
#3: 2 7 3
#4: 3 9 11
#5: 4 1 10
#6: 5 4 1
#7: 6 10 2
#8: 6 5 6

This base R solution using the split-apply-combine methodology returns the same values as #akrun's data.table version, although the logic appears to be different.
do.call(rbind, lapply(split(d, d$iteration),
function(i)
if(nrow(i) >= 3) i[-(nrow(i)-1),] else tail(i, 1)))
iteration value
1 1 8
2.3 2 6
2.5 2 7
3 3 9
4 4 1
5 5 4
6.9 6 10
6.11 6 5
The idea is to split the data.frame into a list of data.frames along iteration, then for each data.frame, check if there are more than 2 rows, if yes, grab the first and final row, if no, then return only the final row. do.call with rbind then compiles these observations into a single data.frame.
Note that this will not work in the presence of other variables.

Related

Remove identical values if the same as previous in a time series

I have a time series:
df <- data.frame(t=1:10, x= c(5,7,8,9,5,5,5,5,4,3))
I want to remove values that are identical to the previous value to obtain:
x = c(5,7,8,9,5,4,3)
I tried:
df[unique(df$x),]
But this gives the incorrect answer.
You can do:
df[c(1, diff(df$x)) != 0, ]
t x
1 1 5
2 2 7
3 3 8
4 4 9
5 5 5
6 9 4
7 10 3
With dplyr, you can do:
df %>%
filter(x != lag(x, default = first(x)-1))
t x
1 1 5
2 2 7
3 3 8
4 4 9
5 5 5
6 9 4
7 10 3
In base R, we can use head and tail
subset(df, c(TRUE, head(x, -1) != tail(x, -1)))
# t x
#1 1 5
#2 2 7
#3 3 8
#4 4 9
#5 5 5
#9 9 4
#10 10 3
Another base solution would be using rle.
If you want to subset the dataframe based on the criteria, you can use lengths. Otherwise, if you only need the subset of x column, we should extract the values from rle. See below;
df[cumsum(rle(df$x)$lengths), ] # dataframe subset
# t x
# 1 1 5
# 2 2 7
# 3 3 8
# 4 4 9
# 8 8 5
# 9 9 4
# 10 10 3
rle(df$x)$values # vector of values
# [1] 5 7 8 9 5 4 3
Or using data.table:
library(data.table)
setDT(df_large)[, rn :=1:.N, by = rleid(x)][rn == 1, .(t, x)]
# t x
# 1: 1 5
# 2: 2 7
# 3: 3 8
# 4: 4 9
# 5: 5 5
# 6: 9 4
# 7: 10 3
library(dplyr)
df <- data.frame(t=1:10, x= c(5,7,8,9,5,5,5,5,4,3))
subsetVec <- df$x - lag(df$x) != 0
subsetVec <- replace_na(subsetVec, TRUE)
df[subsetVec,]

R data table: Assign a value to column based on reference column

I would like to assign a value into a column from a larger table, using another column as a reference.
E.g. data:
require(data.table)
dt <- data.table(N=c(1:5),GPa1=c(sample(0:5,5)),GPa2=c(sample(5:15,5)),
GPb1=c(sample(0:20,5)),GPb2=c(sample(0:10,5)),id=c("b","a","b","b","a"))
N GPa1 GPa2 GPb1 GPb2 id
1: 1 4 10 7 0 b
2: 2 5 15 19 7 a
3: 3 1 5 20 5 b
4: 4 0 13 3 4 b
5: 5 3 7 8 1 a
The idea is to get new columns Val1 and Val2. Any GP column ending in 1 is eligible for Val1 and any ending in 2 is eligible for Val2. The value to be insterted into the column is determined by the id column, per row.
So you can see for Val1, you'd draw on the GPb1 column, then GPa1, GPb1, GPb1 again and finally GPa1.
The final result would be;
N GPa1 GPa2 GPb1 GPb2 id Val1 Val2
1: 1 4 10 7 0 b 7 0
2: 2 5 15 19 7 a 5 15
3: 3 1 5 20 5 b 20 5
4: 4 0 13 3 4 b 3 4
5: 5 3 7 8 1 a 3 7
I did achieve the answer but in quite a few lines after melting it etc, but i'm sure there must be an elegant way to do this in data.table. I was initially frustrated by the fact paste0 doesn't work in data.table;
dt[1,paste0("GP",id,"1")]
but;
# The following gives a vector that is correct for Val1 (and works for 2)
diag(as.matrix(dt[,.SD,.SDcols=dt[,paste0("GP",id,"1")]]))
# I think the answer lies in `set`, but i've not had any luck.
for (i in 1:nrow(dt)) set(dt, i=dt[i,.SD,.SDcols=dt[,paste0("GP",id,"2")]], j=i, value=0)
The data is quite ugly this way so perhaps it's better to just use the melt method.
dt[id == "a", c("Val1", "Val2") := .(GPa1, GPa2)]
dt[id == "b", c("Val1", "Val2") := .(GPb1, GPb2)]
# N GPa1 GPa2 GPb1 GPb2 id Val1 Val2
#1: 1 2 13 5 8 b 5 8
#2: 2 3 8 7 2 a 3 8
#3: 3 5 11 19 1 b 19 1
#4: 4 4 5 6 9 b 6 9
#5: 5 1 15 1 10 a 1 15

Row conditional column operations in data.table

I have a large data.table where I for each row need to make computations based on part of the full data.table. As an example consider the following data.table, and assume I for each row want to compute the sum of the num variable for every rows where id2 matches id1 for the current row as well as the time variable is within distance 1 from the time of the current row.
set.seed(123)
dat <- data.table(cbind(id1=sample(1:5,10,replace=T),
id2=sample(1:5,10,replace=T),
num=sample(1:10,10,replace=T),
time=sample(1:10,10,replace=T)))
This could easily be done by looping over each row like this
dat[,val:= 0]
for (i in 1:nrow(dat)){
this.val <- dat[ (id2==id1[i]) & (time>=time[i]-2) & (time<=time[i]+2),sum(num)]
dat[i,val:=this.val]
}
dat
The resulting data.table looks like this:
> dat
id1 id2 num time val
1: 2 5 9 10 6
2: 4 3 7 10 0
3: 3 4 7 7 10
4: 5 3 10 8 9
5: 5 1 7 1 2
6: 1 5 8 5 6
7: 3 2 6 8 17
8: 5 1 6 3 10
9: 3 2 3 4 0
10: 3 5 2 3 0
What is the proper/fast way to do things like this using data.table?
We can use a self-join here by creating the 'timeminus2' and 'timeplus2' column, join on by 'id2' with 'id1' and the non-equi logical condition to get the sum of 'num' and assign (:=) the 'val' column to the original dataset
tmp <- dat[.(id1 = id1, timeminus2 = time - 2, timeplus2 = time + 2),
.(val = sum(num)),
on = .(id2 = id1, time >= timeminus2, time <= timeplus2),
by = .EACHI
][is.na(val), val := 0][]
dat[, val := tmp$val][]
# id1 id2 num time val
# 1: 2 5 9 10 6
# 2: 4 3 7 10 0
# 3: 3 4 7 7 10
# 4: 5 3 10 8 9
# 5: 5 1 7 1 2
# 6: 1 5 8 5 6
# 7: 3 2 6 8 17
# 8: 5 1 6 3 10
# 9: 3 2 3 4 0
#10: 3 5 2 3 0

R Subset matching contiguous blocks

I have a dataframe.
dat <- data.frame(k=c("A","A","B","B","B","A","A","A"),
a=c(4,2,4,7,5,8,3,2),b=c(2,5,3,5,8,4,5,8),
stringsAsFactors = F)
k a b
1 A 4 2
2 A 2 5
3 B 4 3
4 B 7 5
5 B 5 8
6 A 8 4
7 A 3 5
8 A 2 8
I would like to subset contiguous blocks based on variable k. This would be a standard approach.
#using rle rather than levels
kval <- rle(dat$k)$values
for(i in 1:length(kval))
{
subdf <- subset(dat,dat$k==kval[i])
print(subdf)
#do something with subdf
}
k a b
1 A 4 2
2 A 2 5
6 A 8 4
7 A 3 5
8 A 2 8
k a b
3 B 4 3
4 B 7 5
5 B 5 8
k a b
1 A 4 2
2 A 2 5
6 A 8 4
7 A 3 5
8 A 2 8
So the subsetting above obviously does not work the way I intended. Any elegant way to get these results?
k a b
1 A 4 2
2 A 2 5
k a b
1 B 4 3
2 B 7 5
3 B 5 8
k a b
1 A 8 4
2 A 3 5
3 A 2 8
We can use rleid from data.table to create a grouping variable
library(data.table)
setDT(dat)[, grp := rleid(k)]
dat
# k a b grp
#1: A 4 2 1
#2: A 2 5 1
#3: B 4 3 2
#4: B 7 5 2
#5: B 5 8 2
#6: A 8 4 3
#7: A 3 5 3
#8: A 2 8 3
We can group by 'grp' and do all the operations within the 'grp' using standard data.table methods.
Here is a base R option to create 'grp'
dat$grp <- with(dat, cumsum(c(TRUE, k[-1]!= k[-length(k)])))

Assign value to group based on condition in column

I have a data frame that looks like the following:
> df = data.frame(group = c(1,1,1,2,2,2,3,3,3),
date = c(1,2,3,4,5,6,7,8,9),
value = c(3,4,3,4,5,6,6,4,9))
> df
group date value
1 1 1 3
2 1 2 4
3 1 3 3
4 2 4 4
5 2 5 5
6 2 6 6
7 3 7 6
8 3 8 4
9 3 9 9
I want to create a new column that contains the date value per group that is associated with the value "4" from the value column.
The following data frame shows what I hope to accomplish.
group date value newValue
1 1 1 3 2
2 1 2 4 2
3 1 3 3 2
4 2 4 4 4
5 2 5 5 4
6 2 6 6 4
7 3 7 6 8
8 3 8 4 8
9 3 9 9 8
As we can see, group 1 has the newValue "2" because that is the date associated with the value "4". Similarly, group two has newValue 4 and group three has newValue 8.
I assume there is an easy way to do this using ave() or a range of dplyr/data.table functions, but I have been unsuccessful with my many attempts.
Here's a quick data.table one
library(data.table)
setDT(df)[, newValue := date[value == 4L], by = group]
df
# group date value newValue
# 1: 1 1 3 2
# 2: 1 2 4 2
# 3: 1 3 3 2
# 4: 2 4 4 4
# 5: 2 5 5 4
# 6: 2 6 6 4
# 7: 3 7 6 8
# 8: 3 8 4 8
# 9: 3 9 9 8
Here's a similar dplyr version
library(dplyr)
df %>%
group_by(group) %>%
mutate(newValue = date[value == 4L])
Or a possible base R solution using merge after filtering the data (will need some renaming afterwards)
merge(df, df[df$value == 4, c("group", "date")], by = "group")
Here is a base R option
df$newValue = rep(df$date[which(df$value == 4)], table(df$group))
Another alternative using lapply
do.call(rbind, lapply(split(df, df$group),
function(x){x$newValue = rep(x$date[which(x$value == 4)],
each = length(x$group)); x}))
# group date value newValue
#1.1 1 1 3 2
#1.2 1 2 4 2
#1.3 1 3 3 2
#2.4 2 4 4 4
#2.5 2 5 5 4
#2.6 2 6 6 4
#3.7 3 7 6 8
#3.8 3 8 4 8
#3.9 3 9 9 8
One more base R path:
df$newValue <- ave(`names<-`(df$value==4,df$date), df$group, FUN=function(x) as.numeric(names(x)[x]))
df
group date value newValue
1 1 1 3 2
2 1 2 4 2
3 1 3 3 2
4 2 4 4 4
5 2 5 5 4
6 2 6 6 4
7 3 7 6 8
8 3 8 4 8
9 3 9 9 8
10 3 11 7 8
I used a test on variable length groups. I assigned the date column as the names for the logical index of value equal to 4. Then identify the value by group.
Data
df = data.frame(group = c(1,1,1,2,2,2,3,3,3,3),
date = c(1,2,3,4,5,6,7,8,9,11),
value = c(3,4,3,4,5,6,6,4,9,7))

Resources