R data table: modify column values by referencing other columns by name - r

I have a melted data table with a column containing values that refer to other column names within the same table. I want to replace each row within that same column with the row value of the referenced column.
library("data.table")
## Example input data table
DT_input <- data.table(A=c(1:10),
B=c(11:20),
C=c(21:30),
replace=c(rep("A", 5), rep("B", 3), rep("C", 2)))
## Desired output data table
DT_output <- data.table(A=c(1:10),
B=c(11:20),
C=c(21:30),
replace=c(1:5, 16:18, 29:30))
My old approach shown here is very slow because of the for loop:
## Attempted looping solution
for (kRow in seq_len(nrow(DT_input))) {
e <- parse(text = DT_input[kRow, Variable])
DT_input[kRow, Variable := eval(e)]
}

If we need a vectorized approach use the row/column indexing from base R
i1 <- cbind(seq_len(nrow(df1)), match(df1$replace, names(df1)[-4]))
df1$replace <- df1[-4][i1]
df1$replace
#[1] 1 2 3 4 5 16 17 18 29 30
With data.table, an option is Map or for loop without the eval, but it would be still not vectorized
data
df1 <- as.data.frame(DT_input)

An option using data.table:
DT_input[, rn := .I]
DT_input[, replace :=
DT_input[, DT_input[.SD, on=c("rn", .BY$replace), get(.BY$replace)], .(replace)]$V1
]
output:
A B C replace
1: 1 11 21 1
2: 2 12 22 2
3: 3 13 23 3
4: 4 14 24 4
5: 5 15 25 5
6: 6 16 26 16
7: 7 17 27 17
8: 8 18 28 18
9: 9 19 29 29
10: 10 20 30 30
It will be slower than Akrun base R method.

Related

Keep the row if the specific column is the minimum value of that row

I cannot share the dataset but I will explain it as best as I can.
The dataset has 50 columns 48 of them are in Y/m/d h:m:s format. also the data has many NA, but it must not be removed.
Let's say there is a column B. I want to remove the rows if the value of B is not the earliest in that row.
How can I do this in R? For example, the original would be like this:
df <- data.frame(
A = c(11,19,17,6,13),
B = c(18,9,5,16,12),
C = c(14,15,8,87,16))
A B C
1 11 18 14
2 19 9 15
3 17 5 8
4 6 16 87
5 13 12 16
but I want this:
A B C
1 19 9 15
2 17 5 8
3 13 12 16
You could use apply() to find the minimum for each row.
df |> subset(B == apply(df, 1, min, na.rm = TRUE))
# A B C
# 2 19 9 15
# 3 17 5 8
# 5 13 12 16
The tidyverse equivalent is
library(tidyverse)
df %>% filter(B == pmap(across(A:C), min, na.rm = TRUE))
If you are willing to use data.table, you could do the following for the example.
library(data.table)
setDT(df)
df[(B < A & B < C)]
A B C
1: 19 9 15
2: 17 5 8
3: 13 12 16
More generally, you could do
df <- as.data.table(df)
df[, min := do.call(pmin, .SD)][B == min, !"min"]
.SDcols in the first [ would let you control which columns you want to take the min over, if you wanted to eg. exclude some. I am not super knowledgeable about the inner workings of data.table, but I believe that creating this new column is probably efficient RAM-wise.

Run next iteration on output from previous iteration in R

Lets say I have a data frame
mydata <- data.frame(x = 1:25,
y = 26:50)
and another data frame with a set of min and max values
df.remove <- data.frame(min = c(3,10,22,17),
max = c(6,13,24,20))
Im looking to create an output where the rows with values in column x of mydata, that fall between each row of min and max in df.remove are deleted.
thus giving me an output data frame
x y
1 26
2 27
7 32
8 33
9 34
14 39
15 40
16 41
21 46
25 50
I figured I can use the between() function to delete the values that fall between a range, and since I would be looking at the min and max values from each row in df.remove I attempted to run a loop using the code
result <- data.frame()
for(i in 1:nrow(df.filter)) {
result <- mydata[!between(mydata$x,df.filter$min[i],df.filter$max[i]),]
}
This, for obvious reasons returns the output with only the last set of min and max values removed. I figured to get the output I am looking for I would likely have to run the consecutive iteration on the output from the previous iteration instead of the original data frame mydata, however I couldn't find a way to do it.
What you are looking for is known as non-equi anti-join. This can be done pretty easily with the data.table package. Consider
library(data.table)
mydata <- data.frame(x = 1:25, y = 26:50)
df.remove <- data.frame(min = c(3,10,22,17), max = c(6,13,24,20))
setDT(mydata)[!df.remove, on = .(x >= min, x <= max)] # drop rows where min <= x <= max
Output
x y
1: 1 26
2: 2 27
3: 7 32
4: 8 33
5: 9 34
6: 14 39
7: 15 40
8: 16 41
9: 21 46
10: 25 50
In your code, the result dataframe can only keep your last update, as you operated on the original mydata dataframe and assigned this single update to the result dataframe every time.
Instead, you are supposed to operate on the updated dataframe. You could try the following code.
result <- mydata
for(i in 1:nrow(df.remove)) {
result <- result[!between(result$x,df.remove$min[i],df.remove$max[i]),]
}
After assigning the original mydata dataframe to the result dataframe, you are able to update it in an iterated way.
A base R approach -
res <- subset(mydata, !x %in% unlist(Map(`:`, df.remove$min, df.remove$max)))
res
# x y
#1 1 26
#2 2 27
#7 7 32
#8 8 33
#9 9 34
#14 14 39
#15 15 40
#16 16 41
#21 21 46
#25 25 50
Using Map we create sequence between min and max values, unlist them in a single vector and drop the rows if x has the same value.
Another option using fuzzyjoin package -
fuzzyjoin::fuzzy_anti_join(mydata, df.remove,
c('x' = 'min', 'x' = 'max'),
match_fun = c(`>=`, `<=`))
Since you're using dplyr function between, we can use dplyr filter function. For each row of mydata you want to apply between to each row of df.remove to see if value of column x is between. This can be accomplished with mapply (since there are two values to input to the function). This will create a matrix of T/F. Then go through each row and see if any values are returned as T. Do this with apply, across rows. Negative filter for any row that returns a T indicating a value between the target value:
library(dplyr)
mydata %>%
filter(
!mapply(function(left, right) between(mydata$x, left, right), left = df.remove$min, right = df.remove$max) %>%
apply(., 1, any)
)
Returns:
x y
1 1 26
2 2 27
3 7 32
4 8 33
5 9 34
6 14 39
7 15 40
8 16 41
9 21 46
10 25 50
Just because this is an interesting problem which has several possible solutions, here is another approach using meta programming.
The idea is that we turn df.remove into a list of expressions which we then use inside filter(mydata, !!! .) by splicing it with the !!! operator.
One way to get the list of expressions is to use rowwise summarise and create a list of expressions with bquote which allows us to evaluate expressions wrapped in .(). In our case the min and max values.
And although this is possible, I'd probably use either #ekoam's {data.table} or #Ronak's base R approach.
library(dplyr)
df.remove %>%
rowwise %>%
summarise(x = list(bquote(!x %in% c(.(min):.(max))))) %>%
pull(x) %>%
filter(mydata, !!! .)
#> `summarise()` has ungrouped output. You can override using the `.groups`
#> argument.
#> x y
#> 1 1 26
#> 2 2 27
#> 3 7 32
#> 4 8 33
#> 5 9 34
#> 6 14 39
#> 7 15 40
#> 8 16 41
#> 9 21 46
#> 10 25 50
Created on 2022-01-23 by the reprex package (v0.3.0)
Using data.table::inrange.
library(data.table)
mydata[!mydata$x %inrange% df.remove, ]
# x y
# 1 1 26
# 2 2 27
# 7 7 32
# 8 8 33
# 9 9 34
# 14 14 39
# 15 15 40
# 16 16 41
# 21 21 46
# 25 25 50

lag/lead entire dataframe in R

I am having a very hard time leading or lagging an entire dataframe. What I am able to do is shifting individual columns with the following attempts but not the whole thing:
require('DataCombine')
df_l <- slide(df, Var = var1, slideBy = -1)
using colnames(x_ret_mon) as Var does not work, I am told the variable names are not found in the dataframe.
This attempt shifts the columns right but not down:
df_l<- dplyr::lag(df)
This only creates new variables for the lagged variables but then I do not know how to effectively delete the old non lagged values:
df_l<-shift(df, n=1L, fill=NA, type=c("lead"), give.names=FALSE)
Use dplyr::mutate_all to apply lags or leads to all columns.
df = data.frame(a = 1:10, b = 21:30)
dplyr::mutate_all(df, lag)
a b
1 NA NA
2 1 21
3 2 22
4 3 23
5 4 24
6 5 25
7 6 26
8 7 27
9 8 28
10 9 29
I don't see the point in lagging all columns in a data.frame. Wouldn't that just correspond to rbinding an NA row to your original data.frame (minus its last row)?
df = data.frame(a = 1:10, b = 21:30)
rbind(NA, df[-nrow(df), ]);
# a b
#1 NA NA
#2 1 21
#3 2 22
#4 3 23
#5 4 24
#6 5 25
#7 6 26
#8 7 27
#9 8 28
#10 9 29
And similarly for leading all columns.
A couple more options
data.frame(lapply(df, lag))
require(purrr)
map_df(df, lag)
If your data is a data.table you can do
require(data.table)
as.data.table(shift(df))
Or, if you're overwriting df
df[] <- lapply(df, lag) # Thanks Moody
require(magrittr)
df %<>% map_df(lag)

fill missing values with value from previous column

I have a data.frame whit some columns with missing values, and I want that the missing values are filled in with data from a previous column. For example:
country <- c('a','b','c')
yr01 <- c(15,16,7)
yr02 <- c(NA,18,NA)
yr03 <- c(20,22,NA)
yr04 <- c(15,18,19)
tab <- data.frame(country,yr01,yr02,yr03,yr04)
tab
country yr01 yr02 yr03 yr04
1 a 15 NA 20 15
2 b 16 18 22 18
3 c 7 NA NA 19
How can I make it so that the NA are replaced by the previous value? For example, in country a column yr02 will be equals to 15, and in country c columns year02 and yr03 will be 7. Thanks!
It's usually easier to work with columns, but we can apply to rows the standard answer from the R-FAQ Replace NAs with latest non-NA value.
tab[-1] = t(apply(tab[-1], 1, zoo::na.locf))
tab
# country yr01 yr02 yr03 yr04
# 1 a 15 15 20 15
# 2 b 16 18 22 18
# 3 c 7 7 7 19

Remove rows based on factor-levels

I have a data.frame df in format "long".
df <- data.frame(site = rep(c("A","B","C"), 1, 7),
time = c(11,11,11,22,22,22,33),
value = ceiling(rnorm(7)*10))
df <- df[order(df$site), ]
df
site time value
1 A 11 12
2 A 22 -24
3 A 33 -30
4 B 11 3
5 B 22 16
6 C 11 3
7 C 22 9
Question
How do I remove the rows where an unique element of df$time is not present for each of the levels of df$site ?
In this case I want to remove df[3,], because for df$time the timestamp 33 is only present for site A and not for site B and site C.
Desired output:
df.trimmed
site time value
1 A 11 12
2 A 22 -24
4 B 11 3
5 B 22 16
6 C 11 3
7 C 22 9
The data.frame has easily 800k rows and 200k unique timestamps. I don't want to use loops but I don't know how to use vectorized functions like apply() or lapply() for this case.
Here's another possible solution using the data.table package:
unTime <- unique(df$time)
library(data.table)
DT <- data.table(df, key = "site")
(notInAll <- unique(DT[, list(ans = which(!unTime %in% time)), by = key(DT)]$ans))
# [1] 3
DT[time %in% unTime[-notInAll]]
# site time value
# [1,] A 11 3
# [2,] A 22 11
# [3,] B 11 -6
# [4,] B 22 -2
# [5,] C 11 -19
# [6,] C 22 -14
EDIT from Matthew
Nice. Or a slightly more direct way :
DT = as.data.table(df)
tt = DT[,length(unique(site)),by=time]
tt
time V1
1: 11 3
2: 22 3
3: 33 1
tt = tt[V1==max(V1)] # See * below
tt
time V1
1: 11 3
2: 22 3
DT[time %in% tt$time]
site time value
1: A 11 7
2: A 22 -2
3: B 11 8
4: B 22 -10
5: C 11 3
6: C 22 1
In case no time is present in all sites, when final result should be empty (as Ben pointed out in comments), the step marked * above could be :
tt = tt[V1==length(unique(DT$site))]
Would rle work for you?
df <- df[order(df$time), ]
df <- subset(df, time != rle(df$time)$value[rle(df$time)$lengths == 1])
df <- df[order(df$site), ]
df
## site time value
## 1 A 11 17
## 4 A 22 -3
## 2 B 11 8
## 5 B 22 5
## 3 C 11 0
## 6 C 22 13
Re-looking at your data, it seems that this solution might be too simple for your needs though....
Update
Here's an approach that should be better than the rle solution that I put above. Rather than look for a run-length of "1", will delete rows that do not match certain conditions of the results of table(df$site, df$time). To illustrate, I've also added some more fake data.
df <- data.frame(site = rep(c("A","B","C"), 1, 7),
time = c(11,11,11,22,22,22,33),
value = ceiling(rnorm(7)*10))
df2 <- data.frame(site = rep(c("A","B","C"), 1, 7),
time = c(14,14,15,15,16,16,16),
value = ceiling(rnorm(7)*10))
df <- rbind(df, df2)
df <- df[order(df$site), ]
temp <- as.numeric(names(which(colSums(with(df, table(site, time)))
>= length(levels(df$site)))))
df2 <- merge(df, data.frame(temp), by.x = "time", by.y = "temp")
df2 <- df2[order(df2$site), ]
df2
## time site value
## 3 11 A -2
## 4 16 A -2
## 7 22 A 2
## 1 11 B -16
## 5 16 B 3
## 8 22 B -6
## 2 11 C 8
## 6 16 C 11
## 9 22 C -10
Here's the result of tabulating and summing up the site/time combination:
colSums(with(df, table(site, time)))
## 11 14 15 16 22 33
## 3 2 2 3 3 1
Thus, if we were interested in including sites where at least two sites had the timestamp, we could change the line >= length(levels(df$site)) (in this example, 3) to >= length(levels(df$site))-1 (obviously, 2).
Not sure if this solution is useful to you at all, but I thought I would share it to show the flexibility in solutions we have with R.

Resources