Find the last row in a data frame that meets certain criteria - r

I'm looking for a way to refer to a pevious row in my data frame that has one column value in common with the 'current row'. Basically, if this would be my data frame
A B D
1 10
4 5
6 6
3 25
1 40
I would want D(i) to contain the B value of the last row for which A has the same value as A(i). So for the last row that should be 10.

You could try this:
for(i in seq_len(nrow(dat))) {
try(dat$D[i] <- dat$B[tail(which(dat$A[1:i-1] == dat$A[i]),1)],silent=TRUE)
}
Results:
> dat
A B D
1 1 10 NA
2 4 5 NA
3 6 6 NA
4 3 25 NA
5 1 40 10
Data:
dat <- read.csv(text="A,B,D
1,10
4,5
6,6
3,25
1,40")

You may try
library(dplyr)
df1%>%
group_by(A) %>%
mutate(D=lag(B))
# A B D
#1 1 10 NA
#2 4 5 NA
#3 6 6 NA
#4 3 25 NA
#5 1 40 10
Or
library(data.table)#data.table_1.9.5
setDT(df1)[, D:=shift(B), A][]
data
df1 <- structure(list(A = c(1L, 4L, 6L, 3L, 1L), B = c(10L, 5L, 6L,
25L, 40L)), .Names = c("A", "B"), class = "data.frame",
row.names = c(NA, -5L))

Related

How to remove rows if values from a specified column in data set 1 does not match the values of the same column from data set 2 using dplyr

I have 2 data sets, both include ID columns with the same IDs. I have already removed rows from the first data set. For the second data set, I would like to remove any rows associated with IDs that do not match the first data set by using dplyr.
Meaning whatever is DF2 must be in DF1, if it is not then it must be removed from DF2.
For example:
DF1
ID X Y Z
1 1 1 1
2 2 2 2
3 3 3 3
5 5 5 5
6 6 6 6
DF2
ID A B C
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
DF2 once rows have been removed
ID A B C
1 1 1 1
2 2 2 2
3 3 3 3
5 5 5 5
6 6 6 6
I used anti_join() which shows me the difference in rows but I cannot figure out how to remove any rows associated with IDs that do not match the first data set by using dplyr.
Try with paste
i1 <- do.call(paste, DF2) %in% do.call(paste, DF1)
# if it is only to compare the 'ID' columns
i1 <- DF2$ID %in% DF1$ID
DF3 <- DF2[i1,]
DF3
ID A B C
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 5 5 5 5
5 6 6 6 6
DF4 <- DF2[!i1,]
DF4
ID A B C
4 4 4 4 4
7 7 7 7 7
data
DF1 <- structure(list(ID = c(1L, 2L, 3L, 5L, 6L), X = c(1L, 2L, 3L,
5L, 6L), Y = c(1L, 2L, 3L, 5L, 6L), Z = c(1L, 2L, 3L, 5L, 6L)), class = "data.frame", row.names = c(NA,
-5L))
DF2 <- structure(list(ID = 1:7, A = 1:7, B = 1:7, C = 1:7), class = "data.frame", row.names = c(NA,
-7L))
# Load package
library(dplyr)
# Load dataframes
df1 <- data.frame(
ID = 1:6,
X = 1:6,
Y = 1:6,
Z = 1:6
)
df2 <- data.frame(
ID = 1:7,
X = 1:7,
Y = 1:7,
Z = 1:7
)
# Include all rows in df1
df1 %>%
left_join(df2)
Joining, by = c("ID", "X", "Y", "Z")
ID X Y Z
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6

r apply functions over list of data frames

Help with applying functions over a list of data frames.
I don't often work with lists or functions so following a 3 hour search and test I need some assistance.
I have a list of 2 data frames as follows (real list has 40+):
df1 <- structure(list(ID = 1:4,
Period = c("C_2021", "C_2021", "C_2021", "C_2021"),
subjects = c(2044L, 2044L, 2058L, 2059L),
Q_1_A = c(1L, 1L, 4L, 6L),
Q_1_B = c(6L, 1L, 6L, NA),
col3 = c(4L, 6L, 5L, 2L),
col4 = c(3L, 5L, 4L, 4L)),
class = "data.frame", row.names = c(NA, -4L))
df2 <- structure(list(ID = 1:4,
Period = c("C_2022", "C_2022", "C_2022", "C_2022"),
subjects = c(2058L, 2058L, 2065L, 2066L),
Q_1_A = c(2L, 5L, 5L, 6L),
Q_1_B = c(6L, 1L, 4L, NA),
col3 = c(NA, 6L, 5L, 3L),
col4 = c(3L, 6L, 5L, 5L)),
class = "data.frame", row.names = c(NA, -4L))
The structure of the datasets are as follows:
df1
ID Period subjects Q_1_A Q_1_B col3 col4
1 1 C_2021 2044 1 6 4 3
2 2 C_2021 2044 1 1 6 5
3 3 C_2021 2058 4 6 5 4
4 4 C_2021 2059 6 NA 2 4
df2
ID Period subjects Q_1_A Q_1_B col3 col4
1 1 C_2022 2058 2 6 NA 3
2 2 C_2022 2058 5 1 6 6
3 3 C_2022 2065 5 4 5 5
4 4 C_2022 2066 6 NA 3 5
The list of df's
dflist <- list(df1, df2)
I would like to do 2 things:
1. Conditional removal of string before 2nd underscore
I would like to remove characters before the 2nd underscore only in columns beginning with "Q". Column "Q_1_A" would become "A". The code should only impact columns starting with "Q".
Note: The ifelse is important - in the real data there are other columns with 2 underscores that cannot be modified, and the columns in data frames may be in different orders so it needs to be done by column name.
#doesnt work (cant seem to get purr working either)
dflist <- lapply(dflist, function(x) {
names(x) <- ifelse(starts_with(names(x), "Q"), sub("^[^_]*_", "", names(x)), .x)
x})
2. Once column names are updated, remove columns present on a list.
Note: In the real data there are a lot of columns in each df, it's much easier to list the columns to keep rather than remove.
List of columns to keep below
List is structured assuming the gsub above has been complete.
col_keep <- c("ID", "Period", "subjects", "A", "B")
#doesnt work
dflist <- lapply(dflist, function(x) {
x[(names(x) %in% col_keep)]
x})
**UPDATE** I think actually the following will work
dflist <- lapply(dflist, function(x)
{x <- x %>% select(any_of(col_keep))})
#is the best way to do it?
Help would be greatly appreciated.
For the first required apply this
dflist <- lapply(dflist, function(x) {
names(x) <- ifelse(startsWith(names(x), "Q"),
gsub("[Q_0-9]+", "" , names(x)), names(x))
x})
and the second
col_keep <- c("ID", "Period", "subjects", "A", "B")
dflist <- lapply(dflist, function(x) subset(x , select = col_keep))
In base R:
lapply(dflist, \(x)setNames(x, sub('^Q([^_]*_){2}', '', names(x)))[col_keep])
[[1]]
ID Period subjects A B
1 1 C_2021 2044 1 6
2 2 C_2021 2044 1 1
3 3 C_2021 2058 4 6
4 4 C_2021 2059 6 NA
[[2]]
ID Period subjects A B
1 1 C_2022 2058 2 6
2 2 C_2022 2058 5 1
3 3 C_2022 2065 5 4
4 4 C_2022 2066 6 NA
in tidyverse:
library(tidyverse)
dflist %>%
map(~rename_with(.,~str_remove(.,'([^_]+_){2}'), starts_with('Q'))%>%
select(all_of(col_keep)))
[[1]]
ID Period subjects A B
1 1 C_2021 2044 1 6
2 2 C_2021 2044 1 1
3 3 C_2021 2058 4 6
4 4 C_2021 2059 6 NA
[[2]]
ID Period subjects A B
1 1 C_2022 2058 2 6
2 2 C_2022 2058 5 1
3 3 C_2022 2065 5 4
4 4 C_2022 2066 6 NA
Another solutions using base:
# wrap up code for ease of reading
validate_names <- function(df) {
setNames(df, ifelse(grepl("^Q", names(df)),
gsub("[Q_0-9]", "", names(df)), names(df)))
}
# lapply to transform list, then subset with character vector
lapply(dflist, validate_names) |>
lapply(`[`, col_keep)

Creating multiple new columns conditional on values of previous columns with modified names in R

I have a dataframe that looks like the following, but with more columns and rows.
dog_c cat_c cheese_c hat_c
3 4 3 2
3 1 2 5
5 2 1 4
I would like to create a dataframe that looks like the following. I want to square all the values in a given column from the original dataframe and then create a column name that's the same as the original but with a 2 on the end. I would then like to attach it to the original dataframe.
dog_c cat_c cheese_c hat_c dog_c2 cat_c2 cheese_c2 hat_c2
3 4 3 2 9 16 9 4
3 1 2 5 9 1 2 5
5 2 1 4 25 4 1 16
I know I could do this in the following way for each new column.
df$dog_c2 <- (df$dog_c)^2
Does anyone have suggestions for how I can do this more efficiently?
You could do the same directly for dataframe as well :
df[paste0(names(df), 2)] <- df^2
df
# dog_c cat_c cheese_c hat_c dog_c2 cat_c2 cheese_c2 hat_c2
#1 3 4 3 2 9 16 9 4
#2 3 1 2 5 9 1 4 25
#3 5 2 1 4 25 4 1 16
If you want to do this in dplyr, you can use mutate_all
library(dplyr)
df %>% mutate_all(list(`2` = ~. ^2))
If we want to do this only for selected columns, we can do :
cols <- c('dog_c','cat_c') #Use regex if there are lot of columns.
df[paste0(cols, 2)] <- df[cols]^2
# dog_c cat_c cheese_c hat_c dog_c2 cat_c2
#1 3 4 3 2 9 16
#2 3 1 2 5 9 1
#3 5 2 1 4 25 4
Or use mutate_at with dplyr
df %>% mutate_at(cols, list(`2` = ~.^2))
data
df <- structure(list(dog_c = c(3L, 3L, 5L), cat_c = c(4L, 1L, 2L),
cheese_c = 3:1, hat_c = c(2L, 5L, 4L)), class = "data.frame",
row.names = c(NA, -3L))
We can use mutate with across in the devel version of dplyr
library(dplyr)
df1 %>%
mutate(across(everything(), list(~ .x^2), names = "{col}2"))
# dog_c cat_c cheese_c hat_c dog_c2 cat_c2 cheese_c2 hat_c2
#1 3 4 3 2 9 16 9 4
#2 3 1 2 5 9 1 4 25
#3 5 2 1 4 25 4 1 16
Or for selected columns
nm1 <- c('dog_c','cat_c')
df1 %>%
mutate(across(all_of(nm1), list(~ .x^2), names = "{col}2"))
# dog_c cat_c cheese_c hat_c dog_c2 cat_c2
#1 3 4 3 2 9 16
#2 3 1 2 5 9 1
#3 5 2 1 4 25 4
Or with data.table
library(data.table)
setDT(df1)[, paste0(nm1, 2) := .SD^2, .SDcols = nm1]
data
df1 <- structure(list(dog_c = c(3L, 3L, 5L), cat_c = c(4L, 1L, 2L),
cheese_c = 3:1, hat_c = c(2L, 5L, 4L)), class = "data.frame",
row.names = c(NA, -3L))

How do you find if a number is between a range of multiple mins and max numbers

In R I have:
DataSet1
A
1
4
13
19
22
DataSet2
(min)B (max)C
4 6
8 9
12 15
16 18
I am looking to set up a binary column D based on whether A is between B and C.
So D would added to dataset 1 and calculated as follows:
A D
1 0
4 1
13 1
19 0
22 0
I have tried using the InRange function but it just calculating for between one row of B and C rather than all intervals.
Any help would be much appreciated.
enter image description here
Here is one option using fuzzy_left_join
library(fuzzyjoin)
library(dplyr)
df1 %>% fuzzy_left_join(df2, by = c("A" = "B", "A" = "C"),
match_fun = list(`>=`, `<`)) %>%
mutate(D = ifelse(is.na(B) & is.na(C), 0, 1))
A B C D
1 1 NA NA 0
2 4 4 6 1
3 13 12 15 1
4 19 NA NA 0
5 22 NA NA 0
Data
df1 <- structure(list(A = c(1L, 4L, 13L, 19L, 22L)), class = "data.frame", row.names = c(NA, -5L))
df2 <- structure(list(B = c(4L, 8L, 12L, 16L), C = c(6L, 9L, 15L, 18L)), class = "data.frame", row.names = c(NA, -4L))
Here's a way using sapply from base R -
df1$D <- sapply(df1$A, function(x) {
+any(x >= df2$B & x <= df2$C)
})
df1
A D
1 1 0
2 4 1
3 13 1
4 19 0
5 22 0

R split each row of a dataframe into two rows

I would like to splite each row of a data frame(numberic) into two rows. For example, part of the original data frame like this (nrow(original datafram) > 2800000):
ID X Y Z value_1 value_2
1 3 2 6 22 54
6 11 5 9 52 71
3 7 2 5 2 34
5 10 7 1 23 47
And after spliting each row, we can get:
ID X Y Z
1 3 2 6
22 54 NA NA
6 11 5 9
52 71 NA NA
3 7 2 5
2 34 NA NA
5 10 7 1
23 47 NA NA
the "value_1" and "value_2" columns are split and each element is set to a new row. For example, value_1 = 22 and value_2 = 54 are set to a new row.
Here is one option with data.table. We convert the 'data.frame' to 'data.table' by creating a column of rownames (setDT(df1, keep.rownames = TRUE)). Subset the columns 1:5 and 1, 6, 7 in a list, rbind the list element with fill = TRUE option to return NA for corresponding columns that are not found in one of the datasets, order by the row number ('rn') and assign (:=) the row number column to 'NULL'.
library(data.table)
setDT(df1, keep.rownames = TRUE)[]
rbindlist(list(df1[, 1:5, with = FALSE], setnames(df1[, c(1, 6:7),
with = FALSE], 2:3, c("ID", "X"))), fill = TRUE)[order(rn)][, rn:= NULL][]
# ID X Y Z
#1: 1 3 2 6
#2: 22 54 NA NA
#3: 6 11 5 9
#4: 52 71 NA NA
#5: 3 7 2 5
#6: 2 34 NA NA
#7: 5 10 7 1
#8: 23 47 NA NA
A hadleyverse corresponding to the above logic would be
library(dplyr)
tibble::rownames_to_column(df1[1:4]) %>%
bind_rows(., setNames(tibble::rownames_to_column(df1[5:6]),
c("rowname", "ID", "X"))) %>%
arrange(rowname) %>%
select(-rowname)
# ID X Y Z
#1 1 3 2 6
#2 22 54 NA NA
#3 6 11 5 9
#4 52 71 NA NA
#5 3 7 2 5
#6 2 34 NA NA
#7 5 10 7 1
#8 23 47 NA NA
data
df1 <- structure(list(ID = c(1L, 6L, 3L, 5L), X = c(3L, 11L, 7L, 10L
), Y = c(2L, 5L, 2L, 7L), Z = c(6L, 9L, 5L, 1L), value_1 = c(22L,
52L, 2L, 23L), value_2 = c(54L, 71L, 34L, 47L)), .Names = c("ID",
"X", "Y", "Z", "value_1", "value_2"), class = "data.frame",
row.names = c(NA, -4L))
Here's a (very slow) pure R solution using no extra packages:
# Replicate your matrix
input_df <- data.frame(ID = rnorm(10000),
X = rnorm(10000),
Y = rnorm(10000),
Z = rnorm(10000),
value_1 = rnorm(10000),
value_2 = rnorm(10000))
# Preallocate memory to a data frame
output_df <- data.frame(
matrix(
nrow = nrow(input_df)*2,
ncol = ncol(input_df)-2))
# Loop through each row in turn.
# Put the first four elements into the current
# row, and the next two into the current+1 row
# with two NAs attached.
for(i in seq(1, nrow(output_df), 2)){
output_df[i,] <- input_df[i, c(1:4)]
output_df[i+1,] <- c(input_df[i, c(5:6)],NA,NA)
}
colnames(output_df) <- c("ID", "X", "Y", "Z")
Which results in
> head(output_df)
X1 X2 X3 X4
1 0.5529417 -0.93859275 2.0900276 -2.4023800
2 0.9751090 0.13357075 NA NA
3 0.6753835 0.07018647 0.8529300 -0.9844643
4 1.6405939 0.96133195 NA NA
5 0.3378821 -0.44612782 -0.8176745 0.2759752
6 -0.8910678 -0.37928353 NA NA
This should work
data <- read.table(text= "ID X Y Z value_1 value_2
1 3 2 6 22 54
6 11 5 9 52 71
3 7 2 5 2 34
5 10 7 1 23 47", header=T)
data1 <- data[,1:4]
data2 <- setdiff(data,data1)
names(data2) <- names(data1)[1:ncol(data2)]
combined <- plyr::rbind.fill(data1,data2)
n <- nrow(data1)
combined[kronecker(1:n, c(0, n), "+"),]
Though why you would need to do this beats me.

Resources