Matching logical subscripts to the size of the indexed input - r

I have created two data frames that I then turn into lists (e.g., list1 and list2). I removed one element from list2 to better represent my example data set.
library(dplyr)
intervals <- rep_len(c("01-01-2022", "01-11-2022", "01-31-2022"), 100)
ID <- rep(c("A","B", "C"), 100)
df <- data.frame(ID = as.factor(ID),
intervals = as.factor(intervals))
list1 <- df %>%
group_by(ID, intervals) %>%
group_split()
intervals <- rep_len(c("01-01-2022", "01-11-2022", "01-31-2022"), 25)
ID <- rep(c("A","B"), 25)
df2 <- data.frame(ID = as.factor(ID),
intervals = as.factor(intervals))
list2 <- df2 %>%
group_by(ID, intervals) %>%
group_split()
list2 <- list2[-6]
For each of these list I have added an attribute, and I have included a function to check the added attribute more readily (check).
# Convenience function to grab the attributes for you
check <- function(list, attribute_name) {
return(attr(list, attribute_name))
}
# Add an attribute to hold the attributes of each list element
attr(list1, "match") <- data.frame(id = sapply(list1, function(x) paste(x$ID[1])),
interval_start_date = sapply(list1, function(x) paste(x$intervals[1]))
)
# Check the attributes
check(list1, "match")
# Add an attribute "tab" to hold the attributes of each list element
attr(list2, "match") <- data.frame(id = sapply(list2, function(x) paste(x$ID[1])),
interval_start_date = sapply(list2, function(x) paste(x$intervals[1]))
)
# Check the attributes
check(list2, "match")
I have created an index for the two list, and the objective here is to remove any list components that don't have the same ID and the same intervals. The goal is to have only the matching IDs with the same intervals.
# Creates an index for the two list based on the attributes,
dat2 <- check(list1, "match")
dat1 <- check(list2, "match")
# Removes rows where the id isn't present in both data frames, and creates a
# index where both the interval and id are the same.
if (!length(unique(dat2$id)) == length(unique(dat1$id))){
dat3 <- dat2[dat2$id %in% dat1$id, ]
dat4 <- dat1[dat1$id %in% dat2$id, ]
i1 <- paste(dat3[["id"]], format(as.Date(dat3[["interval_"]]),
"%Y-%d")) %in%
paste(dat4[["id"]], format(as.Date(dat4[["interval_"]]),
"%Y-%d"))
}
Now here is where I begin to get an error:
# Error occurs because the lengths of `i1` is not the same as `list2`
out <- list1[i1]
I know that this is occuring because list1 does not have the same length as i1. I'm wondering if there is a way to appending logical values to i1 to get it the same length as list1, but in a way that it doesn't remove values from list1 that we actually do want to keep. Any thoughts?
Here is my expected output for list1, where I hope it ends up with only the same IDs and intervals as list2.
# Expected output
expected_list1 <- list(list1[1], list1[2],list1[3], list1[4], list1[5])
This answer is close to what I would like, but it has an additional element. I think ultimately the attribute table should be similiar to that of dat4.
test <- list1[dat2$id %in% dat1$id][i1]
# Add an attribute "tab" to hold the attributes of each list element
attr(test, "match") <- data.frame(id = sapply(test, function(x) paste(x$ID[1])),
interval_start_date = sapply(test, function(x) paste(x$intervals[1]))
)
# Check the attributes
check(test, "match")

There was a mismatch in the column name i.e. it is not interval_, but interval_start_date in dat1 and dat2. [[ will look for exact match whereas $ can match partial names as well
if (!length(unique(dat2$id)) == length(unique(dat1$id))){
ids_common <- intersect(dat2$id, dat1$id)
inds1 <- dat2$id %in% ids_common
inds2 <- dat1$id %in% ids_common
i1 <- paste(dat2[["id"]], format(as.Date(dat2[["interval_start_date"]]),
"%Y-%d")) %in%
paste(dat1[["id"]], format(as.Date(dat1[["interval_start_date"]]),
"%Y-%d"))
out <- list1[i1 & inds1]
}
-checking
> length(out)
[1] 5
> i1
[1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE

Related

Obtaining a vector with sapply and use it to remove rows from dataframes in a list with lapply

I have a list with dataframes:
df1 <- data.frame(id = seq(1:10), name = LETTERS[1:10])
df2 <- data.frame(id = seq(11:20), name = LETTERS[11:20])
mylist <- list(df1, df2)
I want to remove rows from each dataframe in the list based on a condition (in this case, the value stored in column id). I create an empty vector where I will store the ids:
ids_to_remove <- c()
Then I apply my function:
sapply(mylist, function(df) {
rows_above_th <- df[(df$id > 8),] # select the rows from each df above a threshold
a <- rows_above_th$id # obtain the ids of the rows above the threshold
ids_to_remove <- append(ids_to_remove, a) # append each id to the vector
},
simplify = T
)
However, with or without simplify = T, this returns a matrix, while my desired output (ids_to_remove) would be a vector containing the ids, like this:
ids_to_remove <- c(9,10,9,10)
Because lastly I would use it in this way on single dataframes:
for(i in 1:length(ids_to_remove)){
mylist[[1]] <- mylist[[1]] %>%
filter(!id == ids_to_remove[i])
}
And like this on the whole list (which is not working and I don´t get why):
i = 1
lapply(mylist,
function(df) {
for(i in 1:length(ids_to_remove)){
df <- df %>%
filter(!id == ids_to_remove[i])
i = i + 1
}
} )
I get the errors may be in the append part of the sapply and maybe in the indexing of the lapply. I played around a bit but couldn´t still find the errors (or a better way to do this).
EDIT: original data has 70 dataframes (in a list) for a total of 2 million rows
If you are using sapply/lapply you want to avoid trying to change the values of global variables. Instead, you should return the values you want. For example generate a vector if IDs to remove for each item in the list as a list
ids_to_remove <- lapply(mylist, function(df) {
rows_above_th <- df[(df$id > 8),] # select the rows from each df above a threshold
rows_above_th$id # obtain the ids of the rows above the threshold
})
And then you can use that list with your data list and mapply to iterate the two lists together
mapply(function(data, ids) {
data %>% dplyr::filter(!id %in% ids)
}, mylist, ids_to_remove, SIMPLIFY=FALSE)
Using base R
Map(\(x, y) subset(x, !id %in% y), mylist, ids_to_remove)

Using attribute tables to apply a function between specific elements in lists

I have two list objects. l1 contains information that has been read in through path files. l2 is a list of values that have similar name components as those in l1. I have assigned attributes to both list based on the names of the elements in the list. I would like to reach my expected results using the attributes that I have assigned to my list.
For example: I would like to apply a function mean() between the elements with the attribute id that are "2013_mean" in l1 to those with the attribute year that are also "2013" in l2. I would like to do the similar thing with those when the attribute for year is "2016".
# File List
oldl1 <- list(2,3,4,5)
names(oldl1) <- c("C:/Users/2013_mean.csv",
"C:/Users/2013_median.csv",
"C:/Users/2016_mean.csv",
"C:/Users/2016_median.csv"
)
newl1 <- list(2,3,4,5,8,9)
names(newl1) <- c("C:/Users/2013_mean.csv",
"C:/Users/2013_median.csv",
"C:/Users/2016_mean.csv",
"C:/Users/2016_median.csv",
"C:/Users/2017_mean.csv",
"C:/Users/2017_median.csv"
)
attributes(l1) <- data.frame(id = sub("\\.csv", "", basename(names(l1))),
year = trimws(basename(names(l1)), whitespace = "_.*"))
# Other List
l2 <- list(8,9,10,15,1)
names(l2) <- c("2013_A",
"2013_B",
"2013_C",
"2016_D",
"2016_E")
attributes(l2) <- data.frame(year = trimws(names(l2), whitespace = "_.*"))
expected <- list(mean(c(l1[[1]], l2[[1]])),
mean(c(l1[[1]], l2[[2]])),
mean(c(l1[[1]], l2[[3]])),
mean(c(l1[[3]], l2[[4]])),
mean(c(l1[[3]], l2[[5]]))
)
We may use the attributes to split and match and get the mean
yrs <- intersect(attr(l1, "year"), attr(l2, "year"))
i1 <- grepl("mean", attr(l1, "id"))
i12 <- attr(l1, "year") %in% yrs
i1 <- i1 & i12
i2 <- attr(l2, "year") %in% yrs
l2new <- l2[i2]
l1new <- l1[i1]
attr(l1new, "year") <- attr(l1, "year")[i1]
out <- do.call(c, Map(function(x, y) lapply(x, function(z)
mean(c(z, y))), split(l2new, attr(l2, 'year')[i2]), l1new))
names(out) <- NULL
-checking with OP's expected
> identical(out, expected)
[1] TRUE
Or another option is to convert the list with attributes to a data.frame, do a merge and use rowMeans and then convert to list with as.list
as.list(rowMeans(merge(transform(data.frame(attributes(l2)),
l2 = unlist(l2)),
subset(transform(data.frame(attributes(l1)), l1 = unlist(l1)),
grepl("mean", id), select = c(year, l1)), all.x = TRUE)[-1]))
-output
[[1]]
[1] 5
[[2]]
[1] 5.5
[[3]]
[1] 6
[[4]]
[1] 9.5
[[5]]
[1] 2.5

Nested ifelse() or case_when() for unknown number of queries in R

I have a data frame which I would like to group according to the value in a given row and column of the data frame
my_data <- data.frame(matrix(ncol = 3, nrow = 4))
colnames(my_data) <- c('Position', 'Group', 'Data')
my_data[,1] <- c('A1','B1','C1','D1')
my_data[,3] <- c(1,2,3,4)
grps <- list(c('A1','B1'),
c('C1','D1'))
grp.names = c("Control", "Exp1", "EMPTY")
my_data$Group <- case_when(
my_data$Position %in% grps[[1]] ~ grp.names[1],
my_data$Position %in% grps[[2]] ~ grp.names[2]
)
OR
my_data$Group <- with(my_data, ifelse(Position %in% grps[[1]], grp.names[1],
ifelse(Position %in% grps[[2]], grp.names[2],
grp.names[3])))
These examples work and produce a Group column with appropriate labels, however I need to have flexibility in the length of the grps list from 1 to approximately 25.
I see no way to iterate through case_with or ifelse in a for loop eg.
my_data$Group <- for (i in 1:length(grps)){
case_when(
my_data$Well %in% grps[[i]] ~ grp.names[i])
}
This example simply deletes the Group column
What is the most appropriate way to handle a variable grps length?
I believe your question implies that the grps variable is a list and every element in that list is itself an array that holds all the positions that belong to that group.
Specifically, in your grps variable below, if the Position is "A1" or "B1" it belongs to the whatever your first entry is grp.names. Similarly, if the position is "C1" or "D1" it belongs to whatever your second entry is in grp.names
> grps
[[1]]
[1] "A1" "B1"
[[2]]
[1] "C1" "D1"
Assuming that to be the case you can do the following:
matching_group_df <- sapply(grps, function(x){ my_data$Position %in% x})
selected_group <- apply(matching_group_df, 1, function(x){which(x == TRUE)})
my_data$Group <- grp.names[selected_group]
Position Group Data
1 A1 Control 1
2 B1 Control 2
3 C1 Exp1 3
4 D1 Exp1 4
The way it works is as follows:
matching_group_df is a matrix of True/False (created via the sapply function) that specifies what group index the position belongs to:
> matching_group_df
[,1] [,2]
[1,] TRUE FALSE
[2,] TRUE FALSE
[3,] FALSE TRUE
[4,] FALSE TRUE
You then select the column that has the TRUE value row by row using an apply command:
selected_group <- apply(matching_group_df, 1, function(x){which(x == TRUE)})
> selected_group
[1] 1 1 2 2
Finally you pass those indices to your grp.names list to select the appropriate ones and set them into your original dataframe.
grp.names[selected_group]
[1] "Control" "Control" "Exp1" "Exp1"
This also has the small side benefit of just using base R functions if that is important to you.
Approach 1: Hash table
I would opt for a different approach here, as group makeup might change during analysis, specifically a lookup table of key-value pairs, and write a small accessor function.
library(tidyverse)
# First, a small adjustment to `grps` to reflect an empty group.
grps <- list(c('A1','B1'),
c('C1','D1'),
NULL)
names <- unlist(grps, use.names = F)
values <- rep(grp.names, map_dbl(grps, length))
h = as.list(values) %>%
set_names(names) %>%
list2env()
# find x in h
f <- Vectorize(function(x) h[[x]], c("x")) # scoping here
This takes some time to setup, but usage is quite convenient:
my_data %>%
mutate(Groups = f(Position))
Position Group Data
1 A1 Control 1
2 B1 Control 2
3 C1 Exp1 3
4 D1 Exp1 4
This avoids having to change your code in multiple places, and can take on arbitrary length of groups.
Approach 2: Dynamic switch
Alternatively, we can make an arbitrary length switch expression, building it from the group names and their unique values.
constructor <- function(ids, names){
purrr::imap_chr(as.character(ids), ~paste(paste0("\"", .x ,"\""),
paste0("\"", names[.y], "\""),
sep = "=")) %>%
paste0(collapse = ", ") %>%
paste0("Vectorize(function(x) switch(as.character(x), ", ., ", NA))", collapse = "") %>%
str2expression()
}
my_data %>%
mutate(Group = eval(constructor(names, values)))
In this case, it would evaluate the expression
expression(Vectorize(function(x) switch(as.character(x), A1 = "Control",
B1 = "Control", C1 = "Exp1", D1 = "Exp1",
NA)))
For each item in my_data$Position you want to go through each of the grps and look for a match and assign grp.names, if so. If you don't find a match in any grp, assign grp.names[3]:
my_data$Group <- lapply(my_data$Position, function(position){ # Goes through each my_data$Position
for(i in 1:length(grps)){
if(position %in% grps[[i]]){
return(grp.names[i]) # Give matching index of grp.names to grps
} else if (i == length(grps)){ # if no matches assign grp.names[3]
return(grp.names[3])
}
}
}) %>% unlist() # Put the list into a vector

selecting list of columns from dataframe to convert to subset

I am trying to create a function , for that at the input i am giving a list of modifying columns .
for eg: sample data is
dataa<-data.frame(
aa = c("q","r","y","v","g","y","d","s","n","k","y","d","s","t","n","u","l","h","x","c","q","r","y","v","g","y","d","s","n","k","y","d","s","t","n","u","l","h","x","c"),
col1=c(1,2,3,2,1,2,3,4,4,4,5,3,4,2,1,2,5,3,2,1,2,4,2,1,3,2,1,2,3,1,2,2,4,4,4,1,2,5,3,5),
col2=c(2,1,1,7,4,1,2,7,5,7,2,6,2,2,6,3,4,3,2,5,7,5,6,4,4,6,5,6,4,1,7,3,2,7,7,2,3,7,2,4)
)
my requirement is like , i can create any more than one cuts like below, or can be a list of cuts
may be i am trying to recode my dataset
dataa$col3 <- ifelse(dataa$aa == "y",1,0)
dataa$col4 <- ifelse(dataa$col2 == 7,1,0)
so now in my function requirement i want a subset of selected variables for calculation.
for eg:
#i am applying my function like this
dat1 = dataa
var1 = "col1" # variable for which calculation will be done
grouping_var = list(dataa$col3,dataa$col4)
total_var= TRUE
#fun_1 <- function(dat1,var1,grouping_var,total_var){
total_col <- ifelse(total_var== TRUE,1,0)
var1 <- rlang::parse_expr(var1)
var2 <- dat1[unlist(grouping_var)] # i am trying to create a subset dataframe of selected grouping_var
#var2 <- data.frame(sapply(grouping_var,c)) # i have tried this too
dat1 <- dat1 %>% select(!!var1,!!var2)
# so after this line i would have a subset to calculations accordingly
var_lab(dat1[[1]]) <- ""
var_lab(dat1[[2]]) <- ""
tab1 <- expss::cro_cpct(total(),dat1[[1]],dat1[[2]])
tab1 <- as.data.frame(tab1)
#}
It is not possible to select columns based on the values of the column.
An easy way would be to pass them as character vector and select. Try :
dat1 = dataa
var1 = "col1"
grouping_var = c('col3', 'col4') #Passing columns as character vector
total_var= TRUE
#fun_1 <- function(dat1,var1,grouping_var,total_var){
total_col <- as.integer(total_var)
#total_col <- ifelse(total_var== TRUE,1,0)
dat1 <- dat1[grouping_var]
expss::var_lab(dat1[[1]]) <- ""
expss::var_lab(dat1[[2]]) <- ""
tab1 <- expss::cro_cpct(expss::total(),dat1[[1]],dat1[[2]])
tab1 <- as.data.frame(tab1)
#}

Convert named vector to data frame using attribute values

I have a vector of characters. Each element of the vector has a name attribute which represents the row index of a data frame and the column index of a data frame, separated by a period. Here's a toy data set:
# Create vector of characters
a <- c("foo","bar","dog","cat")
# Assign attributes. The data frame is 2x2:
attr(a, "names") <- c("1.1", "1.2", "2.1", "2.2")
I am trying to use the attribute names to convert the vector into a data frame, where each element in the data frame is the value in the vector and the element's row is the number before the period in the attribute name and the element's column is the number after the decimal in the attribute name. The toy example's output should look like:
data.frame(var1 = c("foo","dog"), var2 = c("bar", "cat"))
My actual vector is quite large so I am looking to do this efficiently.
You can use indexing by row/column value to do this efficiently:
row.nums <- as.numeric(sapply(strsplit(names(a), "\\."), "[", 1))
col.nums <- as.numeric(sapply(strsplit(names(a), "\\."), "[", 2))
mat <- matrix(NA, max(row.nums), max(col.nums))
mat[cbind(row.nums, col.nums)] <- a
mat
# [,1] [,2]
# [1,] "foo" "bar"
# [2,] "dog" "cat"
Split a on the suffix values and coerce that to a data frame. Omit
the stringsAsFactors=FALSE if you prefer factor columns.
the unname if rownames on the result are acceptable
Code--
as.data.frame(split(unname(a), sub(".*[.]", "", names(a))), stringsAsFactors = FALSE)
giving:
X1 X2
1 foo bar
2 dog cat
I would probably use regex to extract row and column positions, as follows.
my.rows <- as.integer(gsub("\\..*$", "", names(a)))
my.cols <- as.integer(gsub("^.*\\.", "", names(a)))
new.data <- data.frame(matrix(NA, nrow = max(my.rows), ncol = max(my.cols)))
for (i in 1:length(a)) {
new.data[my.rows[i], my.cols[i]] <- a[i]
}
new.data
We can use dplyr and tidyr. b2 is the final output.
library(dplyr)
library(tidyr)
b <- data_frame(Name = names(a), Value = a)
b2 <- b %>%
separate(Name, into = c("Group", "Var")) %>%
spread(Var, Value) %>%
select(-Group)

Resources