R // subset matrix rows and columns based on names - r

I want to subset a large matrix (columns and rows) based on a list input (which will change dynamically). Example (see reproducible example below): I have a symmetric matrix (x) and a list containing the rows and column I want to have in my subset (categories). How do I subset rows and columns so that my results only shows rows & columns for a and c (see desired output)
categories = c("a", "c")
a = c(2,3,4)
b = c(1,9,8)
c = c(5,6,7)
x = cbind(a,b,c)
rownames(x) <- c("a", "b", "c")
x = as.matrix(x)
# attempt:
result = x[x %in% categories == TRUE]
desired output
a = c(2,4)
c = c(5,7)
y = cbind(a,c)
rownames(y) <- c("a", "c")
y = as.matrix(y)

You may also subset for names.
y <- x[c("a", "c"), c("a", "c")]
y
# a c
# a 2 5
# c 4 7
Or, using subset
y <- subset(x, colnames(x) %in% c("a", "c"),
rownames(x) %in% c("a", "c"))
y
# a c
# a 2 5
# c 4 7

Related

How do I Identify by row id the values in a data frame column not in another data frame column?

How do I identify by row id the values in data frame d2 column c3 that are not in data frame d1 column c1? My which function returns all records when sub-setting as shown. My requirement is to follow this sub set structure and not value$field design which works:
c1 <- c("A", "B", "C", "D", "E")
c2 <- c("a", "b", "c", "d", "e")
c3 <- c("A", "z", "C", "z", "E", "F")
c4 <- c("a", "x", "x", "d", "e", "f")
d1 <- data.frame(c1, c2, stringsAsFactors = F)
d2 <- data.frame(c3, c4, stringsAsFactors = F)
x <- unique(d1["c1"])
y <- d2[,"c3"]
id <- which(!(y %in% x) ) # incorrect, all row ids returned
I am trying to find the id's of rows in y where the specified column does not include values of x
I believe setdiff would work here. I see z and F are what you want, right? They are not in d1[,"c1"] but are in d2[,"c3"]
includes <- setdiff(d2[,"c3"], d1[,"c1"])
d2_new <- d2[d2[,"c3"] %in% includes,]
d2_new$id <- rownames(d2_new)
d2_new
# or
ids <- rownames(d2[d2[,"c3"] %in% includes,])
output
d2_new
# c3 c4 id
#2 z x 2
#4 z d 4
#6 F f 6
ids
#[1] "2" "4" "6"
I had the same problem, and this code worked for me. However, indexing did not work for me. With a slight change it worked perfect.
includes <- setdiff(d2$c3, d1$c3)
d2_new <- d2[d2$c3 %in% includes,]
d2_new$id <- rownames(d2_new)
d2_new
thank you #jpsmith

Select for every row between two columns based on condition in another column in R

may someone help me to find the answer thread or provide a method for solution? I can not find a solution.
What I want to do:
For every row if the value in column "x" is "A" then select the value in column "y" from the same row and if the value in column "x" is "B" then select the value in column "z" from the same row.
Ideally collected in a vector to include as a new column in the df afterwards.
df <- data.frame(x = c("A", "B", "B", "A"), y = c(1,2,3,4), z = c(4,3,2,1), fix.empty.names = FALSE)
df
x y z
1 A 1 4
2 B 2 3
3 B 3 2
4 A 4 1
result
[1] 1 3 2 4
Thank you very much in advance
If we can assume x is always "A" or "B":
ifelse(df$x == "A", df$y, df$z)
More generally:
ifelse(df$x == "A", df$y, ifelse(df$x == "B", df$z, NA))
You can, of course, assign this directly as a new column: df$result <- ifelse...
If you like dplyr:
library(dplyr)
df %>%
mutate(
result = case_when(
x == "A" ~ y,
x == "B" ~ z,
TRUE ~ NA_real_
)
)

Replace strings in variable using lookup vector

I have a dataframe df with a character variable and the fromvec and tovec.
df <- tibble(var = c("A", "B", "C", "a", "E", "D", "b"))
fromvec <- c("A", "B", "C")
tovec <- c("X", "Y", "Z")
Use strings in fromvec, check them in df and then replace them with the corresponding strings in tovec so that "A" in df gets replaced with "X", "B" with "Y" and so on to get the desired_df.
desired_df <- tibble(var = c("X", "Y", "Z", "X", "E", "D", "Y"))
I tried following, but not getting the desired result!
from_vec <- paste(fromvec, collapse="|")
to_vec <- paste(tovec, collapse="|")
undesired_df <- df %>%
mutate(var = str_replace(str_to_upper(var), from_vec, to_vec))
i.e. this
tibble(var = c("X|Y|Z", "X|Y|Z", "X|Y|Z", "X|Y|Z", "E", "D", "X|Y|Z"))
How can I get the desired_df?
You could use chartr :
df$var <- chartr(paste(fromvec,collapse=""),
paste(tovec,collapse=""),
toupper(df$var))
# # A tibble: 7 x 1
# var
# <chr>
# 1 X
# 2 Y
# 3 Z
# 4 X
# 5 E
# 6 D
# 7 Y
Or we can use recode
library(dplyr)
df$var <- recode(toupper(df$var), !!!setNames(tovec,fromvec))
If you really want to use str_replace you could do:
library(purrr)
library(stringr)
df$var <- reduce2(fromvec, tovec, str_replace, .init=toupper(df$var))
The correct way to do this with stringr is with str_replace_all:
mutate(df,str_replace_all(str_to_upper(var),setNames(tovec, fromvec)))
(thanks, #Moody_Mudskipper!)
We can use base R
with(df, ifelse(toupper(var) %in% fromvec,
setNames(tovec, fromvec)[toupper(var)], var))
#[1] "X" "Y" "Z" "X" "E" "D" "Y"
which can be also written in two lines by creating a logical condition
i1 <- toupper(df$var) %in% fromvec
df$var[i1] <- setNames(tovec, fromvec)[toupper(df$var)[i1]]
Or using data.table
library(data.table)
setDT(df)[toupper(var) %in% fromvec, var := setNames(tovec, fromvec)[toupper(var)]]
It's not clear the result should be case insensitive.
In my opinion, replacement (update) operations that involve an indeterminate number of changes are best accomplished using JOINs. In this case, it also cements a good practice of tracking your changes in a separate dataframe.
Unfortunately, the tidyverse has no "update dataframe" function....a glaring omission. That means tidyverse-ers must use a work-around, coalesce.
#JOIN Operation
tibble(fromvec, tovec) %>% #< dataframe of changes
right_join(df, by = c("fromvec" = "var")) %>% #< join operation
transmute(var = coalesce(tovec, fromvec)) #< coalesce work-around
# A tibble: 7 x 1
var
<chr>
1 X
2 Y
3 Z
4 a
5 E
6 D
7 b
If a case insensitive operation is preferred, consider inserting str_to_upper in the pipeline:
tibble(fromvec, tovec) %>%
right_join(df %>% mutate(var = (str_to_upper(var))), #<modify case
by = c("fromvec" = "var")) %>%
transmute(var = coalesce(tovec, fromvec))
# A tibble: 7 x 1
var
<chr>
1 X
2 Y
3 Z
4 X
5 E
6 D
7 Y

Subset a Data Frame Based on All Combinations and Sub-combinations of Factor Variables

I need to subset a data.frame based on all combinations an sub-combinations of multiple columns of factor variables. Additionally the number of columns factor variables may change so the method needs to be flexible in accepting different numbers of attributes. I can figure out how to create the combinations of variables in a simple example but don't have a good way to subset the data.frame efficiently. Any thoughts?
#setup an example data.frame
a <- c("a", "b", "b", "b", "e")
b <- c("b", "c", "b", "b", "f")
c <- c("c", "d", "b", "b", "g")
df <- data.table(a = a, b = b, c = c)
#build a data.frame of unique combos to subset on
df_unique <- df[!duplicated(df), ]
df_combos <- data.table()
for(i in 1:ncol(df_unique)){
for(x in 1:ncol(df_unique)){
df_sub <- df_unique[,i:x, with = F]
df_combos <- rbind(df_combos, df_sub, fill = T)
}
}
df_combos <- df_combos[!duplicated(df_combos), ]
rm(df_unique)
#create a loop to build the subsets
combos_out <- data.table()
for(i in 1:nrow(df_combos)){
df_combos_sub <- df_combos[i, ]
df_combos_sub <- df_combos_sub[,which(unlist(lapply(df_combos_sub, function(x)!all(is.na(x))))),with=F]
df_sub <- merge(df, df_combos_sub, by = colnames(df_combos_sub))
#interesting code here that performs analysis on the subsets
}

Merge and paste duplicate columns in R

Suppose I have two data frames with some common variable x:
df1 <- data.frame(
x=c(1, 2, 3, 4),
y=c("a", "b", "c", "d")
)
df2 <- data.frame(
x=c(1, 1, 2, 2, 3, 4, 5),
z=c("A", "B", "C", "D", "E", "F", "G")
)
We can assume that each entry of the variable we're merging over, x, appears exactly once in df1; however, it may appear an arbitrary number of times in df2.
I want to merge df2 'into' df1, while preserving df1. Is there a fast way of merging these two data frames such that the merged output would be of the form (for example):
df_merged <- data.frame(
x=c(1, 2, 3, 4),
y=c("a", "b", "c", "d"),
z=c("A B", "C D", "E", "F")
)
Essentially, I want df_merged to be a composition of the original df1, in addition to any variables in df2 coerced to match the format of df1. The various incantations of merge will append new rows to the merged output, which I want to avoid.
We can assume that each entry of the variable we're merging over, x, appears exactly once.
Speed is also a priority since I'll be merging fairly large data frames.
merge( df1,
aggregate(df2$z , df2[1], FUN=paste, collapse=" ", sep=""),
by.x="x", by.y=1)
x y x
1 1 a A B
2 2 b C D
3 3 c E
4 4 d F
Warning message:
In merge.data.frame(df1, aggregate(df2$z, df2[1], FUN = paste, collapse = " ", :
column name ‘x’ is duplicated in the result
> M1 <- .Last.value
> names(M1)[3] <- "z"
> M1
x y z
1 1 a A B
2 2 b C D
3 3 c E
4 4 d F
Another option:
df2.z <- with(df2, tapply(z, x, paste, collapse=' '))
transform(df1, z=df2.z[match(x, names(df2.z))])
# x y z
# 1 1 a A B
# 2 2 b C D
# 3 3 c E
# 4 4 d F
If df1$x is in order, then use df2.z[names(df2.z) %in% x] in the transform statement.
I'm submitting this question with my own potential answer, but it is fairly slow and I'm curious what other methods might be available.
by <- "x"
df2_processed <- as.data.frame(
sapply( names(df2), function(x) {
tapply( df2[[x]], df2[[by]], function(xx) {
if( x == by ) {
return(xx[1])
} else {
paste(xx, collapse=" ")
}
})
}), optional=TRUE, stringsAsFactors=FALSE )
merge( df1, df2_processed, all.x=TRUE )

Resources