Removing columns based on a vector of names in R - r

I have a data.frame called DATA. Using BASE R, I was wondering how I could remove any variables in DATA that is named any of the following: ar = c("out", "Name", "mdif" , "stder" , "mpre")?
Currently, I use DATA[ , !names(DATA) %in% ar] but while this removes the unwanted variables, it again creates some new nuisance variables suffixed .1.
After extraction, is it possible to remove just suffixes?
Note1: We have NO ACCESS to r, the only input is DATA.
Note2: This is toy data, a functional solution is appreciated.
r <- list(
data.frame(Name = rep("Jacob", 6),
X = c(2,2,1,1,NA, NA),
Y = c(1,1,1,2,1,NA),
Z = rep(3, 6),
out = rep(1, 6)),
data.frame(Name = rep("Jon", 6),
X = c(1,NA,3,1,NA,NA),
Y = c(1,1,1,2,NA,NA),
Z = rep(2, 6),
out = rep(1, 6)))
DATA <- do.call(cbind, r) ## DATA
ar = c("out", "Name", "mdif" , "stder" , "mpre") # The names for exclusion
DATA[ , !names(DATA) %in% ar] ## Current solution
#>
# X Y Z X.1 Y.1 Z.1 ## X.1 Y.1 Z.1 are automatically created but no needed
# 1 2 1 3 1 1 2
# 2 2 1 3 NA 1 2
# 3 1 1 3 3 1 2
# 4 1 2 3 1 2 2
# 5 NA 1 3 NA NA 2
# 6 NA NA 3 NA NA 2

Ideally column names should be unique but if you want to keep duplicated column names, we can remove suffixes using sub after extraction
DATA1 <- DATA[ , !names(DATA) %in% ar]
names(DATA1) <- sub("\\.\\d+", "", names(DATA1))
DATA1
# X Y Z X Y Z
#1 2 1 3 1 1 2
#2 2 1 3 NA 1 2
#3 1 1 3 3 1 2
#4 1 2 3 1 2 2
#5 NA 1 3 NA NA 2
#6 NA NA 3 NA NA 2

In base R, if we create an object with the index, we can reuse it later instead of doing additional manipulations on the column name
i1 <- !names(DATA) %in% ar
DATA1 <- setNames(DATA[i1], names(DATA)[i1])
DATA1
# X Y Z X Y Z
#1 2 1 3 1 1 2
#2 2 1 3 NA 1 2
#3 1 1 3 3 1 2
#4 1 2 3 1 2 2
#5 NA 1 3 NA NA 2
#6 NA NA 3 NA NA 2
For reusuability, we can create a function
f1 <- function(dat, vec) {
i1 <- !names(dat) %in% vec
setNames(dat[i1], names(dat)[i1])
}
f1(DATA, ar)
If the datasets are stored in a list, use lapply to loop over the list and apply the f1
lst1 <- list(DATA, DATA)
lapply(lst1, f1, vec = ar)
If the 'ar' elements are also different for different list elements
arLst <- list(ar1, ar2)
Map(f1, lst1, vec = arLst)
Here,
ar1 <- c("out", "Name")
ar2 <- c("mdif" , "stder" , "mpre")
Here is also another option using tidyverse
library(dplyr)
library(stringr)
DATA %>%
set_names(make.unique(names(.))) %>%
select(-matches(str_c(ar, collapse="|"))) %>%
set_names(str_remove(names(.), "\\.\\d+$"))
# X Y Z X Y Z
#1 2 1 3 1 1 2
#2 2 1 3 NA 1 2
#3 1 1 3 3 1 2
#4 1 2 3 1 2 2
#5 NA 1 3 NA NA 2
#6 NA NA 3 NA NA 2
NOTE: It is not recommended to have duplicate column names

Related

find rows that are the same as a vector in R

I want to search row by row, and if it matches a pre defined vector, assign a value to a variable of that row. I prefer to solve it by using dplyr to stay in the pipeline.
for a simplified example:
a=c(1,2,NA)
b=c(1,NA,NA)
c=c(1,2,3)
d=c(1,2,NA)
D= data.frame(a,b,c,d)
My attempt is:
D %>% mutate(
i= case_when(
identical(c(a,b,c),c(1,1,1)) ~ 1,
identical(c(a,b,c),c(NA,NA,3)) ~ 2
)
)
I hope it gives me:
a b c d i
1 1 1 1 1 1
2 2 NA 2 2 NA
3 NA NA 3 NA 2
but my code doesn't work I guess it's because it's not comparing a row to a vector.
I do not want to simply type within the case_when c==1 & b==1 & c== 1 ~ 1 because there will be too many variables to type in my dataset.
Thank you for your advise.
For this example
The following code would work
a=c(1,2,NA)
b=c(1,NA,NA)
c=c(1,2,3)
D= data.frame(a,b,c,d)
D %>% mutate(
i= case_when(
paste(a,b,c, sep=',') == paste(1,1,1, sep=",") ~ 1,
paste(a,b,c, sep=',') == paste(NA,NA,3, sep=",") ~ 2
)
)
a b c d i
1 1 1 1 1 1
2 2 NA 2 2 NA
3 NA NA 3 NA 2
If we have multiple conditions, create a key/value dataset and then do a join
library(dplyr)
keydat <- data.frame(a =c(1, NA), b = c(1, NA), c = c(1, 3), i = c(1, 2))
left_join(D, keydat)
# a b c d i
#1 1 1 1 1 1
#2 2 NA 2 2 NA
#3 NA NA 3 NA 2

Appending data frames in R based on column names

I am relatively new to R, so bear with me. I have a list of data frames that I need to combine into one data frame. so:
dfList <- list(
df1 = data.frame(x=letters[1:2],y=1:2),
df2 = data.frame(x=letters[3:4],z=3:4)
)
comes out as:
$df1
x y
1 a 1
2 b 2
$df2
x z
1 c 3
2 d 4
and I want them to combine common columns and add anything not already there. the result would be:
final result
x y z
1 a 1
2 b 2
3 c 3
4 d 4
Is this even possible?
Yep, it's pretty easy, actually:
library(dplyr)
df_merged <- bind_rows(dfList)
df_merged
x y z
1 a 1 NA
2 b 2 NA
3 c NA 3
4 d NA 4
And if you don't want NA in the empty cells, you can replace them like this:
df_merged[is.na(df_merged)] <- 0 # or whatever you want to replace NA with
Just using do.call with rbind.fill
do.call(rbind.fill,dfList)
x y z
1 a 1 NA
2 b 2 NA
3 c NA 3
4 d NA 4
You could do that with base function merge():
merge(dfList$df1, dfList$df2, by = "x", all = TRUE)
# x y z
# 1 a 1 NA
# 2 b 2 NA
# 3 c NA 3
# 4 d NA 4
Or with dplyr package with function full_join:
dplyr::full_join(dfList$df1, dfList$df2, by = "x")
# x y z
# 1 a 1 NA
# 2 b 2 NA
# 3 c NA 3
# 4 d NA 4
They both join everything that is in both data.frames.
Hope that works for you.

Subsetting data with a vec

I want to subset a data frame by a vector, but replicate the subsetting for each value in the vector:
data = data.frame(A = c(1,2,3,1), B = c(1,2,3,4))
vec = c(1, 1, 1)
subset(data, A %in% vec)
A B
1 1 1
4 1 4
Instead of this result I want this:
A B
1 1 1
4 1 4
1 1 1
4 1 4
1 1 1
4 1 4
If you use the purrr library, you can do
map_df(vec, function(x) subset(data, A == x))
with base R, it would be
do.call("rbind", lapply(vec, function(x) subset(data, A == x)))
You need to expand it, i.e.
df2 <- subset(data, A %in% vec)
df2[rep(rownames(df2), length(vec)),]
# A B
#1 1 1
#4 1 4
#1.1 1 1
#4.1 1 4
#1.2 1 1
#4.2 1 4
One option with data.table:
library(data.table)
setDT(data, key = 'A')[.(vec)]
# A B
#1: 1 1
#2: 1 4
#3: 1 1
#4: 1 4
#5: 1 1
#6: 1 4
Or use merge, which gives cartesian product as you need when there are duplicated values in the merge-by column:
merge(data, data.frame(A = vec))
# A B
#1: 1 1
#2: 1 1
#3: 1 1
#4: 1 4
#5: 1 4
#6: 1 4
Along the lines of a base R split-apply-combine solution would be
do.call(rbind, lapply(vec, function(i) data[data$A == i, ]))
A B
1 1 1
4 1 4
11 1 1
41 1 4
12 1 1
42 1 4
This could be useful if vec contained an uneven mixture of values. This solution could be expensive if there are many repetitions in vec. In that instance, computation can be reduced by combining it with the rep idea in soto's answer as follows.
# count the number of repetitions by unique value
uni <- table(vec)
# extract unique values
temp <- lapply(as.numeric(names(uni)), function(i) data[data$A == i, ])
# combine results, repeating data.frames according to count
do.call(rbind, temp[rep(seq_along(uni), each=uni)])

R - New column based on previous columns, for multiple similar variables

This question is similar to previous questions (based on my search) but with a twist. I hope to use [s,l,v]apply to perform this action for efficiency.
df <- data.frame(id = c(1,2,3,1,2), var1_dose_v1 = c(2,4,NA,1,NA),
var1_dose_v2 = c(NA,NA,4,NA,3),
var2_dose_v1 = c(NA,4,2,3,5),
var2_dose_v2 = c(1,NA,NA,NA,NA),
var3_dose_v1 = c(NA,NA,2,3,5),
var3_dose_v2 = c(1,4,NA,NA,NA)))
Which looks like this below
id var1_dose_v1 var1_dose_v2 var2_dose_v1 var2_dose_v2 var3_dose_v1 var3_dose_v2
1 2 NA NA 1 NA 1
2 4 NA 4 NA NA 4
3 NA 4 2 NA 2 NA
1 1 NA 3 NA 3 NA
2 NA 3 5 NA 5 NA
I want to create a new feature that amalgamates the information from version 1 (v1) and version 2 (v2) of each var#, producing the output below.
id var1_dose var2_dose var3_dose
1 2 1 1
2 4 4 4
3 4 2 2
4 1 3 3
5 3 5 5
It's important for me to use apply since there are thousands of var#s.
Thanks for your help!
This-
df[is.na(df)] <- 0
new_df <- sapply(seq(1:((ncol(df)-1)/2)), function(x)
{
df[, paste0("var",x,"_dose_v1")] + df[, paste0("var",x,"_dose_v2")]
})
To have a solution that is general for any number of variables or doses, there's a new function from dplyr called 'coalesce' built for this:
library(dplyr)
grps <- unique(sub("_v.*$?", "", names(df)[-1]))
mat <- sapply(grps, function(g) {
do.call("coalesce", unname(as.list(df[grep(g, names(df))])))
})
df2 <- data.frame(id=df$id, mat)
# id var1_dose var2_dose var3_dose
# 1 1 2 1 1
# 2 2 4 4 4
# 3 3 4 2 2
# 4 1 1 3 3
# 5 2 3 5 5
func <- function(i){
col <- paste0("var",i,"_dose")
xx <- colnames(df)[grep(col, colnames(df))]
yy <- rowSums(df[xx], na.rm = TRUE)
}
l = lapply(1:((dim(df)[2]-1)/2) , func)
df1 = as.data.frame(l)
colnames(df1) <- paste0("var",1:((dim(df)[2]-1)/2),"_dose")
# > df1
# var1_dose var2_dose var3_dose
# 1 2 1 1
# 2 4 4 4
# 3 4 2 2
# 4 1 3 3
# 5 3 5 5
If the 2 versions are always going to be side by side :then concised version of my code could be
l = lapply(1:((dim(df)[2]-1)/2),
function(i) rowSums(df[colnames(df)[c(i*2,i*2+1)]], na.rm = T))
df1 = as.data.frame(l)
colnames(df1) <- paste0("var",1:((dim(df)[2]-1)/2),"_dose")

How can I subset a dataframe according to group membership?

I am wanting to write a function so that a (potentially large) dataframe can be subsetted according to group membership, where a 'group' is a unique combination of a set of column values.
For example, I would like to subset the following data frame according to unique combination of the first two columns (Loc1 and Loc2).
Loc1 <- c("A","A","A","A","B","B","B")
Loc2 <- c("a","a","b","b","a","a","b")
Dat1 <- c(1,1,1,1,1,1,1)
Dat2 <- c(1,2,1,2,1,2,2)
Dat3 <- c(2,2,4,4,6,5,3)
DF=data.frame(Loc1,Loc2,Dat1,Dat2,Dat3)
Loc1 Loc2 Dat1 Dat2 Dat3
1 A a 1 1 2
2 A a 1 2 2
3 A b 1 1 4
4 A b 1 2 4
5 B a 1 1 6
6 B a 1 2 5
7 B b 1 2 3
I want to return (i) the number of groups (i.e. 4), (ii) the number in each group (i.e. c(2,2,2,1), and (iii) to relabel the rows so that I can further analyse the data frame according to group membership (e.g. for ANOVA and MANOVA) (i.e.
Group<-as.factor(c(1,1,2,2,3,3,4))
Data <- cbind(Group,DF[,-1:-2])
Group Dat1 Dat2 Dat3
1 1 1 1 2
2 1 1 2 2
3 2 1 1 4
4 2 1 2 4
5 3 1 1 6
6 3 1 2 5
7 4 1 2 3
).
So far all I have managed is to get the number of groups, and I'm suspicious that there's a better way to do even this:
nrow(unique(DF[,1:2]))
I was hoping to avoid for-loops as I am concerned about the function being slow.
I have tried converting to a data matrix so that I could concatenate the row values but I couldn't get that to work either.
Many thanks
You could try:
Create Group column by using unique level combination of Loc1 and Loc2.
indx <- paste(DF[,1], DF[,2])
DF$Group <- as.numeric(factor(indx, unique(indx))) #query No (iii)
DF1 <- DF[-(1:2)][,c(4,1:3)]
# Group Dat1 Dat2 Dat3
#1 1 1 1 2
#2 1 1 2 2
#3 2 1 1 4
#4 2 1 2 4
#5 3 1 1 6
#6 3 1 2 5
#7 4 1 2 3
table(DF$Group) #(No. ii)
#1 2 3 4
#2 2 2 1
length(unique(DF$Group)) #(i)
#[1] 4
Then, if you need to subset the datasets by group, you could split the dataset using the Group to create a list of 4 list elements
split(DF1, DF1$Group)
Update
If you have multiple columns, you could still try:
ColstoGroup <- 1:2
indx <- apply(DF[,ColstoGroup], 1, paste, collapse="")
as.numeric(factor(indx, unique(indx)))
#[1] 1 1 2 2 3 3 4
You could create a function;
fun1 <- function(dat, GroupCols){
FactGroup <- dat[, GroupCols]
if(length(GroupCols)==1){
dat$Group <- as.numeric(factor(FactGroup, levels=unique(FactGroup)))
}
else {
indx <- apply(FactGroup, 1, paste, collapse="")
dat$Group <- as.numeric(factor(indx, unique(indx)))
}
dat
}
fun1(DF, "Loc1")
fun1(DF, c("Loc1", "Loc2"))
This gets all three of your queries.
Begin with a table of the first two columns and then work with that data.
> (tab <- table(DF$Loc1, DF$Loc2))
#
# a b
# A 2 2
# B 2 1
#
> (ct <- c(tab)) ## (ii)
# [1] 2 2 2 1
> length(unlist(dimnames(tab))) ## (i)
# [1] 4
> cbind(Group = rep(seq_along(ct), ct), DF[-c(1,2)]) ## (iii)
# Group Dat1 Dat2 Dat3
# 1 1 1 1 2
# 2 1 1 2 2
# 3 2 1 1 4
# 4 2 1 2 4
# 5 3 1 1 6
# 6 3 1 2 5
# 7 4 1 2 3
Borrowing a bit from this answer and using some dplyr idioms:
library(dplyr)
Loc1 <- c("A","A","A","A","B","B","B")
Loc2 <- c("a","a","b","b","a","a","b")
Dat1 <- c(1,1,1,1,1,1,1)
Dat2 <- c(1,2,1,2,1,2,2)
Dat3 <- c(2,2,4,4,6,5,3)
DF <- data.frame(Loc1, Loc2, Dat1, Dat2, Dat3)
emitID <- local({
idCounter <- -1L
function(){
idCounter <<- idCounter + 1L
}
})
DF %>% group_by(Loc1, Loc2) %>% mutate(Group=emitID())
## Loc1 Loc2 Dat1 Dat2 Dat3 Group
## 1 A a 1 1 2 0
## 2 A a 1 2 2 0
## 3 A b 1 1 4 1
## 4 A b 1 2 4 1
## 5 B a 1 1 6 2
## 6 B a 1 2 5 2
## 7 B b 1 2 3 3

Resources