Subsetting efficiently on multiple columns and rows - r

I am trying to subset my data to drop rows with certain values of certain variables. Suppose I have a data frame df with many columns and rows, I want to drop rows based on the values of variables G1 and G9, and I only want to keep rows where those variables take on values of 1, 2, or 3. In this way, I aim to subset on the same values across multiple variables.
I am trying to do this with few lines of code and in a manner that allows quick changes to the variables or values I would like to use. For example, assuming I start with data frame df and want to end with newdf, which excludes observations where G1 and G9 do not take on values of 1, 2, or 3:
# Naive approach (requires manually changing variables and values in each line of code)
newdf <- df[which(df$G1 %in% c(1,2,3), ]
newdf <- df[which(newdf$G9 %in% c(1,2,3), ]
# Better approach (requires manually changing variables names in each line of code)
vals <- c(1,2,3)
newdf <- df[which(df$G1 %in% vals, ]
newdf <- df[which(newdf$G9 %in% vals, ]
If I wanted to not only subset on G1 and G9 but MANY variables, this manual approach would be time-consuming to modify. I want to simplify this even further by consolidating all of the code into a single line. I know the below is wrong but I am not sure how to implement an alternative.
newdf <- c(1,2,3)
newdf <- c(df$G1, df$G9)
newdf <- df[which(df$vars %in% vals, ]
It is my understanding I want to use apply() but I am not sure how.

You do not need to use which with %in%, it returns boolean values. How about the below:
keepies <- (df$G1 %in% vals) & (df$G9 %in% vals)
newdf <- df[keepies, ]

Use data.table
First, melt your data
library(data.table)
DT <- melt.data.table(df)
Then split into lists
DTLists <- split(DT, list(DT[1:9])) #this is the number of columns that you have.
Now you can operate on the lists recursively using lapply
DTresult <- lapply(DTLists, function(x) {
...
}

Related

How to replace several variables with several variables from another dataframe in R using a loop?

I would like to replace multiple variables with variables from a second dataframe in R.
df1$var1 <- df2$var1
df1$var2 <- df2$var2
# and so on ...
As you can see the variable names are the same in both dataframes, however, numeric values are slightly different whereas the correct version is in df2 but needs to be in df1. I need to do this for many, many variables in a complex data set and wonder whether someone could help with a more efficient way to code this (possibly without using column references).
Here some example data:
# dataframe 1
var1 <- c(1:10)
var2 <- c(1:10)
df1 <- data.frame(var1,var2)
# dataframe 2
var1 <- c(11:20)
var2 <- c(11:20)
df2 <- data.frame(var1,var2)
# assigning correct values
df1$var1 <- df2$var1
df1$var2 <- df2$var2
As Parfait has said, the current post seems a bit too simplified to give any immediate help but I will try and summarize what you may need for something like this to work.
If the assumption is that df1 and df2 have the same number of rows AND that their orders are already matching, then you can achieve this really easily by the following subset notation:
df1[,c({column names df1}), drop = FALSE] <- df2[, c({column names df2}), drop = FALSE]
Lets say that df1 has columns a, b, and c and you want to replace b and c with two columns of df1 whose columns are x, y, z.
df1[,c("b","c"), drop = FALSE] <- df2[, c("y", "z"), drop = FALSE]
Here we are replacing b with y and c with z. The drop argument is just for added protection against subsetting a data.frame to ensure you don't get a vector.
If you do NOT know the order is correct or one data frame may have a differing size than the other BUT there is a unique identifier between the two data.frames - then I would personally use a function that is designed for merging two data frames. Depending on your preference you can use merge from base or use *_join functions from the dplyr package (my preference).
library(dplyr)
#assuming a and x are unique identifiers that can be matched.
new_df <- left_join(df1, df2, by = c("a"="x"))

Subset the remaining of a dataframe using another subset

I have a sample dataset. I've created a subset of the original data frame using some condition. Now I need to extract the remaining contents of the original sample data frame, except the subset created. How can I do this?
data("mtcars")
fulldf <- mtcars
subdf <- subset.data.frame(fulldf, subset = fulldf$disp < 100)
restdf <- subset.data.frame(fulldf, subset = <fulldf without subdf>)
There are a lot of questions on subsetting data frames in R, but I couldn't find one that satisfied my requirement.
Also the final solution need not necessarily be using subset.data.frame. Any method/package will do.
It is better to assign the logical condition in base R to an object identifier and then negate (!)
i1 <- fulldf$disp < 100
subdf <- subset.data.frame(fulldf, subset = i1)
restdf <- subset.data.frame(fulldf, subset = !i1)
Also another option is to create a list of two datasets with split
lst1 <- split(fulldf, i1)
If the 'subdf' is creating with multiple conditions (not clear though), one option is to add a sequence variable in the data and then subset with %in%
fulldf$ind <- seq_len(nrow(fulldf))
then after the 'subdf' step
restdf <- subset(fulldf, !ind %in% subdf$ind)
and remove the 'ind' columns
restdf$ind <- NULL
subdf$ind <- NULL

Split Apply Combine

I have a large list, and would like to apply the exact technique detailed in the answer here:
Create mutually exclusive dummy variables from categorical variable in R
However, my data is much larger, and I would like to split, apply and combine the operation to each individual row.
This code, which of course does not work, illustrates what I am trying to do:
id <- c(1,1,1,1)
time <- c(1,2,3,4)
time <- as.character(time)
unique.time <- as.character(unique(df$time))
df <- data.frame(id,time)
df1 <- split(df, row(df))
sapply(df1, (unique.time, function(x)as.numeric(df1$time == x)))
z <- unsplit(lapply(df1, row(df)), scale), x)
Thanks!

Using a loop to create multiple dataframes from a single dataset

Quick question for you. I have the following:
a <- c(1,5,2,3,4,5,3,2,1,3)
b <- c("a","a","f","d","f","c","a","r","a","c")
c <- c(.2,.6,.4,.545,.98,.312,.112,.4,.9,.5)
df <- data.frame(a,b,c)
What i am looking to do is utilize a for loop to create multiple data frames from rows based on the column contents of column B (i.e. a df for the "a," the "d," and so on).
At the same time, I would also like to name the data frame based on the corresponding value from column B (df will be named "a" for the data frame created from the "a."
I tried making it work based off the answers provided here Using a loop to create multiple data frames in R but i had no luck.
If it helps, I have variables created with levels() and nlevels() to use in the loop to keep it scalable based on how my data changes. Any help would be much appreciated.
Thanks!
This should do:
require(dplyr)
df$b <- as.character(df$b)
col.filters <- unique(df$b)
lapply(seq_along(col.filters), function(x) {
filter(df, b == col.filters[x])
}
) -> list
names(list) <- col.filters
list2env(list, .GlobalEnv)
Naturally, you don't need dplyr to do this. You can just use base syntax:
df$b <- as.character(df$b)
col.filters <- unique(df$b)
lapply(seq_along(col.filters), function(x) {
df[df[, "b"] == col.filters[x], ]
}
) -> list
names(list) <- col.filters
list2env(list, .GlobalEnv)
But I find dplyrmuch more intuitive.
Cheers

R: Add columns to a data frame on the fly

new at R and programming in general over here. I have several binary matrices of presence/absence data for species (columns) and plots (rows). I'm trying to use them in several dissimilarity indices which requires that they all have the same dimensions. Although there are always 10 plots there are a variable number of columns based on which species were observed at that particular time. My attempt to add the 'missing' columns to each matrix so I can perform the analyses went as follows:
df1 <- read.csv('file1.csv', header=TRUE)
df2 <- read.csv('file2.csv', header=TRUE)
newCol <- unique(append(colnames(df1),colnames(df2)))
diff1 <- setdiff(newCol,colnames(df1))
diff2 <- setdiff(newCol,colnames(df2))
for (i in 1:length(diff1)) {
df1[paste(diff1[i])]
}
for (i in 1:length(diff2)) {
df2[paste(diff2[i])]
}
No errors are thrown, but df1 and df2 both remain unchanged. I suspect my issue is with my use of paste, but I couldn't find any other way to add columns to a data frame on the fly like that. When added, the new columns should have 0s in the matrix as well, but I think that's the default, so I didn't add anything to specify it.
Thanks all.
Using your code, you can generate the columns without the for loop by:
df1[, diff1] <- 0 #I guess you want `0` to fill those columns
df2[, diff2] <- 0
identical(sort(colnames(df1)), sort(colnames(df2)))
#[1] TRUE
Or if you want to combine the datasets to one, you could use rbind_list from data.table with fill=TRUE
library(data.table)
rbindlist(list(df1, df2), fill=TRUE)
data
set.seed(22)
df1 <- as.data.frame(matrix(sample(0:1, 10*6, replace=TRUE), ncol=6,
dimnames=list(NULL, sample(paste0("Species", 1:10), 6, replace=FALSE))))
set.seed(35)
df2 <- as.data.frame(matrix(sample(0:1, 10*8, replace=TRUE), ncol=8,
dimnames=list(NULL, sample(paste0("Species", 1:10),8 , replace=FALSE))))

Resources