Subset the remaining of a dataframe using another subset - r

I have a sample dataset. I've created a subset of the original data frame using some condition. Now I need to extract the remaining contents of the original sample data frame, except the subset created. How can I do this?
data("mtcars")
fulldf <- mtcars
subdf <- subset.data.frame(fulldf, subset = fulldf$disp < 100)
restdf <- subset.data.frame(fulldf, subset = <fulldf without subdf>)
There are a lot of questions on subsetting data frames in R, but I couldn't find one that satisfied my requirement.
Also the final solution need not necessarily be using subset.data.frame. Any method/package will do.

It is better to assign the logical condition in base R to an object identifier and then negate (!)
i1 <- fulldf$disp < 100
subdf <- subset.data.frame(fulldf, subset = i1)
restdf <- subset.data.frame(fulldf, subset = !i1)
Also another option is to create a list of two datasets with split
lst1 <- split(fulldf, i1)
If the 'subdf' is creating with multiple conditions (not clear though), one option is to add a sequence variable in the data and then subset with %in%
fulldf$ind <- seq_len(nrow(fulldf))
then after the 'subdf' step
restdf <- subset(fulldf, !ind %in% subdf$ind)
and remove the 'ind' columns
restdf$ind <- NULL
subdf$ind <- NULL

Related

Find difference of same column names across different data frames in a list in R

I have a list of data frames with same column names where each dataframe corresponds to a month
June_2018 <- data.frame(Features=c("abc","def","ghi","jkl"), Metric1=c(100,200,250,450), Metric2=c(1000,2000,5000,6000))
July_2018 <- data.frame(Features=c("abc","def","ghi","jkl"), Metric1=c(140,250,125,400), Metric2=c(2000,3000,2000,3000))
Aug_2018 <- data.frame(Features=c("abc","def","ghi","jkl"), Metric1=c(200,150,250,600), Metric2=c(1500,2000,4000,2000))
Sep_2018 <- data.frame(Features=c("abc","def","ghi","jkl"), Metric1=c(500,500,1000,100), Metric2=c(500,4000,6000,8000))
lst1 <- list(Aug_2018,June_2018,July_2018,Sep_2018)
names(lst1) <- c("Aug_2018","June_2018","July_2018","Sep_2018")
I intend to create a new column in each of the data frames in the list named Percent_Change_Metric1 and Percent_Change_Metric2 by doing below calculation
for (i in names(lst1)){
lst1[[i]]$Percent_Change_Metric1 <- ((lst1[[i+1]]$Metric1-lst1[[i]]$Metric1)*100/lst1[[i]]$Metric1)
lst1[[i]]$Percent_Change_Metric2 <- ((lst1[[i+1]]$Metric2-lst1[[i]]$Metric2)*100/lst1[[i]]$Metric2)
}
However, obviously the i in for loop is against the names(lst1) and wouldn't work
Also, the dataframes in my list in random order and not ordered by month-year. So the calculation to subtract successive dataframes' columns isn't entirely accurate.
Please advise
How I go about with adding the Percent_change_Metric1 and
Percent_change_Metric2
How to choose the dataframe corresponding
to next month to arrive at the correct Percent_Change
Thanks for the guidance
Here is one option with base R
lst1[-length(lst1)] <- Map(function(x, y)
transform(y, Percent_Change_Metric1 = (x$Metric1 - Metric1) * 100/Metric1,
Percent_Change_Metric2 = (x$Metric2 - Metric2) * 100/Metric2),
lst1[-1], lst1[-length(lst1)])

Using a loop to create multiple dataframes from a single dataset

Quick question for you. I have the following:
a <- c(1,5,2,3,4,5,3,2,1,3)
b <- c("a","a","f","d","f","c","a","r","a","c")
c <- c(.2,.6,.4,.545,.98,.312,.112,.4,.9,.5)
df <- data.frame(a,b,c)
What i am looking to do is utilize a for loop to create multiple data frames from rows based on the column contents of column B (i.e. a df for the "a," the "d," and so on).
At the same time, I would also like to name the data frame based on the corresponding value from column B (df will be named "a" for the data frame created from the "a."
I tried making it work based off the answers provided here Using a loop to create multiple data frames in R but i had no luck.
If it helps, I have variables created with levels() and nlevels() to use in the loop to keep it scalable based on how my data changes. Any help would be much appreciated.
Thanks!
This should do:
require(dplyr)
df$b <- as.character(df$b)
col.filters <- unique(df$b)
lapply(seq_along(col.filters), function(x) {
filter(df, b == col.filters[x])
}
) -> list
names(list) <- col.filters
list2env(list, .GlobalEnv)
Naturally, you don't need dplyr to do this. You can just use base syntax:
df$b <- as.character(df$b)
col.filters <- unique(df$b)
lapply(seq_along(col.filters), function(x) {
df[df[, "b"] == col.filters[x], ]
}
) -> list
names(list) <- col.filters
list2env(list, .GlobalEnv)
But I find dplyrmuch more intuitive.
Cheers

Extract Columns that Do Not Exist in Another Matrix Based on Column Names

I have two matrices df_matrix and df_subset. One is a subset of the other one. Therefore, df_matrix has 10000 rows and columns and df_subset contains only 8222 columns and rows of df_matrix.
I want to select only those columns from df_matrix that are NOT in df_subset. I thought it is best to do it by column names, so I tried executing this code:
newdf <- df_matrix[, which( (colnames(df_matrix)) != (colnames(KroneckerProducts)) )]
However, this is not working at all. Is there any other way to do this?
General rule is not to use == or != with objects of different length
Use %in% with !
newdf <- df_matrix[, !(colnames(df_matrix) %in% colnames(KroneckerProducts))]

R select subset of data

I have a dataset with three columns.
## generate sample data
set.seed(1)
x<-sample(1:3,50,replace = T )
y<-sample(1:3,50,replace = T )
z<-sample(1:3,50,replace = T )
data<-as.data.frame(cbind(x,y,z))
What I am trying to do is:
Select those rows where all the three columns have 1
Select those rows where only two columns have 1 (could be any column)
Select only those rows where only column has 1 (could be any column)
Basically I want any two columns (for 2nd case) to fulfill the conditions and not any specific column.
I am aware of rows selection using
subset<-data[c(data$x==1,data$y==1,data$z==1),]
But this only selects those rows based on conditions for specific columns whereas I want any of the three/two columns to fullfill me criteria
Thanks
n = 1 # or 2 or 3
data[rowSums(data == 1) == n,]
Here is another method:
rowCounts <- table(c(which(data$x==1), which(data$y==1), which(data$z==1)))
# this is the long way
df.oneOne <- data[as.integer(names(rowCounts)[rowCounts == 1]),]
df.oneTwo <- data[as.integer(names(rowCounts)[rowCounts == 2]),]
df.oneThree <- data[as.integer(names(rowCounts)[rowCounts == 3]),]
It is better to save multiple data.frames in a list especially when there is some structure that guides this storage as is the case here. Following #richard-scriven 's suggestion, you can do this easily with lapply:
df.oneCountList <- lapply(1:3, function(i)
data[as.integer(names(rowCounts)[rowCounts == i]),]
names(df.oneCountList) <- c("df.oneOne", "df.oneTwo", df.oneThree)
You can then pull out the data.frames using either their index, df.oneCountList[[1]] or their name df.oneCountList[["df.oneOne"]].
#eddi below suggests a nice shortcut to my method of pulling out the table names using tabulate and the arr.ind argument of which. When which is applied on a multipdimensional object such as an array or a data.frame, setting arr.ind==TRUE produces indices of the rows and the columns where the logical expression evaluates to TRUE. His suggestion exploits this to pull out the row vector where a 1 is found across all variables. The tabulate function is then applied to these row values and tabulate returns a sorted vector that where each element represents a row and rows without a 1 are filled in with a 0.
Under this method,
rowCounts <- tabulate(which(data == 1, arr.ind = TRUE)[,1])
returns a vector from which you might immediately pull the values. You can include the above lapply to get a list of data.frames:
df.oneCountList <- lapply(1:3, function(i) data[rowCounts == i,])
names(df.oneCountList) <- c("df.oneOne", "df.oneTwo", df.oneThree)

Subsetting efficiently on multiple columns and rows

I am trying to subset my data to drop rows with certain values of certain variables. Suppose I have a data frame df with many columns and rows, I want to drop rows based on the values of variables G1 and G9, and I only want to keep rows where those variables take on values of 1, 2, or 3. In this way, I aim to subset on the same values across multiple variables.
I am trying to do this with few lines of code and in a manner that allows quick changes to the variables or values I would like to use. For example, assuming I start with data frame df and want to end with newdf, which excludes observations where G1 and G9 do not take on values of 1, 2, or 3:
# Naive approach (requires manually changing variables and values in each line of code)
newdf <- df[which(df$G1 %in% c(1,2,3), ]
newdf <- df[which(newdf$G9 %in% c(1,2,3), ]
# Better approach (requires manually changing variables names in each line of code)
vals <- c(1,2,3)
newdf <- df[which(df$G1 %in% vals, ]
newdf <- df[which(newdf$G9 %in% vals, ]
If I wanted to not only subset on G1 and G9 but MANY variables, this manual approach would be time-consuming to modify. I want to simplify this even further by consolidating all of the code into a single line. I know the below is wrong but I am not sure how to implement an alternative.
newdf <- c(1,2,3)
newdf <- c(df$G1, df$G9)
newdf <- df[which(df$vars %in% vals, ]
It is my understanding I want to use apply() but I am not sure how.
You do not need to use which with %in%, it returns boolean values. How about the below:
keepies <- (df$G1 %in% vals) & (df$G9 %in% vals)
newdf <- df[keepies, ]
Use data.table
First, melt your data
library(data.table)
DT <- melt.data.table(df)
Then split into lists
DTLists <- split(DT, list(DT[1:9])) #this is the number of columns that you have.
Now you can operate on the lists recursively using lapply
DTresult <- lapply(DTLists, function(x) {
...
}

Resources