R: Leaving out undefined columns when subsetting a dataframe - r

I have a dataframe df1 with 300+ Columns, and I am trying to subset it to df2 by column names based on a couple dozen inputs in list1. However, some of the items in list1 are not actually column names in df1. So, I get an "undefined columns" error:
df2 <- df1[,list1]
Error in [.data.frame(df1, , list1) :
undefined columns selected
I realize the reason for this error. BUT I can't go through list1 and see which ones occur in df1 and which ones don't. Because I have to do this many many times with many many different lists. Is there a way to simply ignore those (and not include those null column name values from list1 in the subset df2)?
Thanks for your kind help.

How about something like: df2 <- df1[,list1[list1 %in% names(df1)]]

I was hoping for a nicer/shorter) solution for my own use.
Here is another
df1[,intersect(names(df1), list1)]

Related

How can I get the column names that are resulting in the "undefined columns selected" error?

I got this error in a piece of code where I subset a data frame using a list with the columns I wish to mantain. I have and managed to find the problematic column (column name was mispelled in the list) and solved the problem.
But to accomplish that, I had to check each entry in the list, checking column names one by one to find it.
Is there a way (some function, perhaps) to make R show what columns are undefined?
You may try purrr functions map_depth and vec_depth. Learned here Extract colnames from a nested list of data.frames. Credits to Ronak Shah.
library(purrr)
return_names <- function(x) {
if(inherits(x, "list"))
return(map_depth(x, vec_depth(x) - 2, names))
else return(names(x))
}
map(yourlist, return_names)

How make R ignore undefined columns selected error message in cbind()?

I have a dataframe A, and want to merge a column that exists in A with ones that do not exist in A. I want to make cbind ignore those columns that does not exist and cbind() only existing ones. Something similar to cbind(A$Key.Name,A$Dummy1,A$Dummy2), however preserving dataframe class of the data with the column names.
A<-fromJSON('[{"Key":{"Name":"Victor","ID":61426},"Type":"Unknown","Domain":"Cooking" }]',
flatten = T)
names(A)
cbind(A["Key.Name"],A["Dummy1"],A["Dummy2"])
Use intersect to select only those columns that are present in the data.
cols_to_select <- c('Key.Name', 'Dummy1', 'Dummy2')
result <- A[intersect(names(A), cols_to_select)]
In dplyr you can use any_of :
library(dplyr)
A %>% select(any_of(cols_to_select))

How do I show which variables are not shared by two datasets in R?

I have two data sets (A and B), one with 1600 observations/ rows and 1002 Variables/columns and one with 860 observations/rows and 1040 variables/ columns. I want to quickly check which variables are not contained in dataset A but are in dataset B and vice versa. I am only interestes in the column names, not in the onservations contained within these columns.
I found this great function here: https://cran.r-project.org/web/packages/arsenal/vignettes/comparedf.html and essencially I would want to get an output similar to this:
The code I am trying is: summary(comparedf(dataA, dataB)) However, the table is not printed because R does a row by row comparision of both data sets and then runs out of space when printing the results in the console. Is there a quick way of achieving what I need here?
I think you can use the anti_join() function from the dplyr package to find the unmatched records. It will give you an output of the rows that both data sets A and B do not share in common. Here is an example:-
table1<-data.frame(id=c(1:5), animal=c("cat", "dog", "parakeet",
"lion", "duck"))
table2<-table1[c(1,3,5),]
library(dplyr)
anti_join(table1, table2, by="id")
id animal
1 2 dog
2 4 lion
This will return the unshared rows by ID.
Edit
If you are wanting to find which column names/variables appear in one data frame but not the other, then you could use this solution:-
df1 <- data.frame(a=rnorm(100), b=rnorm(100), not=rnorm(100))
df2 <- data.frame(a=rnorm(100), b=rnorm(100))
df1[, !names(df1) %in% names(df2)] #returns column/variable that appears in df1 but not in df2
I hope this answers your question. It will return the actual values beneath each unshared column/variable, but you could save the output to an object and run colnames() on it, which should print your unshared column/variable names.
It may be a bit clunky, but combining setdiff() with colnames() may work.
Doing both setdiff(colnames(DataA),colnames(DataB)) and setdiff(colnames(DataB),colnames(DataA)) will give you 2 vectors, each with the names of the columns present in one of the datasets but not in the other one.

Merge dataframes with unequal rows, and no matching column names R

I am trying to take df1 (a summary table), and merge it into df2 (master summary table).
This is a snapshot of df2, ignore the random 42, just the answer to the ultimate question.
This is an example of what df1, looks like.
Lastly, I have a vector called Dates. This matches the dates that are the column names for df2.
I am trying to cycle through 20 file, and gather the summary statistics of that file. I then want to enter that data into df2 to be stored permanently. I only need to enter the Earned column.
I have tried to use merge but since they do not have shared column names, I am unable to.
My next attempt was to try this. But it gave an error, because of unequal row numbers.
df2[,paste(Dates[i])] <- cbind(df2,df1)
Then I thought that maybe if I specified the exact location, it might work.
df2[1:length(df1$Earned),Dates[i]] <- df1$Earned
But that gave and error "New columns would leave holes after existing columns"
So then I thought of trying that again, but with cbind.
df2[1:length(df1$Earned),Dates[i]] <- cbind(df2, df1$Earned)
##This gave an error for differing row numbers
df2 <- cbind(df2[1:length(df1$Earned),Dates[i]],df1$earned)
## This "worked" but it replaced all of df2 with df1$earned, so I basically lost the rest of the master table
Any ideas would be greatly appreciated. Thank you.
Something like this might work:
df1[df1$TreatyYear %in% df2$TreatyYear, Dates] <- df2$Earned
Example
df <- data.frame(matrix(NA,4,4))
df$X1 <- 1:4
df[df$X1 %in% c(1,2),c("X3","X4")] <- c(1,2)
The only solution that I have found so far is to force df1$Earned into a vector. Then append the vector to be the exact length of the df2. Then I am able to insert the values into df2 by the specific column.
temp_values <- append(df1$Earned,rep(0,(length(df2$TreatyYear)-length(df1$TreatyYear))),after=length(df1$Earned))
df2[,paste(Dates[i])] <- temp_values
This is kind of a roundabout way to fix it, but not a very pleasant way. Any better ideas would be appreciated.

Select non NA data from Single Row

This should be pretty easy, not sure what I'm missing here. I want to select a single row from a data frame, let's say row 1000, and get all columns where that specific row is not NA.
This works
df<- df[1000,]
df<- df[, !is.na(df)]
This fails
df<- df[1000, !is.na(df)]
ERROR "undefined columns selected"
You missed indexing the part concerning with is.na, here's an approach:
df <- df[1000, !is.na(df[1000, ])]

Resources