R: retrieve dataframe name from another dataframe - r

I have a dataframe dataselect that tells me what dataframe to use for each case of an analysis (let's call this the relevant dataframe).
The case is assigned dynamically, and therefore which dataframe is relevant depends on that case.
Based on the case, I would like to assign the relevant dataframe to a pointer "relevantdf". I tried:
datasetselect <- data.frame(case=c("case1","case2"),dataset=c("df1","df2"))
df1 <- data.frame(var1=letters[1:3],var2=1:3)
df2 <- data.frame(var1=letters[4:10],var2=4:10)
currentcase <- "case1"
relevantdf <- get(datasetselect[datasetselect$case == currentcase,"dataset"]) # relevantdf should point to df1
I don't understand if I have a problem with the get() function or the subsetting process.

You are almost there, the problem is that the dataset column from datasetselect is a factor, you just need to convert it to character
You can add this line after the definition of datasetselect:
datasetselect$dataset <- as.character(datasetselect$dataset)
And you get your expected output
> relevantdf
var1 var2
1 a 1
2 b 2
3 c 3

Related

Building a Dataframe Column-by-Column in R

Is there a way for me to iteratively build a dataframe in R? I would be interested in knowing how I would do so either by adding column-by-column or row-by-row. I have been trying for some time now and find myself stuck.
Here is some code that I have tried:
line <- as.list(strsplit(line, ", "))[[1]] # make into list
col_names = names(idx_for_cell_counts_by_gene_id)
df <- data.frame() # here is where I get stuck - want an empty dataframe
for (x in 1:length(col_names)) {
column_name <- col_names[[x]]
information <- line[[x]]
df$column_name <- information
}
I have tried looking at some SO examples (#1, #2) but to no avail. Is there something I should do to instantiate an empty dataframe (or, better yet, a dataframe with only 'column headers' and now rows) in R?
One issue is that df$column_name creates a column named column_name. It doesn't use the value in the object named column_name. Making a representative example and walking through it will show you:
df <- data.frame(placeholder = 0)
column_name <- "my_col"
# The following will create a column named "column_name"
df$column_name <- 0
# df
# placeholder column_name
# 1 0 0
# The following will create a column with the value inside of the object `column_name`
df[,column_name] <- 0
# df
# placeholder column_name my_col
# 1 0 0 0
Another issue you have is that you're making a data.frame of length 0. That means that any column you add needs to be a matching length. All columns in a dataframe must be the same length.
One way to deal with this is to create a placeholder column when you create the dataframe and then remove it later. df <- data.frame(placeholder = boolean(length(line[[1]]))). There may be other more elegant ways to handle this.

subsetting using column names as objects

I am trying to subset a data frame using a column names stored in an object. Is this possible? Here is an example:
ReallyLongColNameA <- c(1,2,3,4,5,6)
ReallyLongColNameB <- c(6,5,4,3,2,1)
ReallyLongColNameC <- c(7,8,9,10,11,12)
X <- data.frame(ReallyLongColNameA, ReallyLongColNameB, ReallyLongColNameC)
can i store a column name as such:
ShortColNameB <- names(X[2])
and then subset using the column name stored in object ShortColNameB
I can subset the following:
subX <- X[X$ReallyLongColB == 6,]
To get:
ReallyLongColA ReallyLongColB ReallyLongColC
1 6 7
But what if I wanted the following desired output by using the column name stored in an object (ShortColNameB)?:
ReallyLongColA ReallyLongColB
1 6
You can easily remove the last column by subsetting on column numbers.
X[X[[ShortColNameB]]==6,c(1,2)]
You define what rows you want by filtering on the ==6 for ShortColNameB, and you define the columns you want by selecting the numbers (e.g. 1st and 2nd column, A & B).

Counting non-missing occurrences

I need help counting the number of non-missing data points across files and subsetting out only two columns of the larger data frame.
I was able to limit the data to only valid responses, but then I struggled getting it to return only two of the columns.
I found http://www.statmethods.net/management/subset.html and tried their solution, but myvars did not house my column label, it return the vector of data (1:10). My code was:
myvars <- c("key")
answer <- data_subset[myvars]
answer
But instead of printing out my data subset with only the "key" column, it returns the following errors:
"Error in [.data.frame(observations_subset, myvars) : undefined columns selected" and "Error: object 'answer' not found
Lastly, I'm not sure how I count occurrences. In Excel, they have a simple "Count" function, and in SPSS you can aggregate based on the count, but I couldn't find a command similarly titled in R. The incredibly long way that I was going to go about this once I had the data subsetted was adding in a column of nothing but 1's and summing those, but I would imagine there is an easier way.
To count unique occurrences, use table.
For example:
# load the "iris" data set that's built into R
data(iris)
# print the count of each species
table(iris$Species)
Take note of the handy function prop.table for converting a table into proportions, and of the fact that table can actually take a second argument to get a cross-tab. There's also an argument useNA, to include missing values as unique items (instead of ignoring them).
Not sure whether this is what you wanted.
Creating some data as it was mentioned in the post as multiple files.
set.seed(42)
d1 <- as.data.frame(matrix(sample(c(NA,0:5), 5*10, replace=TRUE), ncol=10))
set.seed(49)
d2 <- as.data.frame(matrix(sample(c(NA,0:8), 5*10, replace=TRUE), ncol=10))
Create a list with datasets as the list elements
l1 <- mget(ls(pattern="d\\d+"))
Create a index to subset the list element that has the maximum non-missing elements
indx <- which.max(sapply(l1, function(x) sum(!is.na(x))))
Key of columns to subset from the larger (non-missing) dataset
key <- c("V2", "V3")
Subset the dataset
l1[[indx]][key]
# V2 V3
#1 1 1
#2 1 3
#3 0 0
#4 4 5
#5 7 8
names(l1[indx])
#[1] "d2"

Dynamically indexing a data frame by column name

I have a data frame and I want to extract the rows where particular columns have a particular value. The column names are stored in a character array and the values are stored in a list.
data <- data.frame(A=c("a","b","b"), B=c(1,2,2), C=(3,3,4))
column_key <- c("A", "B")
value_key <- list("b", 2)
Obviously, I can extract the information I want by simple indexing if I hardcode the column names of the keys:
desired_rows <- data[data$A=="b" & data$B==2,]
desired_rows =
A B C
2 b 2 3
3 b 2 4
But how do I do this if the column names are stored in variables. Ideally, it would be something like this:
key <- value_key
names(key) <- column_key
desired_rows <- data[key,]
But I cannot index a data.frame with a list.
I found this trick just before posting the question.
I can compare a data frame to a list that has the same length as a row which returns a logical matrix indicating which element in each row matches the corresponding element in the list. Because I want to find rows that match entirely, I apply the all function across the rows to get a logical index into the rows of data.
desired_rows <- data[apply(data[column_key]==value_key, 1, all),]

Lookup of entries with multiplicities

Suppose I have a vector data <- c(1,2,2,1) and a reference table, say : ref <- cbind(c(1,1,2,2,2,2,4,4), c(1,2,3,4,5,6,7,8))
I would like my code to return the following vector : result <- c(1,2,3,4,5,6,3,4,5,6,1,2). It's like using the R function match(). But match() only returns the first occurrence of the reference vector. Similar for %in%.
I have tried functions like merge(), join() but I would like something with only the combination of rep() and seq() R functions.
You can try
ref[ref[,1] %in% data,2]
To return the second column value whenever the first column value is in the given set. You can wrap this in a lapply:
unlist(lapply(data, function(x) ref[ref[,1] ==x, 2]))
You can get the indices you are looking for like this:
indices <- sapply(data,function(xx)which(ref[,1]==xx))
Of course, that is a list, since the number of hits will be different for each entry of data. So you just unlist() this:
ref[unlist(indices),2]
[1] 1 2 3 4 5 6 3 4 5 6 1 2

Resources