Subset my R list by character vector from dataframe - r

I have an r object, 'd' that is a list. I want a dataframe of references to subsets of this list to make as variables for a function, 'myfunction'. This function will be called thousands of times using rslurm each using a different subset of d.
example: d[['1']][[3]] references a data matrix within the list.
myfunction(d[['1']][[3]])
works fine, but I want to be able to call these subsets from a dataframe.
I want to be able to have a dataframe, 'ds' containing all of my subset references.
>ds
d
1 d[['1']][[3]]
2 d[['1']][[4]]
>myfunction(get(ds[1,1]))
Error in get(ds[1, 1]) : object 'd[['1']][[3]]' not found
Is there something like 'get' that will let me call a subset of my object, d?
Or something I can put in 'myfunction' that will clarify that this string references a subset of d?
stack_overflow 'get'

A list:
my_list <- c('peanut', 'butter', 'is', 'amazing')
A dataframe containing subset references:
my_dataframe <- data.frame(keys=c("my_list[[1]]", "my_list[[2]]", "my_list[[3]]", "my_list[[4]]"), stringsAsFactors=F)
A function that extracts the value from a list based on a passed value:
my_function <- function(key, my_list) {
from_list <- eval(parse(text=key))
print(from_list)
}
Getting the value from a list by passing in the dataframe row choice and the list:
my_function(my_dataframe[1,1], my_list)

I solved this by changing myfunction to take two variables, c and w, and defining d using bracket notation in the first line of the updated function. My ds now has two variables, c and w, with variable c defined as as.character and it works!
myfunction(c,w) {
d<-d[[c]][[w]]
....rest of function}
>ds
c w
1 1 3
2 1 4
>test <- myfunction(ds[1,1],ds[1,2])

Related

Is there a way of assigning multiple variables to elements in a named R list in one line?

I would like to assign variables to elements in a named list. These variable names are the same as the names in the list. Is there a way that I can assign them all in one line instead of one at a time like what I am doing below?
params <- data[data$Month == m,]
a <- params$a
b <- params$b
c <- params$c
I know that in Java Script you can destructure and array like so:
const [a, b, c] =[1,2,3]
Or a dictionary (which is perhaps more similar to an R named list):
const {a, b, c} = {a:1, b:2, c:3}
Each of these assign the variables a,b and c to the values 1,2 and 3 respectively.
Is there a similar approach that I can take with R?
Use list2env to create individual objects for each column in params.
params <- data[data$Month == m,]
list2env(params, .GlobalEnv)
If you want to keep data in a named list use as.list.
as.list(params)
Establish a named list (lst) in advance. Then you can assign the variables in the data frame (params) in one line.
lst <- vector(mode="list", length = 3)
lst <- list(params$a,params$b,params$c)
names(lst) <- c("a","b","c")

R remove NA value from factor in split function

I'm using the split function to group my data.frame into three categories (C, Q or S). Now, when I execute the split function, I notice that there are now 4 lists in the variable (C, Q, S and empty string).
I expect this has to do with an NA value, or an empty string. How do I filter this correctly?
Currently, my code looks like this:
# Read the data from the CSV file.
train.csv <- read.csv("train.csv")
# Create some handy variables
ship.embarked <- split(train.csv, train.csv$Embarked)
ship.pclass <- split(train.csv, train.csv$Pclass)
ship.embarked returns 4 lists (C, Q S and empty string), while I expect to have 3 (C, Q and S). How do I solve this correctly?
If we need to remove the "", convert to character, use nzchar to return a logical vector, subset the rows based on that and remove the unused levels with droplevels
train.csv <- droplevels(train.csv[nzchar(as.character(train.csv$Embarked)‌​),])
Now, we can do the split and there won't be any ""

Method in [R] for arrays of data frames

I am looking for a best practice to store multiple vector results of an evaluation performed at several different values. Currently, my working code does this:
q <- 55
value <- c(0.95, 0.99, 0.995)
a <- rep(0,q) # Just initialize the vector
b <- rep(0,q) # Just initialize the vector
for(j in 1:length(value)){
for(i in 1:q){
a[i]<-rnorm(1, i, value[j]) # just as an example function
b[i]<-rnorm(1, i, value[j]) # just as an example function
}
df[j] <- data.frame(a,b)
}
I am trying to find the best way to store individual a and b for each value level
To be able to iterate through the variable "value" later for graphing
To have the value of the variable "value" and/or a description of it available
I'm not exactly sure what you're trying to do, so let me know if this is what you're looking for.
q = 55
value <- c(sd95=0.95, sd99=0.99, sd995=0.995)
a = sapply(value, function(v) {
rnorm(q, 1:q, v)
})
In the code above, we avoid the inner loop by vectorizing. For example, rnorm(55, 1:55, 0.95) will give you 55 random normal deviates, the first drawn from a distribution with mean=1, the second from a distribution with mean=2, etc. Also, you don't need to initialize a.
sapply takes the place of the outer loop. It applies a function to each value in value and returns the three vectors of random draws as the data frame a. I've added names to the values in value and sapply uses those as the column names in the resulting data frame a. (It would be more standard to make value a list, rather than a vector with named elements. You can do that with value <- list(sd95=0.95, sd99=0.99, sd995=0.995) and the code will otherwise run the same.)
You can create multiple data frames and store them in a list as follows:
q <- list(a=10, b=20)
value <- list(sd95=0.95, sd99=0.99, sd995=0.995)
df.list = sapply(q, function(i) {
sapply(value, function(v) {
rnorm(i, 1:i, v)
})
})
This time we have two different values for q and we wrap the sapply code from above inside another call to sapply. The inner sapply does the same thing as before, but now it gets the value of q from the outer sapply (using the dummy variable i). We're creating two data frames, one called a and the other called b. a has 10 rows and b has 20 (due to the values we set in q). Both data frames are stored in a list called df.list.

in R, fix an argument for use the lapply function

This post contains two questions. The first is some related with the second.
First, suppose that I want define one function that receives two arguments: one data frame and one variable(column) and I would like to do some counts or statistics. In first time, I have to determine the variable position. For example, suppose that my two first rows of the df are
> df
person age rent
1 23 1000
2 35 1.500
and my function is like this
> myfun<- function(df, var)
{
# determining the variable
ind<- which(names(df) %in% var )
# selecting the variable
v <- df[,ind]
# rest of function
....
}
I think that it may be more easy... Is there some way to determine v directly?
Second Question: I have a large list of data frames(samples of one population). All data frame have the same variables and one of these variable is the rent. I would like to calculate the mean of the rent variable for each sample and I would like to use the lapply function. For one sample, I can do the following code
> mean(sample$rent , na.rm = T)
All that I want is do something like this
> apply(list, mean( , variablefix = rent))
One option is create a new mean function with the rent argument being fix or with only one argument and apply the lappy function:
>mean_rent <- function(df){...}
>lapply(df, mean_rent)
But, I want a way to use the apply function directly in only one line
Some ideas?
Question One: you can also use the names (i.e a character string) or a variable containing the name to index data.frames (and vectors,matrices etc.), so you just have to do:
myfun<- function(df, var) {
# select the column
v <- df[,var]
# rest of function
}
but it is more common to define the function on a vector and then just call it with myfun(df[,var])
Question Two: Instead of assigning the new function to a variable, you can also just pass it on directly, i.e.
lapply(list_of_dfs, function(df){ mean( df$rent ) })

What is the difference between colnames(x[1])<-"name" and colnames(x)[1]<-"name"?

I want to rename column names in data.frame,
> x=data.frame(name=c("n1","n2"),sex=c("F","M"))
> colnames(x[1])="Name"
> x
name sex
1 n1 F
2 n2 M
> colnames(x)[1]="Name"
> x
Name sex
1 n1 F
2 n2 M
>
Why does colnames(x[1]) = "Name" not work, while colnames(x)[1]="Name" does?
What is the reason? What is the difference betweent them?
The too much information answer:
If you look at what each of the options "de-sugars" to:
# 1.
`[<-`(x, 1, value=`colnames<-`(x[1], 'Name'))
# 2.
`colnames<-`(x, `[<-`(colnames(x), 1, 'Name'))
The first option makes a new data.frame from just the first column, renames that column (successfully), and then tries to assign that data.frame back over the first column. [<-.data.frame will propagate the values, however will not rename existing columns based on the names of value.
The second option gets the colnames of the data.frame, updates the first value, and creates a new data.frame with the updated names.
(Answer to #Peng Peng's question here because I can't figure out how to get backtick quoting to work in a comment...)
The backtick is to quote the variable name. Consider the difference here:
x<-1
`x<-`<-1
The first assigns 1 to a variable called x, but the second assigns to a variable called x<-. These unusal variable names are actually used by the <- primitive function - you are allowed arbitrary function calls on the lhs of an assignment, and a function with <- appended to the name specifies how to perform the update (similar to setf in lisp).
Because you want to modify the column names attribute of x, a data.frame. Hence
colnames(x) <- ....
is correct, whether or not you assign one or more at the same time.

Resources