R creating variable tables from list of variable names - r

I am currently trying to create a table from a list of variable names (something I feel should be relatively simple) and I can't for the life of me, figure out how to do it correctly.
I have a data table that I've named 'file' and there are a list of 3 variable names within this file. What I want to do is create a table of each variable and then rbind them together. For further context, these few lines of code will be worked into a much larger function. The list of variable names must be able to accommodate the number of variables the user defines.
I have tried the following:
file<-as.data.table(dt)
variable_list<-list("outcome", "type")
for (variable in variable_list){
var_table<-as.data.table(table(file$variable_list))
na_table<-as.data.table(table(is.na(file$variable)))
}
When I run the above code, R returns empty tables of var_table and na_table. What am I doing wrong?

An option is to loop over the 'variable_list, extract the column, apply tableandrbindwithindo.call`
do.call(rbind, lapply(variable_list, function(nm) table(file[[nm]])))
NOTE: assuming that the levels of the columns are the same
If the levels are not the same, make it same by converting the columns to factor with levels specified
lvls <- na.omit(sort(unique(unlist(file[, unlist(variable_list), with = FALSE]))))
do.call(rbind, lapply(variable_list, function(nm)
table(factor(file[[nm]], levels = lvls))))
Or if we have a data.table, use the data.table methods
rbindlist(lapply(variable_list, function(nm) file[, .N,by = c(nm)]), fill = TRUE)

The problem (at least one of the problems) might be that you are attempting to use the $ operator incorrectly. You cannot substitute text values into the second argument. You can use its syntactic equivalent [[ instead of $, however. So this would be a possible improvement. (I've not tested it since you provided no test material.)
file<-as.data.table(dt)
variable_list<-list("outcome", "type")
for (variable in variable_list){
var_table<-as.data.table(table(file[[variable]])) # clearly not variable_list
na_table<-as.data.table(table(is.na(file[[variable]] )))
}
I'm guessing you might have done something like, ...
var_table <- file[, table(variable ) ]
... since data.table syntax evaluates text values in the environment of the file (which in this case is confusing named "file". It's better not to use such names, since in this case there's also an R function by that name.

Related

Problems with renaming columns via variables in R

I'm having issues with a specific problem I have a dataset of a ton of matrices that all have V1 as their column names, essentially NULL. I'm trying to write a loop to replace all of these with column names from a list but I'm running into some issues.
To break this down to the most simple form, this code isn't functioning as I'd expect it to.
nameofmatrix <- paste('column_', i, sep = "")
colnames(eval(as.name(nameofmatrix))) <- c("test")
I would expect this to take the value of column_1 for example, and replace (in the 2nd line) with "test" as the column name.
I tried to break this down smaller, for example, if I run print(eval(as.name(nameofmatrix)) I get the object's column/rows printed as expected and if I run print(colnames(eval(as.name(nameofmatrix))) I'm getting NULL as expected for the column header (since it was set as V1).
I've even tried to manually type in the column name, such as colnames(column_1) <- c("test) and this successfully works to rename the column. But once this variable is put in the text's place as shown above, it does not work the same. I'm having difficulties finding a solution on how to rename several matrix columns after they have been created with this method. Does anyone have any advice or suggestions?
Note, the error I'm receiving on trying to run this is
Error in eval([as.name](nameofmatrix)) <- \`vtmp\` : could not find function "eval<-"
We could return the values of the objects in a list with get (if there are multiple objects use mget, then rename the objects in the list and update those objects in the global env with list2env
list2env(lapply(mget(nameofmatrix), function(x) {colnames(x) <- newnames
x}), .GlobalEnv)
It can also be done with assign
data(mtcars)
nameofobject <- 'mtcars'
assign(nameofobject, `colnames<-`(get(nameofobject),
c('mpg1', names(mtcars)[-1])))
Now, check the names of 'mtcars'
names(mtcars)[1]
#[1] "mpg1"

R - Unexpected behavior when typechecking columns of tidyverse data frame

I'm working with some data that has hundreds of covariates, so I decided to write some functions to make pre-processing much faster and cleaner (like scaling certain numeric variables). An important part of all of these functions is type-checking the columns before I apply a particular function to them.
Here is my function for scaling continuous columns:
# rm (vector): names of columns not to be scaled
scale.continuous <- function(df, rm=NULL) {
cols <- setdiff(colnames(df), rm)
for(col in cols) {
if(is.numeric(df[,col])){
df[,col] <- as.numeric(scale(df[,col]))
}
}
df
}
This works perfectly fine if I load the data frame using read.csv(), but the data I have is huge so the speed boost of using read_csv() from readr/tidyverse is significant. Unfortunately, if I load my data using read_csv() all of my functions break.
I narrowed down the issue to the type-checking, specifically when type-checking a column I am accessing by a string of its column name. Here's some code to demonstrate what I mean:
# When using read.csv()
> is.numeric(df$col)
[1] TRUE
> is.numeric(df[,"col"])
[1] TRUE
# When using read_csv()
> is.numeric(df$col)
[1] TRUE
> is.numeric(df[,"col"])
[1] FALSE
I realized the issue here was that indexing the dataframe with a string the way I do above returns a tibble instead of a regular list like other methods of indexing do. What I don't understand is why this behavior exists, why as.numeric() (or any type-check) does not work with a tibble and in general why there is this difference in the way the default and tidyverse dataframes are constructed. Also, it would be nice to know if there is a parameter I can change in read_csv() that will make the behavior of this type of indexing the same as with a default dataframe.
I should mention, I realize there are probably better ways of writing this code (for example, just using df$"col" to index fixes the issue), but I still don't understand what the root of the issue was with my first approach. I am now working with much larger data sets that require much more involved pre-processing than what I have been used to in the past so I want to have as complete an understanding of the data structures I am using as possible.
Tibbles have a slightly different default behaviour than regular data frames when using the [ extracting function which can be a bit of a gotcha. Specifically df[,"col"] on a tibble will return a one column tibble whereas on a regular data frame it will return a vector. So you need to use:
df[["col"]]
Or explicitly state that you want to coerce to the lowest dimension and do:
df[, "col", drop = TRUE]
From the documentation:
df[, j] returns a tibble; it does not automatically extract the column
inside. df[, j, drop = FALSE] is the default.

R Generic References to Data Frames and Variables

I would like to know how to make a reference to a data frame and variable generic, please. Say I have a data frame named 's' and a variable in that data frame named 'Y'.
Regular R code:
look = s$Y
What I would like to do:
data = s
variable = Y
look = data$variable (which functions the same as look = s$Y)
Any thoughts? The reason I would like to do this is that I have s$Y throughout my code, and later I may want to change s for t (or Y for some other variable), and don't want to have to go through all of my code manually replacing s$Y with t$Y where I need it changed.
Thanks!
This is the reason that the $-operator is considered poor-practice inside function definitions, i.e. it "locks you in" to a particular spelling of a column name. You are not going to do this, however:
variable = Y
Rather you are going to do this:
variable = "Y"
And that is because the first version would have caused the R-interpreter to go out and try to identify a value for the symbol Y someplace in what is known as its "search path" which is roughly speaking all that functions and values that have been called and are still being processed since code was started. In the case of the second version "Y" is its own value and no further searching is needed. With that fundamental confusion corrected you would now do this
look <- data[[ variable ]] # although using 'data' as a name is another "poor-practice"
Whereupon R will look for a value of variable and find it in the global environment, returning the character "Y" and delivering a column named "Y" from the dataset s. Column names are not considered first-class objects in R, whereas named dataframes are. The "names" of columns are not true R names (even though they are called colnames).. The $-operator is just shorthand for "[[" with a character value. Here's a full transcript to test this:
> s <- data.frame(Y=1:10, X=LETTERS[1:10]); data = s
>
> variable <- "Y"
>
> look1 <- data$Y; look2 <- data[["Y"]]
> identical(look1, look2)
[1] TRUE
The confusion that this "non-standard evaluation" (NSE) shorthand feature of R has caused new users appears to be one of the motivations for the creation of first the ggplot aes function and later the evolution of the package-dplyr and the tidyverse-bundle-of-packages. Those packages allow the use of non-quoted names or tokens to refer to column identities.
In addition to #42-'s answer, you can dynamically reference columns like this:
colName <- "something"
myDataFrame[,colname]
Edit: Since you also asked about dynamically referencing data.frames #Rich Scriven suggested making a function that takes the data.frame as an argument, which is one working solution. You can also just load the data you need at the top of your script, which is easy to change on the fly if you need:
fileName <- "file1.csv"
data <- read.table(fileName, header = TRUE, stringsAsFactors = FALSE)
As per -42 above, the best choice seems to be the packages referenced. Using a function is close but doesn't seem to allow 'data' and 'variable' to be generic in 'data$variable'.
Thanks everyone!

R create dynamically subset of data.table

I would like to create dynamically subsets of my data.table based on the values in some columns.
In my data.table, I have the following variables: owner,2G,3G,4G.
2G,3G,4G are binary.
I want to create three subsets: one where 2G==1, one where 3G==1, one where 4G==1.
Example:
a=c("Paul",1,1,0)
b=c("George",1,0,0)
x=cbind(a,b)
colnames(x)=c("Owner","2G","3G","4G")
Here is my code:
all_names_df=c()
for(value in 2:4){
techno=paste0(value,"G")
name=paste0("arcep",techno)
all_df=c(all_names_df,name)
df=arcep[techno==1]
assign(name,df)
}
My new data.tables are created, but empty. I have tried several things (with eval, quote, change the syntax etc...) but I fail to call properly the column.
EDIT:
I have tried something else, but it also fails:
techno=c("2G","3G","4G")
for(value in techno){
index=grep(value,colnames(arcep))
print(index)
set1=subset(arcep,arcep[,index]==1)
print(dim(set1))
assign(set1,paste0("ARCEP_",value))}
Error in `[.data.table`(arcep, , index) :
j (the 2nd argument inside [...]) is a single symbol but column name 'index' is not found. Perhaps you intended DT[,..index] or DT[,index,with=FALSE]. This difference to data.frame is deliberate and explained in FAQ 1.1.
Why does it says "'column name 'index' is not found"? Why isn't it the value of "index" that is taken in account? Eval index doesn't change anything.
Vectors and matrixes cannot contain both numbers and characters, in your case numbers are converted to characters.
This will work better to define your table, but column names can't start with a number in a data.frame
x <- data.frame(Owner = c("Paul","George"),
G2 = c(1,1),
G3 = c(1,0),
G4 = c(0,0),
stringsAsFactors= FALSE)
then here's your subset
subset(x,G3 == 1)
(also, you've used cbind instead of rbind in your question, you may want to edit it)
I finally found the answer:
for(value in techno){
set1=subset(arcep,arcep[,get(colnames(arcep)[grep(value,colnames(arcep))])]==1)
assign(paste0("ARCEP_",value),set1)
}

Move variables in global environment to a dataframe

How to save multiple variables in global environment into a dataframe in R? I can do it one by one using the $. But is there a command that moves all variables at once?
You can do:
do.call(data.frame, lapply(ls(), get))
But let me just say this sounds like a horrible idea.
If there is a named list of objects, then:
cbind(dfrm, list.obj)
E.g.
dados <- data.frame(x=1:10, v1=rnorm(10), v2=rnorm(10))
dat1 <- rnorm(10)
dat2 <- letters[1:5]
cbind(dados, list(new1=dat1, new2=dat2))
Can also use this form:
cbind(dados, new1=dat1, new2=dat2)
If you had a bunch of variables in the global environment with the character string "zzz" in their names and you wanted to append them columnwise to an existing dataframe, you could use this:
dados[, ls(patt="zzz") ] <- sapply( ls(patt="zzz"), get)
You might notice that I was objecting to using data.frame(cbind(...)) and yet I am saying to use cbind. Well, I am assuming that the first argument in my case is already a dataframe, so that cbind does not get called, but rather cbind.data.frame' gets called, and it will allow the usual list-behavior of data.frame objects to persist. If the first argument is only a vector, then.Internal(cbind(...)` will do coercion to a matrix which will remove attributes of all the arguments and coerce then into a common storage mode.

Resources