Is it possible to create new variable names on the fly?
I'd like to read data frames from a list into new variables with numbers at the end. Something like orca1, orca2, orca3...
If I try something like
paste("orca",i,sep="")=list_name[[i]]
I get this error
target of assignment expands to non-language object
Is there another way around this?
Use assign:
assign(paste("orca", i, sep = ""), list_name[[i]])
It seems to me that you might be better off with a list rather than using orca1, orca2, etc, ... then it would be orca[1], orca[2], ...
Usually you're making a list of variables differentiated by nothing but a number because that number would be a convenient way to access them later.
orca <- list()
orca[1] <- "Hi"
orca[2] <- 59
Otherwise, assign is just what you want.
Don't make data frames. Keep the list, name its elements but do not attach it.
The biggest reason for this is that if you make variables on the go, almost always you will later on have to iterate through each one of them to perform something useful. There you will again be forced to iterate through each one of the names that you have created on the fly.
It is far easier to name the elements of the list and iterate through the names.
As far as attach is concerned, its really bad programming practice in R and can lead to a lot of trouble if you are not careful.
FAQ says:
If you have
varname <- c("a", "b", "d")
you can do
get(varname[1]) + 2
for
a + 2
or
assign(varname[1], 2 + 2)
for
a <- 2 + 2
So it looks like you use GET when you want to evaluate a formula that uses a variable (such as a concatenate), and ASSIGN when you want to assign a value to a pre-declared variable.
Syntax for assign:
assign(x, value)
x: a variable name, given as a character string. No coercion is done, and the first element of a character vector of length greater than one will be used, with a warning.
value: value to be assigned to x.
Another tricky solution is to name elements of list and attach it:
list_name = list(
head(iris),
head(swiss),
head(airquality)
)
names(list_name) <- paste("orca", seq_along(list_name), sep="")
attach(list_name)
orca1
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
And this option?
list_name<-list()
for(i in 1:100){
paste("orca",i,sep="")->list_name[[i]]
}
It works perfectly. In the example you put, first line is missing, and then gives you the error message.
Related
I have a question using distinct() from dplyr on a tibble/data.frame. From the documentation it is clear that you can use it by naming explicitely the column names. I have a data frame with >100 columns and want to use the funtion just on a subset. My intuition said I put the column names in a vector and use it as an argument for distinct. But distinct uses only the first vector element
Example on iris:
data(iris)
library(dplyr)
exclude.columns <- c('Species', 'Sepal.Width')
distinct_(iris, exclude.columns)
This is different from
exclude.columns <- c('Sepal.Width', 'Species')
distinct_(iris, exclude.columns)
I think distinct is not made for this operation. Another option would be to subset the data.frame then use distinct and join again with the excluded columns. But my question is if there is another option using just one function?
As suggested in my comment, you could also try:
data(iris)
library(dplyr)
exclude.columns <- c('Species', 'Sepal.Width')
distinct(iris, !!! syms(exclude.columns))
Output (first 10 rows):
Sepal.Width Species
1 3.5 setosa
2 3.0 setosa
3 3.2 setosa
4 3.1 setosa
5 3.6 setosa
6 3.9 setosa
7 3.4 setosa
8 2.9 setosa
9 3.7 setosa
10 4.0 setosa
However, that was suggested more than 2 years ago. A more proper usage of latest dplyr functionalities would be:
distinct(iris, across(all_of(exclude.columns)))
It is not entirely clear to me whether you would like to keep only the exclude.columns or actually exclude them; if the latter then you just put minus in front i.e. distinct(iris, across(-all_of(exclude.columns))).
Your objective sounds unclear. Are you trying to get all distinct rows across all columns except $Species and $Sepal.Width? If so, that doesn't make sense.
Let's say two rows are the same in all other variables except for $Sepal.Width. Using distinct() in the way you described would throw out the second row because it was not distinct from the first. Except that it was in the column you ignored.
You need to rethink your objective and whether it makes sense.
If you are just worried about duplicate rows, then
data %>%
distinct(across(everything()))
will do the trick.
How does an external function inside dplyr::filter know the columns just by their names without the use of the data.frame from which it is coming?
For example consider the following code:
filter(hflights, Cancelled == 1, !is.na(DepDelay))
How does is.na know that DepDelay is from hflights? There could possibly have been a DepDelay vector defined elsewhere in my code. (Assuming that hflights has columns named 'Cancelled', 'DepDelay').
In python we would have to use the column name along with the name of the dataframe. Therefore here I was expecting something like
!is.na(hflights$DepDelay)
Any help would be really appreciated.
While I'm not an expert enough to give a precise answer, hopefully I won't lead you too far astray.
It is essentially a question of environment. filter() first looks for any vector object within the data frame environment named in its first argument. If it doesn't find it, it will then go "up a level", so to speak, to the global environment and look for any other vector object of that name. Consider:
library(dplyr)
Species <- iris$Species
iris2 <- select(iris, -Species) # Remove the Species variable from the data frame.
filter(iris2, Species == "setosa")
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 5.1 3.5 1.4 0.2
#> 2 4.9 3.0 1.4 0.2
#> 3 4.7 3.2 1.3 0.2
#> 4 4.6 3.1 1.5 0.2
#> 5 5.0 3.6 1.4 0.2
More information on the topic can be found here (warning, the book is a work in progress).
Most functions from the dplyr and tidyr packages are specifically designed to handle data frames, and all of those functions require the name of the data frame as their first argument. This allows for usage of the pipe (%>%) which allows to build a more intuitive workflow. Think of the pipe as the equivalent of saying "... and then ...". In the context shown above, you could do:
iris %>%
select(-Species) %>%
filter(Species == "setosa")
And you get the same output as above. Combining the concept of the pipe and focusing the lexical scope of variables to the referenced data frames is meant to lead to more readable code for humans, which is one of the principles of the tidyverse set of packages, which both dplyr and tidyr are components of.
I have a list of data frames, where every data frame is similar (has the same columns with the same names) but contains information on a different, related "thing" (say, species of flower). I need an elegant way to re-categorize one of the columns in all of these data frames from continuous to categorical using the function cut(). The problem is each "thing" (flower) has different cut-points and will use different labels.
I got as far as putting the cut-points and labels in a separate list. If we're following my fake example, it basically looks like this:
iris <- iris
peony <- iris #pretending that this is actually different data!
flowers <- list(iris = iris, peony = peony)
params <- list(iris_param = list(cutpoints = c(1, 4.5),
labels = c("low", "medium", "high")),
peony_param = list(cutpoints = c(1.5, 2.5, 5),
labels = c("too_low", "kinda_low", "okay", "just_right")))
#And we want to cut 'Sepal.Width' on both peony and iris
I am now really stuck. I have tried using some combinations of lapply() and do.call() but I'm kind of just guessing (and guessing wrong).
More generalized, I want to know: how can I use a changing set of arguments to apply a function over different data frames in a list?
I think this is a great time for a for loop. It's straightforward to write and clear:
for (petal in seq_along(flowers)) {
flowers[[petal]]$Sepal.Width.Cut = cut(
x = flowers[[petal]]$Sepal.Width,
breaks = c(-Inf, params[[petal]]$cutpoints, Inf),
labels = params[[petal]]$labels
)
}
Note that (a) I had to augment your breaks to make cut happy about the length of the labels, (b) really I'm just iterating 1, 2. A more robust version would possibly iterate over the names of the list and as a safety check would require the params list to have the same names. Since the names of your lists were different, I just used the indexes.
This could probably be done using mapply. I see no advantage to that - unless you're already comfortable with mapply the only real difference will be that the mapply version will take you 10 times longer to write.
I like Gregor's solution, but I'd probably stack the data instead:
library(data.table)
# rearrange parameters
params0 = setNames(params, c("iris", "peony"))
my_params = c(list(.id = names(params0)), do.call(Map, c(list, params0)))
# stack
DT = rbindlist(flowers, id = TRUE)
# merge and make cuts
DT[my_params, Sepal.Width.Cut :=
cut(Sepal.Width, breaks = c(-Inf,cutpoints[[1]],Inf), labels = labels[[1]])
, on=".id", by=.EACHI]
(I've borrowed Gregor's translation of the cutpoints.) The result is:
.id Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Width.Cut
1: iris 5.1 3.5 1.4 0.2 setosa kinda_low
2: iris 4.9 3.0 1.4 0.2 setosa kinda_low
3: iris 4.7 3.2 1.3 0.2 setosa kinda_low
4: iris 4.6 3.1 1.5 0.2 setosa kinda_low
5: iris 5.0 3.6 1.4 0.2 setosa kinda_low
---
296: peony 6.7 3.0 5.2 2.3 virginica okay
297: peony 6.3 2.5 5.0 1.9 virginica kinda_low
298: peony 6.5 3.0 5.2 2.0 virginica okay
299: peony 6.2 3.4 5.4 2.3 virginica okay
300: peony 5.9 3.0 5.1 1.8 virginica okay
I think stacked data usually make more sense than a list of data.frames. You don't need to use data.table to stack or make the cuts, but it's designed well for those tasks.
How it works.
I guess rbindlist is clear.
The code
DT[my_params, on = ".id"]
makes a merge. To see what that means, look at:
as.data.table(my_params)
# .id cutpoints labels
# 1: iris 1.0,4.5 low,medium,high
# 2: peony 1.5,2.5,5.0 too_low,kinda_low,okay,just_right
So, we're merging this table with DT by their common .id column.
When we do a merge like
DT[my_params, j, on = ".id", by=.EACHI]
this means
Do the merge, matching each row of my_params with related rows of DT.
Do j for each row of my_params, using columns found in either of the two tables.
j in this case is of the form column_for_DT := cut(...), which makes a new column in DT.
R's abbreviate() is useful for truncating, among other things, the column names of a data frame to a set length, with nice checks to ensure uniqueness, etc.:
abbreviate(names(dframe), minlength=2)
One could, of course, use this function to abbreviate the column names in-place and then print out the altered data frame
>>names(dframe) <- abbreviate(names(dframe), minlength=2)
>>dframe
But I would like to print out the data frame with abbreviated column names without altering the data frame in the process. Hopefully this can be done through a simple format option in the print() call, though my search through the help pages of print and format methods like print.data.frame didn't turn up any obvious solution (the available options seem more for formatting the column values, not their names).
So, does print() or format() have any options that call abbreviate() on the column names? If not, is there a way to apply abbreviate() to the column names of a data frame before passing it to print(), again without altering the passed data frame?
The more I think about it, the more I think that the only way would be to pass print() a copy of the data frame with already abbreviated column names. But this is not a solution for me, because I don't want to constantly be updating this copy as I update the original during an interactive session. The original column names must remain unaltered, because I use which(colnames(dframe)=="name_of_column") to interface with the data.
My ultimate goal is to work better remotely on the small screen of my mobile device when working in ssh apps like Server Auditor. If the the column names are abbreviated to only 2-3 characters I can still recognize them but can fit much more data on the screen. Perhaps there even are R packages that are better suited for condensed printing?
You could define your own print method
print.myDF <- function(x, abbr = TRUE, minlength = 2, ...) {
if (abbr) {
names(x) <- abbreviate(names(x), minlength = minlength)
}
print.data.frame(x, ...)
}
Then add the class myDF to the data and print
class(iris) <- c("myDF", class(iris))
head(iris, 3)
# S.L S.W P.L P.W Sp
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
print(head(iris, 3), abbr = FALSE)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
print(head(iris, 3), minlength = 5)
# Spl.L Spl.W Ptl.L Ptl.W Specs
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
Just rewrite print.data.frame:
print.data.frame <-
function(x) setNames( print(x),
abbreviate(names(dframe), minlength=2) )
(You will probably want an auxiliary printfull.data.frame to which you first copy print.data.frame.)
Is it possible to create new variable names on the fly?
I'd like to read data frames from a list into new variables with numbers at the end. Something like orca1, orca2, orca3...
If I try something like
paste("orca",i,sep="")=list_name[[i]]
I get this error
target of assignment expands to non-language object
Is there another way around this?
Use assign:
assign(paste("orca", i, sep = ""), list_name[[i]])
It seems to me that you might be better off with a list rather than using orca1, orca2, etc, ... then it would be orca[1], orca[2], ...
Usually you're making a list of variables differentiated by nothing but a number because that number would be a convenient way to access them later.
orca <- list()
orca[1] <- "Hi"
orca[2] <- 59
Otherwise, assign is just what you want.
Don't make data frames. Keep the list, name its elements but do not attach it.
The biggest reason for this is that if you make variables on the go, almost always you will later on have to iterate through each one of them to perform something useful. There you will again be forced to iterate through each one of the names that you have created on the fly.
It is far easier to name the elements of the list and iterate through the names.
As far as attach is concerned, its really bad programming practice in R and can lead to a lot of trouble if you are not careful.
FAQ says:
If you have
varname <- c("a", "b", "d")
you can do
get(varname[1]) + 2
for
a + 2
or
assign(varname[1], 2 + 2)
for
a <- 2 + 2
So it looks like you use GET when you want to evaluate a formula that uses a variable (such as a concatenate), and ASSIGN when you want to assign a value to a pre-declared variable.
Syntax for assign:
assign(x, value)
x: a variable name, given as a character string. No coercion is done, and the first element of a character vector of length greater than one will be used, with a warning.
value: value to be assigned to x.
Another tricky solution is to name elements of list and attach it:
list_name = list(
head(iris),
head(swiss),
head(airquality)
)
names(list_name) <- paste("orca", seq_along(list_name), sep="")
attach(list_name)
orca1
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
And this option?
list_name<-list()
for(i in 1:100){
paste("orca",i,sep="")->list_name[[i]]
}
It works perfectly. In the example you put, first line is missing, and then gives you the error message.