Abbreviate column names during data frame print - r

R's abbreviate() is useful for truncating, among other things, the column names of a data frame to a set length, with nice checks to ensure uniqueness, etc.:
abbreviate(names(dframe), minlength=2)
One could, of course, use this function to abbreviate the column names in-place and then print out the altered data frame
>>names(dframe) <- abbreviate(names(dframe), minlength=2)
>>dframe
But I would like to print out the data frame with abbreviated column names without altering the data frame in the process. Hopefully this can be done through a simple format option in the print() call, though my search through the help pages of print and format methods like print.data.frame didn't turn up any obvious solution (the available options seem more for formatting the column values, not their names).
So, does print() or format() have any options that call abbreviate() on the column names? If not, is there a way to apply abbreviate() to the column names of a data frame before passing it to print(), again without altering the passed data frame?
The more I think about it, the more I think that the only way would be to pass print() a copy of the data frame with already abbreviated column names. But this is not a solution for me, because I don't want to constantly be updating this copy as I update the original during an interactive session. The original column names must remain unaltered, because I use which(colnames(dframe)=="name_of_column") to interface with the data.
My ultimate goal is to work better remotely on the small screen of my mobile device when working in ssh apps like Server Auditor. If the the column names are abbreviated to only 2-3 characters I can still recognize them but can fit much more data on the screen. Perhaps there even are R packages that are better suited for condensed printing?

You could define your own print method
print.myDF <- function(x, abbr = TRUE, minlength = 2, ...) {
if (abbr) {
names(x) <- abbreviate(names(x), minlength = minlength)
}
print.data.frame(x, ...)
}
Then add the class myDF to the data and print
class(iris) <- c("myDF", class(iris))
head(iris, 3)
# S.L S.W P.L P.W Sp
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
print(head(iris, 3), abbr = FALSE)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
print(head(iris, 3), minlength = 5)
# Spl.L Spl.W Ptl.L Ptl.W Specs
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa

Just rewrite print.data.frame:
print.data.frame <-
function(x) setNames( print(x),
abbreviate(names(dframe), minlength=2) )
(You will probably want an auxiliary printfull.data.frame to which you first copy print.data.frame.)

Related

selecting column by name data and ignoring other column attributes such as "label" R

Any tips on selecting columns by just column name when other column attributes are present?
When using dplyr::subset and selecting from a c(x, y, z) sort of list, which should work when given exact column names, I am not able to get columns that also contain "*,label" attributes.
This is new to me, something I've not seen or dealt with before.
I can use select and starts_with but it's not limited enough as a character search and is getting other columns with similar character strings.
Doesn't work because of "label" attribute?
subset(NHSDA_2001_W, select = c("YEAR", "IRPINC3"))
works but yields far too many columns.
%>% select(starts_with("YEAR") | starts_with("IRPINC3"))
Is it possible to subset by name and ignore other column attributes?
It's a little unclear what the issue is, as labels do not prevent you from selecting by column name. Note: I used a different function to print the labels in the console.
library(tidyverse)
library(labelled)
df <- iris
var_label(df) <- list(Petal.Length = "Length of petal", Petal.Width = "Width of Petal")
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 Length of petal Width of Petal
#2 5.1 3.5 1.4 0.2 setosa
#3 4.9 3 1.4 0.2 setosa
We can just use select as normal from dplyr:
df %>%
select(Petal.Length, Petal.Width) %>%
head()
Petal.Length Petal.Width
1 1.4 0.2
2 1.4 0.2
3 1.3 0.2
4 1.5 0.2
5 1.4 0.2
6 1.7 0.4
Same for using subset (not part of dplyr):
subset(df, select = c(Petal.Length, Petal.Width))

Create several subsets at the same time

I have a dataset (insti) and I want to create 3 different subsets according to a factor (xarxa) with three levels (linkedin, instagram, twitter).
I used this:
linkedin <- subset(insti, insti$xarxa=="linkedin")
twitter <- subset(insti, insti$xarxa=="twitter")
instagram <- subset(insti, insti$xarxa=="instagram")
It does work, however, I was wondering if this can be done with tapply, so I tried:
tapply(insti, insti$xarxa, subset)
It gives this error:
Error in tapply(insti, insti$xarxa, subset) : arguments must have same length
I think that there might be some straigth forward way to do this but I can not work it out. Can you help me with this without using loops?
Thanks a lot.
It's usually better to deal with data frames in a named list. This makes them easy to iterate over, and stops your global workspace being filled up with lots of different variables. The easiest way to get a named list is with split(insti, insti$xarxa).
If you really want the variables written directly to your global environment rather than in a list with a single line, you can do
list2env(split(insti, insti$xarxa), globalenv())
Example
Obviously, I don't have the insti data frame, since you did not supply any example data in your question, but we can demonstrate that the above solution works using the built-in iris data set.
First we can see that my global environment is empty:
ls()
#> character(0)
Now we get the iris data set, split it by species, and put the result in the global environment:
list2env(split(datasets::iris, datasets::iris$Species), globalenv())
#> <environment: R_GlobalEnv>
So now when we check the global environment's contents, we can see that we have three data frames: one for each Species:
ls()
#> [1] "setosa" "versicolor" "virginica"
head(setosa)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
And of course, we can also access versicolor and virginica in the same way
Created on 2021-11-12 by the reprex package (v2.0.0)

Elements of one list as arguments to a function acting on another list

I have a list of data frames, where every data frame is similar (has the same columns with the same names) but contains information on a different, related "thing" (say, species of flower). I need an elegant way to re-categorize one of the columns in all of these data frames from continuous to categorical using the function cut(). The problem is each "thing" (flower) has different cut-points and will use different labels.
I got as far as putting the cut-points and labels in a separate list. If we're following my fake example, it basically looks like this:
iris <- iris
peony <- iris #pretending that this is actually different data!
flowers <- list(iris = iris, peony = peony)
params <- list(iris_param = list(cutpoints = c(1, 4.5),
labels = c("low", "medium", "high")),
peony_param = list(cutpoints = c(1.5, 2.5, 5),
labels = c("too_low", "kinda_low", "okay", "just_right")))
#And we want to cut 'Sepal.Width' on both peony and iris
I am now really stuck. I have tried using some combinations of lapply() and do.call() but I'm kind of just guessing (and guessing wrong).
More generalized, I want to know: how can I use a changing set of arguments to apply a function over different data frames in a list?
I think this is a great time for a for loop. It's straightforward to write and clear:
for (petal in seq_along(flowers)) {
flowers[[petal]]$Sepal.Width.Cut = cut(
x = flowers[[petal]]$Sepal.Width,
breaks = c(-Inf, params[[petal]]$cutpoints, Inf),
labels = params[[petal]]$labels
)
}
Note that (a) I had to augment your breaks to make cut happy about the length of the labels, (b) really I'm just iterating 1, 2. A more robust version would possibly iterate over the names of the list and as a safety check would require the params list to have the same names. Since the names of your lists were different, I just used the indexes.
This could probably be done using mapply. I see no advantage to that - unless you're already comfortable with mapply the only real difference will be that the mapply version will take you 10 times longer to write.
I like Gregor's solution, but I'd probably stack the data instead:
library(data.table)
# rearrange parameters
params0 = setNames(params, c("iris", "peony"))
my_params = c(list(.id = names(params0)), do.call(Map, c(list, params0)))
# stack
DT = rbindlist(flowers, id = TRUE)
# merge and make cuts
DT[my_params, Sepal.Width.Cut :=
cut(Sepal.Width, breaks = c(-Inf,cutpoints[[1]],Inf), labels = labels[[1]])
, on=".id", by=.EACHI]
(I've borrowed Gregor's translation of the cutpoints.) The result is:
.id Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Width.Cut
1: iris 5.1 3.5 1.4 0.2 setosa kinda_low
2: iris 4.9 3.0 1.4 0.2 setosa kinda_low
3: iris 4.7 3.2 1.3 0.2 setosa kinda_low
4: iris 4.6 3.1 1.5 0.2 setosa kinda_low
5: iris 5.0 3.6 1.4 0.2 setosa kinda_low
---
296: peony 6.7 3.0 5.2 2.3 virginica okay
297: peony 6.3 2.5 5.0 1.9 virginica kinda_low
298: peony 6.5 3.0 5.2 2.0 virginica okay
299: peony 6.2 3.4 5.4 2.3 virginica okay
300: peony 5.9 3.0 5.1 1.8 virginica okay
I think stacked data usually make more sense than a list of data.frames. You don't need to use data.table to stack or make the cuts, but it's designed well for those tasks.
How it works.
I guess rbindlist is clear.
The code
DT[my_params, on = ".id"]
makes a merge. To see what that means, look at:
as.data.table(my_params)
# .id cutpoints labels
# 1: iris 1.0,4.5 low,medium,high
# 2: peony 1.5,2.5,5.0 too_low,kinda_low,okay,just_right
So, we're merging this table with DT by their common .id column.
When we do a merge like
DT[my_params, j, on = ".id", by=.EACHI]
this means
Do the merge, matching each row of my_params with related rows of DT.
Do j for each row of my_params, using columns found in either of the two tables.
j in this case is of the form column_for_DT := cut(...), which makes a new column in DT.

R: new variable in the for loop [duplicate]

Is it possible to create new variable names on the fly?
I'd like to read data frames from a list into new variables with numbers at the end. Something like orca1, orca2, orca3...
If I try something like
paste("orca",i,sep="")=list_name[[i]]
I get this error
target of assignment expands to non-language object
Is there another way around this?
Use assign:
assign(paste("orca", i, sep = ""), list_name[[i]])
It seems to me that you might be better off with a list rather than using orca1, orca2, etc, ... then it would be orca[1], orca[2], ...
Usually you're making a list of variables differentiated by nothing but a number because that number would be a convenient way to access them later.
orca <- list()
orca[1] <- "Hi"
orca[2] <- 59
Otherwise, assign is just what you want.
Don't make data frames. Keep the list, name its elements but do not attach it.
The biggest reason for this is that if you make variables on the go, almost always you will later on have to iterate through each one of them to perform something useful. There you will again be forced to iterate through each one of the names that you have created on the fly.
It is far easier to name the elements of the list and iterate through the names.
As far as attach is concerned, its really bad programming practice in R and can lead to a lot of trouble if you are not careful.
FAQ says:
If you have
varname <- c("a", "b", "d")
you can do
get(varname[1]) + 2
for
a + 2
or
assign(varname[1], 2 + 2)
for
a <- 2 + 2
So it looks like you use GET when you want to evaluate a formula that uses a variable (such as a concatenate), and ASSIGN when you want to assign a value to a pre-declared variable.
Syntax for assign:
assign(x, value)
x: a variable name, given as a character string. No coercion is done, and the first element of a character vector of length greater than one will be used, with a warning.
value: value to be assigned to x.
Another tricky solution is to name elements of list and attach it:
list_name = list(
head(iris),
head(swiss),
head(airquality)
)
names(list_name) <- paste("orca", seq_along(list_name), sep="")
attach(list_name)
orca1
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
And this option?
list_name<-list()
for(i in 1:100){
paste("orca",i,sep="")->list_name[[i]]
}
It works perfectly. In the example you put, first line is missing, and then gives you the error message.

How to name variables on the fly?

Is it possible to create new variable names on the fly?
I'd like to read data frames from a list into new variables with numbers at the end. Something like orca1, orca2, orca3...
If I try something like
paste("orca",i,sep="")=list_name[[i]]
I get this error
target of assignment expands to non-language object
Is there another way around this?
Use assign:
assign(paste("orca", i, sep = ""), list_name[[i]])
It seems to me that you might be better off with a list rather than using orca1, orca2, etc, ... then it would be orca[1], orca[2], ...
Usually you're making a list of variables differentiated by nothing but a number because that number would be a convenient way to access them later.
orca <- list()
orca[1] <- "Hi"
orca[2] <- 59
Otherwise, assign is just what you want.
Don't make data frames. Keep the list, name its elements but do not attach it.
The biggest reason for this is that if you make variables on the go, almost always you will later on have to iterate through each one of them to perform something useful. There you will again be forced to iterate through each one of the names that you have created on the fly.
It is far easier to name the elements of the list and iterate through the names.
As far as attach is concerned, its really bad programming practice in R and can lead to a lot of trouble if you are not careful.
FAQ says:
If you have
varname <- c("a", "b", "d")
you can do
get(varname[1]) + 2
for
a + 2
or
assign(varname[1], 2 + 2)
for
a <- 2 + 2
So it looks like you use GET when you want to evaluate a formula that uses a variable (such as a concatenate), and ASSIGN when you want to assign a value to a pre-declared variable.
Syntax for assign:
assign(x, value)
x: a variable name, given as a character string. No coercion is done, and the first element of a character vector of length greater than one will be used, with a warning.
value: value to be assigned to x.
Another tricky solution is to name elements of list and attach it:
list_name = list(
head(iris),
head(swiss),
head(airquality)
)
names(list_name) <- paste("orca", seq_along(list_name), sep="")
attach(list_name)
orca1
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
And this option?
list_name<-list()
for(i in 1:100){
paste("orca",i,sep="")->list_name[[i]]
}
It works perfectly. In the example you put, first line is missing, and then gives you the error message.

Resources