I've got some dataset that is updated once in a while. I want to make an automatic analysis of that dataset, so I've made an R script. The problem is, that with every update the names of the columns change but their order stays the same. I want to rename the colums no matter what name it got this time. I wanted to use rename() from dplyr, but it requires the old names of the columns. I tried something like that:
dataset %<>% rename('new.name1'=.[[1]], 'new.name2'=.[[2]], 'new.name3'=.[[3]])
but it didn't work. So how can I replace the old name with column number in the rename() function? Or what other function can I use to get it done?
Full example to my comment:
dataset <- ...
new_names <- c("new_name_1", "new_name_2", ...)
dataset <- dataset %>% set_names(new_names)
If you only want to replace some older names, use something like this:
dataset <- ...
mtch <- c("old_name_2" = "new_name_2", ...)
new_names <- names(dataset)
new_names[names(mtch)] <- as.character(mtch)
dataset <- dataset %>% set_names(new_names)
Probably too late to be of use to the OP. But for new readers stumbling here, you can use rename() in the usual way even with the column numbers as the old name. Just drop the .[[i]]'s and use the column number by itself:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
iris %>%
rename("NEW_NAME" = 1) %>%
head
#> NEW_NAME Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
Created on 2022-03-02 by the reprex package (v2.0.1)
As with #David's 2nd solution, this solution lets you rename only specific columns, though in a bit more straightforward way IMO.
You can use setNames :
dataset <- setNames(dataset, paste0('new.name', seq_along(dataset)))
Related
I have a dataset (insti) and I want to create 3 different subsets according to a factor (xarxa) with three levels (linkedin, instagram, twitter).
I used this:
linkedin <- subset(insti, insti$xarxa=="linkedin")
twitter <- subset(insti, insti$xarxa=="twitter")
instagram <- subset(insti, insti$xarxa=="instagram")
It does work, however, I was wondering if this can be done with tapply, so I tried:
tapply(insti, insti$xarxa, subset)
It gives this error:
Error in tapply(insti, insti$xarxa, subset) : arguments must have same length
I think that there might be some straigth forward way to do this but I can not work it out. Can you help me with this without using loops?
Thanks a lot.
It's usually better to deal with data frames in a named list. This makes them easy to iterate over, and stops your global workspace being filled up with lots of different variables. The easiest way to get a named list is with split(insti, insti$xarxa).
If you really want the variables written directly to your global environment rather than in a list with a single line, you can do
list2env(split(insti, insti$xarxa), globalenv())
Example
Obviously, I don't have the insti data frame, since you did not supply any example data in your question, but we can demonstrate that the above solution works using the built-in iris data set.
First we can see that my global environment is empty:
ls()
#> character(0)
Now we get the iris data set, split it by species, and put the result in the global environment:
list2env(split(datasets::iris, datasets::iris$Species), globalenv())
#> <environment: R_GlobalEnv>
So now when we check the global environment's contents, we can see that we have three data frames: one for each Species:
ls()
#> [1] "setosa" "versicolor" "virginica"
head(setosa)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
And of course, we can also access versicolor and virginica in the same way
Created on 2021-11-12 by the reprex package (v2.0.0)
I am trying to get multiple summary() outputs from a data-frame. I want to subset according to some characteristics multiple times. Then get the summary() of a certain variable for each slice and put all summary() outputs together in either a dataframe or a list.
Ideally i would like to get the name of each building_id i use to slice the data as a name for that row of summary(). So i thought of using a for loop.
The data are sufficiently large (about 20 m. lines) and i am using the train and building_metadata dataframes joined in one from the ashrae energy prediction from kaggle here
I have created a tibble which holds the building ids i want subset by. I want to get the summary() of variable "energy_sqm" (which i have already created) so i am trying to put this slicing in a for loop:
Warning 1: My building_id tibble has values like 50, 67, 778, 1099 etc. So one of problems i have is with the use of these numbers if i try to use them for some sort of indexing or naming my summary outputs. I think it tries to make row 50, 67 etc in the several differnt trials i did.
summaries_output <- tibble() # or list() `
for (id in building_id){
temp_stats <- joined %>%
filter(building_id == "id") %>%
pull(energy_sqm) %>%
summary() %>%
broom:tidy()
summaries_output <- bind_rows(summaries_output, temp_stats, .id = "id")
`
My problems:
a) whatever summaries_output i use to initialize i cant get it to retain anything inside the loop so i am guessing i am messing up the loop also.
b) Ideally i would like to have the building_id as an identifier of the summary() statistic
c) Could someone propose what is the good practice principle for these kind of loops in terms of using list, tible or whatever.
Details: The class() of summary() is "summaryDefault" "table" which i don't know anything about.
Thanks for the help.
We can also use tidyverse. After grouping by 'Species', tidy the summary output of 'Sepal.Length'. Here, the tidy output is a tibble/data.frame. In dplyr 1.0.0, we could use that without wrapping in a list, but it could also include a column name attribute with $ because we have out and the column names from tidy. To avoid that, we wrap in a list and then unnest the column created
library(dplyr)
library(broom)
library(tidyr)
iris %>%
group_by(Species) %>%
summarise(out = list(tidy(summary(Sepal.Length)))) %>%
unnest(c(out))
# A tibble: 3 x 7
# Species minimum q1 median mean q3 maximum
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 setosa 4.3 4.8 5 5.01 5.2 5.8
#2 versicolor 4.9 5.6 5.9 5.94 6.3 7
#3 virginica 4.9 6.22 6.5 6.59 6.9 7.9
This appears to be summarizing by group. Here's a way to do it with data.table although I am unsure your exact expected output:
library(broom)
library(data.table)
dt_iris = as.data.table(iris)
dt_iris[, tidy(summary(Sepal.Length)), by = Species]
#> Species minimum q1 median mean q3 maximum
#> 1: setosa 4.3 4.800 5.0 5.006 5.2 5.8
#> 2: versicolor 4.9 5.600 5.9 5.936 6.3 7.0
#> 3: virginica 4.9 6.225 6.5 6.588 6.9 7.9
Created on 2020-07-11 by the reprex package (v0.3.0)
How does an external function inside dplyr::filter know the columns just by their names without the use of the data.frame from which it is coming?
For example consider the following code:
filter(hflights, Cancelled == 1, !is.na(DepDelay))
How does is.na know that DepDelay is from hflights? There could possibly have been a DepDelay vector defined elsewhere in my code. (Assuming that hflights has columns named 'Cancelled', 'DepDelay').
In python we would have to use the column name along with the name of the dataframe. Therefore here I was expecting something like
!is.na(hflights$DepDelay)
Any help would be really appreciated.
While I'm not an expert enough to give a precise answer, hopefully I won't lead you too far astray.
It is essentially a question of environment. filter() first looks for any vector object within the data frame environment named in its first argument. If it doesn't find it, it will then go "up a level", so to speak, to the global environment and look for any other vector object of that name. Consider:
library(dplyr)
Species <- iris$Species
iris2 <- select(iris, -Species) # Remove the Species variable from the data frame.
filter(iris2, Species == "setosa")
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 5.1 3.5 1.4 0.2
#> 2 4.9 3.0 1.4 0.2
#> 3 4.7 3.2 1.3 0.2
#> 4 4.6 3.1 1.5 0.2
#> 5 5.0 3.6 1.4 0.2
More information on the topic can be found here (warning, the book is a work in progress).
Most functions from the dplyr and tidyr packages are specifically designed to handle data frames, and all of those functions require the name of the data frame as their first argument. This allows for usage of the pipe (%>%) which allows to build a more intuitive workflow. Think of the pipe as the equivalent of saying "... and then ...". In the context shown above, you could do:
iris %>%
select(-Species) %>%
filter(Species == "setosa")
And you get the same output as above. Combining the concept of the pipe and focusing the lexical scope of variables to the referenced data frames is meant to lead to more readable code for humans, which is one of the principles of the tidyverse set of packages, which both dplyr and tidyr are components of.
I have implemented a wrapper library part of rstudio's htmlwidgets that renders a pivot table.
The package is here.
The package works well with data.tables and data.frame (as it should!). For example it works with iris.
On the other hand if I try to convert iris to data.table my package (actually htmlwidgets - which internally uses RJSONIO) throws an error.
I know that it seems convoluted, but you can sort of reproduce the error just checking the difference between the following codes:
library(data.table)
library(RJSONIO)
data.table(fromJSON(toJSON(data.table(iris))))
The result is different from the dear iris dataset:
V1
1: 5.1,4.9,4.7,4.6,5.0,5.4,
2: 3.5,3.0,3.2,3.1,3.6,3.9,
3: 1.4,1.4,1.3,1.5,1.4,1.7,
4: 0.2,0.2,0.2,0.2,0.2,0.4,
5: setosa,setosa,setosa,setosa,setosa,setosa,
On the other hand jsonlite is able to re-build iris properly (just remember to detach RJSONIO before running the code):
library(data.table)
library(jsonlite)
data.table(fromJSON(toJSON(data.table(iris))))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1: 5.1 3.5 1.4 0.2 setosa
2: 4.9 3.0 1.4 0.2 setosa
3: 4.7 3.2 1.3 0.2 setosa
4: 4.6 3.1 1.5 0.2 setosa
5: 5.0 3.6 1.4 0.2 setosa
I am not sure if the problems lies with data.table or RJSONIO...
This is not related to json.
RJSONIO::fromJSON returns list while jsonlite::fromJSON returns data.frame.
It is related to data.table call on list which is different than call on data.frame, but still behave as expected.
Try as.data.table instead of data.table in the outer call.
as.data.table(fromJSON(toJSON(data.table(iris))))
This was already discussed on data.table github. I've reply to your issue the reference to the discussion.
R's abbreviate() is useful for truncating, among other things, the column names of a data frame to a set length, with nice checks to ensure uniqueness, etc.:
abbreviate(names(dframe), minlength=2)
One could, of course, use this function to abbreviate the column names in-place and then print out the altered data frame
>>names(dframe) <- abbreviate(names(dframe), minlength=2)
>>dframe
But I would like to print out the data frame with abbreviated column names without altering the data frame in the process. Hopefully this can be done through a simple format option in the print() call, though my search through the help pages of print and format methods like print.data.frame didn't turn up any obvious solution (the available options seem more for formatting the column values, not their names).
So, does print() or format() have any options that call abbreviate() on the column names? If not, is there a way to apply abbreviate() to the column names of a data frame before passing it to print(), again without altering the passed data frame?
The more I think about it, the more I think that the only way would be to pass print() a copy of the data frame with already abbreviated column names. But this is not a solution for me, because I don't want to constantly be updating this copy as I update the original during an interactive session. The original column names must remain unaltered, because I use which(colnames(dframe)=="name_of_column") to interface with the data.
My ultimate goal is to work better remotely on the small screen of my mobile device when working in ssh apps like Server Auditor. If the the column names are abbreviated to only 2-3 characters I can still recognize them but can fit much more data on the screen. Perhaps there even are R packages that are better suited for condensed printing?
You could define your own print method
print.myDF <- function(x, abbr = TRUE, minlength = 2, ...) {
if (abbr) {
names(x) <- abbreviate(names(x), minlength = minlength)
}
print.data.frame(x, ...)
}
Then add the class myDF to the data and print
class(iris) <- c("myDF", class(iris))
head(iris, 3)
# S.L S.W P.L P.W Sp
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
print(head(iris, 3), abbr = FALSE)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
print(head(iris, 3), minlength = 5)
# Spl.L Spl.W Ptl.L Ptl.W Specs
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
Just rewrite print.data.frame:
print.data.frame <-
function(x) setNames( print(x),
abbreviate(names(dframe), minlength=2) )
(You will probably want an auxiliary printfull.data.frame to which you first copy print.data.frame.)