Same names in Columns in R - r

In R, I am using a read_excel function, to import some files, the problem is that my files have some columns with the same name, is there any way to force the same name? (I know it's not a good practice, but it's a very specific thing)
New names:
* `44228` -> `44228...4`
* `44229` -> `44229...5`
* `44230` -> `44230...6`
* `44231` -> `44231...7`
* `44232` -> `44232...8`
I need to use a conversion factor for these data names, so I need to leave it with the name of the member, they are data.

You can use the .name_repair argument of read_excel() to control, and turn off, the checks applied to column names by tibble(). So to allow duplicate names:
library("readxl")
library("writexl") # Only needed to generate an example xlsx file
x <- data.frame(a = 1:3, a = 1:3, a = 1:3, check.names = FALSE)
write_xlsx(x, "data.xlsx")
read_xlsx("data.xlsx", .name_repair = "minimal")
#> # A tibble: 3 x 3
#> a a a
#> <dbl> <dbl> <dbl>
#> 1 1 1 1
#> 2 2 2 2
#> 3 3 3 3
Although do be aware that duplicate column names are closer to a syntax error than "bad practice", so the resulting object will behave in strange ways:
df <- read_xlsx("data.xlsx", .name_repair = "minimal")
df$a
#> [1] 1 2 3

Related

Is there a clean way of converting an Excel location string (e.g. "F3" or "BG5") into an R row-column index?

I have a list of xlsx comments imported into R. It is a list of lists, where one of the elements present in each comment is a string representing the comment's location in Excel. I'd like to represent that as an [X,Y] numerical index, as it would be done in R.
list_of_comments
$ :List of 2
..$ location : chr "BA5"
..$ content : chr "some content"
$ :List of 2
you get the picture
I've tried doing it the hardcoded way by creating a data.frame of predefined cell names. Based on the matching content, the index would be returned. I soon realised I don't even know how to create that into the double character territory (e.g. AA2). And even if I did, I'd be left with a massive data.frame.
Is there a smart way of converting an Excel cell location into a row-column index?
You can use the cellranger package which helps power readxl to help here, specifically the as.cell_addr() function:
library(cellranger)
library(dplyr)
list_of_comments <- list(list(location = "BG5", content = "abc"),
list(location = "AA2", content = "xyz"))
bind_rows(list_of_comments) %>%
mutate(as.cell_addr(location, strict = FALSE) %>%
unclass() %>%
as_tibble())
# A tibble: 2 x 4
location content row col
<chr> <chr> <int> <int>
1 BG5 abc 5 59
2 AA2 xyz 2 27
In base R you can do:
excel <- c("F3", "BG5")
r_rows <- as.numeric(sub("^.*?(\\d+)$", "\\1", excel))
r_cols <- sapply(strsplit(sub("^(.+)\\d+$", "\\1", excel), ""), function(x) {
val <- match(rev(x), LETTERS)
sum(val * 26^(seq_along(val) - 1))
})
data.frame(excel = excel, r_row = r_rows, r_col = r_cols)
#> excel r_row r_col
#> 1 F3 3 6
#> 2 BG5 5 59
Or if you want to just replace location within your list (using Ritchie's example data) you can do:
lapply(list_of_comments, function(l) {
excel <- l$location
r_row <- as.numeric(sub("^.*?(\\d+)$", "\\1", excel))
r_col <- sapply(strsplit(sub("^(.+)\\d+$", "\\1", excel), ""), function(x) {
val <- match(rev(x), LETTERS)
sum(val * 26^(seq_along(val) - 1))
})
l$location <- c(row = r_row, col = r_col)
l
})
#> [[1]]
#> [[1]]$location
#> row col
#> 5 59
#>
#> [[1]]$comment
#> [1] "abc"
#>
#>
#> [[2]]
#> [[2]]$location
#> row col
#> 2 27
#>
#> [[2]]$comment
#> [1] "xyz"
Created on 2022-04-06 by the reprex package (v2.0.1)
excel sample
If you read the excel using tidyxl-pacgage, you can derive everyting directly from tne row/col columns
library(tidyxl)
cells <- xlsx_cells("./temp/xl_test.xlsx")
cells[!is.na(cells$comment), c(1:4,17)]
# A tibble: 1 x 5
# sheet address row col comment
# <chr> <chr> <int> <int> <chr>
# 1 Blad1 C2 2 3 "wim:\r\ncomment1"

Renaming doesn't work for column names starting with two dots

I updated my tidyverse and my read_excel() function (from readxl) has also changed. Columns without titles are are now called ..1, ..2 and so on, when they used to be called X__1, X__2.
I'm trying to rename() these columns starting with two dots, but I'm getting an error message.
Here's an example:
library(tidyverse)
df <- tibble(a = 1:3,
..1 = 4:6)
df <- df %>%
rename(b = ..1)
Throws the error:
Error in .f(.x[[i]], ...) :
..1 used in an incorrect context, no ... to look in
I get the same error if I use backticks around the name: rename(b = `..1`).
..1 is a reserved word in R. See help("reserved") and help("..1"). Try quoting it:
df %>% rename(b = "..1")
giving:
# A tibble: 3 x 2
a b
<int> <int>
1 1 4
2 2 5
3 3 6
The janitor package has a very handy function clean_names for tasks like this. In this case, it replaces any .. that come from readxl with x. I added another .. column to show how the replacement works.
library(tidyverse)
df <- tibble(a = 1:3,
..1 = 4:6,
..5 = 10:12)
df %>%
janitor::clean_names()
#> # A tibble: 3 x 3
#> a x1 x5
#> <int> <int> <int>
#> 1 1 4 10
#> 2 2 5 11
#> 3 3 6 12
It seems like the naming setup in readxl is a topic of debate: see this issue, among others on the best way to convert unusable names from Excel sheets. There's also a vignette on it. To be honest, the last couple times I've needed to mess with readxl names, I just passed the data frame to janitor.

Command for renaming several variables

After reshaping my data, I have a large dataset with columnnames that look like this:
1_abc 1_vwxyz 2_abc 2_vwxyz
I would like to change my column names to look like this: abc_1 vwxyz_1 abc_2 vwxyz_2
My code looks like this:
data <- tibble("1_abc" = c(1,2,3), "1_vwxyz" = c(10,11,12),
"2_abc" = c(1,1,2),"2_vwxyz" = c(9,11,15))
data_renamed <- data %>%
rename_(.dots=setNames(names(.), paste(substr(names(.), start=3, stop=nchar(names(.))),
substr(names(.), start=1, stop=1))))
I get this error:
Error in parse(text = x) : <text>:1:2: unexpected input
1: 1_
^
Here's a solution in base R. You first take the column names as a character vector, convert them to a list of two-element character vectors, reverse the order of each and put them back together with _.
ll <- strsplit(colnames(data), pattern = "_")
# apply across this list of character vectors to reverse the order and concatenate
ll1 <- lapply(ll, function(x) paste(rev(x), collapse = "_"))
# unlist and assign them to the new data frame
data_renamed <- data
colnames(data_renamed) <- unlist(ll1)
# A tibble: 3 x 4
# abc_1 vwxyz_1 abc_2 vwxyz_2
# <dbl> <dbl> <dbl> <dbl>
# 1 1 10 1 9
# 2 2 11 1 11
# 3 3 12 2 15

unquote string as variable in pipe

I want to remove duplicate rows from a dataframe, for specific columns only. That can be obtained with distinct:
data <- tibble(a = c(1, 1, 2, 2), b = c(3, 3, 3, 4), z = c(5,4,5,5))
filtered_data <- data %>% distinct(a, b, .keep_all = T)
dim(filtered_data)
# [1] 3 3
This is (almost) what I need. Yet, my problem is that the columnnames I need to use with distinct will change. So I have a string gen that contains the names of the columns I want to use for with the distinct function. They need to get unquoted to be usefull in the pipe. I found suggestions to use as.name() or eval(parse()). This however gives me a different result:
gen <- c("a", "b")
filtered_data <- data %>% distinct(eval(parse(text = gen)), .keep_all = T)
dim(filtered_data)
# [1] 2 4
The eval seems to do something funny with the amount of times the data is filtered. (and, adds an extra column. I could live with that, though...) So, how to obtain a similar result, as if I had used a,b, but by using a variable instead?
additional information
I actually obtain gen by reading the columnnames of a dataframe: gen <- colnames(data)[1:2]. The solution suggested by #gymbrane would be perfect, if I had a way to transform the gen to c(a, b). The whole point is to avoid hardcoding the columnames. I tried things like gen <- noquotes(gen), which does not give an error in the rm_dup_rows function suggested below, but it does give a different result, giving the same sort of repeated filtering as I started with...
fixed
I think I got it working. It might be unelegant, and I'm not sure if every step is necessary for the result, but it seems to work by combining the function provided by #gymbrane below with ensym and quos in a forloop while adding to a list in GlobalEnv (edit: GlobalEnv isn't necessary):
unquote_string <- function(string) {
out <- list()
i <- 1
for (s in string) {
t <- ensym(s)
out[i] <-dplyr::quos(!!t)
i <- i+1
}
return(out)
}
gen_quo <- unquote_string(gen)
filtered_data <- rm_dup_rows(data, gen_quo)
dim(filtered_data)
# [1] 3 3
How about creating a function and using quosures . Perhaps something like this is what you are looking for...
rm_dup_rows <- function(data, ...){
vars = dplyr::quos(...)
data %>% distinct(!!! vars, .keep_all = T)
}
I believe this returns what you are asking for
rm_dup_rows(data = data, a, b)
# A tibble: 3 x 3
a b z
<dbl> <dbl> <dbl>
1 3 5
2 3 5
2 4 5
rm_dup_rows(data, b, z)
# A tibble: 3 x 3
a b z
<dbl> <dbl> <dbl>
1 3 5
1 3 4
2 4 5
Additional
You could modify rm_dup_rows just slightly and construct and your vector with quos. Something like this...
rm_dup_rows <- function(data, vars){
data %>% distinct(!!! vars, .keep_all = T)
}
# quos your column name vector
gen <- quos(a,z)
rm_dup_rows(data, gen)
# A tibble: 3 x 3
a b z
<dbl> <dbl> <dbl>
1 3 5
1 3 4
2 3 5

R - Creating DFs (tibbles) in a loop. How to rename them and columns inside, to include date? (I do it with eval(..), but is there a better solution?)

I have a loop, that creates a tibble at the end of each iteration, tbl. Loop uses different date each time, date.
Assume:
tbl <- tibble(colA=1:5,colB=5:10)
date <- as.Date("2017-02-28")
> tbl
# A tibble: 5 x 2
colA colB
<int> <int>
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
(contents are changing every loop, but tbl, date and all columns (colA, colB) names remain the same)
The output that I want needs to start with output - outputdate1, outputdate2 etc.
With columns inside it as colAdate1, colBdate1, and colAdate2, colBdate2 and so on.
At the moment I am using this piece of code, which works, but is not easy to read:
eval(parse(text = (
paste0("output", year(date), months(date), " <- tbl %>% rename(colA", year(date), months(date), " = 'colA', colB", year(date), months(date), " = 'colB')")
)))
It produces this code for eval(parse(...) to evaluate:
"output2017February <- tbl %>% rename(colA2017February = 'colA', colB2017February = 'colB')"
Which gives me the output that I want:
> output2017February
# A tibble: 5 x 2
colA2017February colB2017February
<int> <int>
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
Is there a better way of doing this? (Preferably with dplyr)
Thanks!
This avoids eval and is easier to read:
ym <- "2017February"
assign(paste0("output", ym), setNames(tbl, paste0(names(tbl), ym)))
Partial rename
If you only wanted to replace the names in the character vector old with the corresponding names in the character vector new then use the following:
assign(paste0("output", ym),
setNames(tbl, replace(names(tbl), match(old, names(tbl)), new)))
Variation
You might consider putting your data frames in a list instead of having a bunch of loose objects in your workspace:
L <- list()
L[[paste0("output", ym)]] <- setNames(tbl, paste0(names(tbl), ym))
.GlobalEnv could also be used in place of L (omitting the L <- list() line) if you want this style but still to put the objects separately in the global environment.
dplyr
Here it is using dplyr and rlang but it does involve increased complexity:
library(dplyr)
library(rlang)
.GlobalEnv[[paste0("output", ym)]] <- tbl %>%
rename(!!!setNames(names(tbl), paste0(names(tbl), ym)))

Resources