data frame column names no longer unique when subsetting - r

I have a data frame that contains duplicate column names. I'm aware that it's non-standard to use duplicated column names, but these names are actually being reassigned downstream using user inputs. For now, I'm attempting to positionally subset a data frame, but the column names become deduplicated. Here's an example.
> df <- data.frame(x = 1:4, y = 2:5, y = LETTERS[2:5], y = (2+(2:5)), check.names = F)
> df
x y y y
1 1 2 B 4
2 2 3 C 5
3 3 4 D 6
4 4 5 E 7
However, when I attempt to subset, the names change...
> df[, 1:3]
x y y.1
1 1 2 B
2 2 3 C
3 3 4 D
4 4 5 E
Is there any way to prevent this from happening? It only occurs when I subset on columns, not rows.
> df[1:3,]
x y y y
1 1 2 B 4
2 2 3 C 5
3 3 4 D 6
Edit for others noticing this behavior:
I've done some digging into the behavior and this relevant section from the help page for extract.data.frame (type ?'[')
The relevant section states:
If [ returns a data frame it will have unique (and non-missing) row
names, if necessary transforming the row names using make.unique.
Similarly, if columns are selected column names will be transformed to
be unique if necessary (e.g., if columns are selected more than once,
or if more than one column of a given name is selected if the data
frame has duplicate column names).
This explains the why, appreciate the comments so far on addressing how to best navigate this.

Here is an option, although I think it is not a good idea to have duplicated column names.
as.data.frame(as.list(df)[1:3], check.names = F)
# x y y
# 1 1 2 B
# 2 2 3 C
# 3 3 4 D
# 4 4 5 E

Related

Why does dropping which(FALSE) columns delete all columns?

This answer warns of some scary behavior from which. Specifically, if you take any data frame, say df <- data.frame(x=1:5, y=2:6), and then try to subset it with something that evaluates to which(FALSE) (i.e. integer(0)), then you will delete every column in the data set. Why is this? Why would dropping all columns that correspond to integer(0) delete everything? Deleting nothing shouldn't destroy everything.
Example:
>df <- data.frame(x=1:5, y=2:6)
>df
x y
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
>df <- df[,-which(FALSE)]
>df
data frame with 0 columns and 5 rows
Consider:
identical(integer(0), -integer(0))
# [1] TRUE
So, actually you're selecting nothing, rather than deleting nothing.
If you want to delete nothing, you could use a large negative integer, e.g. the largest possible.
df[, -.Machine$integer.max]
# x y
# 1 1 2
# 2 2 3
# 3 3 4
# 4 4 5
# 5 5 6

add a new column conditional on another character column in R

I have been trying to make a procedure in R. I want to ADD a new column base on several categories of another column.
I put an example :
Column New Column
A 1
B 2
C 3
D 4
D 4
A 1
My question is how to add this new column with a particular value base on the values (in characters) of the first column.
It is really similar using the function of MUTATE and CASE_WHEN. The problem is that this function just takes into consideration numeric values and in this case I want take characters (categories) and base on this give a specific value to the new column.
Assuming you have a column of categories (not only letters), you can convert it to "ordered factors" to order the categories and then convert to integers.
x <- c("A", "B", "C", "D", "D", "A")
# make the dataframe
v <- data.frame(x, as.integer(as.ordered(x)))
#
colnames(v) <- c("Column", "New Column")
v
# output
> v
Column New Column
1 A 1
2 B 2
3 C 3
4 D 4
5 D 4
6 A 1
If I understand you correctly you want to create a new column that has numbers corresponding to letters, with 1 corresponding to the first letter of the alphabet A, 2 corresponding to B, 3 to C, and so on. If that premise is correct, then this code will work for you:
ILLUSTRATIVE DATA
set.seed(12)
df <- data.frame(
Column = sample(LETTERS[1:5],10, replace = T)
)
df
Column
1 A
2 E
3 E
4 B
5 A
6 A
7 A
8 D
9 A
10 A
SOLUTION:
Assign the indices of LETTERS, which is an ordered sequence of integers starting with 1, to the letters in df$COlumn where they match the letters in LETTERS:
df$Newcolumn <- seq(LETTERS)[match(df$Column, LETTERS)]
RESULt:
df
Column Newcolumn
1 A 1
2 E 5
3 E 5
4 B 2
5 A 1
6 A 1
7 A 1
8 D 4
9 A 1
10 A 1

populate Data Frame based on lookup data frame in R

How does one go about switching a data frame based on column names between to tables with a lookup table in between.
Orig
A B C
1 2 3
2 2 2
4 5 6
Ret
D E
7 8
8 9
2 4
lookup <- data.frame(Orig=c('A','B','C'),Ret=c('D','D','E'))
Orig Ret
1 A D
2 B D
3 C E
So that the final data frame would be
A B C
7 7 8
8 8 9
2 2 4
We can match the 'Orig' column in 'lookup' with the column names of 'Orig' to find the numeric index (although, it is in the same order, it could be different in other cases), get the corresponding 'Ret' elements based on that. We use that to subset the 'Ret' dataset and assign the output back to the original dataset. Here I made a copy of "Orig".
OrigN <- Orig
OrigN[] <- Ret[as.character(lookup$Ret[match(as.character(lookup$Orig),
colnames(Orig))])]
OrigN
# A B C
#1 7 7 8
#2 8 8 9
#3 2 2 4
NOTE: as.character was used as the columns in 'lookup' were factor class.
I believe that the following will work as well.
OrigN <- Orig
OrigN[, as.character(lookup$Orig)] <- Ret[, as.character(lookup$Ret)]
This method applies a column shuffle to Orig (actually a copy OrigN following #Akrun) and then fills these columns with the appropriately ordered columns of Ret using the lookup.

Finding the top values in data frame using r

How can I find the 5 highest values of a column in a data frame
I tried the order() function but it gives me only the indices of the rows, wherease I need the actual data from the column. Here's what I have so far:
tail(order(DF$column, decreasing=TRUE),5)
You need to pass the result of order back to DF:
DF <- data.frame( column = 1:10,
names = letters[1:10])
order(DF$column)
# 1 2 3 4 5 6 7 8 9 10
head(DF[order(DF$column),],5)
# column names
# 1 1 a
# 2 2 b
# 3 3 c
# 4 4 d
# 5 5 e
You're correct that order just gives the indices. You then need to pass those indices to the data frame, to pick out the rows at those indices.
Also, as mentioned in the comments, you can use head instead of tail with decreasing = TRUE if you'd like, but that's a matter of taste.

Subset dataframe in a list by a dataframe column criteria

I have a list of dataframes. I need to subset a dataframe of this list according to a criteria in one column of the dataframe.
(all dataframes of the list have the same number and names of columns, and the same number of rows)
For example, I have:
l <- list(data.frame(x=c(2,3,4,5), y = c(4,4,4,4), z=c(2,3,4,5)),
data.frame(x=c(1,4,7,3), y = c(7,7,7,7), z=c(2,5,7,8)),
data.frame(x=c(2,3,1,8), y = c(1,1,1,1), z=c(6,4,1,3)))
names(l) <- c("MH1", "MH2","MH3")
output
$MH1
x y z
1 2 4 2
2 3 4 3
3 4 4 4
4 5 4 5
$MH2
x y z
1 1 7 2
2 4 7 5
3 7 7 7
4 3 7 8
$MH3
x y z
1 2 1 6
2 3 1 4
3 1 1 1
4 8 1 3
So I want to subset the dataframe for which column "y" is the closest to a given number. For example if I say a=3, the chosen dataframe should be "MH1" (where column y=4)
If "l" was a dataframe I will do something like:
closestDF <- subset(l, abs(l$y - a) == min(abs(l$y - a))
How can I do this with the list of dataframes?
Following the answers and comments of #David Arenburg, #akrun and #shadow, here there are three possible solutions to the problem I posted:
Option 1)
library(data.table)
rbindlist(l)[abs(y - a) == min(abs(y - a))]
Option 2) (needs an R version > 3.1.2)
library(dplyr)
bind_rows(l) %>% filter(abs(y-a)==which.min(abs(y-a)))
Option 3) (also works perfectly, but computationally less faster than the first 2 options if used within a big loop or an iterative process)
l[[which.min(sapply(l, function(df) sum(abs(df$y - a))))]]

Resources