Initialise a dataframe where a column references another column - r

I wonder if there is a way to do:
df <- data.frame(x = 1:3)
df$y = df$x + 5
yielding:
x y
1 1 6
2 2 7
3 3 8
in one line of code where the y column refers to the x column? For example:
data.frame(x = 1:3, y = self$x + 5) # doesn't work
(I won't accept answers that ignore the x column, for example, data.frame(x = 1:3, y = 6:8 :-))

This is possible using tibble from tibble library. Credit to #DaveArmstrong from the comments.
library(tibble)
tibble(x = 1:3, y = x + 5)
# A tibble: 3 × 2
x y
<int> <dbl>
1 1 6
2 2 7
3 3 8

Here's a base R method that do not need to use external package (e.g. tibble).
We can use outer to add 5 to each element in df$x, then cbind the result with df.
setNames(data.frame(cbind(1:3, outer(1:3, 5, `+`))), c("x", "y"))
# or to expand your code
setNames(cbind(data.frame(x = 1:3), outer(1:3, 5, `+`)), c("x", "y"))
x y
1 1 6
2 2 7
3 3 8

Related

Move several chunks of columns dynamically to another position

My data is:
df <- data.frame(a = 1:2,
x = 1:2,
b = 1:2,
y = 3:4,
x_2 = 1:2,
y_2 = 3:4,
c = 1:2,
x_3 = 5:6,
y_3 = 1:2)
I now want to put together the x vars, and the y vars so that the order of columns would be:
a, x, x_2, x_3, b, y, y_2, y_3, c
I thought, I could use tidyverse's relocate function in combination with lapply or map or reduce (?), but it doesn't work out.
E.g. if I do:
move_names <- c("x", "y")
library(tidyverse)
moved_data <- lapply(as.list(move_names), function(x)
{
df <- df |>
relocate(!!!syms(paste0(x, "_", 2:3)),
.after = all_of(x))
}
)
It does the moving for x and y separately, but it creates separate list, but I want to have just my original df with relocated columns.
Update:
I should have been clear that my real data frame has ~500 columns where the to-be-moved columns are all over the place. So providing the full vector of desired column name order won't be feasible.
What I instead have: I have the names of my original columns, i.e. x and y, and I have the names of the to-be-moved columns, i.e. x_2, x_3, y_2, y_3.
In base R:
df[match(c('a', 'x', 'x_2', 'x_3', 'b', 'y', 'y_2', 'y_3', 'c'), names(df))]
#> a x x_2 x_3 b y y_2 y_3 c
#> 1 1 1 1 5 1 3 3 1 1
#> 2 2 2 2 6 2 4 4 2 2
Not sure if it's what you want.
Vector with order of column names
Let's say you have a vector relocate_name that contains the order of your columns:
library(tidyverse)
relocate_name <- c("a", "x", "x_2", "x_3", "b", "y", "y_2", "y_3", "c")
df %>% relocate(any_of(relocate_name))
Vector with prefix of column names
Or if you only have the prefix of the order, let's call it relocate_name2:
relocate_name2 <- c("a", "x", "b", "y", "c")
df %>% relocate(starts_with(relocate_name2))
Group x and y together
Or if you only want to "group" x and y together:
df %>%
relocate(starts_with("x"), .after = "x") %>%
relocate(starts_with("y"), .after = "y")
Output
All of the above output is the same.
a x x_2 x_3 b y y_2 y_3 c
1 1 1 1 5 1 3 3 1 1
2 2 2 2 6 2 4 4 2 2
library(rlist)
# split based in colname-part before _
L <- split.default(df, f = gsub("(.*)_.*", "\\1", names(df)))
# remove names with an underscore
# this is the new order, it should match the names of list L !!
neworder <- names(df)[!grepl("_", names(df))]
# [1] "a" "x" "b" "y" "c"
# cbind list elements together
ans <- rlist::list.cbind(L[neworder])
# a x.x x.x_2 x.x_3 b y.y y.y_2 y.y_3 c
# 1 1 1 1 5 1 3 3 1 1
# 2 2 2 2 6 2 4 4 2 2
# create tidy names again
names(ans) <- gsub(".*\\.(.*)", "\\1", names(ans))
# a x x_2 x_3 b y y_2 y_3 c
# 1 1 1 1 5 1 3 3 1 1
# 2 2 2 2 6 2 4 4 2 2
Ok, this is probably the worst workaround ever and I don't really understand what exactly I'm doing (especially with the <<-), but it is does the trick.
My general idea after realizing the problem a bit more with the help of you guys here was to "loop" through both of my x and y names, remove these new _2 and _3 columns from the vector of column names and re-append them after their "base" x and y columns.
search_names <- c("x", "y")
df_names <- names(df)
new_names <- lapply(search_names, function(x)
{
start <- which(df_names == x)
without_new_names <- setdiff(df_names, paste0(x, "_", 2:3))
df_names <<- append(without_new_names, values = paste0(x, "_", 2:3), after = start)
})[[length(search_names)]]
df |>
relocate(any_of(new_names))
a x x_2 x_3 b y y_2 y_3 c
1 1 1 1 5 1 3 3 1 1
2 2 2 2 6 2 4 4 2 2

Finding the index for 2nd Min value in a data frame

I have a data frame df1. I would like to find the index for the second smallest value from this dataframe. With the function which.min I was able to get the row index for the smallest value but is there a way to get the index for the second smallest value?
> df1
structure(list(x = c(1, 2, 3, 4, 3), y = c(2, 3, 2, 4, 6), z = c(1,
4, 2, 3, 11)), row.names = c(NA, -5L), class = c("tbl_df", "tbl",
"data.frame"))
>df1
x y z
1 2 1
2 3 4
3 2 2
4 4 3
3 6 11
This is my desired output. For example, in x, the value 2 in row 2 is the second smallest value. Thank you.
>df2
x 2
y 2
z 3
Updated answer
You can write a function like the following, using factor:
which_min <- function(x, pos) {
sapply(x, function(y) {
which(as.numeric(factor(y, sort(unique(y)))) == pos)[1]
})
}
which_min(df1, 2)
# x y z
# 2 2 3
Testing it out with other data:
df2 <- df1
df2$new <- c(1, 1, 1, 2, 3)
which_min(df2, 2)
# x y z new
# 2 2 3 4
Original answer
Instead of sort, you can use order:
sapply(df1, function(x) order(unique(x))[2])
# x y z
# 2 2 3
Or you can make use of the index.return argument in sort:
sapply(df1, function(x) sort(unique(x), index.return = TRUE)$ix[2])
# x y z
# 2 2 3
You can do :
sapply(df1, function(x) which.max(x == sort(unique(x))[2]))
#x y z
#2 2 3
Or with dplyr :
library(dplyr)
df1 %>%
summarise(across(.fns = ~which.max(. == sort(unique(.))[2])))
# x y z
# <int> <int> <int>
#1 2 2 3
Another base R version using rank
> sapply(df1, function(x) which(rank(unique(x)) == 2))
x y z
2 2 3
You could try something like:
sort(unique(unlist(df1)))[2]

R: How to insert a row in Dataframe starting at a certain column?

I have the following data frame:
df <- tibble(x = 1:3, y = 3:1, z = 4:6, a = 6:4, b = 7:9)
I now need to extract the values from the second row, third to fifth column with this command:
newrow <- df[2,3:5]
I now want to insert a new row after the second row. The problem is that I need the new row to start at column 2. If I use the following code, the row will be added at the same column positions as I extracted it from:
df%>% add_row(newrow, .before = 3)
Hope anybody can help with this, any help is much appreciated.
Your newrow dataframe has the colnames from coluns 3:5 (z,a,b). Therefore add_row()matches the newrow to these columns.
You need to rename the columns of newrow with the first three column names.
df%>% add_row(setNames(newrow, names(df)[1:ncol(newrow)]),
.before = 3)
I'm not sure exactly what you're desired outcome is but does this achieve what you want?
library(tibble)
library(dplyr)
df <- tibble::tibble(x = 1:3, y = 3:1, z = 4:6, a = 6:4, b = 7:9)
whatrow <- 2
whatcolumns <- 3:5
beforerow <- 3
newdf <-
slice(df, whatrow) %>%
select(all_of(whatcolumns)) %>%
setNames(., names(df)[whatcolumns - 1]) %>%
add_row(df, ., .before = beforerow)
newdf
#> # A tibble: 4 x 5
#> x y z a b
#> <int> <int> <int> <int> <int>
#> 1 1 3 4 6 7
#> 2 2 2 5 5 8
#> 3 NA 5 5 8 NA
#> 4 3 1 6 4 9

Why doesn't `[<-` work to reorder data frame columns?

Why doesn't this work?
df <- data.frame(x=1:2, y = 3:4, z = 5:6)
df[] <- df[c("z", "y", "x")]
df
#> x y z
#> 1 5 3 1
#> 2 6 4 2
notice that the names are in the original order, but the data itself has changed order.
This works just fine
df <- data.frame(x=1:2, y = 3:4, z = 5:6)
df[c("z", "y", "x")]
#> z y x
#> 1 5 3 1
#> 2 6 4 2
When an extraction is completed the values in the index are replaced not the names. For example, replacing the first item below does not affect the name of the element:
x <- c(a=1, b=2)
x[1] <- 3
x
a b
3 2
In your data frame you replaced the values in the same way. The values changed but the names stayed constant. To reorder the data frame avoid the extraction framework.
df <- df[c("z", "y", "x")]
Just don't put the [] after the df and it will do as you want...
df <- data.frame(x=1:2, y = 3:4, z = 5:6)
df <- df[c("z", "y", "x")]
df
# z y x
#1 5 3 1
#2 6 4 2
And if you question is about why, Pierre Lafortune's comment is right.
as a side note, I also like to add the commat to separate dimension:
df <- df[,c("z", "y", "x")]
I find it more proper.

Bind data frames on longer identifiers R

I've got two data frames in which the unique identifiers common to both frames differ in the number of observations. I would like to create a dataframe from both in which the observations from each frame are taken if they have more observations for a common identifier. For example:
f1 <- data.frame(x = c("a", "a", "b", "c", "c", "c"), y = c(1,1,2,3,3,3))
f2 <- data.frame(x = c("a","b", "b", "c", "c"), y = c(4,5,5,6,6))
I would like this to generate a merge based on the longer x such that it produces:
x y
a 1
a 1
b 5
b 5
c 3
c 3
c 3
Any and all thoughts would be great.
Here's a solution using split
dd<-rbind(cbind(f1, s="f1"), cbind(f2, s="f2"))
keep<-unsplit(lapply(split(dd$s, dd$x), FUN=function(x) {
y<-table(x)
x == names(y[which.max(y)])
}), dd$x)
dd <- dd[keep,]
Normally i'd prefer to use the ave function here but because i'm changing data.types from a factor to a logical, it wasn't as appropriate so I basically copied the idea that ave uses and used split.
dplyr solution
library(dplyr)
First we combine the data:
with rbind() and introduce a new variable called ref to know where each observation came from:
both <- rbind( f1, f2 )
both$ref <- rep( c( "f1", "f2" ) , c( nrow(f1), nrow(f2) ) )
then count the observations:
make another new variable that contains how many observations for each ref and x combination:
both_with_counts <- both %>%
group_by( ref ,x ) %>%
mutate( counts = n() )
then filter for the largest count:
both_with_counts %>% group_by( x ) %>% filter( n==max(n) )
note: you could also select only the x and y cols with select(x,y)...
this gives:
## Source: local data frame [7 x 4]
## Groups: x
##
## x y ref counts
## 1 a 1 f1 2
## 2 a 1 f1 2
## 3 c 3 f1 3
## 4 c 3 f1 3
## 5 c 3 f1 3
## 6 b 5 f2 2
## 7 b 5 f2 2
Altogether now...
what_I_want <-
rbind(cbind(f1,ref = "f1"),cbind(f2,ref = "f2")) %>%
group_by(ref,x) %>%
mutate(counts = n()) %>%
group_by( x ) %>%
filter( counts==max(counts) ) %>%
select( x, y )
and thus:
> what_I_want
# Source: local data frame [7 x 2]
# Groups: x
#
# x y
# 1 a 1
# 2 a 1
# 3 c 3
# 4 c 3
# 5 c 3
# 6 b 5
# 7 b 5
Not a elegant answer but still give the desired result. Hope this help.
f1table <- data.frame(table(f1$x))
colnames(f1table) <- c("x","freq")
f1new <- merge(f1,f1table)
f2table <- data.frame(table(f2$x))
colnames(f2table) <- c("x","freq")
f2new <- merge(f2,f2table)
table <- rbind(f1table, f2table)
table <- table[with(table, order(x,-freq)), ]
table <- table[!duplicated(table$x), ]
data <-rbind(f1new, f2new)
merge(data, table, by=c("x","freq"))[,c(1,3)]
x y
1 a 1
2 a 1
3 b 5
4 b 5
5 c 3
6 c 3
7 c 3

Resources