How to rearrange a tibble by rownames - r

Traditional dataframes support rearrangement of rows by rownames:
> df <- data.frame(c1 = letters[1:3], c2 = 1:3, row.names = paste0("x", 1:3))
> df
c1 c2
x1 a 1
x2 b 2
x3 c 3
#' If we want, say, row "x3" and "x1":
> df[c("x3", "x1"), ]
c1 c2
x3 c 3
x1 a 1
When it comes to tibble, since it drops the concept of rownames, I wonder what the standard way is to achieve similar goal.
> tb <- as_tibble(rownames_to_column(df))
> tb
# A tibble: 3 x 3
rowname c1 c2
<chr> <fct> <int>
1 x1 a 1
2 x2 b 2
3 x3 c 3
> ?
Thanks.
Update
I can come up with the following solution:
> tb[match(c("x3", "x1"), tb[["rowname"]]), ]
# A tibble: 2 x 3
rowname c1 c2
<chr> <fct> <int>
1 x3 c 3
2 x1 a 1
But it seems clumsy. Does anyone have better idea?
Update 2
In a more generalized sense, my question can be rephrased as: by the syntax of tidyverse, what is the most neat and quick equivalent to
df[c("x3", "x1"), ]
that is, subsetting and rearranging rows of a dataframe.

As joran described, you can use filter to select rows of interest and then to arrange a tibble in a specific order, manually defined, you can use arrange with factor:
tibble(rowname = paste0("x", 1:3), c1 = letters[1:3], c2 = 1:3) %>%
filter(rowname %in% c("x3", "x1")) %>%
arrange(factor(rowname, levels = c("x3", "x1")))

Related

creating multiple variables in loops in r

I am quite new to R, and I do not know how to create variables in a loop. I have a dataset where each observation is uniquely defined by an id and a type. My goal would be to create different datasets from a starting one, keeping for each dataset the id, type a specific variable, and to rename the variable type as type_variable. Please see below a reproducible example of my dataset:
dt_type <- data.frame(id = c(1,1,1,1,2,2,2,2),
type= c("b1", "b2","c1", "c2","b1", "b2","c1", "c2"),
a=rnorm(8), b=rnorm(8),c=rnorm(8),d=rnorm(8))
# id type a b c d
# 1 1 b1 -0.74733339 -1.1121249 -0.2005649 1.70320036
# 2 1 b2 -0.87290362 -0.1221949 -2.7723691 1.04158671
# 3 1 c1 -0.00878965 -0.7592988 -0.5108226 2.10755315
# 4 1 c2 0.87295622 -0.5885439 0.2606365 -0.87080649
# 5 2 b1 -0.74536372 0.1377794 -0.1382621 0.01743011
# 6 2 b2 -0.01570109 -0.3058672 -0.3146880 -0.43594081
# 7 2 c1 -0.28966205 -0.2045772 -1.1776759 -2.24223369
# 8 2 c2 -0.63680969 2.3815740 0.4462243 -0.05397941
This is how I have tried to do it, but unfortunately it does not work.
varlist <- list("a", "b", "c", "d")
for (i in 1:4) {
tmp <- dt_type %>% rename(paste("type", varlist[[i]], sep=="_") = type) %>%
arrange(id, varlist[[i]], desc(paste("type", varlist[[i]], sep=="_"))) %>%
distinct(id, varlist[[i]], .keep_all = T)
assign(paste("dt_type_", varlist[[i]]), tmp)
}
I am used to using loops in other programming languages, but if there are better ways to reach the result I want, please let me know.
Sorry for not posting the expected output, here it is:
dt_type_a
# id type value
# 1 1 b1 -1.5023199
# 2 1 b2 -0.3653626
# 3 1 c1 1.2842098
# 4 1 c2 0.2732327
# 5 2 b1 -0.7581897
# 6 2 b2 1.1627059
# 7 2 c1 -1.6644546
# 8 2 c2 1.2916819
dt_type_b
# id type value
# 1 1 b1 -0.19573684
# 2 1 b2 -1.35095843
# 3 1 c1 0.69342205
# 4 1 c2 0.47689611
# 5 2 b1 0.67058845
# 6 2 b2 0.21992074
# 7 2 c1 -0.02046201
# 8 2 c2 0.19686712
Thanks,
Vincenzo
Hum, I would just go from wide to long but since you're asking to create variables dynamically:
library(data.table)
dt_type <- data.frame(id = c(1,1,1,1,2,2,2,2),
type= c("b1", "b2","c1", "c2","b1", "b2","c1", "c2"),
a=rnorm(8), b=rnorm(8),c=rnorm(8),d=rnorm(8))
setDT(dt_type)
dt_long <- melt(dt_type, id.vars = c("id", "type"))
varnames <- unique(dt_long$variable)
for (var in varnames) {
assign(paste0("dt_type_", var), dt_long[variable == var, .(id, type, value)])
}
hope it helps...

How can I combine/merge two columns in a dataframe and create a new column that specifies the origin of the data point?

I have a dataframe like this
x1 <- c(1,2,3)
x2 <- c(5,6,7)
df <- data.frame(x1,x2)
I would like to combine/merge those two columns x1 and x2 into one column that contains all values. Furthermore I would like to create a new column that says from which of the two columns my datapoint originates. My desired output looks like this:
x <- c(1,2,3,4,5,6)
orig <- c("x1","x1","x1","x2","x2","x2")
df_merged <- data.frame(x,orig)
I searched a lot for merging, combining, grouping of columns but apparently this is not what the process is called. Thanks for any help in advance.
You can use reshaping functions pivot_longer in tidyr
library(tidyr)
pivot_longer(df, cols = everything())
# name value
# <chr> <dbl>
#1 x1 1
#2 x2 5
#3 x1 2
#4 x2 6
#5 x1 3
#6 x2 7
Or melt in data.table :
library(data.table)
melt(setDT(df))
Try stack
> stack(df)
values ind
1 1 x1
2 2 x1
3 3 x1
4 5 x2
5 6 x2
6 7 x2
Using base R
data.frame(name = rep(names(df), each = nrow(df)), value = unlist(df))
Or use gather
library(tidyr)
gather(df)

Binding data frames from a list with different column types

Trying to figure out a way in purrr to bind rows over different elements of lists where the column types are not consistent. For example, my data looks a little like this...
d0 <- list(
data_frame(x1 = c(1, 2), x2 = c("a", "b")),
data_frame(x1 = c("P1"), x2 = c("c"))
)
d0
# [[1]]
# # A tibble: 2 x 2
# x1 x2
# <dbl> <chr>
# 1 1 a
# 2 2 b
#
# [[2]]
# # A tibble: 1 x 2
# x1 x2
# <chr> <chr>
# 1 P1 c
I can use a for loop and then map_df with bind_rows to get the output I want (map_df will not work if the columns are of different types)...
for(i in 1:length(d0)){
d0[[i]] <- mutate_if(d0[[i]], is.numeric, as.character)
}
map_df(d0, bind_rows)
# # A tibble: 3 x 2
# x1 x2
# <chr> <chr>
# 1 1 a
# 2 2 b
# 3 P1 c
but I think I am missing a trick somewhere that would allow me to avoid the for loop. My attempts along these lines...
d0 %>%
map(mutate_if(., is.numeric, as.character)) %>%
map_df(.,bind_rows)
# Error in UseMethod("tbl_vars") :
# no applicable method for 'tbl_vars' applied to an object of class "list"
... do not seem to work (still getting my head around purrr)
You can use rbindlist() from data.table in this case
data.table::rbindlist(d0) %>%
dplyr::as_data_frame()
# A tibble: 3 x 2
x1 x2
<chr> <chr>
1 1 a
2 2 b
3 P1 c
There may be circumstances where you will want to make sure the fill argument is TRUE
Documentation reference:
If column i of input items do not all have the same type; e.g, a
data.table may be bound with a list or a column is factor while others
are character types, they are coerced to the highest type (SEXPTYPE).
How about this?
library(purrr)
map_df(lapply(d0, function(x) data.frame(lapply(x, as.character))), bind_rows)
Output is:
x1 x2
1 1 a
2 2 b
3 P1 c
Sample data:
d0 <- list(structure(list(x1 = c(1, 2), x2 = c("a", "b")), .Names = c("x1",
"x2"), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
)), structure(list(x1 = "P1", x2 = "c"), .Names = c("x1", "x2"
), row.names = c(NA, -1L), class = c("tbl_df", "tbl", "data.frame"
)))
With tidyverse, the option would be
library(tidyverse)
d0 %>%
map_df(~ .x %>%
mutate_if(is.numeric, as.character))
# A tibble: 3 x 2
# x1 x2
# <chr> <chr>
#1 1 a
#2 2 b
#3 P1 c
It's a good opportunity to use purrr::modify_depth :
library(purrr)
library(dplyr)
bind_rows(modify_depth(d0,2,as.character))
# # A tibble: 3 x 2
# x1 x2
# <chr> <chr>
# 1 1 a
# 2 2 b
# 3 P1 c

Subset tibble based on column sums, while retaining character columns

I have a feeling this is a pretty stupid issue, but I haven't been able to find the solution either
I have a tibble where each row is a sample and the first column is a character variable containing the sample ID and all subsequent columns are variables with numeric variables.
For example:
id <- c("a", "b", "c", "d", "e")
x1 <- rep(1,5)
x2 <- seq(1,5,1)
x3 <- rep(2,5)
x4 <- seq(0.1, 0.5, 0.1)
tb <- tibble(id, x1, x2, x3, x4)
I want to subset this to include only the columns with a sum greater than 5, and the id column. With the old dataframe structure, I know the following worked:
df <- as.data.frame(tb)
df2 <- cbind(df$id, df[,colSums(df[,2:5])>5)
colnames(df2)[1] <- "id"
However, when I try to subset this way with a tibble, I get the error message:
Error: Length of logical index vector must be 1 or 5, got: 4
Does anyone know how to accomplish this task without converting to the old data frame format? Preferably without creating an intermediate tibble with the id variable missing, because separating my ids from my data is just asking for trouble down the road.
Thanks!
# install.packages(c("tidyverse"), dependencies = TRUE)
library(tibble)
df <- tibble(id = letters[1:5], x1 = 1, x2 = 1:5, x3 = 2, x4 = seq(.1, .5, len = 5))
### two additional examples of how to generate the Tibble data
### exploiting that its arguments are evaluated lazily and sequentially
# df <- tibble(id = letters[1:5], x1 = 1, x2 = 1:5, x3 = x1 + 1, x4 = x2/10)
# df <- tibble(x2 = 1:5, id = letters[x2], x3 = 2, x1 = x3-1, x4 = x2/10) %>%
# select(id, num_range("x", 1:4))
base R solution, cf. HubertL's comment above,
### HubertL's base solution
df[c(TRUE,colSums(df[2:5])>5)]
#> # A tibble: 5 x 3
#> id x2 x3
#> <chr> <int> <dbl>
#> 1 a 1 2
#> 2 b 2 2
#> 3 c 3 2
#> 4 d 4 2
#> 5 e 5 2
dplyr solution, cf David Klotz's comment,
### Klotz's dplyr solution
library(dplyr)
df %>% select_if(function(x) is.character(x) || sum(x) > 5)
#> # A tibble: 5 x 3
#> id x2 x3
#> <chr> <int> <dbl>
#> 1 a 1 2
#> 2 b 2 2
#> 3 c 3 2
#> 4 d 4 2
#> 5 e 5 2

forloop inside dplyr mutate

I would like to do a few column operations using mutate in more elegant way as I have more than 200 columns in my table that I would like transform using mutate.
here is an example
Sample data:
df <- data.frame(treatment=rep(letters[1:2],10),
c1_x=rnorm(20),c2_y=rnorm(20),c3_z=rnorm(20),
c4_x=rnorm(20),c5_y=rnorm(20),c6_z=rnorm(20),
c7_x=rnorm(20),c8_y=rnorm(20),c9_z=rnorm(20),
c10_x=rnorm(20),c11_y=rnorm(20),c12_z=rnorm(20),
c_n=rnorm(20))
sample code:
dfm<-df %>%
mutate(cx=(c1_x*c4_x/c_n+c7_x*c10_x/c_n),
cy=(c2_y*c5_y/c_n+c8_y*c11_y/c_n),
cz=(c3_z*c6_z/c_n+c9_z*c12_z/c_n))
Despite the tangent, the initial recommendations for using tidyr functions is where you need to go. This pipe of functions seems to do the job based on what you've provided.
Your data:
df <- data.frame(treatment=rep(letters[1:2],10),
c1_x=rnorm(20), c2_y=rnorm(20), c3_z=rnorm(20),
c4_x=rnorm(20), c5_y=rnorm(20), c6_z=rnorm(20),
c7_x=rnorm(20), c8_y=rnorm(20), c9_z=rnorm(20),
c10_x=rnorm(20), c11_y=rnorm(20), c12_z=rnorm(20),
c_n=rnorm(20))
library(dplyr)
library(tidyr)
This first auxiliary data.frame is used to translate your c#_[xyz] variable into a unified one. I'm sure there are other ways to handle this, but it works and is relatively easy to reproduce and extend based on your 200+ columns.
variableTransform <- data_frame(
cnum = paste0("c", 1:12),
cvar = rep(paste0("a", 1:4), each = 3)
)
head(variableTransform)
# Source: local data frame [6 x 2]
# cnum cvar
# <chr> <chr>
# 1 c1 a1
# 2 c2 a1
# 3 c3 a1
# 4 c4 a2
# 5 c5 a2
# 6 c6 a2
Here's the pipe all at once. I'll explain the steps in a sec. What you're looking for is likely a combination of the treatment, xyz, and ans columns.
df %>%
tidyr::gather(cnum, value, -treatment, -c_n) %>%
tidyr::separate(cnum, c("cnum", "xyz"), sep = "_") %>%
left_join(variableTransform, by = "cnum") %>%
select(-cnum) %>%
tidyr::spread(cvar, value) %>%
mutate(
ans = a1 * (a2/c_n) + a3 * (a4/c_n)
) %>%
head
# treatment c_n xyz a1 a2 a3 a4 ans
# 1 a -1.535934 x -0.3276474 1.45959746 -1.2650369 1.02795419 1.15801448
# 2 a -1.535934 y -1.3662388 -0.05668467 0.4867865 -0.10138979 -0.01828831
# 3 a -1.535934 z -2.5026018 -0.99797169 0.5181513 1.20321878 -2.03197283
# 4 a -1.363584 x -0.9742016 -0.12650863 1.3612361 -0.24840493 0.15759418
# 5 a -1.363584 y -0.9795871 1.52027017 0.5510857 1.08733839 0.65270681
# 6 a -1.363584 z 0.2985557 -0.22883439 0.1536078 -0.09993095 0.06136036
First, we take the original data and turn all (except two) columns into two columns of "column name" and "column values" pairs:
df %>%
tidyr::gather(cnum, value, -treatment, -c_n) %>%
# treatment c_n cnum value
# 1 a 0.20745647 c1_x -0.1250222
# 2 b 0.01015871 c1_x -0.4585088
# 3 a 1.65671028 c1_x -0.2455927
# 4 b -0.24037137 c1_x 0.6219516
# 5 a -1.16092349 c1_x -0.3716138
# 6 b 1.61191700 c1_x 1.7605452
It will be helpful to split c1_x into c1 and x in order to translate the first and preserve the latter:
tidyr::separate(cnum, c("cnum", "xyz"), sep = "_") %>%
# treatment c_n cnum xyz value
# 1 a 0.20745647 c1 x -0.1250222
# 2 b 0.01015871 c1 x -0.4585088
# 3 a 1.65671028 c1 x -0.2455927
# 4 b -0.24037137 c1 x 0.6219516
# 5 a -1.16092349 c1 x -0.3716138
# 6 b 1.61191700 c1 x 1.7605452
From here, let's translate the c1, c2, and c3 variables into a1 (repeat for other 9 variables) using variableTransform:
left_join(variableTransform, by = "cnum") %>%
select(-cnum) %>%
# treatment c_n xyz value cvar
# 1 a 0.20745647 x -0.1250222 a1
# 2 b 0.01015871 x -0.4585088 a1
# 3 a 1.65671028 x -0.2455927 a1
# 4 b -0.24037137 x 0.6219516 a1
# 5 a -1.16092349 x -0.3716138 a1
# 6 b 1.61191700 x 1.7605452 a1
Since we want to deal with multiple variables simultaneously (with a simple mutate), we need to bring some of the variables back into columns. (The reason we gathered and will now spread helps me with keeping things organized and named well. I'm confident somebody can come up with another way to do it.)
tidyr::spread(cvar, value) %>% head
# treatment c_n xyz a1 a2 a3 a4
# 1 a -1.535934 x -0.3276474 1.45959746 -1.2650369 1.02795419
# 2 a -1.535934 y -1.3662388 -0.05668467 0.4867865 -0.10138979
# 3 a -1.535934 z -2.5026018 -0.99797169 0.5181513 1.20321878
# 4 a -1.363584 x -0.9742016 -0.12650863 1.3612361 -0.24840493
# 5 a -1.363584 y -0.9795871 1.52027017 0.5510857 1.08733839
# 6 a -1.363584 z 0.2985557 -0.22883439 0.1536078 -0.09993095
From here, we just need to mutate to get the right answer.
Similar to r2evans's answer, but with more manipulation instead of the joins (and less explanation).
library(tidyr)
library(stringr)
library(dplyr)
# get it into fully long form
gather(df, key = cc_xyz, value = value, c1_x:c12_z) %>%
# separate off the xyz and the c123
separate(col = cc_xyz, into = c("cc", "xyz")) %>%
# extract the number
mutate(num = as.numeric(str_replace(cc, pattern = "c", replacement = "")),
# mod it by 4 for groupings and add a letter so its a good col name
num_mod = paste0("v", (num %% 4) + 1)) %>%
# remove unwanted columns
select(-cc, -num) %>%
# go into a reasonable data width for calculation
spread(key = num_mod, value = value) %>%
# calculate
mutate(result = v1 + v2/c_n + v3 + v4 / c_n)
# treatment c_n xyz v1 v2 v3 v4 result
# 1 a -1.433858289 x 1.242153708 -0.985482158 -0.0240414692 1.98710285 0.51956295
# 2 a -1.433858289 y -0.019255516 0.074453615 -1.6081599298 1.18228939 -2.50389188
# 3 a -1.433858289 z -0.362785313 2.296744655 -0.0610463292 0.89797526 -2.65188998
# 4 a -0.911463819 x -1.088308527 -0.703388193 0.6308253909 0.22685013 0.06534405
# 5 a -0.911463819 y 1.284513516 1.410276163 0.5066869590 -2.07263912 2.51790289
# 6 a -0.911463819 z 0.957778345 -1.136532104 1.3959561507 -0.50021647 4.14947069
# ...

Resources