I have this example dataset
x <- c("hot", "cold", "warm", "hot", "hot")
y <- c("happy", "content", "happy", "sad", "annoyed")
df <- data.frame(x, y)
I want to find a quick way to convert the text to numbers, it doesn't matter which order the numbers are.
So the output would be:
x y
1 1
2 2
3 1
1 3
1 4
Many Thanks
With Base R:
df[] <- lapply(df, function(x) as.numeric(as.factor(x)))
df
#> x y
#> 1 2 3
#> 2 1 2
#> 3 3 3
#> 4 2 4
#> 5 2 1
With purrr:
library(purrr)
df %>% map(as.factor) %>% map_dfc(as.numeric)
#> # A tibble: 5 x 2
#> x y
#> <dbl> <dbl>
#> 1 2 3
#> 2 1 2
#> 3 3 3
#> 4 2 4
#> 5 2 1
Keep track of the labels with labelled:
df <- df %>% map(as.factor) %>% map_dfc(labelled::to_labelled)
df
#> # A tibble: 5 x 2
#> x y
#> <dbl+lbl> <dbl+lbl>
#> 1 2 [hot] 3 [happy]
#> 2 1 [cold] 2 [content]
#> 3 3 [warm] 3 [happy]
#> 4 2 [hot] 4 [sad]
#> 5 2 [hot] 1 [annoyed]
df$x
#> <labelled<double>[5]>
#> [1] 2 1 3 2 2
#>
#> Labels:
#> value label
#> 1 cold
#> 2 hot
#> 3 warm
Or keep the numbers next to the original values in a new column:
df[paste0(names(df), "_num")] <- lapply(df, function(x) as.numeric(as.factor(x)))
df
#> x y x_num y_num
#> 1 hot happy 2 3
#> 2 cold content 1 2
#> 3 warm happy 3 3
#> 4 hot sad 2 4
#> 5 hot annoyed 2 1
If you want to change only the character columns to numeric:
library(purrr)
df %>% map_if(is.character, as.factor) %>% map_dfc(as.numeric)
df %>% map_if(is.character, as.factor) %>% map_dfc(labelled::to_labelled)
Or choose them by name:
library(purrr)
cols <- c("x", "y")
df %>% map_at(cols, as.factor) %>% map_dfc(as.numeric)
df %>% map_at(cols, as.factor) %>% map_dfc(labelled::to_labelled)
df[paste0(cols, "_num")] <- lapply(df[cols], function(x) as.numeric(as.factor(x)))
You could use rapply:
rapply(type.convert(df), function(x)as.integer(factor(x, unique(x))),'factor',how = 'replace')
x y
1 1 1
2 2 2
3 3 1
4 1 3
5 1 4
Maybe try this with dplyr:
library(dplyr)
#Code
newdf <- df %>% mutate(across(everything(),~as.numeric(as.factor(.))))
Output:
x y
1 2 3
2 1 2
3 3 3
4 2 4
5 2 1
In order to see the values, you can try this:
#Code 2
newdf2 <- df %>% mutate(across(everything(),~as.factor(.))) %>%
mutate(across(everything(),.fns = list(value = ~ as.numeric(.))))
Output:
x y x_value y_value
1 hot happy 2 3
2 cold content 1 2
3 warm happy 3 3
4 hot sad 2 4
5 hot annoyed 2 1
If we add a numeric variable, this should work:
#Code 3
newdf <- df %>% mutate(across(x:y,~as.factor(.))) %>%
mutate(across(x:y,.fns = list(value = ~ as.numeric(.))))
Output:
x y number x_value y_value
1 hot happy 10 2 3
2 cold content 20 1 2
3 warm happy 30 3 3
4 hot sad 40 2 4
5 hot annoyed 50 2 1
We can use match
df[] <- lapply(df, function(x) match(x, unique(x)))
Related
I have a data frame ordered by id variables ("city"), and I want to keep the second observation of those cities that have more than one observation.
For example, here's an example data set:
city <- c(1,1,2,3,3,4,5,6,7,7,8)
value <- c(3,5,7,8,2,5,4,2,3,2,3)
mydata <- data.frame(city, value)
Then we have:
city value
1 1 3
2 1 5
3 2 7
4 3 8
5 3 2
6 4 5
7 5 4
8 6 2
9 7 3
10 7 2
11 8 3
The ideal outcome would be:
city value
2 1 5
3 2 7
5 3 2
6 4 5
7 5 4
8 6 2
10 7 2
11 8 3
Any help is appreciated!
dplyr
library(dplyr)
mydata %>%
group_by(city) %>%
filter(n() == 1L | row_number() == 2L) %>%
ungroup()
# # A tibble: 8 x 2
# city value
# <dbl> <dbl>
# 1 1 5
# 2 2 7
# 3 3 2
# 4 4 5
# 5 5 4
# 6 6 2
# 7 7 2
# 8 8 3
or slightly different
mydata %>%
group_by(city) %>%
slice(min(n(), 2)) %>%
ungroup()
base R
ind <- ave(rep(TRUE, nrow(mydata)), mydata$city,
FUN = function(z) length(z) == 1L | seq_along(z) == 2L)
ind
# [1] FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
mydata[ind,]
# city value
# 2 1 5
# 3 2 7
# 5 3 2
# 6 4 5
# 7 5 4
# 8 6 2
# 10 7 2
# 11 8 3
data.table
Since you mentioned "is way bigger", you might consider data.table at some point for its speed and referential semantics. (And it doesn't hurt that this code is much more terse :-)
library(data.table)
DT <- as.data.table(mydata) # normally one might use setDT(mydata) instead ...
DT[, .SD[min(.N, 2),], by = city]
# city value
# <num> <num>
# 1: 1 5
# 2: 2 7
# 3: 3 2
# 4: 4 5
# 5: 5 4
# 6: 6 2
# 7: 7 2
# 8: 8 3
Here is logic that uses pmin() to choose either 2 or 1 depending on the length of the vector of value-values:
aggregate( value ~ city, mydata, function(x) x[ pmin(2, length(x))] )
city value
1 1 5
2 2 7
3 3 2
4 4 5
5 5 4
6 6 2
7 7 2
8 8 3
The aggregate function delivers vectors of value split on the basis of city-values.
You may try
library(dplyr)
mydata %>%
group_by(city) %>%
filter(case_when(n()> 1 ~ row_number() == 2,
TRUE ~ row_number()== 1))
city value
<dbl> <dbl>
1 1 5
2 2 7
3 3 2
4 4 5
5 5 4
6 6 2
7 7 2
8 8 3
Another dplyr solution:
mydata %>% group_by(city) %>%
summarize(value=value[pmin(2, n())])
Or:
mydata %>% group_by(city) %>%
summarize(value=ifelse(n() >= 2, value[2], value[1]))
Both Output:
city value
<dbl> <dbl>
1 1 5
2 2 7
3 3 2
4 4 5
5 5 4
6 6 2
7 7 2
8 8 3
If base R is ok try this:
EDIT (since performance really seems to be important):
Using if as a function, should give a 100-fold speed-up in some cases.
aggregate( value ~ city, mydata, function(x) `if`(!is.na(x[2]),x[2],x[1]) )
city value
1 1 5
2 2 7
3 3 2
4 4 5
5 5 4
6 6 2
7 7 2
8 8 3
Benchmarks
Here're some benchmarks because I was curious. I gathered all solutions and let them run through microbenchmark.
Bottom line is 'if'(cond,T,F) is fastest (22.3% faster than ifelse and 17-times faster than the slowest), followed by ifelse and aggregate(pmin). Keep in mind that the data.table solution only ran on one core. So all speed-up in that package comes from parallelization. No real shocker but interesting nonetheless.
library(microbenchmark)
lengths( mydata )
city value
20000 20000
c( class(mydata$value), class(mydata$value) )
[1] "integer" "integer"
microbenchmark("aggr_if_function" = { res <- aggregate( value ~ city, mydata, function(x) `if`(!is.na(x[2]),x[2],x[1]) )},
"aggr_ifelse" = { res <- aggregate( value ~ city, mydata, function(x) ifelse(!is.na(x[2]),x[2],x[1]) ) },
"dplyr_filter" = { res <- mydata %>% group_by(city) %>% filter(n() == 1L | row_number() == 2L) %>% ungroup() },
"dplyr_slice" = { res <- mydata %>% group_by(city) %>% slice(min(n(), 2)) %>% ungroup() },
"data.table_single_core" = { res <- DT[, .SD[min(.N, 2),], by = city] },
"aggr_pmin" = { res <- aggregate( value ~ city, mydata, function(x) x[ pmin(2, length(x))] ) },
"dplyr_filter_case_when" = { res <- mydata %>% group_by(city) %>% filter(case_when(n()> 1 ~ row_number() == 2, TRUE ~ row_number()== 1)) },
"group_split_purrr" = { res <- group_split(mydata, city) %>% map_if(~nrow(.) > 1, ~.[2, ]) %>% bind_rows() }, times=50)
Unit: milliseconds
expr min lq mean median uq
aggr_if_function 175.5104 179.3273 184.5157 182.1778 186.8963
aggr_ifelse 214.5846 220.7074 229.2062 228.0688 234.1087
dplyr_filter 585.5275 607.7011 643.6320 632.0794 660.8184
dplyr_slice 713.4047 762.9887 792.7491 780.8475 803.7191
data.table_single_core 2080.3869 2164.3829 2240.8578 2229.5310 2298.9002
aggr_pmin 321.5265 330.5491 343.2752 341.7866 352.2880
dplyr_filter_case_when 3171.4859 3337.1669 3492.6915 3500.7783 3608.1809
group_split_purrr 1466.4527 1543.2597 1590.9994 1588.0186 1630.5590
max neval cld
212.6006 50 a
253.0433 50 a
1066.6018 50 c
1304.4045 50 d
2702.4201 50 f
457.3435 50 b
4195.0774 50 g
1786.5310 50 e
Combining group_split and map_if:
library(tidyverse)
city <- c(1,1,2,3,3,4,5,6,7,7,8)
value <- c(3,5,7,8,2,5,4,2,3,2,3)
value2 <- c(3,5,7,8,2,5,4,2,3,2,3)
mydata <- data.frame(city, value)
group_split(mydata, city) %>%
map_if(~nrow(.) > 1, ~.[2, ]) %>% bind_rows()
#> # A tibble: 8 × 2
#> city value
#> <dbl> <dbl>
#> 1 1 5
#> 2 2 7
#> 3 3 2
#> 4 4 5
#> 5 5 4
#> 6 6 2
#> 7 7 2
#> 8 8 3
Created on 2021-11-30 by the reprex package (v2.0.1)
I would like to merge multiple columns. Here is what my sample dataset looks like.
df <- data.frame(
id = c(1,2,3,4,5),
cat.1 = c(3,4,NA,4,2),
cat.2 = c(3,NA,1,4,NA),
cat.3 = c(3,4,1,4,2))
> df
id cat.1 cat.2 cat.3
1 1 3 3 3
2 2 4 NA 4
3 3 NA 1 1
4 4 4 4 4
5 5 2 NA 2
I am trying to merge columns cat.1 cat.2 and cat.3. It is a little complicated for me since there are NAs.
I need to have only one cat variable and even some columns have NA, I need to ignore them. The desired output is below:
> df
id cat
1 1 3
2 2 4
3 3 1
4 4 4
5 5 2
Any thoughts?
Another variation of Gregor's answer using dplyr::transmute:
library(dplyr)
df %>%
transmute(id = id, cat = coalesce(cat.1, cat.2, cat.3))
#> id cat
#> 1 1 3
#> 2 2 4
#> 3 3 1
#> 4 4 4
#> 5 5 2
With dplyr:
library(dplyr)
df %>%
mutate(cat = coalesce(cat.1, cat.2, cat.3)) %>%
select(-cat.1, -cat.2, -cat.3)
An option with fcoalesce from data.table
library(data.table)
setDT(df)[, .(id, cat = do.call(fcoalesce, .SD)), .SDcols = patterns('^cat')]
-output
# id cat
#1: 1 3
#2: 2 4
#3: 3 1
#4: 4 4
#5: 5 2
Does this work:
> library(dplyr)
> df %>% rowwise() %>% mutate(cat = mean(c(cat.1, cat.2, cat.3), na.rm = T)) %>% select(-(2:4))
# A tibble: 5 x 2
# Rowwise:
id cat
<dbl> <dbl>
1 1 3
2 2 4
3 3 1
4 4 4
5 5 2
Since values across rows are unique, mean of the rows will return the same unique value, can also go with max or min.
Here is a base R solution which uses apply:
df$cat <- apply(df, 1, function(x) unique(x[!is.na(x)][-1]))
I want to sort a dataframe by a 'sum' of a group. So, I don't want the data frame to be ordered by group, but by the total amount of a group. I.e. I want to know which group is the biggest, 1 or 2 or 3 and then order the values according to that. So, say group 3 is the biggest group, then I want group 3 at the top and I want the values of group 3 in descending order.
set.seed(123)
d <- data.frame(
x = runif(90),
grp = gl(3, 30))
Thanks!
One way using dplyr is
d %>%
group_by(grp) %>%
mutate(sum_ = sum(x)) %>%
arrange(desc(sum_), desc(x)) %>%
select(-sum_)
Basically, we create a temporary variable sum_ that indicates the sum of x by group, and then arrange according to sum_ first and x second. Afterwards, we remove sum_ since it's no longer needed.
Output
# A tibble: 90 x 2
# Groups: grp [3]
# x grp
# <dbl> <fct>
# 1 0.994 1
# 2 0.957 1
# 3 0.955 1
# 4 0.940 1
# 5 0.900 1
# 6 0.892 1
# 7 0.890 1
# 8 0.883 1
# 9 0.788 1
# 10 0.709 1
# ... with 80 more rows
Another option is to reorder the levels of grp according to the sum of x. This can be done with tidyverse like this:
library(tidyverse)
set.seed(123)
df <- tibble(
x = runif(90),
grp = gl(3, 30)
)
df %>%
mutate(
grp = fct_reorder(grp, x, .fun = "sum", .desc = TRUE)
) %>%
arrange(grp)
#> # A tibble: 90 x 2
#> x grp
#> <dbl> <fct>
#> 1 0.288 1
#> 2 0.788 1
#> 3 0.409 1
#> 4 0.883 1
#> 5 0.940 1
#> 6 0.0456 1
#> 7 0.528 1
#> 8 0.892 1
#> 9 0.551 1
#> 10 0.457 1
#> # … with 80 more rows
df %>%
group_by(grp) %>%
summarise(tot = sum(x))
#> # A tibble: 3 x 2
#> grp tot
#> <fct> <dbl>
#> 1 1 17.2
#> 2 2 13.2
#> 3 3 15.4
Created on 2020-06-17 by the reprex package (v0.3.0)
In base you can first use order on all, than split by grp and rbind by the order of rowsum.
d <- d[order(-d$x),]
do.call(rbind, split(d, d$grp)[order(-rowsum(d$x, d$grp))])
# x grp
#1.24 0.9942697766 1
#1.11 0.9568333453 1
#1.20 0.9545036491 1
#1.5 0.9404672843 1
#1.16 0.8998249704 1
#1.8 0.8924190444 1
#1.21 0.8895393161 1
#1.4 0.8830174040 1
#1.2 0.7883051354 1
#1.26 0.7085304682 1
#1.22 0.6928034062 1
#1.13 0.6775706355 1
#1.25 0.6557057991 1
#1.23 0.6405068138 1
#1.28 0.5941420204 1
#1.14 0.5726334020 1
#1.9 0.5514350145 1
#1.27 0.5440660247 1
#1.7 0.5281054880 1
#1.10 0.4566147353 1
#1.12 0.4533341562 1
#1.3 0.4089769218 1
#1.19 0.3279207193 1
#1.29 0.2891597373 1
#1.1 0.2875775201 1
#1.17 0.2460877344 1
#1.30 0.1471136473 1
#1.15 0.1029246827 1
#1.6 0.0455564994 1
#1.18 0.0420595335 1
#3.87 0.9849569800 3
#3.88 0.8930511144 3
#3.89 0.8864690608 3
#3.65 0.8146400389 3
#3.68 0.8123895095 3
#3.67 0.8100643530 3
#3.69 0.7943423211 3
#3.84 0.7881958340 3
#3.71 0.7544751586 3
#3.73 0.7101824014 3
#3.82 0.6680555874 3
#3.61 0.6651151946 3
#3.72 0.6292211316 3
#3.78 0.6127710033 3
#3.75 0.4753165741 3
#3.66 0.4485163414 3
#3.70 0.4398316876 3
#3.86 0.4348927415 3
#3.83 0.4176467797 3
#3.63 0.3839696378 3
#3.77 0.3798165377 3
#3.79 0.3517979092 3
#3.64 0.2743836446 3
#3.81 0.2436194727 3
#3.76 0.2201188852 3
#3.90 0.1750526503 3
#3.80 0.1111354243 3
#3.85 0.1028646443 3
#3.62 0.0948406609 3
#3.74 0.0006247733 3
#2.31 0.9630242325 2
#2.32 0.9022990451 2
#2.59 0.8950453592 2
#2.50 0.8578277153 2
#2.53 0.7989248456 2
#2.34 0.7954674177 2
#2.37 0.7584595375 2
#2.58 0.7533078643 2
#2.33 0.6907052784 2
#2.55 0.5609479838 2
#2.36 0.4777959711 2
#2.48 0.4659624503 2
#2.52 0.4422000742 2
#2.42 0.4145463358 2
#2.43 0.4137243263 2
#2.60 0.3744627759 2
#2.44 0.3688454509 2
#2.39 0.3181810076 2
#2.49 0.2659726404 2
#2.47 0.2330340995 2
#2.40 0.2316257854 2
#2.38 0.2164079358 2
#2.56 0.2065313896 2
#2.45 0.1524447477 2
#2.41 0.1428000224 2
#2.46 0.1388060634 2
#2.57 0.1275316502 2
#2.54 0.1218992600 2
#2.51 0.0458311667 2
#2.35 0.0246136845 2
Another option is to reorder the levels of grp which can than be used in order.
d$grp <- factor(d$grp, levels(d$grp)[order(-rowsum(d$x, d$grp))])
d[order(d$grp, -d$x),]
For the dataframe below I want to add the original values for Var_x after a group_by on ID and event and a max() on quest, but I cannot get my code right. Any suggestions? By the way, in my original dataframe more than 1 column needs to be added.
df <- data.frame(ID = c(1,1,1,1,1,1,2,2,2,3,3,3),
quest = c(1,1,2,2,3,3,1,2,3,1,2,3),
event = c("A","B","A","B","A",NA,"C","D","C","D","D",NA),
VAR_X = c(2,4,3,6,3,NA,6,4,5,7,5,NA))
Code:
df %>%
group_by(ID,event) %>%
summarise(quest = max(quest))
Desired output:
ID quest event VAR_X
1 1 2 B 6
2 1 3 A 3
3 2 2 D 4
4 2 3 C 5
5 3 2 D 5
Start by omiting the na values and in the end do an inner_join with the original data set.
df %>%
na.omit() %>%
group_by(ID, event) %>%
summarise(quest = max(quest)) %>%
inner_join(df, by = c("ID", "event", "quest"))
## A tibble: 5 x 4
## Groups: ID [3]
# ID event quest VAR_X
# <dbl> <fct> <dbl> <dbl>
#1 1 A 3 3
#2 1 B 2 6
#3 2 C 3 5
#4 2 D 2 4
#5 3 D 2 5
df %>%
drop_na() %>% # remove if necessary ..
group_by(ID, event) %>%
filter(quest == max(quest)) %>%
ungroup()
# A tibble: 5 x 4
# ID quest event VAR_X
#<dbl> <dbl> <chr> <dbl>
# 1 1 2 B 6
# 2 1 3 A 3
# 3 2 2 D 4
# 4 2 3 C 5
# 5 3 2 D 5
How can I get a dense rank of multiple columns in a dataframe? For example,
# I have:
df <- data.frame(x = c(1,1,1,1,2,2,2,3,3,3),
y = c(1,2,3,4,2,2,2,1,2,3))
# I want:
res <- data.frame(x = c(1,1,1,1,2,2,2,3,3,3),
y = c(1,2,3,4,2,2,2,1,2,3),
r = c(1,2,3,4,5,5,5,6,7,8))
res
x y z
1 1 1 1
2 1 2 2
3 1 3 3
4 1 4 4
5 2 2 5
6 2 2 5
7 2 2 5
8 3 1 6
9 3 2 7
10 3 3 8
My hack approach works for this particular dataset:
df %>%
arrange(x,y) %>%
mutate(r = if_else(y - lag(y,default=0) == 0, 0, 1)) %>%
mutate(r = cumsum(r))
But there must be a more general solution, maybe using functions like dense_rank() or row_number(). But I'm struggling with this.
dplyr solutions are ideal.
Right after posting, I think I found a solution here. In my case, it would be:
mutate(df, r = dense_rank(interaction(x,y,lex.order=T)))
But if you have a better solution, please share.
data.table
data.table has you covered with frank().
library(data.table)
frank(df, x,y, ties.method = 'min')
[1] 1 2 3 4 5 5 5 8 9 10
You can df$r <- frank(df, x,y, ties.method = 'min') to add as a new column.
tidyr/dplyr
Another option (though clunkier) is to use tidyr::unite to collapse your columns to one plus dplyr::dense_rank.
library(tidyverse)
df %>%
# add a single column with all the info
unite(xy, x, y) %>%
cbind(df) %>%
# dense rank on that
mutate(r = dense_rank(xy)) %>%
# now drop the helper col
select(-xy)
You can use cur_group_id:
library(dplyr)
df %>%
group_by(x, y) %>%
mutate(r = cur_group_id())
# x y r
# <dbl> <dbl> <int>
# 1 1 1 1
# 2 1 2 2
# 3 1 3 3
# 4 1 4 4
# 5 2 2 5
# 6 2 2 5
# 7 2 2 5
# 8 3 1 6
# 9 3 2 7
# 10 3 3 8