This question already has answers here:
How can I take pairwise parallel maximum or minimum between two vectors?
(3 answers)
For each row return the column name of the largest value
(10 answers)
Closed 7 months ago.
HAVE = data.frame("STUDENT"=c(1, 1, 1, 2, 2, 2, 3, 3, 3),
"CLASS"=c('A','A','A','B','B','B','C','C','C'),
"SEMESTER"=c(1, 2, 3, 1, 2, 3, 1, 2, 3),
"SCORE"=c(50, 74, 78, 79, 100, 65, 61, 70, 87),
"TEST"=c(80, 59, 63, 96, 57, 53, 93, 89, 92))
WANT = HAVE %>%
rowwise() %>%
mutate(MAX = max(c(SCORE, TEST)))
WANT$WHICHCOL = c("TEST", "SCORE", "SCORE", "TEST", "SCORE", "SCORE", "TEST", "TEST", "TEST")
I am able to identify the way to get the max value between SCORE and TEST but I wish to also make the column WHICHCOL which equals to 'TEST' if TEST> SCORE or 'SCORE' if SCORE > TEST
pmax is a built-in function that will be much more efficient than a rowwise max:
HAVE %>%
mutate(
MAX = pmax(SCORE, TEST),
WHICHCOL = ifelse(SCORE > TEST, "SCORE", "TEST")
)
# STUDENT CLASS SEMESTER SCORE TEST MAX WHICHCOL
# 1 1 A 1 50 80 80 TEST
# 2 1 A 2 74 59 74 SCORE
# 3 1 A 3 78 63 78 SCORE
# 4 2 B 1 79 96 96 TEST
# 5 2 B 2 100 57 100 SCORE
# 6 2 B 3 65 53 65 SCORE
# 7 3 C 1 61 93 93 TEST
# 8 3 C 2 70 89 89 TEST
# 9 3 C 3 87 92 92 TEST
Note that, since I use > not >=, TEST will win ties.
A base R solution:
df1 <- HAVE[c("SCORE", "TEST")]
x <- max.col(df1, "first")
MAX <- df1[cbind(1:nrow(df1), x)]
WHICHCOL <- names(df1)[x]
HAVE <- cbind(HAVE, MAX, WHICHCOL)
HAVE
#> STUDENT CLASS SEMESTER SCORE TEST MAX WHICHCOL
#> 1 1 A 1 50 80 80 TEST
#> 2 1 A 2 74 59 74 SCORE
#> 3 1 A 3 78 63 78 SCORE
#> 4 2 B 1 79 96 96 TEST
#> 5 2 B 2 100 57 100 SCORE
#> 6 2 B 3 65 53 65 SCORE
#> 7 3 C 1 61 93 93 TEST
#> 8 3 C 2 70 89 89 TEST
#> 9 3 C 3 87 92 92 TEST
Related
Question:
Below works, but is there a better "R way" of achieving similar result? I am essentially trying to create / distribute groups into individual line items according to a user defined function (currently just using a loop).
Example:
df1 <- data.frame(group = c("A", "B", "C"),
volume = c(200L, 45L, 104L)
)
print(df1)
#> group volume
#> 1 A 200
#> 2 B 45
#> 3 C 104
I want the volume to be broken across multiple rows according to group so that the final result is a dataframe where the new volume (vol2 in the below) would add up to original volume above. In this example, I'm applying integer math with a divisor of 52, so my final result should be:
print(df3)
#> group vol2
#> 1 A 52
#> 2 A 52
#> 3 A 52
#> 4 A 44
#> 21 B 45
#> 31 C 52
#> 32 C 52
This works
The code below DOES get me to the desired result shown above:
div <- 52L
df1$intgr <- df1$volume %/% div
df1$remainder <- df1$volume %% div
print(df1)
#> group volume intgr remainder
#> 1 A 200 3 44
#> 2 B 45 0 45
#> 3 C 104 2 0
df2 <- data.frame()
for (r in 1:nrow(df1)){
if(df1[r,"intgr"] > 0){
for (k in 1:as.integer(df1[r,"intgr"])){
df1[r,"vol2"] <- div
df2 <- rbind(df2, df1[r,])
}
}
if(df1[r,"remainder"]>0){
df1[r, "vol2"] <- as.integer(df1[r, "remainder"])
df2 <- rbind(df2, df1[r,])
}
}
print(df2)
#> group volume intgr remainder vol2
#> 1 A 200 3 44 52
#> 2 A 200 3 44 52
#> 3 A 200 3 44 52
#> 4 A 200 3 44 44
#> 21 B 45 0 45 45
#> 31 C 104 2 0 52
#> 32 C 104 2 0 52
df3 <- subset(df2, select = c("group", "vol2"))
print(df3)
#> group vol2
#> 1 A 52
#> 2 A 52
#> 3 A 52
#> 4 A 44
#> 21 B 45
#> 31 C 52
#> 32 C 52
Being still relatively new to R, I'm just curious if someone knows a better way / function / method that gets to the same place. Seems like there might be. I could potentially have a more complex way of breaking up the rows and I was thinking maybe there's a method that applies a UDF to the dataframe to do something like this. I was searching for "expand group/groups" but was finding mostly "expand.grid" which isn't what I'm doing here.
Thank you for any suggestions!
A quick function to help split each number by the modulus,
fun <- function(num, mod) c(rep(mod, floor(num / mod)), (num-1) %% mod + 1)
fun(200, 52)
# [1] 52 52 52 44
fun(45, 52)
# [1] 45
fun(104, 52)
# [1] 52 52
And we can apply this a number of ways:
dplyr
library(dplyr)
df1 %>%
group_by(group) %>%
summarize(vol2 = fun(volume, 52), .groups = "drop")
# # A tibble: 7 x 2
# group vol2
# <chr> <dbl>
# 1 A 52
# 2 A 52
# 3 A 52
# 4 A 44
# 5 B 45
# 6 C 52
# 7 C 52
base R
do.call(rbind, by(df1, seq(nrow(df1)),
FUN = function(z) data.frame(group = z$group, vol2 = fun(z$volume, 52))))
data.table
library(data.table)
setDT(df1)
df1[, .(vol2 = fun(volume, 52)), by = group]
A tidyverse approach using purrr::pmap and tidyr::unnest_longer may look like so:
library(dplyr, w = FALSE)
library(tidyr)
library(purrr)
div <- 52
df1 |>
mutate(intgr = volume %/% div, remainder = volume %% div, intgr1 = +(remainder > 0)) |>
mutate(vol2 = purrr::pmap(list(intgr, intgr1, remainder), ~ c(rep(div, ..1), rep(..3, ..2)))) |>
tidyr::unnest_longer(vol2) |>
select(-intgr1)
#> # A tibble: 7 × 5
#> group volume intgr remainder vol2
#> <chr> <int> <dbl> <dbl> <dbl>
#> 1 A 200 3 44 52
#> 2 A 200 3 44 52
#> 3 A 200 3 44 52
#> 4 A 200 3 44 44
#> 5 B 45 0 45 45
#> 6 C 104 2 0 52
#> 7 C 104 2 0 52
With data.table and rep:
library(data.table)
setDT(df1)[, .(vol2 = c(rep(52, volume%/%52), (volume%%52)[sign(volume%%52)])), group]
#> group vol2
#> 1: A 52
#> 2: A 52
#> 3: A 52
#> 4: A 44
#> 5: B 45
#> 6: C 52
#> 7: C 52
Or
setDT(df1)[, .(vol2 = c(rep(52, volume%/%52), volume%%52)), group][vol2 != 0]
#> group vol2
#> 1: A 52
#> 2: A 52
#> 3: A 52
#> 4: A 44
#> 5: B 45
#> 6: C 52
#> 7: C 52
Vectorised and without grouping:
df1 <- data.frame(group = c("A", "B", "C"),
volume = c(200L, 45L, 104L))
n <- 52
idx <- df1$volume %/% n + ((sel <- df1$volume %% n) != 0)
out <- df1[rep(seq_len(nrow(df1)), idx),]
out$volume <- n
out$volume[cumsum(idx)[sel != 0]] <- sel[sel != 0]
## group volume
##1 A 52
##1.1 A 52
##1.2 A 52
##1.3 A 44
##2 B 45
##3 C 52
##3.1 C 52
Another base R solution using aggregate :
aggregate(.~group,df1,\(x) c(rep(52, x / 52), (x-1) %% 52 + 1))
group volume
1 A 52, 52, 52, 44
2 B 45
3 C 52, 52, 52
This results in a list column for volume (could be useful)
To transform it to a long dataframe we can either use stack:
with(
aggregate(.~group,df1,\(x) c(rep(52, x / 52), (x-1) %% 52 + 1)),
setNames(stack(setNames(volume,group))[2:1],names(df1))
)
group volume
1 A 52
2 A 52
3 A 52
4 A 44
5 B 45
6 C 52
7 C 52
8 C 52
Or alternatively use unnest from tidyr
library(tidyr)
aggregate(.~group,df1,\(x) c(rep(52, x / 52), (x-1) %% 52 + 1)) %>% unnest(volume)
# A tibble: 8 × 2
group volume
<chr> <dbl>
1 A 52
2 A 52
3 A 52
4 A 44
5 B 45
6 C 52
7 C 52
8 C 52
I have the following dataset df.
For each id, a1-a3 are the values of variable a recorded at time points 1-3. b1-b3 are the values of variable b recorded at time 1-3. c is a time-invariant variable.
Here is the codes to create the dataset:
id <- c(1, 2, 3)
a1 <- c(52, 339, 83)
a2 <- c(86, 746, 35)
a3 <- c(46, 546, 45)
b1 <- c(84, 45, 83)
b2 <- c(55, 46, 35)
b3 <- c(46, 60, 45)
c <- c(30, 20, 50)
df <- cbind(id, a1, a2, a3, b1, b2, b3, c)
Here is original dataset df
id a1 a2 a3 b1 b2 b3 c
[1,] 1 52 86 46 84 55 46 30
[2,] 2 339 746 546 45 46 60 20
[3,] 3 83 35 45 83 35 45 50
I want to change it to the long format, i.e., into the following df2
time id a b c
[1,] 1 1 52 84 30
[2,] 2 1 86 55 30
[3,] 3 1 46 46 30
[4,] 1 2 339 45 20
[5,] 2 2 746 46 20
[6,] 3 2 546 60 20
[7,] 1 3 83 83 50
[8,] 2 3 35 35 50
[9,] 3 3 45 45 50
What is the best way to do that?
I tried pivot_longer Function (tidyr Package), but it does not return what I need.
Thank you very much for the help!
Here's a way to do it with a more advanced use of pivot_longer. A little harder to learn, but much less code:
df %>%
as.data.frame %>%
pivot_longer(-c(id, c), names_to = c('.value', 'time'), names_pattern = '(.)(.)') %>%
relocate(c, .after = b)
id time a b c
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 1 52 84 30
2 1 2 86 55 30
3 1 3 46 46 30
4 2 1 339 45 20
5 2 2 746 46 20
6 2 3 546 60 20
7 3 1 83 83 50
8 3 2 35 35 50
9 3 3 45 45 50
Or, if you wanted to be a little more explicit about how the "time" and "c" columns are treated:
df %>%
as.data.frame %>%
pivot_longer(-id, names_to = c('.value', 'time'), names_pattern = '(.)(.*)') %>%
group_by(id) %>%
mutate(
time = as.numeric(time),
c = c[!is.na(c)]
) %>%
filter(!is.na(time))
id time a b c
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 52 84 30
2 1 2 86 55 30
3 1 3 46 46 30
4 2 1 339 45 20
5 2 2 746 46 20
6 2 3 546 60 20
7 3 1 83 83 50
8 3 2 35 35 50
9 3 3 45 45 50
Here is a way. After reshaping to long format, remove the digits in the name column, create a complementary id column, n, and reshape back to wide format.
id <- c(1, 2, 3)
a1 <- c(52, 339, 83)
a2 <- c(86, 746, 35)
a3 <- c(46, 546, 45)
b1 <- c(84, 45, 83)
b2 <- c(55, 46, 35)
b3 <- c(46, 60, 45)
c <- c(30, 20, 50)
df <- cbind(id, a1, a2, a3, b1, b2, b3, c)
df
#> id a1 a2 a3 b1 b2 b3 c
#> [1,] 1 52 86 46 84 55 46 30
#> [2,] 2 339 746 546 45 46 60 20
#> [3,] 3 83 35 45 83 35 45 50
suppressPackageStartupMessages({
library(dplyr)
library(tidyr)
})
df %>%
as.data.frame() %>%
pivot_longer(-id) %>%
mutate(name = gsub("\\d+", "", name)) %>%
group_by(id, name) %>%
mutate(n = row_number()) %>%
ungroup() %>%
pivot_wider(id_cols = c(id, n)) %>%
select(-n) %>%
mutate(c = zoo::na.locf(c))
#> # A tibble: 9 × 4
#> id a b c
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 52 84 30
#> 2 1 86 55 30
#> 3 1 46 46 30
#> 4 2 339 45 20
#> 5 2 746 46 20
#> 6 2 546 60 20
#> 7 3 83 83 50
#> 8 3 35 35 50
#> 9 3 45 45 50
Created on 2022-10-23 with reprex v2.0.2
So I have this dataframe and I aim to add a new variable based on others:
Qi
Age
c_gen
1
56
13
2
43
15
5
31
6
3
67
8
I want to create a variable called c_sep that if:
Qi==1 or Qi==2 c_sep takes a random number between (c_gen + 6) and Age;
Qi==3 or Qi==4 c_sep takes a random number between (Age-15) and Age;
And 0 otherwise,
so my data would look something like this:
Qi
Age
c_gen
c_sep
1
56
13
24
2
43
15
13
5
31
6
0
3
67
8
40
Any ideas please
In base R, you can do something along the lines of:
dat <- read.table(text = "Qi Age c_gen
1 56 13
2 43 15
5 31 6
3 67 8", header = T)
set.seed(100)
dat$c_sep <- 0
dat$c_sep[dat$Qi %in% c(1,2)] <- apply(dat[dat$Qi %in% c(1,2),], 1, \(row) sample(
(row["c_gen"]+6):row["Age"], 1
)
)
dat$c_sep[dat$Qi %in% c(3,4)] <- apply(dat[dat$Qi %in% c(3,4),], 1, \(row) sample(
(row["Age"]-15):row["Age"], 1
)
)
dat
# Qi Age c_gen c_sep
# 1 1 56 13 28
# 2 2 43 15 43
# 3 5 31 6 0
# 4 3 67 8 57
If you are doing it more than twice you might want to put this in a function - depending on your requirements.
Try this
df$c_sep <- ifelse(df$Qi == 1 | df$Qi == 2 ,
sapply(1:nrow(df) ,
\(x) sample(seq(df$c_gen[x] + 6, df$Age[x]) ,1)) ,
sapply(1:nrow(df) ,
\(x) sample(seq(df$Age[x] - 15, df$Age[x]) ,1)) , 0))
output
Qi Age c_gen c_sep
1 1 56 13 41
2 2 43 15 42
3 5 31 6 0
4 3 67 8 58
A tidyverse option:
library(tidyverse)
df <- tribble(
~Qi, ~Age, ~c_gen,
1, 56, 13,
2, 43, 15,
5, 31, 6,
3, 67, 8
)
df |>
rowwise() |>
mutate(c_sep = case_when(
Qi <= 2 ~ sample(seq(c_gen + 6, Age, 1), 1),
between(Qi, 3, 4) ~ sample(seq(Age - 15, Age, 1), 1),
TRUE ~ 0
)) |>
ungroup()
#> # A tibble: 4 × 4
#> Qi Age c_gen c_sep
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 56 13 39
#> 2 2 43 15 41
#> 3 5 31 6 0
#> 4 3 67 8 54
Created on 2022-06-29 by the reprex package (v2.0.1)
I'm trying to split columns into new rows keeping the data of the first two columns.
d1 <- data.frame(a=c(100,0,78),b=c(0,137,117),c.1=c(111,17,91), d.1=c(99,66,22), c.2=c(11,33,44), d.2=c(000,001,002))
d1
a b c.1 d.1 c.2 d.2
1 100 0 111 99 11 0
2 0 137 17 66 33 1
3 78 117 91 22 44 2
Expected results would be:
a b c d
1 100 0 111 99
2 100 0 11 0
3 0 137 17 66
4 0 137 33 1
5 78 117 91 22
6 78 117 44 2
Multiple tries with dplyr, but in sees is not the right approach.
If you want to stay in dplyr/tidyverse, you want tidyr::pivot_longer with a special reference to .value -- see the pivot vignette for more:
library(tidyverse)
d1 <- data.frame(
a = c(100, 0, 78),
b = c(0, 137, 117),
c.1 = c(111, 17, 91),
d.1 = c(99, 66, 22),
c.2 = c(11, 33, 44),
d.2 = c(000, 001, 002)
)
d1 %>%
pivot_longer(
cols = contains("."),
names_to = c(".value", "group"),
names_sep = "\\."
)
#> # A tibble: 6 x 5
#> a b group c d
#> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 100 0 1 111 99
#> 2 100 0 2 11 0
#> 3 0 137 1 17 66
#> 4 0 137 2 33 1
#> 5 78 117 1 91 22
#> 6 78 117 2 44 2
Created on 2020-05-11 by the reprex package (v0.3.0)
This could solve your issue:
#Try this
a1 <- d1[,c(1:4)]
a2 <- d1[,c(1,2,5,6)]
names(a1) <- names(a2) <- c('a','b','c','d')
DF <- rbind(a1,a2)
The posted answers are good, here's my attempt:
df <- data.frame(a=c(100,0,78),b=c(0,137,117),
c.1=c(111,17,91), d.1=c(99,66,22),
c.2=c(11,33,44), d.2=c(000,001,002))
# Make 2 pivot long operations
df_c <- df %>% select(-d.1, -d.2) %>%
pivot_longer(cols = c("c.1", "c.2"), values_to = "c") %>% select(-name)
df_d <- df %>% select(-c.1, -c.2) %>%
pivot_longer(cols=c("d.1","d.2"), values_to = "d") %>% select(-name)
# bind them without the "key" colums
bind_cols(df_c, select(df_d, -a, -b))
Which produces
# A tibble: 6 x 4
a b c d
<dbl> <dbl> <dbl> <dbl>
1 100 0 111 99
2 100 0 11 0
3 0 137 17 66
4 0 137 33 1
5 78 117 91 22
6 78 117 44 2
I have a dataframe that contains tiers and scores. I want to rescale the scores based on the tier with tier 5 from 100-91, tier 4 from 90-81, tier 3 from 80-71 etc. A sample of the data is as follows...
Tier Score
1 95
2 85
3 90
3 87
1 90
4 88
5 90
2 90
5 75
3 80
4 72
1 86
5 70
What I have so far is
library(scales)
df$scale = ifelse(df$tier == "5", rescale(df[df$tier == "5",]$score, to = c(91, 100)), df$scale)
and the output is NA
First, create a list containing the limits for rescale. The first list element is for Tier 1, the second list element is for Tier 2 etc.
limits <- list(c(60, 51), c(61, 70), c(71, 80), c(81, 90), c(91, 100))
You can use this list in the following dplyr approach:
library(dplyr)
df %>%
group_by(Tier) %>%
mutate(scale = rescale(Score, to = limits[[first(Tier)]]))
The result:
# A tibble: 13 x 3
# Groups: Tier [5]
Tier Score scale
<int> <int> <dbl>
1 1 95 51
2 2 85 61
3 3 90 80
4 3 87 77.3
5 1 90 56
6 4 88 90
7 5 90 100
8 2 90 70
9 5 75 93.2
10 3 80 71
11 4 72 81
12 1 86 60
13 5 70 91