How to extract only three observations that are top observations with respect to some variable, ex. count (n var in example data below)? I would like to avoid arranging rows so I thought I could use dplyr::min_rank.
ex <- structure(list(code = c("17.1", "6.2", "151.5", "78.1", "88.1",
"95.1", "45.2", "252.2"), id = c(1, 2, 3, 4, 5, 6, 7, 8), n = c(6L,
5L, 8L, 10L, 6L, 3L, 4L, 6L)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -8L))
ex %>%
filter(min_rank(desc(n)) <= 3)
But if there are ties, it can give more than 3 observations. For example, the command above returns five rows:
# A tibble: 5 x 3
code id n
<chr> <dbl> <int>
1 17.1 1 6
2 151.5 3 8
3 78.1 4 10
4 88.1 5 6
5 252.2 8 6
How can I then extract exactly 3 observations? (no matter which observation is returned in case of ties)
We can use row_number that can take a column as argument
ex %>%
filter(row_number(desc(n)) <= 3)
# A tibble: 3 x 3
# code id n
# <chr> <dbl> <int>
#1 17.1 1 6
#2 151.5 3 8
#3 78.1 4 10
In base R, we can use
ex[tail(order(ex$n),3), ]
Related
I have a dataset where there is an ID columns Patient_ID, and multiple columns relating to each baby of a birth event. There are more than one set of each column, as there have been multiple births (twins, triplets etc), and the database decided to work in a wide format.
So, I have the columns:
Patient_ID *for the mother;
pofid_1
pof1completeddate
pof1pregendweeks
pofid_2
pof2completeddate
pof2pregendweeks
etc, etc.
pofid_1 refers to a unique identifier for each baby, and is the only variable that doesnt follow the format of pofnvarname (pof - pregnancy outcome form). There are ~50 columns for each baby, I have only listed three here for demonstration. Is there a way I can pivot the whole dataset based on the number after pof so I have the following column names, and one row for each baby born:
Patient_ID
babynumber
pofid *baby ID;
pofcompleteddate
pofpregendweeks
So, I am starting off with:
data.frame(
Patient_ID = c(1, 2, 3, 4),
pofid_1 = c(1, 2, 3, 4),
pof1completeddate = as.Date(c("2022-11-12", "2022-12-11", "2022-10-10", "2022-01-01")),
pof1pregendweeks = c(40, 39, 41, 40),
pofid_2 = c(NA, NA, 5, 6),
pof2completeddate = as.Date(c(NA, NA, "2022-10-10", "2022-01-01")),
pof2pregendweeks = c(NA, NA, 41, 40)
)
Patient_ID pofid_1 pof1completeddate pof1pregendweeks pofid_2 pof2completeddate pof2pregendweeks
1 1 1 2022-11-12 40 NA <NA> NA
2 2 2 2022-12-11 39 NA <NA> NA
3 3 3 2022-10-10 41 5 2022-10-10 41
4 4 4 2022-01-01 40 6 2022-01-01 40
And want
Patient_ID pofid babynumber pofcompleteddate pofpregendweeks
1 1 1 1 2022-11-12 40
2 2 2 1 2022-12-11 39
3 3 3 1 2022-10-10 41
4 3 5 2 2022-10-10 41
5 4 4 1 2022-01-01 40
6 4 6 2 2022-01-01 40
It's best to ensure you have consistent naming across your columns by changing pofid_1 and pof_id2 to pof1id and pof2id. You can do this in one gulp using rename_with. Then, it's just a case of pivoting to long format and filtering to retain complete cases:
library(tidyverse)
df %>%
rename_with(~gsub('pofid_(\\d+)', 'pof\\1id', .x)) %>%
pivot_longer(-Patient_ID, names_sep = '(?<=pof\\d)',
names_to = c('babynumber', '.value')) %>%
filter(complete.cases(.)) %>%
mutate(babynumber = as.numeric(gsub('\\D', '', babynumber))) %>%
rename(pofid = id)
#> # A tibble: 6 x 5
#> Patient_ID babynumber pofid completeddate pregendweeks
#> <int> <dbl> <int> <chr> <int>
#> 1 1 1 1 2022-11-12 40
#> 2 2 1 2 2022-12-11 39
#> 3 3 1 3 2022-10-10 41
#> 4 3 2 5 2022-10-10 41
#> 5 4 1 4 2022-01-01 40
#> 6 4 2 6 2022-01-01 4
Created on 2023-02-13 with reprex v2.0.2
Data in reproducible format
df <- structure(list(Patient_ID = 1:4, pofid_1 = 1:4,
pof1completeddate = c("2022-11-12",
"2022-12-11", "2022-10-10", "2022-01-01"), pof1pregendweeks = c(40L,
39L, 41L, 40L), pofid_2 = c(NA, NA, 5L, 6L), pof2completeddate = c(NA,
NA, "2022-10-10", "2022-01-01"), pof2pregendweeks = c(NA, NA,
41L, 4L)), class = "data.frame", row.names = c("1", "2", "3",
"4"))
I am new to R and I am trying to write a piece of code that will enable me to pick some data in df.x and put it in df.y:
Category 2019 2020 2021 2022 2023
Apple 3 4 5 6 7
Pear 3 4 5 6 7
Banana 3 4 5 6 7
Oranges 3 4 5 6 7
I want to select the value for oranges in 2019 and put in df.y and differences in years for Apple into a new df.y, like this:
Resource 2019 2020 2021 2022 2023
Orange 3 4 5 6 7
Apple 1 1 1 1
Any helps are appreciated!!!
Thanks
This is a tidyverse approach involving wide to long transform since it's easier to calculate the year differences.
library(tidyverse)
df.x <- tibble(
Category = c("Apple", "Pear", "Banana", "Orange"),
"2019" = 3,
"2020" = 4,
"2021" = 5,
"2022" = 6,
"2023" = 7
)
df.y <- df %>%
filter(Category %in% c("Apple", "Orange")) %>%
pivot_longer(-Category) %>%
mutate(value = ifelse(Category == "Apple", value - lag(value, 1), value)) %>%
pivot_wider()
# A tibble: 2 x 6
# Category `2019` `2020` `2021` `2022` `2023`
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Apple NA 1 1 1 1
#2 Orange 3 4 5 6 7
First you need to provide your data using dput(df.x):
df.x <- structure(list(Category = c("Apple", "Pear", "Banana", "Oranges"
), X2019 = c(3L, 3L, 3L, 3L), X2020 = c(4L, 4L, 4L, 4L), X2021 = c(5L,
5L, 5L, 5L), X2022 = c(6L, 6L, 6L, 6L), X2023 = c(7L, 7L, 7L,
7L)), class = "data.frame", row.names = c(NA, -4L))
Note that your column names have changed because R does not allow column/variable names to begin with a number. The process for extracting information from a data frame is covered in detail on the manual page: ?Extract. It is a bit dense so it may be easier to begin with some introductory tutorials on R.
To extract the row for Oranges:
row1 <- df.x[df.x$Category=="Oranges", ]
row1
# Category X2019 X2020 X2021 X2022 X2023
# 4 Oranges 3 4 5 6 7
The row number indicates that this is the 4th row in df.x. Now the second row is slightly more involved. First extract the row:
row2 <- df.x[df.x$Category=="Apple", ]
row2
# Category X2019 X2020 X2021 X2022 X2023
# 1 Apple NA 1 1 1 1
Now compute the differences across the row by converting the data frame row to a simple vector with unlist since diff is picky about what kind of data structure it will work with and insert a missing value, NA, for the first year:
row2[ , -1] <- c(X2019=NA, diff(unlist(row2[, -1])))
df.y <- rbind(row1, row2)
df.y
# Category X2019 X2020 X2021 X2022 X2023
# 4 Oranges 3 4 5 6 7
# 1 Apple NA 1 1 1 1
rownames(df.y) <- NULL
The last line just resets the row names which have carried over from df.x.
x y z column_indices
6 7 1 1,2
5 4 2 3
1 3 2 1,3
I have the column indices of the values I would like to collect in a separate column like so, what I want to create is something like this:
x y z column_indices values
6 7 1 1,2 6,7
5 4 2 3 2
1 3 2 1,3 1,2
What is the simplest way to do this in R?
Thanks!
In base R, we can use apply, split the column_indices on ',', convert them to integer and get the corresponding value from the row.
df$values <- apply(df, 1, function(x) {
inds <- as.integer(strsplit(x[4], ',')[[1]])
toString(x[inds])
})
df
# x y z column_indices values
#1 6 7 1 1,2 6, 7
#2 5 4 2 3 2
#3 1 3 2 1,3 1, 2
data
df <- structure(list(x = c(6L, 5L, 1L), y = c(7L, 4L, 3L), z = c(1L,
2L, 2L), column_indices = structure(c(1L, 3L, 2L), .Label = c("1,2",
"1,3", "3"), class = "factor")), class = "data.frame", row.names = c(NA, -3L))
One solution involving dplyr and tidyr could be:
df %>%
pivot_longer(-column_indices) %>%
group_by(column_indices) %>%
mutate(values = toString(value[1:n() %in% unlist(strsplit(column_indices, ","))])) %>%
pivot_wider(names_from = "name", values_from = "value")
column_indices values x y z
<chr> <chr> <int> <int> <int>
1 1,2 6, 7 6 7 1
2 3 2 5 4 2
3 1,3 1, 2 1 3 2
I am honest, I could come up with a decent title for this.
Basically, I have a dateframe:
ID Qty BasePrice Total
1 2 30 50
1 1 20 20
2 4 5 15
For each line I want to calculate the following:
Result = (Qty * BasePrice) - Total
Which is supposedly easy to do in R. However, I want to group the results by ID (sum them).
Sample Output:
ID Qty BasePrice Total Results
1 2 30 50 10
1 1 20 20 10
2 4 5 15 5
For instance, for ID=1, the values represent ((2*30)-50)+((1*20)-20)
Any idea on how can I achieve this?
Thanks!
We can do a group_by sum of the difference between the product of 'Qty', 'BasePrice' with 'Total'
library(dplyr)
df1 %>%
group_by(ID) %>%
mutate(Result = sum((Qty * BasePrice) - Total))
# A tibble: 3 x 5
# Groups: ID [2]
# ID Qty BasePrice Total Result
# <int> <int> <int> <int> <int>
#1 1 2 30 50 10
#2 1 1 20 20 10
#3 2 4 5 15 5
data
df1 <- structure(list(ID = c(1L, 1L, 2L), Qty = c(2L, 1L, 4L), BasePrice = c(30L,
20L, 5L), Total = c(50L, 20L, 15L)), class = "data.frame", row.names = c(NA,
-3L))
We have some tidy data with treatments (multiple samples and control), time points, and measured values. I want to normalize all the samples by dividing by the corresponding time point in the control variable.
I know how I would do this with each value in its own column, but can't figure out how to us a combination of gather mutate, sumamrise etc from tidyr or dplyr to do this in a straightforward way.
Here is a sample data frame definition:
structure(list(time = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
value = c(10, 20, 15, 100, 210, 180, 110, 180, 140),
as.factor.treat. = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
.Label = c("c", "t1", "t2"), class = "factor")),
.Names = c("time", "value", "treat"),
row.names = c(NA, -9L), class = "data.frame")
Data frame looks like this:
time value treat
1 10 c
2 20 c
3 15 c
1 100 t1
2 210 t1
3 180 t1
1 110 t2
2 180 t2
3 140 t2
Expected output. same but with normvalue column containing c(1,1,1,10,10.5,12,11,9,9.333333)
I'd like to get out columns of normalized value for each treatment and time point using tidyverse procedures...
If you group by time (assuming that, as in the example, it is the grouping variable for time-point) then we can use bracket notation in a mutate statement to search only within the group. We can use that to access the control value for each group and then divide the un-normalized value by that:
df %>%
group_by(time) %>%
mutate(value.norm = value / value[treat == 'c'])
# A tibble: 9 x 4
# Groups: time [3]
time value treat value.norm
<dbl> <dbl> <fct> <dbl>
1 1 10 c 1
2 2 20 c 1
3 3 15 c 1
4 1 100 t1 10
5 2 210 t1 10.5
6 3 180 t1 12
7 1 110 t2 11
8 2 180 t2 9
9 3 140 t2 9.33
All this does is take the value column of each row and divide it by the value for the control sample with the same time value. As you can see, it doesn't care if sample t1 is missing an observation for time == 1:
df <- structure(list(time = c(1, 2, 3, 2, 3, 1, 2, 3),
value = c(10, 20, 15, 210, 180, 110, 180, 140),
as.factor.treat. = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L),
.Label = c("c", "t1", "t2"), class = "factor")),
.Names = c("time", "value", "treat"),
row.names = c(NA, -8L), class = "data.frame")
df %>%
group_by(time) %>%
mutate(value.norm = value / value[treat == 'c'])
# A tibble: 8 x 4
# Groups: time [3]
time value treat value.norm
<dbl> <dbl> <fct> <dbl>
1 1 10 c 1
2 2 20 c 1
3 3 15 c 1
4 2 210 t1 10.5
5 3 180 t1 12
6 1 110 t2 11
7 2 180 t2 9
8 3 140 t2 9.33