Keeping one row and discarding others in R using specific criteria? - r

I'm working with the data frame below, which is just part of the full data, and I need to condense the duplicate numbers in the id column into one row. I want to preserve the row that has the highest sbp number, unless it's 300 or over, in which case I want to discard that too.
So for example, for the first three rows that have id as 13480, I want to keep the row that has 124 and discard the other two.
id,sex,visits,sbp
13480,M,2,124
13480,M,3,306
13480,M,4,116
13520,M,2,124
13520,M,3,116
13520,M,4,120
13580,M,2,NA
13580,M,3,124
This is the farthest I got, been trying to tweak this but not sure I'm on the right track:
maxsbp <- split(sbp, sbp$sbp)
r <- data.frame()
for (i in 1:length(maxsbp)){
one <- maxsbp[[i]]
index <- which(one$sbp == max(one$sbp))
select <- one[index,]
r <- rbind(r, select)
}
r1 <- r[!(sbp$sbp>=300),]
r1

I think a tidy solution to this would work quite well. I would first filter all values above 300, if you do not want to keep any value above that threshold. Then group_by id, order, and keep the first.
my.df <- data.frame("id" = c(13480,13480,13480,13520,13520,13520,13580,13580),
"sex" = c("M","M","M","M","M","M","M","M"),
"sbp"= c(124,306,116,124,116,120,NA,124))
my.df %>% filter(sbp < 300) # filter to retain only values below 300
%>% group_by(id) # group by id
%>% arrange(-sbp) # arrange by id in descending order
%>% top_n(1, sbp) # retain first value i.e. the largest
# A tibble: 3 x 3
# Groups: id [3]
# id sex sbp
# <dbl> <chr> <dbl>
#1 13480 M 124
#2 13520 M 124
#3 13580 M 124

In R, very rarely you'll require explicit for loops to do tasks.
There are functions available which will help you perform such grouped operations.
For example, in base R you can use subset and ave :
subset(df,sbp == ave(sbp,id,FUN = function(x) max(sbp[sbp <= 300],na.rm = TRUE)))
# id sex visits sbp
#1 13480 M 2 124
#4 13520 M 2 124
#8 13580 M 3 124
The same can be done using dplyr whose syntax is a little bit easier to understand.
library(dplyr)
df %>%
group_by(id) %>%
filter(sbp == max(sbp[sbp <= 300], na.rm = TRUE))

slice_head can also be used
my.df <- data.frame("id" = c(13480,13480,13480,13520,13520,13520,13580,13580),
"sex" = c("M","M","M","M","M","M","M","M"),
"sbp"= c(124,306,116,124,116,120,NA,124))
> my.df
id sex sbp
1 13480 M 124
2 13480 M 306
3 13480 M 116
4 13520 M 124
5 13520 M 116
6 13520 M 120
7 13580 M NA
8 13580 M 124
Proceed simply like this
my.df %>% group_by(id, sex) %>%
arrange(desc(sbp)) %>%
slice_head() %>%
filter(sbp <300)
# A tibble: 2 x 3
# Groups: id, sex [2]
id sex sbp
<dbl> <chr> <dbl>
1 13520 M 124
2 13580 M 124

Related

Rebuild tibble under condition

My Tibble:
df1 <- tibble(a = c("123*", "123", "124", "678*", "678", "679", "677"))
# A tibble: 7 x 1
a
<chr>
1 123*
2 123
3 124
4 678*
5 678
6 679
7 677
What it should become:
# A tibble: 3 x 2
a b
<chr> <chr>
1 123 124
2 678 679
3 678 677
The values with the stars refer to the following values with no stars, until a new value with a star comes and so on.
Each value with a star should go to the first column, the other values (except the ones that are identical to the values with a star, except the star) should go to the second column. If one value with a star is followed by several values, they should still be linked to eachother, so the values in the first column are duplicated to keep the connection.
I know how to filter and bring the values in each column, but not sure how i would keep the connection.
Regards
We can use tidyverse. Create a grouping column based on the occurence of * in 'a', extract the numeric part with parse_number, get the distinct rows, grouped by 'grp', create a new column with the first value of 'b'
library(dplyr)
library(stringr)
df1 %>%
transmute(grp = cumsum(str_detect(a, fixed("*"))),
b = readr::parse_number(a)) %>%
distinct(b, .keep_all = TRUE) %>%
group_by(grp) %>%
mutate(a = first(b)) %>%
slice(-1) %>%
ungroup %>%
select(a, b)
-output
# A tibble: 3 × 2
a b
<dbl> <dbl>
1 123 124
2 678 679
3 678 677
Here is one base R option -
Using cumsum and grepl we split the data on occurrence of *.
In each group, we drop the values which are similar to the star values and create a dataframe with two columns.
Finally, combine the list of dataframes in one combined dataframe.
result <- do.call(rbind, lapply(split(df1,
cumsum(grepl('*', df1$a, fixed = TRUE))), function(x) {
a <- x[[1]]
a[1] <- sub('*', '', a[1], fixed = TRUE)
data.frame(a = a[1], b = a[a != a[1]])
}))
rownames(result) <- NULL
result
# a b
#1 123 124
#2 678 679
#3 678 677

How to take difference between variable and lag determined by month date per group?

Essentially, I have a dataset with variables indicating group, date and value of variable. I need to take the difference between the value and the end-of-previous year value per group. Since the data is balanced, I was trying to do that with dplyr::lag, inserting the lag given the month of the observation:
x <- x %>% group_by(g) %>% mutate(y = v - lag(v, n=month(d))
This, however, does not work.
The results should be:
Mock dataset:
x <- data.frame('g'=c('B','B','B','C','A','A','A','A','A','A'),'d'=c('2018-11-30', '2018-12-31','2019-01-31','2019-12-31','2016-12-31','2017-11-30','2017-12-31','2018-12-31','2019-01-31','2019-02-28'),'v'=c(300,200,250,100,400,150,200,500,400,500))
Desired variable:
y <- c(NA,NA,-50,NA,NA,-250,-200,300,-100,0)
New dataset:
cbind(x,y)
An idea via dplyr can be to look for the last day, get the index and use that to subtract and then convert to NAs, i.e.
library(dplyr)
x %>%
group_by(g) %>%
mutate(new = which(sub('^[0-9]+-([0-9]+-[0-9]+)$', '\\1', d) == '12-31'),
y = v - v[new],
y = replace(y, row_number() <= new, NA)) %>%
select(-new)
which gives,
# A tibble: 7 x 4
# Groups: g [3]
g d v y
<fct> <fct> <dbl> <dbl>
1 B 2018-11-30 300 NA
2 B 2018-12-31 200 NA
3 B 2019-01-31 250 50
4 C 2017-12-31 400 NA
5 A 2018-12-31 500 NA
6 A 2019-01-31 400 -100
7 A 2019-02-28 500 0
In the end I decided to create an auxiliary variable ('eoy') to indicate the row of the corresponding end-of-year per group for each row. It requires a loop and is inefficient but facilitates the remaining computations that will depend on this. The desired computation would become:
mutate('y'= x - x[eoy])

Compute sum and relative proportion by group for any number of columns with random names using dplyr

I want to calculate the relative proportion by group for every column - except the grouping column - of a data frame. However, this should be programmed once to be used with different data frames which will have a different number of columns with different names. Because I am relying heavily on dplyr in this project, I want to achive this with dplyr.
I have read this topic, regarding a similiar but less complex problem:
Use dynamic variable names in `dplyr`
and also vignette("programming", "dplyr") but I am still not able to set the quotation correctly. I am really stuck at this point and like to have some advice of more experienced developers.
To reproduce the problem, I have set up a minimal example with a data frame with randomly created data columns and a grouping column.
library(dplyr)
library(stringi)
df <- setNames(as.data.frame(matrix(sample(1:10, 999, replace = T), 333, 3)),
stri_rand_strings(3, 10, pattern = "[A-Za-z]"))
group <- c("group1","group2","group3")
df <- cbind(df, group)
The following function should achive two things:
calculate the sum of every column in the data frame by group
calculate the relative proportions of every column in the data frame by group
propsum <- function(df, expr){
expr_quo <- enquo(expr)
sum <- paste(quo_name(expr), "sum", sep = ".")
prop <- paste(quo_name(expr), "prop", sep = ".")
df %>%
group_by(., group) %>%
mutate(., !! sum := sum(!! expr_quo),
!! prop := expr / !! sum * 100) -> df
return(df)
}
for(i in length(df)-1){
propsum(df, names(df)[i]) -> df_new
}
The expected result is a data frame with the initial columns, the sums by group for every initial column and the relative proportions for every initial column by group. So in the example, the data frame should have 10 columns (1 goruping column, 3 initial data columns, 3 columns with sums by group, 3 columns with relative proportions by group).
However, I am getting the following error:
Error in sum(~names(df)[i]) : invalid 'type' (character) of argument
In the vignette, the code example for a similar task ist:
my_mutate <- function(df, expr) {
expr <- enquo(expr)
mean_name <- paste0("mean_", quo_name(expr))
sum_name <- paste0("sum_", quo_name(expr))
mutate(df,
!! mean_name := mean(!! expr),
!! sum_name := sum(!! expr)
)
}
my_mutate(df, a)
#> # A tibble: 5 x 6
#> g1 g2 a b mean_a sum_a
#> <dbl> <dbl> <int> <int> <dbl> <int>
#> 1 1 1 5 4 3 15
#> 2 1 2 3 2 3 15
#> 3 2 1 4 1 3 15
#> 4 2 2 1 3 3 15
#> # … with 1 more row
I tried a lot of different things as of now, but I am not able to get the RHS to use the correct column. What am I doing wrong?
I have found a solution which I just want to share in case somebody faces a similar task.
The solution is, to call rlang::parse_expr() explicitly to save the varnames as expressions.
Here is the working example:
library(dplyr)
library(stringi)
df <- setNames(as.data.frame(matrix(sample(1:10, 999, replace = T), 333, 3)),
stri_rand_strings(3, 10, pattern = "[A-Za-z]"))
group <- c("group1","group2","group3")
df <- cbind(df, group)
gpercentage <- function(df, a_var, p_var, sum_var){
df %>%
group_by(., group) %>%
mutate(., !! sum_var := sum(!! a_var),
!! p_var := !! a_var / sum(!! a_var)) -> df
return(df)
}
i <- 1
for(i in seq_along(1:(length(df)-1))){
a_var <- rlang::parse_expr(names(df)[i])
p_var <- rlang::parse_expr(paste(names(df)[i], "P", sep = "."))
sum_var <- rlang::parse_expr(paste(names(df)[i], "SUM", sep = "."))
df %>%
gpercentage(., a_var, p_var, sum_var) -> df
}
We could achieve this as follows. :
propsum <- function(df, grouping_column){
df %>%
group_by(!!sym(grouping_column)) %>%
summarise_all(list(sum,function(x)
length(x)/nrow(.) * 100)) %>%
tidyr::pivot_longer(cols=-1,
names_to = "Variable",
values_to = "Value") %>%
mutate(Variable = gsub("fn1","sum",Variable),
Variable = gsub("fn2","prop",Variable))
}
propsum(iris,"Species")
Using df in the question:
propsum(df,"group")
# A tibble: 18 x 3
group Variable Value
<fct> <chr> <dbl>
1 group1 dVFQteFGjs_sum 628
2 group1 wiQCPUeIvC_sum 599
3 group1 yBvktNXcfd_sum 644
4 group1 dVFQteFGjs_prop 33.3
5 group1 wiQCPUeIvC_prop 33.3
6 group1 yBvktNXcfd_prop 33.3
7 group2 dVFQteFGjs_sum 630
8 group2 wiQCPUeIvC_sum 606
9 group2 yBvktNXcfd_sum 656
10 group2 dVFQteFGjs_prop 33.3
11 group2 wiQCPUeIvC_prop 33.3
12 group2 yBvktNXcfd_prop 33.3
13 group3 dVFQteFGjs_sum 636
14 group3 wiQCPUeIvC_sum 581
15 group3 yBvktNXcfd_sum 635
16 group3 dVFQteFGjs_prop 33.3
17 group3 wiQCPUeIvC_prop 33.3
18 group3 yBvktNXcfd_prop 33.3
To get back to wide(can use pivot_wider, I find spread "faster" to use),
propsum(df,"group") %>%
tidyr::spread(Variable,Value)
# A tibble: 3 x 7
group dVFQteFGjs_prop dVFQteFGjs_sum wiQCPUeIvC_prop wiQCPUeIvC_sum
<fct> <dbl> <dbl> <dbl> <dbl>
1 grou~ 33.3 628 33.3 599
2 grou~ 33.3 630 33.3 606
3 grou~ 33.3 636 33.3 581
# ... with 2 more variables: yBvktNXcfd_prop <dbl>,
# yBvktNXcfd_sum <dbl>

How to remove duplicate values of a data set and take the count of how many times each value duplicated?

I am getting a data set each month with unique reference IDs which contains duplicate values. I have to remove the duplicate unique IDs and take the count of how many times each of they are duplicated.
name <- c("A","A","A","B","B","c","D","A")
age <- c(22,23,22,32,32,54,65,70)
sex <- c("m","f","f","m","m","f","m","f")
both <- data.frame(name,age,sex)
both
both[!duplicated(both$name),]
Desired out put:
name age sex count
A 70 f 4
B 32 m 2
C 54 f 1
D 65 m 1
This combination gives your result
both <- both%>%group_by(name)%>%mutate(Count = n())
both <- both%>%group_by(name)%>%slice(n())
We can group by 'name', get the frequency count (n()), then filter the rows having the most frequency value of 'sex', and slice the last row
library(dplyr)
both %>%
group_by(name) %>%
group_by(n = n(), add = TRUE) %>%
filter(sex == Mode(sex)) %>%
slice(n())
# A tibble: 4 x 4
# Groups: name, n [4]
# name age sex n
# <fct> <dbl> <fct> <int>
#1 A 70 f 4
#2 B 32 m 2
#3 c 54 f 1
#4 D 65 m 1
where
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}

How to run a for loop for each group in a dataframe?

This question is similar to this one asked earlier but not quite. I would like to iterate through a large dataset (~500,000 rows) and for each unique value in one column, I would like to do some processing of all the values in another column.
Here is code that I have confirmed to work:
df = matrix(nrow=783,ncol=2)
counts = table(csvdata$value)
p = (as.vector(counts))/length(csvdata$value)
D = 1 - sum(p**2)
The only problem with it is that it returns the value D for the entire dataset, rather than returning a separate D value for each set of rows where ID is the same.
Say I had data like this:
How would I be able to do the same thing as the code above, but return a D value for each group of rows where ID is the same, rather than for the entire dataset? I imagine this requires a loop, and creating a matrix to store all the D values in with ID in one column and the value of D in the other, but not sure.
Ok, let's work with "In short, I would like whatever is in the for loop to be executed for each block of data with a unique value of "ID"".
In general you can group rows by values in one column (e.g. "ID") and then perform some transformation based on values/entries in other columns per group. In the tidyverse this would look like this
library(tidyverse)
df %>%
group_by(ID) %>%
mutate(value.mean = mean(value))
## A tibble: 8 x 3
## Groups: ID [3]
# ID value value.mean
# <fct> <int> <dbl>
#1 a 13 12.6
#2 a 14 12.6
#3 a 12 12.6
#4 a 13 12.6
#5 a 11 12.6
#6 b 12 15.5
#7 b 19 15.5
#8 cc4 10 10.0
Here we calculate the mean of value per group, and add these values to every row. If instead you wanted to summarise values, i.e. keep only the summarised value(s) per group, you would use summarise instead of mutate.
library(tidyverse)
df %>%
group_by(ID) %>%
summarise(value.mean = mean(value))
## A tibble: 3 x 2
# ID value.mean
# <fct> <dbl>
#1 a 12.6
#2 b 15.5
#3 cc4 10.0
The same can be achieved in base R using one of tapply, ave, by. As far as I understand your problem statement there is no need for a for loop. Just apply a function (per group).
Sample data
df <- read.table(text =
"ID value
a 13
a 14
a 12
a 13
a 11
b 12
b 19
cc4 10", header = T)
Update
To conclude from the comments&chat, this should be what you're after.
# Sample data
set.seed(2017)
csvdata <- data.frame(
microsat = rep(c("A", "B", "C"), each = 8),
allele = sample(20, 3 * 8, replace = T))
csvdata %>%
group_by(microsat) %>%
summarise(D = 1 - sum(prop.table(table(allele))^2))
## A tibble: 3 x 2
# microsat D
# <fct> <dbl>
#1 A 0.844
#2 B 0.812
#3 C 0.812
Note that prop.table returns fractions and is shorter than your (as.vector(counts))/length(csvdata$value). Note also that you can reproduce your results for all values (irrespective of ID) if you omit the group_by line.
A base R option would be
df1$value.mean <- with(df1, ave(value, ID))

Resources