How to use R to merge redundant information? - r

It’s hard to describe what I mean, I mean I have the following data frame
A 1013574 1014475
A 1014005 1014475
A 1014005 1014435
I want to merge these data into A 1013574 1014475,Is there any function that can do me achieve this goal?
My desired output is two have 1 row for each ID (in my case value "A"), the second column will contain the smallest value and the third the highest value for each ID.

This is an updated answer. I think that this is what you want. I added additional rows, so you can see how it works with multiple data.
library(dplyr)
df <- tibble(a = c("A", "A", "A","B", "B", "B" ),
v1 = as.numeric(c(1013574,1014005,1014005, 1014005, 1014305, 1044005)),
v2 = as.numeric(c(1014475, 1014475,1014435, 1014435, 1014435, 1314435)))
df_new <-df %>% group_by(a) %>% mutate(v1 = min(v1),
v2 = max(v2)) %>%
distinct()

Related

Summarise multiple functions at once using tidyeval in dplyr 1.0

Say we have a data frame,
library(tidyverse)
library(rlang)
df <- tibble(id = rep(c(1:2), 10),
grade = sample(c("A", "B", "C"), 20, replace = TRUE))
we would like to get the mean of grades grouped by id,
df %>%
group_by(id) %>%
summarise(
n = n(),
mu_A = mean(grade == "A"),
mu_B = mean(grade == "B"),
mu_C = mean(grade == "C")
)
I am handling a case where there are multiple conditions (many grades in this case) and would like to make my code more robust. How can we simplify this using tidyevaluation in dplyr 1.0?
I am talking about the idea of generating multiple column names by passing all grades at once, without breaking the flow of piping in dplyr, something like
# how to get the mean of A, B, C all at once?
mu_{grade} := mean(grade == {grade})
I actually found the answer to my own question from a post that I wrote 2 years ago...
I am just going to post the code right below hoping to help anybody that comes across the same problem.
make_expr <- function(x) {
x %>%
map( ~ parse_expr(str_glue("mean(grade == '{.x}')")))
}
# generate multiple expressions
grades <- c("A", "B", "C")
exprs <- grades %>% make_expr() %>% set_names(paste0("mu_", grades))
# we can 'top up' something extra by adding named element
exprs <- c(n = parse_expr("n()"), exprs)
# using the big bang operator `!!!` to force expressions in data frame
df %>% group_by(id) %>% summarise(!!!exprs)

Identifying maximum value in a row, from multiple columns, with an output including all columns in the dataset?

I have a fairly large dataset, and I need to determine the maximum value of each row, from several columns. So in the below sample data, for "II" what the highest value is, and if the highest value is in "N" or "P". I know very similar questions to this have been posted previously, however I need the output to not remove the other metadata columns in my dataset. This also means I need to specify the range of columns which should be included in the "max" query.
Thanks in advance for any guidance with this.
data<-data_frame(Exp = c("I", "II", "III", "IV", "V", "VI", "VII", "VIII"),
N = c(8.77, 1.67, 7.47, 7.58, 1.1, 8.9, 7.5, 7.7),
P = c(1.848, 3.029, 1.925, 2.725, 1.900, 3.100,
2.000, 9.800))
I have tried several variations of the below code
test %>%
mutate(Max = pmax(!!! rlang::syms(names(.)[c("N", "P"),]))) %>%
group_by(data, Exp) %>%
summarise(Max = max(Max))
and receive the error:
Error in UseMethod("mutate_") :
no applicable method for 'mutate_' applied to an object of class "function"
This is my first question asked on here, so apologies for any incorrect formatting etc, any advice on this (and my question) would be much appreciated.
I am considering this in two steps
find the max value of columns
find label that matches the max value (assume not equal values)
If you only have two columns N and P then this is straightforward to do using case_when.
data2 = data %>%
mutate(max_val = pmax(N,P)) %>% # find max
mutate(source = case_when(max_val == N ~ "N", # find label
max_val == P ~ "P"))
However, if the number of columns, or the column names, is dynamic then this becomes harder. I have the following working:
cols = c("N", "P") # list of column names to work with
data2 = data %>%
mutate(max_val = pmax(!!!syms(cols))) %>% # find max
mutate(source = NA) # initialize blank labels
# iterate to find labels
data3 = data2
for(c in cols)
data3 = mutate(data3, source = ifelse(is.na(source) & max_val == !!sym(c), c, source))
There is probably a way to combine sym with case_when so you do not have to iterate over the labels. If someone finds it, please post an update to this answer.
Looking to solve the same problem I found a different solution, that is clearer to me.
cur_data returns the current working group.
rowwise can have columns specified which work like groups while using summarise.
ungroup is needed to revert to the default column-wise format.
The summarise method drops the non-grouping variables.
# using names
v = c('N', 'P')
data %>% rowwise %>% mutate(m=max(cur_data()[v])) %>% ungroup
# using ranges
start = 8
end = 25
data %>% rowwise %>% mutate(m=max(cur_data()[start:end])) %>% ungroup
# using summarize
data %>% rowwise(Exp) %>% summarize(m=max(cur_data()))

dplyr: how to ignore NA in grouping variable

Using dplyr, I'm trying to group by two variables. Now, if there is a NA in one variable but the other variable match, I'd still like to see those rows grouped, with the NA taking on the value of the non-NA value. So if I have a data frame like this:
variable_A <- c("a", "a", "b", NA, "f")
variable_B <- c("c", "d", "e", "c", "c")
variable_C <- c(10, 20, 30, 40, 50)
df <- data.frame(variable_A, variable_B, variable_C)
And if I wanted to group by variable_A and variable_B, row 1 and 4 normally wouldn't group but I'd like them to , while the NA gets overridden to "a." How can I achieve this? The below doesn't do the job.
df2 <- df %>%
group_by(variable_A, variable_B) %>%
summarise(total=sum(variable_C))
You can group by B first, and then fill in the missing A values. Then proceed with what you wanted to do:
df_filled = df %>%
group_by(variable_B) %>%
mutate(variable_A = first(na.omit(variable_A)))
df_filled %>%
group_by(variable_A, variable_B) %>%
summarise(total=sum(variable_C))
You could do the missing value imputation using base R as follows:
ii <- which(is.na(df$variable_A))
jj <- which(df$variable_B == df$variable_B[ii])
df_filled <- df
df_filled$variable_A[jj] = df$variable_A[jj][!is.na(df$variable_A[jj])]
Then group and summarize as planned with dplyr
df_filled %>%
group_by(variable_A, variable_B) %>%
dplyr::summarise(total=sum(variable_C))

Using dplyr collapse rows taking condition from another numeric column

An example df:
experiment = c("A", "A", "A", "A", "A", "B", "B", "B")
count = c(1,2,3,4,5,1,2,1)
df = cbind.data.frame(experiment, count)
Desired output:
experiment_1 = c("A", "A", "A", "B", "B")
freq = c(1,1,3,2,1) # frequency
freq_per = c(20,20,60,66.6,33.3) # frequency percent
df_1 = cbind.data.frame(experiment_1, freq, freq_per)
I want to do the following:
Group df using experiment
Calculate freq using the count column
Calculate freq_per
Calculate sum of freq_per for all observations with count >= 3
I have the following code. How do I do the step 4?
freq_count = df %>% dplyr::group_by(experiment, count) %>% summarize(freq=n()) %>% na.omit() %>% mutate(freq_per=freq/sum(freq)*100)
Thank you very much.
There may be a more concise approach but I would suggest collapsing your count in a new column using mutate() and ifelse() and then summarising:
freq_count %>%
mutate(collapsed_count = ifelse(count >= 3, 3, count)) %>%
group_by(collapsed_count, add = TRUE) %>% # adds a 2nd grouping var
summarise(freq = sum(freq), freq_per = (sum(freq_per))) %>%
select(-collapsed_count) # dropped to match your df_1.
Also, just fyi, for step 2 you might consider the count() function if you're keen to save some keystrokes. Also tibble() or data.frame() are likely better options than calling the dataframe method of cbind explicitly to create a data frame.

Select columns using a vector

I am trying to create a new data frame with 2 columns: var1 and var2, each one of them is the row sum of specific columns in data frame sampData.
library(dplyr)
sampData <-
rnorm(260) %>%
matrix(ncol = 26) %>%
data.frame() %>%
setNames(LETTERS)
var1 <- c("A", "B", "C")
var2 <- c("D", "E", "F", "G")
I know that I can select columns using [] and c(), like this:
sampData[ ,c("A","B")]
but when I try to generate and use that format from my vectors like this:
d1_ <-paste(var1, collapse=",")
d2_ <-paste(var2, collapse=",")
sampData[ ,d1_]
I get this error:
Error in `[.data.frame`(sampData, , d1_) : undefined columns selected
Which I also get if I try to calculate the rowSums -- which is what I am interested in getting.
data.frame(var1 = rowSums(sampData[ , d1_])
, var2 = rowSums(sampData[ , d2_])
I think I have managed to figure out what you are asking, but if I am wrong, let me know.
You are trying to select columns from prep that match the values in l1 and l2, and sum across the rows, limited to the columns that matched each.
It is always better to provide reproducible data, here is some for this case (using dplyr to build it):
sampData <-
rnorm(260) %>%
matrix(ncol = 26) %>%
data.frame() %>%
setNames(LETTERS)
var1 <- c("A", "B", "C")
var2 <- c("D", "E", "F", "G")
Then, you don't need to concatenate the column indices at all -- just use the variable (or column, in your case) directly. Here, I have made the ID's letters and will match the letters. However, if your ID's are numeric, it will match that index (e.g., 3 will return the third column).
data.frame(
var1sums = rowSums(sampData[, var1])
, var2sums = rowSums(sampData[, var2])
)
Of note, cat returns NULL after printing to the screen. If you need to concatenate values, you will need to use paste (or similar), but that will not work for what you are trying to do here.
This question got me thinking about flexibility of such solutions, so here is an attempt using dplyr and tidyr, which yields effectively the same result. The difference is that this may provide more flexibility for variable selection or even downstream processing.
sampData %>%
# add column for individual
mutate(ind = 1:nrow(.)) %>%
# convert data to long format
gather("Variable", "Value", -ind) %>%
# Set to group by the individual we added above
group_by(ind) %>%
# Calculate sums as desired
summarise(
var1sums = sum(Value[Variable %in% var1])
, var2sums = sum(Value[Variable %in% var2])
)
However, the real advantage would come if you had an arbitrary number (or just a large number generally) of sets of variables that you wanted to get the individual sums from. Instead of manually constructing every column you might be interested in, you can use standard evaluation (as opposed to non-standard) to automatically generate the columns based on a named list of vectors:
sampData %>%
mutate(ind = 1:nrow(.)) %>%
gather("Variable", "Value", -ind) %>%
group_by(ind) %>%
# Calculate one column for each vector in `varList`
summarise_(
.dots = lapply(varList, function(x){
paste0("sum(Value[Variable %in% c('"
, paste(x, collapse = "', '")
, "')])")
})
)

Resources