I'd like to get the cumulative sum of the corresponding records in the smaller column for each name under Species_a and Species_b as two new columns, and have them in the same row without including the value for that row. the smaller column lists which species column has a smaller width.
Species_a Species_b Sepal.Width_a Sepal.Width_b Date smaller
1 versicolor virginica 2.5 3.0 2022-05-05 a
2 versicolor virginica 2.6 2.8 2022-04-04 a
3 versicolor setosa 2.2 4.4 2021-03-03 a
4 setosa virginica 4.2 2.5 2021-02-02 b
5 virginica setosa 3.0 3.4 2020-01-01 a
Ideally the format of the data would be in the same format as it is now, and the summation would be based off of the smaller, Date, Species_a, and Species_b columns alone. I tried to create a count column but I get stuck on properly accumulating based on Date being less than the current value for that column.
My desired output would be as follows:
Species_a Species_b Sepal.Width_a Sepal.Width_b Date smaller smaller_sum_a smaller_sum_b
1 versicolor virginica 2.5 3.0 2022-05-05 a 2 2
2 versicolor virginica 2.6 2.8 2022-04-04 a 1 2
3 versicolor setosa 2.2 4.4 2021-03-03 a 0 0
4 setosa virginica 4.2 2.5 2021-02-02 b 0 1
5 virginica setosa 3.0 3.4 2020-01-01 a 0 0
Code:
library(tidyverse)
set.seed(12)
data_a <- iris[sample(1:nrow(iris)), ] %>%
head()
colnames(data_a) <- paste0(colnames(data_a), "_a")
data_b <- iris[sample(1:nrow(iris)), ] %>%
tail()
colnames(data_b) <- paste0(colnames(data_b), "_b")
data <- bind_cols(data_a, data_b) %>%
filter(Species_a != Species_b) %>%
select(Species_a,
Species_b,
Sepal.Width_a,
Sepal.Width_b) %>%
mutate(Date = c('2022-05-05', '2022-04-04', '2021-03-03', '2021-02-02', '2020-01-01'),
smaller = ifelse(Sepal.Width_a > Sepal.Width_b, 'b',
ifelse(Sepal.Width_a < Sepal.Width_b, 'a', NA)))
I don't know if this is a solution, but it might be a start.
How exactly are the new columns calculated? Looks like smaller_sum_a is the number of consecutive rows where species a has the smaller value, minus one. But the same doesn't work for smaller_sum_b I don't think? Or is it just cumulative number of days where each species is has the smaller value, minus 1, but with zero if the species isn't smaller in that row (again this doesn't check out for smaller_sum_b though...).
As for determining if Date is less than the current value, firstly you'll want to tell R that your Date column is actually a date, rather than just a character.
Easiest way to see what format it is in is to make your data (not a good name for your data btw, preferably make it something that R or the computer wouldn't use, like my_data) a tibble rather than a data.frame. tibbles tell you what format each column is in which is handy.
data %>%
tibble
# # A tibble: 5 x 6
# Species_a Species_b Sepal.Width_a Sepal.Width_b Date smaller
# <fct> <fct> <dbl> <dbl> <chr> <chr>
# 1 versicolor virginica 2.5 3 2022-05-05 a
# 2 versicolor virginica 2.6 2.8 2022-04-04 a
# 3 versicolor setosa 2.2 4.4 2021-03-03 a
# 4 setosa virginica 4.2 2.5 2021-02-02 b
# 5 virginica setosa 3 3.4 2020-01-01 a
The bits inside the < > under the column names tell you the formats, <fct> is factor, <dbl> is numeric (see here for explanation) and <chr> is character.
So, we want to make Date into a date format, which we can do with the ymd() (year-month-day) function from lubridate. Also, I rearranged the data so the rows are in chronological order (earliest at the top), because that's how things are normally arranged, and makes more sense to me, especially if you're interested in cumulative sums.
data %>%
tibble %>%
mutate(
Date = ymd(Date)
) %>%
arrange(Date) %>%
{. ->> my_data}
my_data
# # A tibble: 5 x 6
# Species_a Species_b Sepal.Width_a Sepal.Width_b Date smaller
# <fct> <fct> <dbl> <dbl> <date> <chr>
# 1 virginica setosa 3 3.4 2020-01-01 a
# 2 setosa virginica 4.2 2.5 2021-02-02 b
# 3 versicolor setosa 2.2 4.4 2021-03-03 a
# 4 versicolor virginica 2.6 2.8 2022-04-04 a
# 5 versicolor virginica 2.5 3 2022-05-05 a
We can see that R now recognises that the Date column is a date, and is now in the R-recognised <date> format.
Now this is where I'm not 100% sure on exactly how you want to calculate your new columns, but for example you can use ifelse() to determine if species a is smaller, and then calculate the cumulative sum of the days where it was smaller.
my_data %>%
mutate(
s_a = ifelse(smaller == 'a', 1, 0),
smaller_sum_a = cumsum(s_a),
)
# # A tibble: 5 x 8
# Species_a Species_b Sepal.Width_a Sepal.Width_b Date smaller s_a smaller_sum_a
# <fct> <fct> <dbl> <dbl> <date> <chr> <dbl> <dbl>
# 1 virginica setosa 3 3.4 2020-01-01 a 1 1
# 2 setosa virginica 4.2 2.5 2021-02-02 b 0 1
# 3 versicolor setosa 2.2 4.4 2021-03-03 a 1 2
# 4 versicolor virginica 2.6 2.8 2022-04-04 a 1 3
# 5 versicolor virginica 2.5 3 2022-05-05 a 1 4
As long as either a) the Date column is in an R-recognised <date> format, or b) it is arranged chronologically, you can use the less than or greater than operators < & > to determine if dates are before/after a given row.
This is a good resource for understanding how R treats dates and times, and is well worth a read https://r4ds.had.co.nz/dates-and-times.html
Here is my current solution, I'd like to not use plyr if I can help it since I heard it breaks some of dplyr's functions. I feel like there is definitely a more efficient and modern way of solving this issue but I can't seem to find it.
library(plyr)
library(lubridate)
# creating counts for smaller sums for red side
data$Date <- lubridate::parse_date_time(x = data$Date, # standardizing date (outside of the reproducible example there are two date types)
orders = c("%m/%d/%Y", "%Y-%m-%d"))
A_rn <- mutate(filter(select(data,
Species_a,
Date,
smaller),
smaller == 'a'),
smaller_ct_a = 1)
# creating counts for smaller sums for b
BtoA_rn <- mutate(filter(select(data,
Species_b,
Date,
smaller),
smaller == 'b'), # calling Species_b Species_a for easier joining
Species_a = Species_b,
smaller_ct_a = 1) %>%
select(Species_a, Date, smaller, smaller_ct_a)
# cumsum for both a and b
A <- ddply(bind_rows(A_rn, BtoA_rn) %>%
arrange(Date),
.(Species_a), transform,
smaller_sum_a = lag(cumsum(replace_na(smaller_ct_a, 0)))) %>%
select(-smaller_ct_a)
# naming adjustment
B <- A %>% filter(smaller == "b") %>%
select(-smaller)
names(B) <- gsub(x = names(B), pattern = "_a", replacement = "_b")
A <- A %>% filter(smaller == "a") %>%
select(-smaller)
data <- left_join(data, A, by = c("Species_a", "Date")) %>%
left_join(B, by = c("Species_b", "Date"))
data[is.na(data)] <- 0
Related
I have the following data frame:
df<- splitstackshape::stratified(iris, group="Species", size=1)
I want to make a z-score for each species including all of the variables. I can do this manually by finding the SD and mean for each row and using the appropriate formula, but I need to do this several times over and would like to find a more efficient way.
I tried using scale(), but can't figure out how to get it to do the row-wise calculation that includes several variables and a grouping variable.
Using dplyr::group_by returns a "'x' must be numeric variable" error.
Are you sure the question is taking a z-score to each group? It should be for each value.
Lets say the functions to take z-score could be:
scale(x, center = TRUE, scale = TRUE)
Or
function_zscore = function(x){x <- x[na.rm = TRUE]; return(((x) - mean(x)) / sd(x))}
Both functions suggest that if the argument x is a vector, the results will return to a vector too.
df<- splitstackshape::stratified(iris, group="Species", size=1)
df <- tidyr::pivot_longer(df, cols = c(1:4), names_to = "var.name", values_to = "value")
df %>%
group_by(Species) %>%
mutate(zscore = scale(value, center = TRUE, scale = TRUE)[,1])
## A tibble: 12 x 4
## Groups: Species [3]
# Species var.name value zscore
# <fct> <chr> <dbl> <dbl>
# 1 setosa Sepal.Length 4.9 1.22
# 2 setosa Sepal.Width 3.1 0.332
# 3 setosa Petal.Length 1.5 -0.455
# 4 setosa Petal.Width 0.2 -1.09
# 5 versicolor Sepal.Length 5.9 1.10
# 6 versicolor Sepal.Width 3.2 -0.403
# 7 versicolor Petal.Length 4.8 0.486
# 8 versicolor Petal.Width 1.8 -1.18
# 9 virginica Sepal.Length 6.5 1.14
#10 virginica Sepal.Width 3 -0.574
#11 virginica Petal.Length 5.2 0.501
#12 virginica Petal.Width 2 -1.06
If we still hope to get a score for each group to describe how a sample deviates around the mean, a possible solution could be getting the coefficient of variation?
df %>%
group_by(Species) %>%
summarise(coef.var = 100*sd(value)/mean(value))
## A tibble: 3 x 2
# Species coef.var
# <fct> <dbl>
#1 setosa 83.8
#2 versicolor 45.8
#3 virginica 49.0
Let's say we want to calculate the means of sepal length based on tercile groups of sepal width.
We can use the split_quantile function from the fabricatr package and do the following:
iris %>%
group_by(split_quantile(Sepal.Width, 3)) %>%
summarise(Sepal.Length = mean(Sepal.Length))
So far so good. Now, let's say we want to group_by(Species, split_quantile(Sepal.Width, 3)) instead of just group_by(split_quantile(Sepal.Width, 3)).
However, what if we want the terciles to be calculated inside of the each species type and not generally?
Basically, what I'm looking for could be achieved by splitting iris into several dataframes based on Species, using split_quantile on those dataframes to calculate terciles and then joining the dataframes back together. However, I'm looking for a way to do this without splitting the dataframe.
You kinda have written the answer in your text, but you can create a new variable for tercile after grouping by species, then regroup with both Species and Tercile.
library(tidyverse)
library(fabricatr)
iris %>%
group_by(Species) %>%
mutate(Tercile = split_quantile(Sepal.Width, 3)) %>%
group_by(Species, Tercile) %>%
summarise(Sepal.Length = mean(Sepal.Length))
#> # A tibble: 9 x 3
#> # Groups: Species [3]
#> Species Tercile Sepal.Length
#> <fct> <fct> <dbl>
#> 1 setosa 1 4.69
#> 2 setosa 2 5.08
#> 3 setosa 3 5.27
#> 4 versicolor 1 5.61
#> 5 versicolor 2 6.12
#> 6 versicolor 3 6.22
#> 7 virginica 1 6.29
#> 8 virginica 2 6.73
#> 9 virginica 3 6.81
Created on 2020-05-27 by the reprex package (v0.3.0)
I'm trying to dplyr::summarise a dataset (collapse) by different summarise_at/summarise_if functions so that I have the same named variables in my output dataset. Example:
library(tidyverse)
data(iris)
iris$year <- rep(c(2000,3000),each=25) ## for grouping
iris$color <- rep(c("red","green","blue"),each=50) ## character column
iris$letter <- as.factor(rep(c("A","B","C"),each=50)) ## factor column
head(iris, 3)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species year color letter
1 5.1 3.5 1.4 0.2 setosa 2000 red A
2 4.9 3.0 1.4 0.2 setosa 2000 red A
3 4.7 3.2 1.3 0.2 setosa 2000 red A
The resulting dataset should look like this:
full
Species year Sepal.Width Petal.Width Sepal.Length Petal.Length letter color
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
1 setosa 2000 87 6.2 5.8 1.9 A red
2 setosa 3000 84.4 6.1 5.5 1.9 A red
3 versicolor 2000 69.4 33.6 7 4.9 B green
4 versicolor 3000 69.1 32.7 6.8 5.1 B green
5 virginica 2000 73.2 51.1 7.7 6.9 C blue
6 virginica 3000 75.5 50.2 7.9 6.4 C blue
I can achieve this by doing the following which is a bit repetitive:
sums <- iris %>%
group_by(Species, year) %>%
summarise_at(vars(matches("Width")), list(sum))
max <- iris %>%
group_by(Species, year) %>%
summarise_at(vars(matches("Length")), list(max))
last <- iris %>%
group_by(Species, year) %>%
summarise_if(is.factor, list(last))
first <- iris %>%
group_by(Species, year) %>%
summarise_if(is.character, list(first))
full <- full_join(sums, max) %>% full_join(last) %>% full_join(first)
I have found similar approaches below but can't figure out the approach I've tried here. I would prefer not to make my own function as I think something like this is cleaner by passing everything through a pipe and joining:
test <- iris %>%
#group_by(.vars = vars(Species, year)) %>% #why doesnt this work?
group_by_at(.vars = vars(Species, year)) %>% #doesnt work
{left_join(
summarise_at(., vars(matches("Width")), list(sum)),
summarise_at(., vars(matches("Length")), list(max)),
summarise_if(., is.factor, list(last)),
summarise_if(., is.character, list(first))
)
} #doesnt work
This doesnt work, any suggestions or other approaches?
Helpful:
How can I use summarise_at to apply different functions to different columns?
Summarize different Columns with different Functions
Using dplyr summarize with different operations for multiple columns
By default, the dplyr::left_join() function only accepts two data frames. If you want to use this function with more than two data frames, you can iterate it with the Reduce function (base R function):
iris %>%
group_by(Species, year) %>%
{
Reduce(
function(x, y) left_join(x, y),
list(
summarise_at(., vars(matches("Width")), base::sum),
summarise_at(., vars(matches("Length")), base::max),
summarise_if(., is.factor, dplyr::last),
summarise_if(., is.character, dplyr::first)
))
}
# Species year Sepal.Width Petal.Width Sepal.Length Petal.Length letter color
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
# 1 setosa 2000 87 6.2 5.8 1.9 A red
# 2 setosa 3000 84.4 6.1 5.5 1.9 A red
# 3 versicolor 2000 69.4 33.6 7 4.9 B green
# 4 versicolor 3000 69.1 32.7 6.8 5.1 B green
# 5 virginica 2000 73.2 51.1 7.7 6.9 C blue
# 6 virginica 3000 75.5 50.2 7.9 6.4 C blue
Furthermore, notice I had to call functions from its package by using :: in order to avoid name overlapping with previously created data frames.
Robbing #Ulises idea and using purrr::reduce instead of Reduce is an alternative:
iris %>%
group_by(Species, year) %>%
list(
summarise_at(., vars(matches("Width")), base::sum),
summarise_at(., vars(matches("Length")), base::max),
summarise_if(., is.factor, dplyr::last),
summarise_if(., is.character, dplyr::first)
) %>%
.[c(2:5)] %>%
reduce(left_join)
OR solution with curly brackets to suppress the first argument:
iris %>%
group_by(Species, year) %>%
{
list(
summarise_at(., vars(matches("Width")), base::sum),
summarise_at(., vars(matches("Length")), base::max),
summarise_if(., is.factor, dplyr::last),
summarise_if(., is.character, dplyr::first)
)
} %>%
reduce(left_join)
I have a dataframe including lab results for individual subjects. Some of the subjects have duplicate records, only that the duplicates have certain data points missing in one record, but not in the other.
I'm trying to write a function that will "fill in" the NA data points in one line from any duplicate that may exist for that subject. Here's what I tried:
# example data with duplicate IDs, some with missing values
ir<-head(iris)
ir$unique_flower_ID<-1:6
ir<-rbind(ir, ir[c(1,3,5),])
ir[7:9, c(1,3)]<-NA
ir[c(1,3,5), c(2,4)]<-NA
ir<-ir[order(ir$unique_flower_ID),]
# function to run on a given dataframe (df) to
# replace missing values in certain variables (vars) from duplicates
# as identified by a unique ID
replaceNAs_dupl <- function(df, ID, vars) {
#identify duplicate IDs and subset the dataframe
df_dupl<-data.frame(table(df[, ID]))
df_dupl<-df[df[, ID] %in% df_dupl$Var1[which(df_dupl$Freq > 1)],]
# loop through specified columns
for(i in vars) {
#create a mini-dataframe of ID and value for each column
df_dupl_uni<-unique(df_dupl[which(!is.na(df_dupl[,i])), c(ID, i)])
# replace missing data with data from duplicate record
df[which(df[, ID] %in% df_dupl_uni[, ID]), i] <- df_dupl_uni[match(df[which(df[, ID] %in% df_dupl_uni[, ID]), ID], df_dupl_uni[, ID]), i]
return(df)
}
}
# define the columns to run the function on by name
col_names<-colnames(ir[,1:4])
# pass ir to the function
ir2<-replaceNAs_dupl(ir, "unique_flower_ID", col_names)
The output works, but only for the first column; the loop simply won't loop.
Can anyone please explain what I'm doing wrong?
Is there a better way entirely to do what I'm attempting?
As #jdobres said, your initial problem is that you are returning within the loop, before it can iterate further.
I offer as an alternate implementation, the following code:
library(dplyr)
ir %>%
group_by(unique_flower_ID) %>%
mutate_at(vars(Sepal.Length:Petal.Width), ~ if_else(is.na(.), na.omit(.)[1], .)) %>%
ungroup()
# # A tibble: 9 x 6
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species unique_flower_ID
# <dbl> <dbl> <dbl> <dbl> <fct> <int>
# 1 5.1 3.5 1.4 0.2 setosa 1
# 2 5.1 3.5 1.4 0.2 setosa 1
# 3 4.9 3 1.4 0.2 setosa 2
# 4 4.7 3.2 1.3 0.2 setosa 3
# 5 4.7 3.2 1.3 0.2 setosa 3
# 6 4.6 3.1 1.5 0.2 setosa 4
# 7 5 3.6 1.4 0.2 setosa 5
# 8 5 3.6 1.4 0.2 setosa 5
# 9 5.4 3.9 1.7 0.4 setosa 6
How it works:
grouping by the ID field means that the code below will be executed once for each unique id; meaning the first time the mutate_at function is called, it will see only
Sepal.Length Sepal.Width Petal.Length Petal.Width Species unique_flower_ID
1 5.1 NA 1.4 NA setosa 1
11 NA 3.5 NA 0.2 setosa 1
mutate_at does the same function on one or more columns, in this case all columns between (and including) Sepal.Length and Petal.Width;
the function called is using rlang's "tilde notation", in which the dot . will be replaced with the vector of data within each column, effectively doing each of
if_else(is.na(Sepal.Length), na.omit(Sepal.Length)[1], Sepal.Length)
if_else(is.na(Sepal.Width), na.omit(Sepal.Width)[1], Sepal.Width)
if_else(is.na(Petal.Length), na.omit(Petal.Length)[1], Petal.Length)
if_else(is.na(Petal.Width), na.omit(Petal.Width)[1], Petal.Width)
(it could just as easily have been mutate_at(..., function(a) if_else(is.na(a), na.omit(a)[1], a)), but I like the more compact ~ notation)
within that function, for each value in the vector, if it is non-NA then it is used without change; if it is NA, then it replaces the NA with the first non-NA value in the frame ("first" means first in the frame, so if there are multiple distinct values, you must control which gets priority by the order of your rows);
this safeguards against having no available data in a column by using na.omit(.)[1]: if na.omit(.) returns nothing (vector of length 0, as in na.omit(NA)), then the [1] forces it to return something, which in our case is (another) NA, so we retain a full vector. For example:
ir$Sepal.Length[1:2] <- NA
ir %>%
group_by(unique_flower_ID) %>%
mutate_at(vars(Sepal.Length:Petal.Width), ~ if_else(is.na(.), na.omit(.)[1], .)) %>%
ungroup()
# # A tibble: 9 x 6
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species unique_flower_ID
# <dbl> <dbl> <dbl> <dbl> <fct> <int>
# 1 NA 3.5 1.4 0.2 setosa 1
# 2 NA 3.5 1.4 0.2 setosa 1
# 3 4.9 3 1.4 0.2 setosa 2
# 4 4.7 3.2 1.3 0.2 setosa 3
# 5 4.7 3.2 1.3 0.2 setosa 3
# 6 4.6 3.1 1.5 0.2 setosa 4
# 7 5 3.6 1.4 0.2 setosa 5
# 8 5 3.6 1.4 0.2 setosa 5
# 9 5.4 3.9 1.7 0.4 setosa 6
(PS: since you're new to R, I should clarify: the use of rlang's tilde notation is unique to the tidyverse packages; it is not necessarily available in other packages/functions unless explicitly provided as such. For those, one should use the more generic anonymous function (e.g., function(a) { ... }) or a named function.)
Here's a simple (but somewhat naive) solution for merging records.
library(dplyr)
ir2 <- ir %>%
group_by(unique_flower_ID) %>%
summarise_if(is.numeric, mean, na.rm=TRUE) %>%
ungroup()
Limitations:
This merges records, meaning there will no longer be duplicates, which may not be desirable.
If there are ever two duplicate records that don't match, it takes the average value. mean could be replaced with another summary function, but it might be preferable to throw some sort of error if you ever had two records with the same ID but different values in a given column.
If all records with a given ID have NA's in a column, it returns NaN.
library(tidyverse)
df <- iris %>%
group_by(Species) %>%
mutate(Petal.Dim = Petal.Length * Petal.Width,
rank = rank(desc(Petal.Dim))) %>%
mutate(new_col = rank == 4, Sepal.Width)
table <- df %>%
filter(rank == 4) %>%
select(Species, new_col = Sepal.Width)
correct_df <- left_join(df, table, by = "Species")
df
#> # A tibble: 150 x 8
#> # Groups: Species [3]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Dim
#> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 5.1 3.5 1.4 0.2 setosa 0.280
#> 2 4.9 3 1.4 0.2 setosa 0.280
#> 3 4.7 3.2 1.3 0.2 setosa 0.26
#> 4 4.6 3.1 1.5 0.2 setosa 0.3
#> 5 5 3.6 1.4 0.2 setosa 0.280
#> 6 5.4 3.9 1.7 0.4 setosa 0.68
#> 7 4.6 3.4 1.4 0.3 setosa 0.42
#> 8 5 3.4 1.5 0.2 setosa 0.3
#> 9 4.4 2.9 1.4 0.2 setosa 0.280
#> 10 4.9 3.1 1.5 0.1 setosa 0.15
#> # ... with 140 more rows, and 2 more variables: rank <dbl>, new_col <lgl>
I'm basically looking for new_col to show the value that corresponds with rank = 4 from the Sepal.Width column. In this case, those values would be 3.9, 3.3, and 3.8. I'm envisioning this similar to a VLookup, or Index/Match in Excel.
When ever I think "now I need to use VLOOKUP like I did in the past in Excel" I find the left_join() function helpful. It's also part of the dplyr package. Instead of "looking up" values in one table in another table, it's easier for R to just make one bigger table where one table remains unchanged (here the "left" one or the first term you put in the function) and the other is added using a column or columns they have in common as an index.
In your specific example, I can't entirely understand what you want new_col to have in it. If you want to do Excel-style VLOOKUP in R, then left_join() is the best starting point.
The question is not clear since it does not mention the purpose of a Vlookup or Index/Match like operation from Excel.
Also, you don't mention what value should "new_col" have if rank is not equal to 4.
Assuming the value is NA, the below solution with a simple ifelse would work:
df <- iris %>%
group_by(Species) %>%
mutate(Petal.Dim = Petal.Length * Petal.Width,
rank = rank(desc(Petal.Dim))) %>%
ungroup() %>%
mutate(new_col = ifelse(rank == 4, Sepal.Width,NA))
df