Applying functions to each group in a dataframe in R

Applying functions to each group in a dataframe in R - r

I have a dataframe like this:
df<-data.frame(info=c("Lucas sold $3.01","Lucia bought 3.00","Lucas bought $2.5","Lucas sold
$3.01","Lucia bought 3.00","Lucas bought $2.5"),
number=c("1001","1001","1002","1003","1003","1003"),
step=c("step 1","step 2","step 1","step 1","step 2","step 3"),
status=c("ok",NA,NA,"ok",NA,NA))
I need to transforme the information that i already have, using diverse functions, but I need to do it grouping the information based in "Number".
For example, I need to group by "number" and then replace the first NA in column "Status" for an "ok", for each group.
Then "status" would be c("ok","ok","ok","ok","ok",NA)
last(which.na(df$status)) would do the trick if I could apply that to each group.
Another function that I need to apply would be to create a new column where I can place a "1", the last time that the word "bought" is in the column "info".
Something like df[max(which(grepl("bought",df$info))]<-"1" would do the trick if I could apply that to each group, but I am not sure about how to do it.

You could make great use of dplyr's group_by syntax here after creating some bespoke functions to do the required tasks:
# Replace the last NA element of a vector with 'ok'
replace_first_na <- function(x) {
# Coerce to character to catch potential issues
x <- as.character(x)
# Get the position of the first NA
first_na <- which(is.na(x))[1]
# Replace the element in that position with 'ok'
x[first_na] <- "ok"
x
}
# Get the last element containing the word 'bought'
last_bought_flag <- function(x) {
# Prepare the output
out <- rep(0, length(x))
# Get the position of the last string to contain 'bought'
last_bought <- max(which(grepl("bought", x)))
# Replace the element in that position with `1`
out[last_bought] <- 1
# Return the output
out
}
df %>%
as_tibble() %>%
# Apply grouping by `number`
group_by(number) %>%
# Replace the first `NA` with 'ok' in the `status` column
mutate(status = replace_first_na(status)) %>%
# Get a flag column indicating the last 'bought' item for each group
mutate(last_bought = last_bought_flag(info)) %>%
# Remove grouping
ungroup()

Related

Can I use lapply to check for outliers in comparison to values from all listed tibbles?

My data is imported into R as a list of 60 tibbles each with 13 columns and 8 rows. I want to detect outliers defined as 2*sd by comparing each value in column "2" to the mean of all values of column "2" in the same row.
I know that I am on a wrong path with these lines, as I am not comparing the single values
lapply(list, function(x){
if(x$"2">(mean(x$"2")) + (2*sd(x$"2"))||x$"2"<(mean(x$"2")) - (2*sd(x$"2"))) {}
})
Also I was hoping to replace all values that are thus identified as outliers by the corresponding mean calculated from the 60 values in the same position as the outlier while keeping everything else, but I am also quite unsure how to do that.
Thank you!

you haven't added an example of your code so I've made a quick and simple example to demonstrate my answer. I think this would be much more straightforward logic if you first combine the list of tibbles into a single tibble. This allows you to do everything you want in a simple dplyr pipe, ultimately identifying outliers by 1's in the 'outlier' column:
library(tidyverse)
tibble1 <- tibble(colA = c(seq(1,20,1), 150),
colB = seq(0.1,2.1,0.1),
id = 1:21)
tibble2 <- tibble(colA = c(seq(101,120,1), -150),
colB = seq(21,41,1),
id = 1:21)
# N.B. if you don't have an 'id' column or equivalent
# then it makes it a lot easier if you add one
# The 'id' column is essentially shorthand for an index
tibbleList <- list(tibble1, tibble2)
joinedTibbles <- bind_rows(tibbleList, .id = 'tbl')
res <- joinedTibbles %>%
group_by(id) %>%
mutate(meanA = mean(colA),
sdA = sd(colA),
lowThresh = meanA - 2*sdA,
uppThresh = meanA + 2*sdA,
outlier = ifelse(colA > uppThresh | colA < lowThresh, 1, 0))

I want to find a mode for each group of dataframes within the element of a list, and write the result as a new column

I have a list called "data". It consists of 10 elements (lists), each having different number of elements (lists), such as
lengths(data)
[1] 26 33 3 20 22 21 17 18 12 29
Thus, the first element of our list consists of 26 elements, the second of 33, and so on ... Each of these elements are dataframes ("tibbles"), with 6 columns (first four being integers, fifth logical, and the last character), for instance
colnames(data[[1]][[1]])
[1] "width" "height" "x" "y" "space" "text"
Although the structure of dataframes (columns) is consistent in and outside of the groups, the number of rows differs for each dataframe even within the group.
I want to find a mode "height" for the dataframes grouped within the same element.
Thus, there is common mode for 26 dataframes within the first element and so on. In other words, I want to group the data for 26 dataframes within the first element, calculate the mode, and then write result as a new column to each of the dataframes so that I could perform different operations for rows with height above, below, and equal to mode.
This is what I figured out so far, although it is not correct it should produce the same result in most of the cases:
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
mode <- lapply(data, function(x) lapply(lapply(x, '[[',
'height'), getmode)) # find mode height for each
# paper and each page
mode2 <- lapply(mode, function (x) getmode(x))
# find mode for each paper

Here is one option where we loop over the outer list with map, then bind the inner list elements to a single data with bind_rows (from dplyr) creating a column grp, apply the getmode on the combined 'height' column to create a new column and then split the dataset by the 'grp' column
library(purrr)
library(dplyr)
map(data, ~ bind_rows(.x, .id = 'grp') %>%
mutate(Mode = getmode(height)) %>%
group_split(grp, .keep = FALSE))
Or loop over the list with lapply, loop over the inner list with sapply, extract the 'height' from individual inner list elements, apply the getmode and return a vector of mode values on which the getmode is applied again. Then loop over the inner list and create a new column with the mode value we got
lapply(data, function(x) {
mode <- getmode(sapply(x, function(y) getmode(y$height)))
lapply(x, function(y) cbind(y, mode))
})
data
set.seed(24)
data1 <- replicate(26, head(mtcars) %>% mutate(height = rnorm(6)), simplify = FALSE)
data2 <- replicate(33, head(iris) %>% mutate(height = rnorm(6)), simplify = FALSE)
data <- list(data1, data2)

Substituting or summing based on condition

I have a dataset that looks something like this
df <- data.frame("id" = c("Alpha", "Alpha", "Alpha","Alpha","Beta","Beta","Beta","Beta"),
"Year" = c(1970,1971,1972,1973,1974,1977,1978,1990),
"Group" = c(1,NA,1,NA,NA,2,2,NA),
"Val" = c(2,3,3,5,2,5,3,5))
And I would like to create a cumulative sum of "Val". I know how to do the simple cumulative sum
df <- df %>% group_by(id) %>% mutate(cumval=cumsum(Val))
However, I would like my final data to look like this
final <- data.frame("id" = c("Alpha", "Alpha", "Alpha","Alpha","Beta","Beta","Beta","Beta"),
"Year" = c(1970,1971,1972,1973,1974,1977,1978,1990),
"Group" = c(1,NA,1,NA,NA,2,2,NA),
"Val" = c(2,3,3,5,2,5,3,5),
"cumval" = c(2,5,6,11,2,7,5,10))
The basic idea is that when two "Val"'s are of the same "Group" the one happening later (Year) substitutes the previous one.
For instance, in the sample dataset, observation 3 has a "cumval" of 6 rather than 8 because of the "Val" at time 1972 replaced the "Val" at time 1970. similarly for Beta.
I thank you in advance for your help

In my head, this requires a for loop. First we split the dataframe by the id column into a list of two. Then we create two empty lists. In the og list, we will put the row where the first unique non NA group identifier occurs. For alpha this is the first row and for Beta this is the second row. We will use this to subtract from the cumulative sum when the value gets substituted.
mylist <- split(df, f = df$id)
og <- list()
vals <- list()
df_num <- 1
We shall use a nested loop, the outer loop loops over each object (dataframe in this case) in the list and the inner loop loops over each value in the Group column.
We need to keep track of the row numbers, which we do with the r variable. We initially set it to 0 outside the for loop so we add 1. First we check if we are in the first row of the data frame, in which case the cumulative sum is simply equal to the value in the first row of the Val column. Then within the if test, we use another if test to check if the Group id is an NA. If it isn't then this is the first occurrence of the number that will indicate a substitution of the current value if this number appears again. So we save the number to the temporary variable temp. We also extract and save the row that contains the value to the og list.
After this it, goes to the next iteration. We check if the current Group value is NA. If it is, then we just add the value to the cumulative sum. If it isn't equal to NA, we check if the value is NA and is equal to the value stored in temp. If both are true, then this means we need to substitute. We extract the original value stored in the og list and save it as old. We then subtract the old value from the cumulative sum and add the current value. We also replace the orginal value in og with the current replacement value. This is because if the value needs to replaced again, we will need to subtract the current value and not the original value.
If j is NA but it is not equal to temp, then this is a new instance of Group. So we save the row with the original value to og list, and save the Group. The sum continues as normal as this is not an instance of replacing a value. Note that the variable x that is used to count the elements in the og list is only incremented when a new occurrence is added to the list. Thus, og[[x-1]] will always be the replacement value.
for (my_df in mylist) {
x <- 1
r <- 0
for (j in my_df$Group) {
r <- r + 1
if (r == 1) {
vals[[1]] <- my_df$Val[1]
if (is.na(j)==FALSE) {
og[[x]] <- df[r, c('Group', 'Val'), drop = FALSE]
temp <- j
x <- x + 1
}
next
}
if (is.na(j)==TRUE) {
vals[[r]] <- vals[[r-1]] + my_df$Val[r]
} else if (is.na(j)==FALSE & j==temp) {
old <- og[[x-1]]
old <- old[,2]
vals[[r]] <- vals[[r-1]] - old + df$Val[r]
og[[x-1]] <- df[r, c('Group', 'Val'), drop = FALSE]
} else {
vals[[r]] <- vals[[r-1]] + my_df$Val[r]
og[[x]] <- my_df[r, c('Group', 'Val')]
temp <- j
x <- x + 1
}
}
cumval <- unlist(vals) %>% as.data.frame()
colnames(cumval) <- 'cumval'
my_df <- cbind(my_df, cumval)
mylist[[df_num]] <- my_df
df_num <- df_num + 1
}
Lastly, we combine the two dataframes in the list by binding them on rows with bind_rows from the dplyr package. Then I check if the Final dataframe is identical to your desired output with identical() and it evaluates to TRUE
final_df <- bind_rows(mylist)
identical(final_df, final)
[1] TRUE

How can I create a data.frame based on variable number of tidyeval inputs

First of all apologies for the somewhat uninformative title
I have a shiny app where a user downloads one of many possible datasets and for some columns can perform a filter to produce a data.frame output
I want to standardize the code irrespective of the dataset downloaded
The problem is that the column names differ by dataset and there will be a variable number of columns I wish to filter on
As far as creating inputs are concerned, I have adapted this solution using a tidyeval approach. However, I am having difficulty with the output without having to resort to a lot of if else statements based on number of columns that can be filtered on
Here is an (non-shiny) example based on a dataset where I have 2 filterable columns, a Value column which is always required and one column unwanted in the final output
library(tidyverse)
## desired columns
my_cols <- c("col 1", "another col")
# selected input
input_1 <- c("A","B")
input_2 <- c("Z")
l <- list(`col 1` = rep(c("A","B","C"),times=3), `another col` =
rep(c("X","Y","Z"),each=3), Value = c(1:9),`Unwanted 1`=(9:1))
df <- as_tibble(l)
# this creates the right number of correctly-named columns
for (i in seq_along(my_cols)) {
assign(paste0("col_", i), sym(my_cols[i]))
}
## This produces output but wish to adapt
## to varying number of columns
df %>%
filter(!!col_1 %in% input_1) %>%
filter(!!col_2 %in% input_2) %>%
select(!!col_1, !!col_2, Value)
# `col 1` `another col` Value
# <chr> <chr> <int>
# 1 A Z 7
# 2 B Z 8
So it is the last piece of code I wish to adapt to take account of the variable length of my_cols
TIA

You seem to store input in separate variables, with suggests you know up front how many columns will be operated on (unless those are coming from dynamically generated UI).Anyways, I suggest you keep inputs in one object as well (hopefully same length as my_cols, otherwise you could subset the input list to match the length of the my_cols vector). Then you can prepare a list of quosures and splice them into filter and select.
library(tidyverse)
library(rlang)
## desired columns
my_cols <- c("col 1", "another col")
# selected input
input_1 <- c("A","B")
input_2 <- c("Z")
input_3 <- NULL # ui handle that is not used for this dataset
l <- list(`col 1` = rep(c("A","B","C"),times=3), `another col` =
rep(c("X","Y","Z"),each=3), Value = c(1:9),`Unwanted 1`=(9:1))
df <- as_tibble(l)
# make a list of inputs
l_input <- list(input_1, input_2, input_3)[seq_along(my_cols)]
# make a list of expression quosures. Make sure you use enquos if inside function
l_expr <- mapply(function(x,y) quos(!!sym(x) %in% !!y), my_cols, l_input, USE.NAMES = F)
# splice into filter and select
df %>% filter(!!!l_expr) %>% select(!!!syms(my_cols), Value)
If you put this inside a function, remember to use enquos()

How to transpose data frame in R by group with column header

I am working with a data frame that looks like the following which I need to transpose by group based on the common Id:
testDF = data.frame(c("Id", "1", "1", "2", "2"), c("Item", 'Milk','Eggs','Bacon', "Bread"))
testDF
#>Id Item
#>1 Milk
#>1 Eggs
#>2 Bacon
#>2 Bread
newDT <- dcast(testDF, Id ~ Item, value.var = "Item")
View(newDT)
I need the output to look like the following (excluding the header row and the Id column altogether):
Milk, Eggs (Id 1)
Bacon, Bread (Id 2)
I am receiving the error 'Error: value.var (Item) not found in input'. Can you please tell me what I am doing wrong?
Thanks,
Matt

Some good answers above, however I think this should be listed as an option as well:
df %>%
group_by(Id) %>%
# Create string listing all items in given Id, separated by comma
summarise(Items = str_c(Item, collapse = ', '))
Returns:
# A tibble: 2 × 2
Id Items
<fctr> <chr>
1 Milk, Eggs
2 Bacon, Bread

testDF = data.frame(Id = c("1", "1", "2", "2"),
Item = c('Milk','Eggs','Bacon', "Bread"))
testDF
z <- aggregate(list(Item = testDF$Item), list(ID = testDF$Id),
function(x) paste(x, collapse = ','))
z
ID Item
1 1 Milk,Eggs
2 2 Bacon,Bread

With group_by and summarise functions from dplyr you have:
library("dplyr")
testDF %>%
group_by(Id) %>%
summarise(Items=paste0(Item,collapse=","))
testDF
#Source: local data frame [2 x 2]
#Groups: Id [2]
#
# Id Items
# (fctr) (chr)
#1 1 Milk,Eggs
#2 2 Bacon,Bread

since I didn't see a Green checkmark, figured I'd take a stab at it because I wrote a function for this exact problem.
library(dplyr)
transp <- function(input,uniq_var,compare_var,transposed_column_names = 'measurement'){
if(class(input[,uniq_var]) == "factor"){
input[uniq_var] = sapply(input[uniq_var],as.character)
}
#' input is the dataframe/data.table that you want to perform the operation on, uniq_var is the variable that you are groupying by, compare_var is the variable that is being measured in each of the groups, and transposed_colum_names is just an optional string for the user to call each of their columns (will be concatenated with an observation number, i.e. if you input 'distance', it will name the observations 'distance_1','distance_2','distance_3'...ect.)
list_df <- input %>% group_by(input[,uniq_var]) %>% do(newcol = t(.[compare_var]))
# it gets us the aggregates we want, BUT all of our columns are stored in a list
# instead of in separate columns.... so we need to create a new dataframe with the dimensions
# rows = the number of unique values that we are "grouping" by, noted here by uniq_var and the number of columns will be
# the maximum number of observations that are assigned to one of those groups.
# so first we will create the skeleton of the matrix, and then use a user defined function
# to fill it with the correct values
new_df <- matrix(rep(NA,(max(count(input,input[,uniq_var])[,2])*dim(list_df)[1])),nrow = dim(list_df)[1])
new_df <- data.frame(new_df)
new_df <- cbind(list_df[,1],new_df)
# i am writing a function inside of a function becuase for loops can take a while
# when doing operaitons on multiple columns of a dataframe
func2 <- function(input,thing = new_df){
# here, we have a slightly easier case when we have the maximum number of children
# assigned to a household.
# we subtract 1 from the number of columns because the first column holds the value of the
# unique value we are looking at, so we don't count it
if(length(input[2][[1]])==dim(thing)[2]-1){
# we set the row corresponding to the specific unique value specified in our list_df of aggregated values
# equal to the de-aggregated values, so that you have a column for each value like in PROC Transpose.
thing[which(thing[,1]==input[1]),2:ncol(thing)]= input[2][[1]]
#new_df[which(new_df[,1]==input[1]),2:ncol(new_df)]= input[2][,1][[1]][[1]]
}else{
thing[which(thing[,1]==input[1]),2:(1+length(input[2][[1]]))]= input[2][[1]]
}
# if you're wondering why I have to use so many []'s it's because our list_df has 1 column
# of unique identifiers and the other column is actually a column of dataframes
# each of which only has 1 row and 1 column, and that element is a list of the transposed values
# that we want to add to our new dataframe
# so essentially the first bracket
return(thing[which(thing[,1]==input[1]),])
}
quarter_final_output <- apply(list_df,1,func2)
semi_final_output <- data.frame(matrix(unlist(quarter_final_output),nrow = length(quarter_final_output),byrow = T))
#return(apply(list_df,1,func2))
# this essentially names the columns according to the column names that a user would typically specify
# in a proc transpose.
name_trans <- function(trans_var=transposed_column_names,uniq_var = uniq_var,df){
#print(trans_var)
colnames(df)[1] = colnames(input[uniq_var])
colnames(df)[2:length(colnames(df))] = c(paste0(trans_var,seq(1,(length(colnames(df))-1),1)))
return(df)
}
final_output <- name_trans(transposed_column_names,uniq_var,semi_final_output)
return(final_output)
}
In your case, you'd apply it like this:
transp(testDF,uniq_var = 'Id',compare_var = "Item")
If you want to download it from my github https://github.com/seanpili/R_PROC_TRANSPOSE

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Applying functions to each group in a dataframe in R - r

Related

Can I use lapply to check for outliers in comparison to values from all listed tibbles?

I want to find a mode for each group of dataframes within the element of a list, and write the result as a new column

Substituting or summing based on condition

How can I create a data.frame based on variable number of tidyeval inputs

How to transpose data frame in R by group with column header

Categories

Resources