Substituting or summing based on condition - r

I have a dataset that looks something like this
df <- data.frame("id" = c("Alpha", "Alpha", "Alpha","Alpha","Beta","Beta","Beta","Beta"),
"Year" = c(1970,1971,1972,1973,1974,1977,1978,1990),
"Group" = c(1,NA,1,NA,NA,2,2,NA),
"Val" = c(2,3,3,5,2,5,3,5))
And I would like to create a cumulative sum of "Val". I know how to do the simple cumulative sum
df <- df %>% group_by(id) %>% mutate(cumval=cumsum(Val))
However, I would like my final data to look like this
final <- data.frame("id" = c("Alpha", "Alpha", "Alpha","Alpha","Beta","Beta","Beta","Beta"),
"Year" = c(1970,1971,1972,1973,1974,1977,1978,1990),
"Group" = c(1,NA,1,NA,NA,2,2,NA),
"Val" = c(2,3,3,5,2,5,3,5),
"cumval" = c(2,5,6,11,2,7,5,10))
The basic idea is that when two "Val"'s are of the same "Group" the one happening later (Year) substitutes the previous one.
For instance, in the sample dataset, observation 3 has a "cumval" of 6 rather than 8 because of the "Val" at time 1972 replaced the "Val" at time 1970. similarly for Beta.
I thank you in advance for your help

In my head, this requires a for loop. First we split the dataframe by the id column into a list of two. Then we create two empty lists. In the og list, we will put the row where the first unique non NA group identifier occurs. For alpha this is the first row and for Beta this is the second row. We will use this to subtract from the cumulative sum when the value gets substituted.
mylist <- split(df, f = df$id)
og <- list()
vals <- list()
df_num <- 1
We shall use a nested loop, the outer loop loops over each object (dataframe in this case) in the list and the inner loop loops over each value in the Group column.
We need to keep track of the row numbers, which we do with the r variable. We initially set it to 0 outside the for loop so we add 1. First we check if we are in the first row of the data frame, in which case the cumulative sum is simply equal to the value in the first row of the Val column. Then within the if test, we use another if test to check if the Group id is an NA. If it isn't then this is the first occurrence of the number that will indicate a substitution of the current value if this number appears again. So we save the number to the temporary variable temp. We also extract and save the row that contains the value to the og list.
After this it, goes to the next iteration. We check if the current Group value is NA. If it is, then we just add the value to the cumulative sum. If it isn't equal to NA, we check if the value is NA and is equal to the value stored in temp. If both are true, then this means we need to substitute. We extract the original value stored in the og list and save it as old. We then subtract the old value from the cumulative sum and add the current value. We also replace the orginal value in og with the current replacement value. This is because if the value needs to replaced again, we will need to subtract the current value and not the original value.
If j is NA but it is not equal to temp, then this is a new instance of Group. So we save the row with the original value to og list, and save the Group. The sum continues as normal as this is not an instance of replacing a value. Note that the variable x that is used to count the elements in the og list is only incremented when a new occurrence is added to the list. Thus, og[[x-1]] will always be the replacement value.
for (my_df in mylist) {
x <- 1
r <- 0
for (j in my_df$Group) {
r <- r + 1
if (r == 1) {
vals[[1]] <- my_df$Val[1]
if (is.na(j)==FALSE) {
og[[x]] <- df[r, c('Group', 'Val'), drop = FALSE]
temp <- j
x <- x + 1
}
next
}
if (is.na(j)==TRUE) {
vals[[r]] <- vals[[r-1]] + my_df$Val[r]
} else if (is.na(j)==FALSE & j==temp) {
old <- og[[x-1]]
old <- old[,2]
vals[[r]] <- vals[[r-1]] - old + df$Val[r]
og[[x-1]] <- df[r, c('Group', 'Val'), drop = FALSE]
} else {
vals[[r]] <- vals[[r-1]] + my_df$Val[r]
og[[x]] <- my_df[r, c('Group', 'Val')]
temp <- j
x <- x + 1
}
}
cumval <- unlist(vals) %>% as.data.frame()
colnames(cumval) <- 'cumval'
my_df <- cbind(my_df, cumval)
mylist[[df_num]] <- my_df
df_num <- df_num + 1
}
Lastly, we combine the two dataframes in the list by binding them on rows with bind_rows from the dplyr package. Then I check if the Final dataframe is identical to your desired output with identical() and it evaluates to TRUE
final_df <- bind_rows(mylist)
identical(final_df, final)
[1] TRUE

Related

Can I use lapply to check for outliers in comparison to values from all listed tibbles?

My data is imported into R as a list of 60 tibbles each with 13 columns and 8 rows. I want to detect outliers defined as 2*sd by comparing each value in column "2" to the mean of all values of column "2" in the same row.
I know that I am on a wrong path with these lines, as I am not comparing the single values
lapply(list, function(x){
if(x$"2">(mean(x$"2")) + (2*sd(x$"2"))||x$"2"<(mean(x$"2")) - (2*sd(x$"2"))) {}
})
Also I was hoping to replace all values that are thus identified as outliers by the corresponding mean calculated from the 60 values in the same position as the outlier while keeping everything else, but I am also quite unsure how to do that.
Thank you!
you haven't added an example of your code so I've made a quick and simple example to demonstrate my answer. I think this would be much more straightforward logic if you first combine the list of tibbles into a single tibble. This allows you to do everything you want in a simple dplyr pipe, ultimately identifying outliers by 1's in the 'outlier' column:
library(tidyverse)
tibble1 <- tibble(colA = c(seq(1,20,1), 150),
colB = seq(0.1,2.1,0.1),
id = 1:21)
tibble2 <- tibble(colA = c(seq(101,120,1), -150),
colB = seq(21,41,1),
id = 1:21)
# N.B. if you don't have an 'id' column or equivalent
# then it makes it a lot easier if you add one
# The 'id' column is essentially shorthand for an index
tibbleList <- list(tibble1, tibble2)
joinedTibbles <- bind_rows(tibbleList, .id = 'tbl')
res <- joinedTibbles %>%
group_by(id) %>%
mutate(meanA = mean(colA),
sdA = sd(colA),
lowThresh = meanA - 2*sdA,
uppThresh = meanA + 2*sdA,
outlier = ifelse(colA > uppThresh | colA < lowThresh, 1, 0))

modification of lists with respect to dates

I have a list data
list <- list()
list$date <- structure(19297:19310, class = "Date")
list$value <- c(100,200,300,100,200,300,100,200,300,100,200,500,800)
list$temp2 <- c(1000,2000,3000,1000,2000,3000,1000,2000,3000,1000,2000,5888,9887)
I want to modify the list in such a way that:
every element of the list$value is multiplied with 0.5 * list$temp2 (which can be done by a multiply operation)
Except the maximum of the value that is in between days 1 to 7 of the date (maximum of first week) - this maximum value needs to be doubled. (i.e., only one list$value doesn't get replaced with the step 1 rather is doubled by its own value)
Can anyone help me with this?
Converting the list into a data.frame (or better yet a data.table) will enable column-wise operations on the data.
dl <- list()
dl$date <- structure(19297:19310, class = "Date")
dl$value <- c(100,200,300,100,200,300,100,200,300,100,200,500,800)
dl$temp2 <- c(1000,2000,3000,1000,2000,3000,1000,2000,3000,1000,2000,5888,9887)
Since the list elements are unequal length, adding NAs to the end, so there are 14 elements in each value
dl$value[[length(dl$value)+1]] <- NA
dl$temp2[[length(dl$temp2)+1]] <- NA
Convert to a data.frame
df <- as.data.frame(dl)
Create the exception criteria (max value of the first 7 days)
df$exception <- df$value == max(df[1:7,"value"])
df$exception[is.na(df$exception)] <- FALSE
Create a new variable "value2" and perform the multiplications .5 where the exception doesn't occur, and x2 where it does occur.
df$value2 <- as.numeric(NA)
df$value2[df$exception == FALSE] <- df$value[df$exception == FALSE] * 0.5
df$value2[df$exception == TRUE] <- df$value[df$exception == TRUE] * 2
The output, which can be passed back into a list object, if required
df$date
df$value2
df$temp2

Applying functions to each group in a dataframe in R

I have a dataframe like this:
df<-data.frame(info=c("Lucas sold $3.01","Lucia bought 3.00","Lucas bought $2.5","Lucas sold
$3.01","Lucia bought 3.00","Lucas bought $2.5"),
number=c("1001","1001","1002","1003","1003","1003"),
step=c("step 1","step 2","step 1","step 1","step 2","step 3"),
status=c("ok",NA,NA,"ok",NA,NA))
I need to transforme the information that i already have, using diverse functions, but I need to do it grouping the information based in "Number".
For example, I need to group by "number" and then replace the first NA in column "Status" for an "ok", for each group.
Then "status" would be c("ok","ok","ok","ok","ok",NA)
last(which.na(df$status)) would do the trick if I could apply that to each group.
Another function that I need to apply would be to create a new column where I can place a "1", the last time that the word "bought" is in the column "info".
Something like df[max(which(grepl("bought",df$info))]<-"1" would do the trick if I could apply that to each group, but I am not sure about how to do it.
You could make great use of dplyr's group_by syntax here after creating some bespoke functions to do the required tasks:
# Replace the last NA element of a vector with 'ok'
replace_first_na <- function(x) {
# Coerce to character to catch potential issues
x <- as.character(x)
# Get the position of the first NA
first_na <- which(is.na(x))[1]
# Replace the element in that position with 'ok'
x[first_na] <- "ok"
x
}
# Get the last element containing the word 'bought'
last_bought_flag <- function(x) {
# Prepare the output
out <- rep(0, length(x))
# Get the position of the last string to contain 'bought'
last_bought <- max(which(grepl("bought", x)))
# Replace the element in that position with `1`
out[last_bought] <- 1
# Return the output
out
}
df %>%
as_tibble() %>%
# Apply grouping by `number`
group_by(number) %>%
# Replace the first `NA` with 'ok' in the `status` column
mutate(status = replace_first_na(status)) %>%
# Get a flag column indicating the last 'bought' item for each group
mutate(last_bought = last_bought_flag(info)) %>%
# Remove grouping
ungroup()

How to extract rows of a data frame between two characters

I've got some poorly structured data I am trying to clean. I have a list of keywords I can use to extract data frames from a CSV file. My raw data is structured roughly as follows:
There are 7 columns with values, the first columns are all string identifiers, like a credit rating or a country symbol (for FX data), while the other 6 columns are either a header like a percentage change string (e.g. +10%) or just a numerical value. Since I have all this data lumped together, I want to be able to extract data for each category. So for instance, I'd like to extract all the rows between my "credit" keyword and my "FX" keyword in my first column. Is there a way to do this in either base R or dplyr easily?
eg.
df %>%
filter(column1 = in_between("credit", "FX"))
Sample dataframe:
row 1: c('random',-1%', '0%', '1%, '2%')
row 2: c('credit', NA, NA, NA, NA)
row 3: c('AAA', 1,2,3,4)
...
row n: c('FX', '-1%', '0%', '1%, '2%')
And I would want the following output:
row 1: c('credit', -1%', '0%', '1%, '2%')
row 2: c('AAA', 1,2,3,4)
...
row n-1: ...
If I understand correctly you could do something like
start <- which(df$column1 == "credit")
end <- which(df$column1 == "FX")
df[start:(end-1), ]
Of course this won't work if "credit" or "FX" is in the column more than once.
Using what Brian suggested:
in_between <- function(df, start, end){
return(df[start:(end-1),])
}
Then loop over the indices in
dividers = which(df$column1 %in% keywords == TRUE)
And save the function outputs however one would like.
lapply(1:(length(dividers)-1), function(x) in_between(df, start = dividers[x], end = dividers[x+1]))
This works. Messy data so I still have the annoying case where I need to keep the offset rows.
I'm still not 100% sure what you are trying to accomplish but does this do what you need it to?
set.seed(1)
df <- data.frame(
x = sample(LETTERS[1:10]),
y = rnorm(10),
z = runif(10)
)
start <- c("C", "E", "F")
df2 <- df %>%
mutate(start = x %in% start,
group = cumsum(start))
split(df2, df2$group)

Store output of sapply into a data frame?

how can I store the output of sapply() to a dataframe where the index value is stored in first column and its value in corresponding 2nd column. For illustration, I have shown only 2 elements here, but there are 110 columns in my data. "loan" is the data frame.
cols <- sapply(loan,function(x) sum(is.na(x)))
cols
id
0
member_id
7
I want output as:
var value
id 0
member_id 7
I know that sapply() returns a vector, but when I print the vector, values are printed along with its some "index" e.g., column name if applied on a data frame. So, now when I want to store it as a data frame with two columns where 1st column contains the index part and the second column contains the value, how can I do it?
I found an answer to my question. For those who actually did understand my problem, this answer might make sense:
cols <- data.frame(sapply(loan ,function(x) sum(is.na(x))))
cols <- cbind(variable = row.names(cols), cols)
I wanted the row.names to be in a column of the same data frame corresponding to the values obtained from sapply.
We can use stack
stack(mylist)[2:1]
data
mylist <- list(df = 1, rf = 2)
Is this what you want?
Your original list:
L <- c("df",1,"rf",2)
L
[1] "df" "1" "rf" "2"
As a data frame:
N <- length(L)
df <- data.frame( var = L[seq(1,N,2)], value = L[seq(2,N,2)] )
df
var value
1 df 1
2 rf 2

Resources