I have some noisy data with numbers, nulls and characters. I need to check the percentage change in numbers.
For that, I used a regular expression to check the % symbol present in a column. If yes, then extract numbers and subtract the percentage change Else extract numbers and perform calculations to get change.
Below is the reproducible code
df = data.frame(Actual = c('0.10%','55.10%',NA,'20.8B'),
Previous = c('-0.50%','47.90%',NA,'16.6B'))
df
Actual Previous
1 0.10% -0.50%
2 55.10% 47.90%
3 <NA <NA>
4 20.8B 16.6B
# if loop to calculate percentage change
if(grepl("%", df$Actual) & grepl("%", df$Previous)) {
a = as.numeric(stringr::str_extract(df$Actual,"[-\\d.][\\d]"))
p = as.numeric(stringr::str_extract(df$Previous,"[-\\d.][\\d]"))
df$Gain = a - p
} else {
a = as.numeric(stringr::str_extract(df$Actual,"[-\\d.][\\d]"))
p = as.numeric(stringr::str_extract(df$Previous,"[-\\d.][\\d]"))
df$Gain = (a - p)/p * 100
}
df
Actual Previous Gain
1 0.10% -0.50% 0.6
2 55.10% 47.90% 7.2
3 <NA> <NA> <NA>
4 20.8B 16.6B 4.2
The last value should be calculated as 25.30, instead of 4.2
The value of if loop is :
grepl("%", df$Actual) & grepl("%", df$Previous)
[1] TRUE TRUE FALSE FALSE
The last row should be in else loop. Can you help to get mistake in code.
You could use the parse_number-function from the readr-package (one of the tidyverse-packages) in combination with an ifelse condition to achieve what you want.
Using:
library(readr)
library(dplyr)
df %>%
mutate(gain = (parse_number(Actual) - parse_number(Previous)) /
if_else(grepl('%', Actual), 1, parse_number(Previous)/100) )
gives:
Actual Previous gain
1 0.10% -0.50% 0.6000
2 55.10% 47.90% 7.2000
3 <NA> <NA> NA
4 20.8B 16.6B 25.3012
Non-dplyr approach could be
df = data.frame(Actual = c('0.10%','55.10%',NA,'20.8B'),
Previous = c('-0.50%','47.90%',NA,'16.6B'), stringsAsFactors = FALSE)
df
percChange <- function(x) {
if (all(grepl("%", x))){
d <- diff(rev(as.numeric(gsub("[^-\\d{1,2}.\\d+]", "", x, perl = TRUE))))
}
else {
n <- rev(as.numeric(gsub("[^-\\d{1,2}.\\d+]", "", x, perl = TRUE)))
d <- diff(n) / n[1] * 100
}
return (d)
}
df$diff <- apply(df, 1, percChange)
df
Actual Previous diff
1 0.10% -0.50% 0.6000
2 55.10% 47.90% 7.2000
3 <NA> <NA> NA
4 20.8B 16.6B 25.3012
Also, regarding what is wrong with your loop - running it throws the following error:
Warning message:
In if (grepl("%", df$Actual) & grepl("%", df$Previous)) { :
the condition has length > 1 and only the first element will be used
Meaning that only the first element (which is TRUE because the first row has % values for both columns) will be used. So your outcome in row 4 is 20-16 = 4! You have to loop over the rows to prevent this
Related
I have a function that I want to iterate over only certain rows of my dataset, and then save the results in a variable in the dataset.
So for example say I have this set up:
library(tidyverse)
add_one <- function(vector, x_id){
return(vector[x_id] + 1)
}
test <- data.frame(x = c(1,2,3,4), y = c(1,2,3,4), run_on = c(TRUE,FALSE,TRUE,FALSE))
test
So the test data frame looks like:
> x y run_on
>1 1 1 TRUE
>2 2 2 FALSE
>3 3 3 TRUE
>4 4 4 FALSE
So what I want to do is iterate over the dataframe and set the y column to be the result of applying the function add_one() to the x column for just the rows where run_on is TRUE. I want the end result to look like this:
> x y run_on
>1 1 2 TRUE
>2 2 2 FALSE
>3 3 4 TRUE
>4 4 4 FALSE
I have been able to iterate the function over all of the rows using apply(). So for example:
test$y <- apply(test,1,add_one,x_id = 1)
test
> x y run_on
>1 1 2 TRUE
>2 2 3 FALSE
>3 3 4 TRUE
>4 4 5 FALSE
But this also applies the function to rows 2 and 4, which I do not want. I suspect there may be some way to do this using versions of the map() functions from ::purrr, which is why I tagged this post as such.
In reality, I am using this kind of procedure to repeatedly iterate over a large dataset multiple times, so I need it to be done automatically and cleanly. Any help or suggestions would be very much appreciated.
UPDATE
I managed to find a solution. Some of the solutions offered here did work in my toy example but did not extend to the more complex function I was actually using. Ultimately what worked was something similar to what tmfmnk suggested. I just wrapped the original function inside another function that included an if statement to determine whether or not to apply the original function. So to extend my toy example, my solution looks like this:
add_one_if <- function(vector, x_id, y_id, run_on_id){
if(vector[run_on_id]){
return(add_one(vector,x_id))}
else{
return(vector[x_id])
}
}
test$y <- apply(test, 1, add_one_if, x_id = 1, y_id = 2, run_on_id = 3)
It seems a little convoluted, but it worked for me and is reproducible and reliable in the way I need it to be.
You can also do:
add_one <- function(data, vector, x_id, n, is.true = c(TRUE, FALSE)) {
if (is.true) {
return(data[[vector]] + (data[[x_id]]) * n)
} else {
return(data[[vector]] + (!data[[x_id]]) * n)
}
}
add_one(test, vector = "y", x_id = "run_on", 1, is.true = TRUE)
[1] 2 2 4 4
add_one(test, vector = "y", x_id = "run_on", 5, is.true = FALSE)
[1] 1 7 3 9
It may be that your real case is more complicated than allowed by this, but why not just use ifelse?
test$y <- ifelse(test$run_on,add_one(test,x),y)
Or even:
test$y[test$run_on]<-add_one(test[run_on,],x)
You won't need to use purrr until you are applying the same function to multiple columns. Since you want to modify only one column, but based on a condition you can use mutate() + case_when().
mutate(test, y = case_when(run_on ~ add_one(y),
!run_on ~ y))
#> x y run_on
#> 1 1 2 TRUE
#> 2 2 2 FALSE
#> 3 3 4 TRUE
#> 4 4 4 FALSE
For a project I'm working on, I need to have a dataframe to indicate whether a person was absent (0) or not (1) on a particular day.
The problem is: my data is in a format where it gives the starting date of absenteïsm and then the number of days the person was absent.
Example of my dataframe:
df1 <- data.frame(Person = c(1,1,1,1,1),
StartDate = c("01-01","02-01","03-01","04-01","05-01"),
DAYS = c(3,NA,NA,NA,1))
Instead of the "Start date" and "number of days absent" per person, it should look like this instead:
df2 <- data.frame(Person = c(1,1,1,1,1),
Date = c("01-01","02-01","03-01","04-01","05-01"),
Absent = c(1,1,1,0,1))
For now I solved it with this for loop with two if-conditions:
for(i in 1:nrow(df1)){
if(!is.na(df1$DAYS[i])){
var <- df1$DAYS[i]
}
if(var > 0){
var <- var-1
df1$DAYS[i] <- 1
}
}
This works, however I have thousands of persons with a full year of dates each, meaning that I have more than 5 million rows in my dataframe. You can imagine how slow the loop is.
Does anyone know a quicker way to solve my problem?
I tried looking at the lubridate package to work with periods and dates, but I don't see a solution there.
Here is an approach based upon generating all the indices of observations that should be set to 1, and then filling in the values.
# The data
df1 <- data.frame(Person = c(1,1,1,1,1),
StartDate = c("01-01","02-01","03-01","04-01","05-01"),
DAYS = c(3,NA,NA,NA,1))
# Initialize the vector we want with zeros
df1$Absent <- 0
# we get the indices of all the non-zero day entries
inds <- which(!is.na(df1$DAYS))
# Now we are going to build a list of all the indices that should be
# set to one. These are the intervals from when absence starts to
# the number of days absent - 1
inds_to_change <- mapply(function(i,d){i:(i+d-1)}, inds, na.omit(df1$DAYS))
df1$Absent[unlist(inds_to_change)] <- 1
df1
#> Person StartDate DAYS Absent
#> 1 1 01-01 3 1
#> 2 1 02-01 NA 1
#> 3 1 03-01 NA 1
#> 4 1 04-01 NA 0
#> 5 1 05-01 1 1
Created on 2019-02-20 by the reprex package (v0.2.1)
A faster solution can be found by using integrated R functions.
The general idea:
For each person, find the position for absent days greater than 1. Let the number of absent days be a and the position be p.
In every position defined by the sequence p:(p + a - 1) insert the value 1.
Return the redefined vector, in place of the old vector.
This can all be implemented into a function, and then applied across all the subgroups. For this to be faster
the function
For the specific case using mapply (as the previous answer suggest) works, but using data.table will in general be faster for larger data sets. This is utilized below.
RelocateAbsentees <- function(x){
#Find the position in x for which the value is greater than 1
pos <- which(x > 1)
#Fill in the vector with the absent days
for(i in pos){
val <- x[i]
x[i:(i + val - 1)] <- 1
}
#return the vector
pos
}
df1 <- data.frame(Person = c(1,1,1,1,1),
StartDate = c("01-01","02-01","03-01","04-01","05-01"),
DAYS = c(3,NA,NA,NA,1))
library(data.table)
setDT(df1)
df2 <- copy(df1)[,Person := 2]
df3 <- rbind(df1,df2)
#Using data.table package (faster)
df1[, newDays := RelocateAbsentees(DAYS), by = Person]
df3[, newDays := RelocateAbsentees(DAYS), by = Person]
I found pretty neat solution using tidyverse:
library(tidyverse)
df1 %>%
group_by(Person) %>%
mutate(Abs = map_dbl(DAYS, ~ {
if (!is.na(.x)) {
d <<- .x
+(d > 0)
} else {
d <<- d - 1
+(d > 0)
}
}))
Firstly, your original approach was not so bad. Some minor improvements can make it faster than gfgm`s (as of my testing, I do not know your exact data structure):
improvedOP <- function(d) {
days <- d$DAYS # so we do not repeatedly change data.frames column
ii <- !is.na(days) # this can be calculated outside the loop
for (i in 1:nrow(d)) {
if (ii[i]) var <- days[i]
if (var > 0) {
var <- var - 1
days[i] <- 1
}
}
return(days)
}
I came up with this approach:
minem <- function(d) {
require(zoo)
rn <- 1:nrow(d) # row numbers
ii <- rn + d$DAYS - 1L # get row numbers which set to 1
ii <- na.locf(ii, na.rm = F) # fill NA forward
ii <- rn <= ii # if row number less or equal than interested row is 1
ii[ii == 0] <- NA # set 0 to NA to match original results
as.integer(ii)
}
all.equal(minem(d), improvedOP(d))
# TRUE
The idea is that we calculate row numbers which need to be 1 (current row + DAYS - 1). Then fill the NAs with this value and if row matches our condition set to 1. This should be faster than any other approach, that involves creating sequences.
Benchmark on larger (7.3 mil rows) simulated data:
gfgm <- function(d) {
days <- rep(0, nrow(d))
inds <- which(!is.na(d$DAYS))
inds_to_change <- mapply(function(i, d) {i:(i + d - 1)}, inds, na.omit(d$DAYS))
days[unlist(inds_to_change)] <- 1
days
}
nrow(d)/1e6 # 7.3 mil
require(bench)
require(data.table)
bm <- bench::mark(minem(d), improvedOP(d), gfgm(d), iterations = 2, check = F)
as.data.table(bm[, 1:7])
# expression min mean median max itr/sec mem_alloc
# 1: minem(d) 281.34ms 302.85ms 302.85ms 324.35ms 3.3019990 408MB
# 2: improvedOP(d) 747.45ms 754.55ms 754.55ms 761.65ms 1.3252907 139MB
# 3: gfgm(d) 3.23s 3.27s 3.27s 3.31s 0.3056558 410MB
P.S. but the real results probably depends on the distribution of DAYS values.
Description
ifelse() function allows to filter the values in a vector through a series of tests, each of them producing different actions in case of a positive result. For instance, let xx be a data.frame, as follows:
xx <- data.frame(a=c(1,2,1,3), b=1:4)
xx
a b
1 1
2 2
1 3
3 4
Suppose that you want to create a new column, c, from column b, but depending on the values in column a in the following way:
For each row,
if the value in column a is 1, the value in column c, is the same value in column b.
if the value in column a is 2, the value in column c, is 100 times the value in column b.
in any other case, the value in column c is the negative of the value in column b.
Using ifelse(), a solution could be:
xx$c <- ifelse(xx$a==1, xx$b,
ifelse(xx$a==2, xx$b*100,
-xx$b))
xx
a b c
1 1 1
2 2 200
1 3 3
3 4 -4
Problem 1
An aesthetic problem arises when the number of tests increases, say, four tests:
xx$c <- ifelse(xx$a==1, xx$b,
ifelse(xx$a==2, xx$b*100,
ifelse(xx$a==3, ...,
ifelse(xx$a==4, ...,
...))))
I found partial solution to the problem in this page, which consists in the definition of the functions if.else_(), i_(), e_(), as follows:
library(lazyeval)
i_ <- function(if_stat, then) {
if_stat <- lazyeval::expr_text(if_stat)
then <- lazyeval::expr_text(then)
sprintf("ifelse(%s, %s, ", if_stat, then)
}
e_ <- function(else_ret) {
else_ret <- lazyeval::expr_text(else_ret)
else_ret
}
if.else_ <- function(...) {
args <- list(...)
for (i in 1:(length(args) - 1) ) {
if (substr(args[[i]], 1, 6) != "ifelse") {
stop("All but the last argument, need to be if.then_ functions.", call. = FALSE)
}
}
if (substr(args[[length(args)]], 1, 6) == "ifelse"){
stop("Last argument needs to be an else_ function.", call. = FALSE)
}
args$final <- paste(rep(')', length(args) - 1), collapse = '')
eval_string <- do.call('paste', args)
eval(parse(text = eval_string))
}
In this way, the problem given in the Description, can be rewritten as follows:
xx <- data.frame(a=c(1,2,1,3), b=1:4)
xx$c <- if.else_(
i_(xx$a==1, xx$b),
i_(xx$a==2, xx$b*100),
e_(-xx$b)
)
xx
a b c
1 1 1
2 2 200
1 3 3
3 4 -4
And the code for the four tests will simply be:
xx$c <- if.else_(
i_(xx$a==1, xx$b),
i_(xx$a==2, xx$b*100),
i_(xx$a==3, ...), # dots meaning actions for xx$a==3
i_(xx$a==4, ...), # dots meaning actions for xx$a==4
e_(...) # dots meaning actions for any other case
)
Problem 2 & Question
The given code apparently solves the problem. Then, I wrote the following test function:
test.ie <- function() {
dd <- data.frame(a=c(1,2,1,3), b=1:4)
if.else_(
i_(dd$a==1, dd$b),
i_(dd$a==2, dd$b*100),
e_(-dd$b)
) # it should give c(1, 200, 3, -4)
}
When I tried the test:
test.ie()
it spit the following error message:
Error in ifelse(dd$a == 1, dd$b, ifelse(dd$a == 2, dd$b * 100, -dd$b)) :
object 'dd' not found
Question
Since the if.else_() syntactic constructor is not supposed to run only from the console, is there a way for it to 'know' the variables from the function that calls it?
Note
In "Best way to replace a lengthy ifelse structure in R", a similar problem was posted. However, the given solution there focuses on building the table's new column with the given constant output values (the "then" or "else" slots of the ifelse() function), whereas my case addresses a syntactic problem in which the "then" or "else" slots can even be expressions in terms of other data.frame elements or variables.
I think you can use dplyr::case_when inside dplyr::mutate to achieve this.
library(dplyr)
df <- tibble(a=c(1,2,1,3), b=1:4)
df %>%
mutate(
foo = case_when(
.$a == 1 ~ .$b,
.$a == 2 ~ .$b * 100L,
TRUE ~ .$b * -1L
)
)
#> # A tibble: 4 x 3
#> a b foo
#> <dbl> <int> <int>
#> 1 1 1 1
#> 2 2 2 200
#> 3 1 3 3
#> 4 3 4 -4
In the upcoming relase of dplyr 0.6.0 you won't need to use the akward work-around of .$, and you can just use:
df %>%
mutate(
foo = case_when(
a == 1 ~ b,
a == 2 ~ b * 100L,
TRUE ~ b * -1L
)
)
Taking into account MrFlick's advice, I re-coded the if.else_() function as follows:
if.else_ <- function(...) {
args <- list(...)
for (i in 1:(length(args) - 1) ) {
if (substr(args[[i]], 1, 6) != "ifelse") {
stop("All but the last argument, need to be if.then_ functions.", call. = FALSE)
}
}
if (substr(args[[length(args)]], 1, 6) == "ifelse"){
stop("Last argument needs to be an else_ function.", call. = FALSE)
}
args$final <- paste(rep(')', length(args) - 1), collapse = '')
eval_string <- do.call('paste', args)
eval(parse(text = eval_string), envir = parent.frame())
}
Now the test.ie() function runs properly
test.ie()
[1] 1 200 3 -4
With full respect to the OP's remarkable effort to improve nested ifelse(), I prefer a different approach which I believe is easy to write, concise, maintainable and fast:
xx <- data.frame(a=c(1L,2L,1L,3L), b=1:4)
library(data.table)
# coerce to data.table, and set the default first
setDT(xx)[, c:= -b]
xx[a == 1L, c := b] # 1st special case
xx[a == 2L, c := 100L*b] # 2nd special case, note use of integer 100L
# xx[a == 3L, c := ...] # other cases
# xx[a == 4L, c := ...]
#...
xx
# a b c
#1: 1 1 1
#2: 2 2 200
#3: 1 3 3
#4: 3 4 -4
Note that for the 2nd special case b is multiplied by the integer constant 100L to make sure that the right hand sides are all of type integer in order to avoid type conversion to double.
Edit 2: This can also be written in an even more concise (but still maintainable) way as a one-liner:
setDT(xx)[, c:= -b][a == 1L, c := b][a == 2L, c := 100*b][]
data.table chaining works here, because c is updated in place so that subsequent expressions are acting on all rows of xx even if the previous expression was a selective update of a subset of rows.
Edit 1: This approach can be implemented with base R as well:
xx <- data.frame(a=c(1L,2L,1L,3L), b=1:4)
xx$c <- -xx$b
idx <- xx$a == 1L; xx$c[idx] <- xx$b[idx]
idx <- xx$a == 2L; xx$c[idx] <- 100 * xx$b[idx]
xx
# a b c
#1 1 1 1
#2 2 2 200
#3 1 3 3
#4 3 4 -4
I have an input like this:
input=c(8,-200,4,0,9,5,-900,10,8,8)
and I want to do the following:
If input<(-100)
replace input and next two values by the mean of the value before and after the values replaced
So that the result should look like this:
result=c(8,8.5,8.5,8.5,9,5,6.5,6.5,6.5,8)
df=data.frame(input, result)
I tried the following which only works if I have only one case in my df:
ind <- which(df$input<(-100))
df$input[ind:ind+2] <- sapply(ind, function(i) with(df, mean(c(input[i-1], input[i+3]))))
For more then one case I get the error message:
Warning messages:
1: In ind:ind : numerical expression has 2 elements: only the first used
2: In ind:ind : numerical expression has 2 elements: only the first used
3: In df$input[ind:ind + 2] <- sapply(ind, function(i) with(df, mean(c(input[i - :
number of items to replace is not a multiple of replacement length
I might also have the case where the value x+3 is another value to be replaced:
input2=c(1,1,2,-100,7,0,-200,4,5,6)
In which case I want to skip over the value again and take the next x+3 value (here: the average of 2 and 6) so that:
result2=c(1,1,2,4,4,4,4,4,4,6)
Any help would be appreciated.
Thanks!
Here is a solution :
myfun <- function(input){
# Replace values by NA
ind <- which(input < -100)
ind <- unique(c(ind, ind+1, ind+2))
ind <- ind[ind<=length(input)]
input[ind] <- NA
# Replace NA by mean
input[ind] <- rowMeans(cbind(na.locf(input, fromLast = T, na.rm = F),
na.locf(input, fromLast = F, na.rm = F)),
na.rm = T)[ind]
input
}
myfun(c(8,8.5,8.5,8.5,9,5,6.5,6.5,6.5,8))
# [1] 8.0 8.5 8.5 8.5 9.0 5.0 6.5 6.5 6.5 8.0
myfun(c(1,1,2,-200,7,0,-200,4,5,6))
# [1] 1 1 2 4 4 4 4 4 4 6
I have a well balanced panel data set which contains NA observations. I will be using LOCF, and would like to know how many consecutive NA's are in each panel, before carrying observations forward. LOCF is a procedure where by missing values can be "filled in" using the "last observation carried forward". This can make sense it some time-series applications; perhaps we have weather data in 5 minute increments: a good guess at the value of a missing observation might be an observation made 5 minutes earlier.
Obviously, it makes more sense to carry an observation forward one hour within one panel than it does to carry that same observation forward to the next year in the same panel.
I am aware that you can set a "maxgap" argument using zoo::na.locf, however, I want to get a better feel for my data. Please see a simple example:
require(data.table)
set.seed(12345)
### Create a "panel" data set
data <- data.table(id = rep(1:10, each = 10),
date = seq(as.POSIXct('2012-01-01'),
as.POSIXct('2012-01-10'),
by = '1 day'),
x = runif(100))
### Randomly assign NA's to our "x" variable
na <- sample(1:100, size = 52)
data[na, x := NA]
### Calculate the max number of consecutive NA's by group...this is what I want:
### ID Consecutive NA's
# 1 1
# 2 3
# 3 3
# 4 3
# 5 4
# 6 5
# ...
# 10 2
### Count the total number of NA's by group...this is as far as I get:
data[is.na(x), .N, by = id]
All solutions are welcomed, but data.table solutions are highly preferred; the data file is large.
This will do it:
data[, max(with(rle(is.na(x)), lengths[values])), by = id]
I just ran rle to find all consecutive NA's and picked the max length.
Here's a rather convoluted answer to the comment question of recovering the date ranges for the above max:
data[, {
tmp = rle(is.na(x));
tmp$lengths[!tmp$values] = 0; # modify rle result to ignore non-NA's
n = which.max(tmp$lengths); # find the index in rle of longest NA sequence
tmp = rle(is.na(x)); # let's get back to the unmodified rle
start = sum(tmp$lengths[0:(n-1)]) + 1; # and find the start and end indices
end = sum(tmp$lengths[1:n]);
list(date[start], date[end], max(tmp$lengths[tmp$values]))
}, by = id]
You can use rle with the modification suggested here (and pasted below) to count NA values.
foo <- data[, rle(x), by=id]
foo[is.na(values), max(lengths), by=id]
# id V1
# 1: 1 1
# 2: 2 3
# 3: 3 3
# 4: 4 3
# 5: 5 4
# 6: 6 5
# 7: 7 3
# 8: 8 5
# 9: 9 2
# 10: 10 2
Amended rle function:
rle<-function (x)
{
if (!is.vector(x)&& !is.list(x))
stop("'x' must be an atomic vector")
n<- length(x)
if (n == 0L)
return(structure(list(lengths = integer(), values = x),
class = "rle"))
#### BEGIN NEW SECTION PART 1 ####
naRepFlag<-F
if(any(is.na(x))){
naRepFlag<-T
IS_LOGIC<-ifelse(typeof(x)=="logical",T,F)
if(typeof(x)=="logical"){
x<-as.integer(x)
naMaskVal<-2
}else if(typeof(x)=="character"){
naMaskVal<-paste(sample(c(letters,LETTERS,0:9),32,replace=T),collapse="")
}else{
naMaskVal<-max(0,abs(x[!is.infinite(x)]),na.rm=T)+1
}
x[which(is.na(x))]<-naMaskVal
}
#### END NEW SECTION PART 1 ####
y<- x[-1L] != x[-n]
i<- c(which(y), n)
#### BEGIN NEW SECTION PART 2 ####
if(naRepFlag)
x[which(x==naMaskVal)]<-NA
if(IS_LOGIC)
x<-as.logical(x)
#### END NEW SECTION PART 2 ####
structure(list(lengths = diff(c(0L, i)), values = x[i]),
class = "rle")
}