I'm trying to apply a function to a dataframe using ddply from the plyr package, but I'm getting some results that I don't understand. I have 3 questions about the
results
Given:
mydf<- data.frame(c(12,34,9,3,22,55),c(1,2,1,1,2,2)
, c(0,1,2,1,1,2))
colnames(mydf)[1] <- 'n'
colnames(mydf)[2] <- 'x'
colnames(mydf)[3] <- 'x1'
mydf looks like this:
n x x1
1 12 1 0
2 34 2 1
3 9 1 2
4 3 1 1
5 22 2 1
6 55 2 2
Question #1
If I do:
k <- function(x) {
mydf$z <- ifelse(x == 1, 0, mydf$n)
return (mydf)
}
mydf <- ddply(mydf, c("x") , .fun = k, .inform = TRUE)
I get the following error:
Error in `$<-.data.frame`(`*tmp*`, "z", value = structure(c(12, 34, 9, :
replacement has 3 rows, data has 6
Error: with piece 1:
n x x1
1 12 1 0
2 9 1 2
3 3 1 1
I get this error regardless of whether I specify the variable to split by as c("x"), "x", or .(x). I don't understand why I'm getting this error message.
Question #2
But, what I really want to do is set up an if/else function because my dataset has variables x1, x2, x3, and x4 and I want to take those variables into account as well. But when I try something simple such as:
j <- function(x) {
if(x == 1){
mydf$z <- 0
} else {
mydf$z <- mydf$n
}
return(mydf)
}
mydf <- ddply(mydf, x, .fun = j, .inform = TRUE)
I get:
Warning messages:
1: In if (x == 1) { :
the condition has length > 1 and only the first element will be used
2: In if (x == 1) { :
the condition has length > 1 and only the first element will be used
Question #3
I'm confused about to use function() and when to use function(x). Using function() for either j() or k() gives me a different error:
Error in .fun(piece, ...) : unused argument (piece)
Error: with piece 1:
n x x1 z
1 12 1 0 12
2 9 1 2 9
3 3 1 1 3
4 12 1 0 12
5 9 1 2 9
6 3 1 1 3
7 12 1 0 12
8 9 1 2 9
9 3 1 1 3
10 12 1 0 12
11 9 1 2 9
12 3 1 1 3
where column z is not correct. Yet I see a lot of functions written as function().
I sincerely appreciate any comments that can help me out with this
There's a lot that needs explaining here. Let's start with the simplest case. In your first example, all you need is:
mydf$z <- with(mydf,ifelse(x == 1,0,n))
An equivalent ddply solution might look like this:
ddply(mydf,.(x),transform,z = ifelse(x == 1,0,n))
Probably your biggest source of confusion is that you seem to not understand what is being passed as arguments to functions within ddply.
Consider your first attempt:
k <- function(x) {
mydf$z <- ifelse(x == 1, 0, mydf$n)
return (mydf)
}
The way ddply works is that it splits mydf up into several, smaller data frame, based on the values in the column x. That means that each time ddply calls k, the argument passed to k is a data frame. Specifically, a subset of you primary data frame.
So within k, x is a subset of mydf, with all the columns. You should not be trying to modify mydf from within k. Modify x, and then return the modified version. (If you must, but the options I displayed above are better.) So we might re-write your k like this:
k <- function(x) {
x$z <- ifelse(x$x == 1, 0, x$n)
return (x)
}
Note that you've created some confusing stuff by using x as both an argument to k and as the name of one of our columns.
Related
I have several variables in my dataset that need to be recoded in exactly the same way, and several other variables that need to be recoded in a different way. I tried writing a function to help me with this, but I'm having trouble.
library(dplyr)
recode_liberalSupport = function(arg1){
arg1 = recode(arg1, "1=-1;2=1;else=NA")
return(arg1)
}
liberals = c(df$var1, df$var4, df$var8)
for(i in unique(liberals)){
paste(df$liberals[i] <- sapply(liberals, FUN = recode_liberalSupport))
}
R studio works on this for about 5 minutes then gives me this error message:
Error in `$<-.data.frame`(`*tmp*`, liberals, value = c(NA_real_, NA_real_, :
replacement has 9 rows, data has 64600
In addition: Warning messages:
1: Unknown or uninitialised column: 'liberals'.
2: In df$liberals[i] <- sapply(liberals, FUN = recode_liberalSupport) :
number of items to replace is not a multiple of replacement length
Any help would be really appreciated! Thank you
This is neater I think with dplyr. Using recode correctly is a good idea. mutate_all() can be used to operate on the whole dataframe, mutate_at() on just selected variables. There are lots of ways to specify variables in dplyr.
mydata <- data.frame(arg1=c(1,2,4,5),arg2=c(1,1,2,0),arg3=c(1,1,1,1))
mydata
arg1 arg2 arg3
1 1 1 1
2 2 1 1
3 4 2 1
4 5 0 1
mydata <- mydata %>%
mutate_at(c("arg1","arg2"), funs(recode(., `1`=-1, `2`=1, .default = NaN)))
mydata
arg1 arg2 arg3
1 -1 -1 1
2 1 -1 1
3 NaN 1 1
4 NaN NaN 1
I use NaN instead of NA as it is numeric is be simpler to manage within a column of other numbers.
As always there are many ways of doing this. I don't know dplyr well enough to use that function, but this seems to be what you are looking for.
mydata <- data.frame(arg1=c(1,2,4,5),arg2=c(1,1,2,0))
mydata
arg1 arg2
1 1 1
2 2 1
3 4 2
4 5 0
Function to recode using a nested ifelse()
recode_liberalSupport <- function(var = "arg1", data=mydata) {
+ recoded <- ifelse(mydata[[var]] == 1, -1,
+ ifelse(mydata[[var]] == 2, 1, NA))
+ return(recoded)
+ }
Call the function
recode_liberalSupport(var = "arg1")
[1] -1 1 NA NA
Replace the variable arg1 with recoded values.
mydata$arg1 <- recode_liberalSupport(var = "arg1")
mydata
arg1 arg2
1 -1 1
2 1 1
3 NA 2
4 NA 0
I have a vector that I want to modify so that it contains only elements that are equal too or larger than the previous element. The vector represents a phenomena that should only increase or stay the same (i.e. cumulative deaths by day), but reporting errors result in elements that are less than the previous element. I want to correct this by replacing elements with previous ones until the vector meets the aforementioned criteria.
raw data : 1 3 3 6 8 10 7 9 15 12
desired modified data: 1 3 3 6 6 6 7 9 9 12
library(zoo)
raw <- c(1, 3, 3, 6, 8, 10, 7, 9, 15, 12)
replace.errors <- function(x){
x %>%
replace(diff(x) < 0, NA) %>%
na.locf(na.rm=FALSE)
}
replace.errors(raw)
# [1] 1 3 3 6 8 8 7 9 9 12
My function does not work if multiple sequential elements in a row need to be replaced (8 and 10), as it just pulls forward an element that is still greater than the next one.
A data.table option using nafill along with cummin
nafill(replace(raw, rev(cummin(rev(raw))) != raw, NA), type = "locf")
gives
> nafill(replace(raw, rev(cummin(rev(raw))) != raw, NA), type = "locf")
[1] 1 3 3 6 6 6 7 9 9 12
Following the similar idea from above approach, your function replace.errors can be defined as
replace.errors <- function(x){
x %>%
replace(rev(cummin(rev(.))) != (.), NA) %>%
na.locf()
}
such that
> replace.errors(raw)
[1] 1 3 3 6 6 6 7 9 9 12
Another option is to define a user function like below
f <- function(v) {
for (k in which(c(FALSE, diff(v) < 0))) {
p <- max(v[v < v[k]])
v <- replace(v, tail(which(v == p), 1):(k - 1), p)
}
v
}
which gives
> f(raw)
[1] 1 3 3 6 6 6 7 9 9 12
Base R using #ThomasIsCoding brilliant replace logic:
# Replace values breaching condition with NA: scrubbed => integer vector
scrubbed <- replace(raw, rev(cummin(rev(raw))) != raw, NA_integer_)
# i) Interpolate constants:
res <- na.omit(scrubbed)[cumsum(!is.na(scrubbed))]
# OR
# ii) Interpolate constants using approx()
res <- approx(scrubbed, method = "constant", n = length(scrubbed))$y
Or in one expression:
approx(
replace(raw, rev(cummin(rev(raw))) != raw, NA_integer_),
method = "constant",
n = length(raw)
)$y
This smells a bit inefficient, but it may still be the best option:
replace_errors <- function(raw) {
while (is.unsorted(raw)) {
raw <- raw[c(TRUE, diff(raw) >= 0)]
}
raw
}
I am trying to recreate a Stata code snippet in R and I have hit a snag.
In Stata, the lag function gives this result when applied:
A B
1 2
1 2
1 2
1 2
replace A=B if A==A[_n-1]
A B
1 2
2 2
1 2
2 2
If I try to replicate in R I get the following:
temp <- data.frame("A" = rep(1,4), "B" = rep(2,4))
temp
A B
1 2
1 2
1 2
1 2
temp <- temp %>% mutate(A = ifelse(A==lag(A,1),B,A))
temp
A B
2 2
2 2
2 2
2 2
I need it to be the same as in Stata.
lag would not be used here because it uses the original values in A whereas at each iteration the question needs the most recently updated values.
Define an Update function and apply it using accumulate2 in the purrr package. It returns a list so unlist it.
library(purrr)
Update <- function(prev, A, B) if (A == prev) B else A
transform(temp, A = unlist(accumulate2(A, B[-1], Update)))
giving:
A B
1 1 2
2 2 2
3 1 2
4 2 2
Another way to write this uses fn$ in gsubfn which causes formula arguments to be interpreted as functions. The function that it builds uses the free variables in the formula as the arguments in the order encountered.
library(gsubfn)
library(purrr)
transform(temp, A = unlist(fn$accumulate2(A, B[-1], ~ if (prev == A) B else A)))
Also note the comments below this answer for another variation.
Looks like we need to update after each run
for(i in 2:nrow(temp)) temp$A[i] <- if(temp$A[i] == temp$A[i-1])
temp$B[i] else temp$A[i]
temp
# A B
#1 1 2
#2 2 2
#3 1 2
#4 2 2
Or as #G.Grothendieck mentioned in the comments, it can be compact with
for(i in 2:nrow(temp)) if (temp$A[i] == temp$A[i-1]) temp$A[i] <- temp$B[i]
Here's a function that will do it:
lagger <- function(x,y){
current = x[1]
out = x
for(i in 2:length(x)){
if(x[i] == current){
out[i] = y[i]
}
current = out[i]
}
out
}
lagger(temp$A, temp$B)
[1] 1 2 1 2
With a vector, say:
a <- c(1,1,1, 2,2,2,2, 1,1, 3,3,3,3)
I need to know
at which index values the element values changed
the element value before/after change
Using the above example, the output would look something like:
i before after
3 1 2
7 2 1
9 1 3
I suppose I can convert the values into a data.frame and lag/shift the column but am wondering if there is a better approach. I am also trying to avoid looping over the vector, if possible.
I also don't need the results in data.frame / spreadsheet format; feel free to propose a different output format.
Here is an option
idx <- which(a != c(a[-1], NA))
data.frame(
i = idx,
before = a[idx],
after = a[idx + 1])
# i before after
#1 3 1 2
#2 7 2 1
#3 9 1 3
You can roll everything into a function
f <- function(x) {
idx <- which(x != c(x[-1], NA))
data.frame(
i = idx,
before = x[idx],
after = x[idx + 1])
}
f(a)
giving the same output as above.
with(rle(a), data.frame(i = head(cumsum(lengths), -1),
before = head(values, -1),
after = values[-1]))
# i before after
#1 3 1 2
#2 7 2 1
#3 9 1 3
I have several variables in my dataset that need to be recoded in exactly the same way, and several other variables that need to be recoded in a different way. I tried writing a function to help me with this, but I'm having trouble.
library(dplyr)
recode_liberalSupport = function(arg1){
arg1 = recode(arg1, "1=-1;2=1;else=NA")
return(arg1)
}
liberals = c(df$var1, df$var4, df$var8)
for(i in unique(liberals)){
paste(df$liberals[i] <- sapply(liberals, FUN = recode_liberalSupport))
}
R studio works on this for about 5 minutes then gives me this error message:
Error in `$<-.data.frame`(`*tmp*`, liberals, value = c(NA_real_, NA_real_, :
replacement has 9 rows, data has 64600
In addition: Warning messages:
1: Unknown or uninitialised column: 'liberals'.
2: In df$liberals[i] <- sapply(liberals, FUN = recode_liberalSupport) :
number of items to replace is not a multiple of replacement length
Any help would be really appreciated! Thank you
This is neater I think with dplyr. Using recode correctly is a good idea. mutate_all() can be used to operate on the whole dataframe, mutate_at() on just selected variables. There are lots of ways to specify variables in dplyr.
mydata <- data.frame(arg1=c(1,2,4,5),arg2=c(1,1,2,0),arg3=c(1,1,1,1))
mydata
arg1 arg2 arg3
1 1 1 1
2 2 1 1
3 4 2 1
4 5 0 1
mydata <- mydata %>%
mutate_at(c("arg1","arg2"), funs(recode(., `1`=-1, `2`=1, .default = NaN)))
mydata
arg1 arg2 arg3
1 -1 -1 1
2 1 -1 1
3 NaN 1 1
4 NaN NaN 1
I use NaN instead of NA as it is numeric is be simpler to manage within a column of other numbers.
As always there are many ways of doing this. I don't know dplyr well enough to use that function, but this seems to be what you are looking for.
mydata <- data.frame(arg1=c(1,2,4,5),arg2=c(1,1,2,0))
mydata
arg1 arg2
1 1 1
2 2 1
3 4 2
4 5 0
Function to recode using a nested ifelse()
recode_liberalSupport <- function(var = "arg1", data=mydata) {
+ recoded <- ifelse(mydata[[var]] == 1, -1,
+ ifelse(mydata[[var]] == 2, 1, NA))
+ return(recoded)
+ }
Call the function
recode_liberalSupport(var = "arg1")
[1] -1 1 NA NA
Replace the variable arg1 with recoded values.
mydata$arg1 <- recode_liberalSupport(var = "arg1")
mydata
arg1 arg2
1 -1 1
2 1 1
3 NA 2
4 NA 0