Dataframe create new column based on other columns - r

I have a dataframe:
df <- data.frame('a'=c(1,2,3,4,5), 'b'=c(1,20,3,4,50))
df
a b
1 1 1
2 2 20
3 3 3
4 4 4
5 5 50
and I want to create a new column based on existing columns. Something like this:
if (df[['a']] == df[['b']]) {
df[['c']] <- df[['a']] + df[['b']]
} else {
df[['c']] <- df[['b']] - df[['a']]
}
The problem is that the if condition is checked only for the first row... If I create a function from the above if statement then I use apply() (or mapply()...), it is the same.
In Python/pandas I can use this:
df['c'] = df[['a', 'b']].apply(lambda x: x['a'] + x['b'] if (x['a'] == x['b']) \
else x['b'] - x['a'], axis=1)
I want something similar in R. So the result should look like this:
a b c
1 1 1 2
2 2 20 18
3 3 3 6
4 4 4 8
5 5 50 45

One option is ifelse which is vectorized version of if/else. If we are doing this for each row, the if/else as showed in the OP's pandas post can be done in either a for loop or lapply/sapply, but that would be inefficient in R.
df <- transform(df, c= ifelse(a==b, a+b, b-a))
df
# a b c
#1 1 1 2
#2 2 20 18
#3 3 3 6
#4 4 4 8
#5 5 50 45
This can be otherwise written as
df$c <- with(df, ifelse(a==b, a+b, b-a))
to create the 'c' column in the original dataset
As the OP wants a similar option in R using if/else
df$c <- apply(df, 1, FUN = function(x) if(x[1]==x[2]) x[1]+x[2] else x[2]-x[1])

Here is a slightly more confusing algebraic method:
df$c <- with(df, b + ((-1)^((a==b)+1) * a))
df
a b c
1 1 1 2
2 2 20 18
3 3 3 6
4 4 4 8
5 5 50 45
The idea is that the "minus" operator is turned on or off based on the test a==b.

If you want an apply method, then another way with mapply would be create a function and apply it,
fun1 <- function(x, y) if (x == y) {x + y} else {y-x}
df$c <- mapply(fun1, df$a, df$b)
df
# a b c
#1 1 1 2
#2 2 20 18
#3 3 3 6
#4 4 4 8
#5 5 50 45

Using dplyr package:
library(dplyr)
df <- df %>%
mutate(c = if_else(a == b, a + b, b - a))
df
# a b c
# 1 1 1 2
# 2 2 20 18
# 3 3 3 6
# 4 4 4 8
# 5 5 50 45

A solution with apply
myFunction <- function(x){
a <- x[1]
b <- x[2]
#further values ignored (if there are more than 2 columns)
value <- if(a==b) a + b else b - a
#or more complicated stuff
return(value)
}
df$c <- apply(df, 1, myFunction)

Related

Looping through dataframe to filter rows

In a given dataframe, I need to filter the rows on separate columns, one at a time, using the same condition. The following formulation does not work. Any suggestions?
DF <- data.frame(A = c(1,4,99),
B = c(2,5,6),
C = c(3,99,7))
r <- c("A", "C")
for (i in r){
column = as.formula(paste0("DF$",i))
DF<- DF[column != 99,]
print(DF)
}
The desired outputs are the following two:
A B C
1 1 2 3
2 4 5 99
A B C
1 1 2 3
3 99 6 7
We may use
library(dplyr)
library(purrr)
map(r, ~ DF %>%
filter(!! rlang::sym(.x) != 99))
-output
[[1]]
A B C
1 1 2 3
2 4 5 99
[[2]]
A B C
1 1 2 3
2 99 6 7
Or in base R
lapply(r, \(x) subset(DF, DF[[x]] != 99))
[[1]]
A B C
1 1 2 3
2 4 5 99
[[2]]
A B C
1 1 2 3
3 99 6 7
If it is to filter and then remove on a loop
library(data.table)
setDT(df)
for(nm in r) {
tmp <- DF[DF[[nm]] != 99]
... do some calc ...
rm(tmp)
gc()
}

Replacing Value in Next Row with Output from Function Before Applying Function to Next Row

I'm looking at trying to apply a function in R to each row, while updating each row with the output of the function from the previous row. I know that's a mouthful, but here's an example. Let's say I had dataframe, df:
df<- data.frame(a=c(10,15,20,25,30), b=c(2,4,5,7,10))
And I had a function, funR, that just took the difference between column a and column b:
funR<- function(argA, argB){
c<- argA-argB
return(c)
}
Now a simplified version of what I'd be going for is let's say I apply the function to the first row and get 10 - 2 = 8. I would then want to replace the second row of column a with this output before applying the function to that row, so instead of 15 - 4 I'd be doing 8 - 4. I would then replace 20 in row 3 with 4, and so on and so on.
Edit to show expected output:
a b
1 10 2
2 8 4
3 4 5
4 -1 7
5 -8 10
Any help would be greatly appreciated!
This is really a one-liner in base R:
Method 1:
for (i in 1:(nrow(df) - 1)) df$a[i + 1] <- df$a[i] - df$b[i];
df;
# a b
#1 10 2
#2 8 4
#3 4 5
#4 -1 7
#5 -8 10
Here we implement the recursion relation a[i+1] = a[i] - b[i] in a simple for loop. The for loop will be very fast, as we directly overwrite existing entries in df.
Method 2
Or alternatively:
df$a <- df$a[1] - cumsum(c(0, df$b))[1:length(df$a)];
df;
# a b
#1 10 2
#2 8 4
#3 4 5
#4 -1 7
#5 -8 10
This is based on the expanded recursion relation, where you can see that e.g. a[4] = a[1] - (b[1] + b[2] + b[3]), and so on.
We can also do this with accumulate from purrr
library(purrr)
library(dplyr)
df %>%
mutate(a = accumulate(b[-n()], `-`, .init = a[1]))
# a b
#1 10 2
#2 8 4
#3 4 5
#4 -1 7
#5 -8 10
Here is a faster version if you want to maintain the use of the function funR.
df<- data.frame(a=c(10,15,20,25,30), b=c(2,4,5,7,10))
funR<- function(argA, argB){
n = length(argA)
argC = c(argA[1], argB)
accumdiff <- function(x){
Reduce(function(x1,x2) x1-x2, x, accumulate=TRUE)}
argC = c(argA[1],accumdiff(argC)[c(-1)])
rev(rev(argC)[-1])
}
df$a <- funR(df$a, df$b)
df
# a b
# 1 10 2
# 2 8 4
# 3 4 5
# 4 -1 7
# 5 -8 10

understanding apply and outer function in R

Suppose i have a data which looks like this
ID A B C
1 X 1 10
1 X 2 10
1 Z 3 15
1 Y 4 12
2 Y 1 15
2 X 2 13
2 X 3 13
2 Y 4 13
3 Y 1 16
3 Y 2 18
3 Y 3 19
3 Y 4 10
I Wanted to compare these values with each other so if an ID has changed its value of A variable over a period of B variable(which is from 1 to 4) it goes into data frame K and if it hasn't then it goes to data frame L.
so in this data set K will look like
ID A B C
1 X 1 10
1 X 2 10
1 Z 3 15
1 Y 4 12
2 Y 1 15
2 X 2 13
2 X 3 13
2 Y 4 13
and L will look like
ID A B C
3 Y 1 16
3 Y 2 18
3 Y 3 19
3 Y 4 10
In terms of nested loops and if then else statement it can be solved like following
for ( i in 1:length(ID)){
m=0
for (j in 1: length(B)){
ifelse( A[j] == A[j+1],m,m=m+1)
}
ifelse(m=0, L=c[,df[i]], K=c[,df[i]])
}
I have read in some posts that in R nested loops can be replaced by apply and outer function. if someone can help me understand how it can be used in such circumstances.
So basically you don't need a loop with conditions here, all you need to do is to check if there's a variance (and then converting it to a logical using !) in A during each cycle of B (IDs) by converting A to a numeric value (I'm assuming its a factor in your real data set, if its not a factor, you can use FUN = function(x) length(unique(x)) within ave instead ) and then split accordingly. With base R we can use ave for such task, for example
indx <- !with(df, ave(as.numeric(A), ID , FUN = var))
Or (if A is a character rather a factor)
indx <- with(df, ave(A, ID , FUN = function(x) length(unique(x)))) == 1L
Then simply run split
split(df, indx)
# $`FALSE`
# ID A B C
# 1 1 X 1 10
# 2 1 X 2 10
# 3 1 Z 3 15
# 4 1 Y 4 12
# 5 2 Y 1 15
# 6 2 X 2 13
# 7 2 X 3 13
# 8 2 Y 4 13
#
# $`TRUE`
# ID A B C
# 9 3 Y 1 16
# 10 3 Y 2 18
# 11 3 Y 3 19
# 12 3 Y 4 10
This will return a list with two data frames.
Similarly with data.table
library(data.table)
setDT(df)[, indx := !var(A), by = ID]
split(df, df$indx)
Or dplyr
library(dplyr)
df %>%
group_by(ID) %>%
mutate(indx = !var(A)) %>%
split(., indx)
Since you want to understand apply rather than simply getting it done, you can consider tapply. As a demonstration:
> tapply(df$A, df$ID, function(x) ifelse(length(unique(x))>1, "K", "L"))
1 2 3
"K" "K" "L"
In a bit plainer English: go through all df$A grouped by df$ID, and apply the function on df$A within each groupings (i.e. the x in the embedded function): if the number of unique values is more than 1, it's "K", otherwise it's "L".
We can do this using data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)). Grouped by 'ID', we check the length of unique elements in 'A' (uniqueN(A)) is greater than 1 or not, create a column 'ind' based on that. We can then split the dataset based on that
'ind' column.
library(data.table)
setDT(df1)[, ind:= uniqueN(A)>1, by = ID]
setDF(df1)
split(df1[-5], df1$ind)
#$`FALSE`
# ID A B C
#9 3 Y 1 16
#10 3 Y 2 18
#11 3 Y 3 19
#12 3 Y 4 10
#$`TRUE`
# ID A B C
#1 1 X 1 10
#2 1 X 2 10
#3 1 Z 3 15
#4 1 Y 4 12
#5 2 Y 1 15
#6 2 X 2 13
#7 2 X 3 13
#8 2 Y 4 13
Or similarly using dplyr, we can use n_distinct to create a logical column and then split by that column.
library(dplyr)
df2 <- df1 %>%
group_by(ID) %>%
mutate(ind= n_distinct(A)>1)
split(df2, df2$ind)
Or a base R option with table. We get the table of the first two columns of 'df1' i.e. the 'ID' and 'A'. By double negating (!!) the output, we can get the '0' values convert to 'TRUE' and all other frequency as 'FALSE'. Get the rowSums ('indx'). We match the ID column in 'df1' with the names of the 'indx', use that to replace the 'ID' with TRUE/FALSE, and split the dataset with that.
indx <- rowSums(!!table(df1[1:2]))>1
lst <- split(df1, indx[match(df1$ID, names(indx))])
lst
#$`FALSE`
# ID A B C
#9 3 Y 1 16
#10 3 Y 2 18
#11 3 Y 3 19
#12 3 Y 4 10
#$`TRUE`
# ID A B C
#1 1 X 1 10
#2 1 X 2 10
#3 1 Z 3 15
#4 1 Y 4 12
#5 2 Y 1 15
#6 2 X 2 13
#7 2 X 3 13
#8 2 Y 4 13
If we need to get individual datasets on the global environment, change the names of the list elements to the object names we wanted and use list2env (not recommended though)
list2env(setNames(lst, c('L', 'K')), envir=.GlobalEnv)

multiplication in matrix R

Look, what I want to do: [In Excel is clear and easy, but in R I have a problem...:(]
Column A 1 2 3 4 5
Column B 0 9 2 1 7
That's my real "algorithm":
Column C
(first value) = mean(Column A) = 3
(second value) = ((mean(Column A)*4) + 0)/5 = 2,4
(third value) = ((second value*4) + 9)/5 = 3,72
etc.
So we have:
# A B C
# 1 1 0 3
# 2 2 9 2,4
# 3 3 2 3,72
# 4 4 1 3,37
# 5 5 7 2,90
This is my actually code with your suggestion:
a <- c(1:5)
b <- c(0,9,0,1,7,0)
matrix <- data.frame(A=a,B=b)
matrix <- c(mean(matrix$A), (cumsum(matrix$B) + (mean(matrix$A)*4))/5)
This is solution: 2.4 4.2 4.2 4.4 5.8 (WRONG !!)
Of course R write me error that: "replacement has 6 rows, data has 5" but this isn't relevant...I only want to know, how should I do it??
You could use ?cumsum:
a <- 1:5
b <- c(0, 9, 2, 1, 7)
mean(a) + cumsum(b)
# [1] 3 12 14 15 22
UPDATE:
It seems you want to run a (weighted) moving average. Maybe you should have a look at the TTR package.
Please find an easy approach below:
wma <- function(b, startValue, a=4/5) {
m <- double(length(b)+1)
m[1] <- startValue
for (i in seq(along=b)) {
m[i+1] <- a * m[i] + (1-a) * b[i]
}
return(m)
}
wma(b, mean(a))
# [1] 3.00000 2.40000 3.72000 3.37600 2.90080 3.72064
This solves your issue:
mydf<-data.frame(A=1:5, B=c(0,9,2,1,7))
mydf$C<-cumsum(mydf$B)+mean(mydf$A)
mydf
# A B C
# 1 1 0 3
# 2 2 9 12
# 3 3 2 14
# 4 4 1 15
# 5 5 7 22
Hope it helps.

How to assign a counter to a specific subset of a data.frame which is defined by a factor combination?

My question is: I have a data frame with some factor variables. I now want to assign a new vector to this data frame, which creates an index for each subset of those factor variables.
data <-data.frame(fac1=factor(rep(1:2,5)), fac2=sample(letters[1:3],10,rep=T))
Gives me something like:
fac1 fac2
1 1 a
2 2 c
3 1 b
4 2 a
5 1 c
6 2 b
7 1 a
8 2 a
9 1 b
10 2 c
And what I want is a combination counter which counts the occurrence of each factor combination. Like this
fac1 fac2 counter
1 1 a 1
2 2 c 1
3 1 b 1
4 2 a 1
5 1 c 1
6 2 b 1
7 1 a 2
8 2 a 2
9 1 b 2
10 1 a 3
So far I thought about using tapply to get the counter over all factor-combinations, which works fine
counter <-tapply(data$fac1, list(data$fac1,data$fac2), function(x) 1:length(x))
But I do not know how I can assign the counter list (e.g. unlisted) to the combinations in the data-frame without using inefficient looping :)
This is a job for the ave() function:
# Use set.seed for reproducible examples
# when random number generation is involved
set.seed(1)
myDF <- data.frame(fac1 = factor(rep(1:2, 7)),
fac2 = sample(letters[1:3], 14, replace = TRUE),
stringsAsFactors=FALSE)
myDF$counter <- ave(myDF$fac2, myDF$fac1, myDF$fac2, FUN = seq_along)
myDF
# fac1 fac2 counter
# 1 1 a 1
# 2 2 b 1
# 3 1 b 1
# 4 2 c 1
# 5 1 a 2
# 6 2 c 2
# 7 1 c 1
# 8 2 b 2
# 9 1 b 2
# 10 2 a 1
# 11 1 a 3
# 12 2 a 2
# 13 1 c 2
# 14 2 b 3
Note the use of stringsAsFactors=FALSE in the data.frame() step. If you didn't have that, you can still get the output with: myDF$counter <- ave(as.character(myDF$fac2), myDF$fac1, myDF$fac2, FUN = seq_along).
A data.table solution
library(data.table)
DT <- data.table(data)
DT[, counter := seq_len(.N), by = list(fac1, fac2)]
This is a base R way that avoids (explicit) looping.
data$counter <- with(data, {
inter <- as.character(interaction(fac1, fac2))
names(inter) <- seq_along(inter)
inter.ordered <- inter[order(inter)]
counter <- with(rle(inter.ordered), unlist(sapply(lengths, sequence)))
counter[match(names(inter), names(inter.ordered))]
})
Here a variant with a little looping (I have renamed your variable to "x" since "data" is being used otherwise):
x <-data.frame(fac1=rep(1:2,5), fac2=sample(letters[1:3],10,rep=T))
x$fac3 <- paste( x$fac1, x$fac2, sep="" )
x$ctr <- 1
y <- table( x$fac3 )
for( i in 1 : length( rownames( y ) ) )
x$ctr[x$fac3 == rownames(y)[i]] <- 1:length( x$ctr[x$fac3 == rownames(y)[i]] )
x <- x[-3]
No idea whether this is efficient over a large data.frame but it works!

Resources