I want to subtract 1 from the values of column A if column B is <= 20.
A = c(1,2,3,4,5)
B = c(10,20,30,40,50)
df = data.frame(A,B)
output
A B
1 0 10
2 1 20
3 3 30
4 4 40
5 5 50
My data is very huge so I prefer not to use a loop. Is there any computationally efficient method in R?
You can do
df$A[df$B <= 20] <- df$A[df$B <= 20] - 1
# A B
#1 0 10
#2 1 20
#3 3 30
#4 4 40
#5 5 50
We can break this down step-by-step to understand how this works.
First we check which numbers in B is less than equal to 20 which gives us a logical vector
df$B <= 20
#[1] TRUE TRUE FALSE FALSE FALSE
Using that logical vector we can select the numbers in A
df$A[df$B <= 20]
#[1] 1 2
Subtract 1 from those numbers
df$A[df$B <= 20] - 1
#[1] 0 1
and replace these values for the same indices in A.
With dplyr we can also use case_when
library(dplyr)
df %>%
mutate(A = case_when(B <= 20 ~ A - 1,
TRUE ~ A))
Another possibility:
df$A <- ifelse(df$B < 21, df$A - 1, df$A)
And here is a data.table solution:
library(data.table)
setDT(df)
df[B <= 20, A := A - 1]
Related
Could anyone explain how to change the negative values in the below dataframe?
we have been asked to create a data structure to get the below output.
# > df
# x y z
# 1 a -2 3
# 2 b 0 4
# 3 c 2 -5
# 4 d 4 6
Then we have to use control flow operators and/or vectorisation to multiply only the negative values by 10.
I tried so many different ways but cannot get this to work. I get an error when i try to use a loop and because of the letters.
Create indices of the negative values and multiply by 10, i.e.
i1 <- which(df < 0, arr.ind = TRUE)
df[i1] <- as.numeric(df[i1]) * 10
# x y z
#1 a -20 3
#2 b 0 4
#3 c 2 -50
#4 d 4 6
First find out the numeric columns of the dataframe and multiply the negative values by 10.
cols <- sapply(df, is.numeric)
#Multiply negative values by 10 and positive with 1
df[cols] <- df[cols] * ifelse(sign(df[cols]) == -1, 10, 1)
df
# x y z
#1 a -20 3
#2 b 0 4
#3 c 2 -50
#4 d 4 6
Using dplyr -
library(dplyr)
df <- df %>% mutate(across(where(is.numeric), ~. * ifelse(sign(.) == -1, 10, 1)))
What is the optimal way to get the index of all elements that are repeated # times? I want to identify the elements that are duplicated more than 2 times.
rle() and rleid() both hint to the values I need but neither method directly gives me the indices.
I came up with this code:
t1 <- c(1, 10, 10, 10, 14, 37, 3, 14, 8, 8, 8, 8, 39, 12)
t2 <- lag(t1,1)
t2[is.na(t2)] <- 0
t3 <- ifelse(t1 - t2 == 0, 1, 0)
t4 <- rep(0, length(t3))
for (i in 2:length(t3)) t4[i] <- ifelse(t3[i] > 0, t3[i - 1] + t3[i], 0)
which(t4 > 1)
returns:
[1] 4 11 12
and those are the values I need.
Are there any R-functions that are more appropriate?
Ben
One option with data.table. No real reason to use this instead of lag/shift when n = 2, but for larger n this would save you from creating a large number of new lagged vectors.
library(data.table)
which(rowid(rleid(t1)) > 2)
# [1] 4 11 12
Explanation:
rleid will produce a unique value for each "run" of equal values, and rowid will mark how many elements "into" the run each element is. What you want is elements more than 2 "into" a run.
data.table(
t1,
rleid(t1),
rowid(t1))
# t1 V2 V3
# 1: 1 1 1
# 2: 10 2 1
# 3: 10 2 2
# 4: 10 2 3
# 5: 14 3 1
# 6: 37 4 1
# 7: 3 5 1
# 8: 14 6 2
# 9: 8 7 1
# 10: 8 7 2
# 11: 8 7 3
# 12: 8 7 4
# 13: 39 8 1
# 14: 12 9 1
Edit: If, as in the example posed by this question, no two runs (even length-1 "runs") are of the same value (or if you don't care whether the duplicates are next to eachother), you can just use which(rowid(t1) > 2) instead. (This is noted by Frank in the comments)
Hopefully this example clarifies the differences
a <- c(1, 1, 1, 2, 2, 1)
which(rowid(a) > 2)
# [1] 3 6
which(rowid(rleid(a)) > 2)
# [1] 3
You can use dplyr::lag or data.table::shift (note, default for shift is to lag, so shift(t1, 1) is equal to shift(t1, 1, type = "lag"):
which(t1 == lag(t1, 1) & lag(t1, 1) == lag(t1, 2))
[1] 4 11 12
# Or
which(t1 == shift(t1, 1) & shift(t1, 1) == shift(t1, 2))
[1] 4 11 12
If you need it to scale for several duplicates you can do the following (thanks for the tip #IceCreamToucan):
n <- 2
df1 <- sapply(0:n, function(x) shift(t1, x))
which(rowMeans(df1 == df1[,1]) == 1)
[1] 4 11 12
This is usually a case that rle is useful, i.e.
v1 <- rle(t1)
i1 <- seq_along(t1)[t1 %in% v1$values[v1$lengths > 2]]
i2 <- t1[t1 %in% v1$values[v1$lengths > 2]]
tapply(i1, i2, function(i) tail(i, -2))
#$`8`
#[1] 11 12
#$`10`
#[1] 4
You can unlist and get it as a vector,
unlist(tapply(i1, i2, function(i) tail(i, -2)))
#81 82 10
#11 12 4
There is also a function called rleid in data.table package which we can use,
unlist(lapply(Filter(function(i) length(i) > 2, split(seq_along(t1), data.table::rleid(t1))),
function(i) tail(i, -2)))
#2 71 72
#4 11 12
Another possibility involving rle() could be:
pseudo_rleid <- with(rle(t1), rep(seq_along(values), lengths))
which(ave(t1, pseudo_rleid, FUN = function(x) seq_along(x) > 2) != 0)
[1] 4 11 12
I have a data frame where I need to apply a formula to create new columns. The catch is, I need to calculate these numbers one row at a time. For eg,
df <- data.frame(c(1:10),c(21:30),5,10)
names(df) <- c('a','b','c','d')
I now need to create columns 'c' and 'd' as follows. Column 'c' whose R1 value is fixed as 5. But from R2 onwards the value of 'c' is calculated as (c (from previous row) - b(from previous row). Column 'd' R1 value is fixed as 10, but from R2 onwards, 'd' is calculated as 'c' from R2 - d from previous row.
I want my output to look like this:
A B C D
1 21 5 10
2 22 -16 -26
3 23 -38 -12
And so on. My actual data has over 1000 rows and 18 columns. For every row, 5 of the column values come from different columns of the previous row only (no other rows). And the rest of the column values are calculated from these newly calculated row values. I am quite at a loss in creating a formula that will apply my formulae to each row, calculate values for that row and then move to the next row. I know that I have simplified the problem a bit here, but this captures the essence of what I am attempting.
This is what I attempted:
df <- within(df, {
v1 <- shift(c)
v2 <- shift(d)
c <- v1-shift(b)
d <- c-v2
})
However, I need to apply this only from row 2 onwards and I have no idea how to do that.Because of that, I get something like this:
a b c d v2 v1
1 21 NA NA NA NA
2 22 4 -6 10 5
3 23 4 -6 10 5
I only get these values repeatedly for c, and d (4, -6, 10, 5).
Output
Thank you for your help.
df <- data.frame(a = 1:10, b = 21:30, c = 5:-4, d = 10)
for (i in (2:nrow(df))) {
df[i, "c"] <- df[i - 1, "c"] - df[i - 1, "b"]
df[i, "d"] <- df[i, "c"] - df[i - 1, "d"]
}
df[1:3, ]
a b c d
1 1 21 5 10
2 2 22 -16 -26
3 3 23 -38 -12
Edit: adapting to your comment
# Let's define the coefficients of the equations into a dataframe
equation1 <- c("c", 0, 0, 0, 0, 0, -1, 1, 0) # c (from previous row) - b(from previous row)
equation2 <- c("d", 0, 0, 1, 0, 0, 0, 0, -1) # d is calculated as 'c' from R2 - d from previous row
equations <- data.frame(rbind(equation1,equation2), stringsAsFactors = F)
names(equations) <- c("y","a","b","c","d","a_previous","b_previous","c_previous","d_previous")
equations
# y a b c d a_previous b_previous c_previous d_previous
# "c" 0 0 0 0 0 -1 1 0
# "d" 0 0 1 0 0 0 0 -1
# define function to mutiply the rows of the dataframes
sumProd <- function(vect1, vect2) {
return(as.numeric(as.numeric(vect1) %*% as.numeric(vect2)))
}
# Apply the formulas to the originaldataframe
for (i in (2:nrow(df))) {
for(e in 1:nrow(equations)) {
df[i, equations[e, 'y']] <- sumProd(equations[e, c('a','b','c','d')], df[i, c('a','b','c','d')]) +
sumProd(equations[e, paste0(c('a','b','c','d'),'_previous')], df[i - 1, c('a','b','c','d')])
}
}
df[1:3,]
a b c d
1 1 21 5 10
2 2 22 -16 -26
3 3 23 -38 -12
It might not be the most elegant way to do it with a for loop but it works. Your column c sounds like a simple sequence to me.
This is waht I would do:
df <- data.frame(c(1:10),c(21:30),5,10)
names(df) <- c('a','b','c','d')
# Use a simple sequence for c
df$c <- seq(5,5-(dim(df)[1]-1))
# Use for loop to calculate d
for(i in 2:(length(df$d)-1))
{
df$d[i] <- df$c[i] - df$d[i-1]
}
> df
a b c d
1 1 21 5 10
2 2 22 4 -6
3 3 23 3 9
4 4 24 2 -7
5 5 25 1 8
6 6 26 0 -8
7 7 27 -1 7
8 8 28 -2 -9
9 9 29 -3 6
10 10 30 -4 10
I have a huge data frame that is like:
df = data.frame(A = c(1,54,23,2), B=c(1,2,4,65), C=c("+","-","-","+"))
> df
A B C
1 1 1 +
2 54 2 -
3 23 4 -
4 2 65 +
I need to subtract the rows based on different conditions, and add these results in a new column:
A - B if C == +
B - A if C == -
So, my output would be:
> new_df
A B C D
1 1 1 + 0
2 54 2 - -52
3 23 4 - -19
4 2 65 + -63
This assumes that only two conditions, + and -, are in column C.
df$D <- with(df, ifelse(C %in% "+", A - B, B - A))
df
# A B C D
# 1 1 1 + 0
# 2 54 2 - -52
# 3 23 4 - -19
# 4 2 65 + -63
Better to add stringsAsFactors = FALSE when you create a data frame. Also, I don't like to use df for variable names since there is a df() function:
df1 <- data.frame(A = c(1, 54, 23, 2),
B = c(1, 2, 4, 65),
C = c("+", "-", "-", "+"),
stringsAsFactors = FALSE)
Assuming that C is only + or -, you can use dplyr::mutate() and test using ifelse():
library(dplyr)
df1 %>%
mutate(D = ifelse(C == "+", A - B, B - A))
using dplyr:
If there are definitely only + and - in the C column you can do:
library(dplyr)
df2 <- df %>%
mutate(D = ifelse(C == '+', A - B, B - A))
I would generally do:
df2 <- df %>%
mutate(D = ifelse(C == '+', A - B,
ifelse(C == '-', B - A, NA)))
Just in case there are some that do not have + or -.
Alternatively, if you want to evaluate the arithmetic information in column C (as in addition or subtraction), you can use eval(parse(txt)) (more about that here: Evaluate expression given as a string).
## Transforming into a matrix (simplifies everything into characters)
df_mat <- as.matrix(df)
## Function for evaluation the rows
eval.row <- function(row) {
eval(parse(text= paste(row[1], row[3], row[2])))
}
## For the first row
eval.row(df_mat[1,])
# [1] 2
## For the whole data frame
apply(df_mat, 1, eval.row)
# [1] 2 52 19 67
## Updating the data.frame
df$D <- apply(df_mat, 1, eval.row)
This answer should work for you
https://stackoverflow.com/a/19000310/6395612
You can use with like this:
df['D'] = with(df, ifelse(C=='+', A - B, B - A))
A base solution:
df$D = (df$B-df$A)*sign((df$C=="-")-0.5)
# A B C D
# 1 1 1 + 0
# 2 54 2 - -52
# 3 23 4 - -19
# 4 2 65 + -63
can also be written df <- transform(df, D = (B-A)*sign((C=="-")-0.5))
So I am using R and trying to change values in a data frame in one column by comparing two columns together. I have something like
Median MyPrice
10 0
20 18
20 20
30 35
15 NA
And I would like to say something like
if(MyPrice == 0 & MyPrice < Median){MyPrice <- 1
}else if (MyPrice == Median){MyPrice <- 2
}else if (MyPrice > Median){MyPrice <- 3
}else {MyPrice <- 4}
To come up with
Median MyPrice
10 1
20 1
20 2
30 3
15 4
But there is always an error. I have also tried something like
for(i in MyPrice){if(MyPrice == 0 & MyPrice < Median){MyPrice <- 1
}else if (MyPrice == Median){MyPrice <- 2
}else if (MyPrice > Median){MyPrice <- 3
}else {MyPrice <- 4}
}
The for loop runs but it changes all values in MyPrice to 4. I also tried the ifelse() function but it seemed to have an issue taking that many arguments at once.
I would also not be opposed to a new column being added to the end of the data frame if a solution like that is easier.
You don't necessarily have to use a for loop. Start by setting every comparison to 4.
> x$Comp=4
> x$Comp[x$Median>x$MyPrice]=1 #if Median is higher, comparison = 1
> x$Comp[x$Median==x$MyPrice]=2 #if Median is equal to MyPrice, comparison = 2
> x$Comp[x$Median<x$MyPrice]=3 #if Median is lower, comparison = 3
> x
Median MyPrice Comp
1 10 0 1
2 20 18 1
3 20 20 2
4 30 35 3
5 15 NA 4
Given your first argument that if MyPrice == 0 & MyPrice < Median, your 2nd row where Median: 20 and MyPrice: 18 should also be 4. Here is a working nested ifelse statement with an NA handler after.
df <- as.data.frame(matrix(c(10,0,20,18,20,20,30,35,15,NA), byrow = T, ncol = 2))
colnames(df) <- c("Median","MyPrice")
df$NewPrice <- ifelse(df$MyPrice == 0 & df$MyPrice < df$Median, 1,
ifelse(df$MyPrice == df$Median, 2,
ifelse(df$MyPrice > df$Median, 3, 4)))
df$NewPrice[is.na(df$MyPrice)] <- 4
df
# Median MyPrice NewPrice
#1 10 0 1
#2 20 18 4
#3 20 20 2
#4 30 35 3
#5 15 NA 4
What about setting a new variable with all values in 4 and then, replace those cases where your conditions apply?
Simple, straight forward and easy to read :-)
#(Following the example from #Evans Friedland)
df <- as.data.frame(matrix(c(10,0,20,18,20,20,30,35,15,NA), byrow = T, ncol = 2))
colnames(df) <- c("Median","MyPrice")
df <- mutate(df, myNewPrice = 4) #set my new price to 4, then edit by following your conditions
df$myNewPrice<- replace (df$myNewPrice, df$MyPrice == 0 & df$MyPrice < df$Median, 1)
df$myNewPrice<- replace (df$myNewPrice, df$MyPrice == df$Median , 2)
df$myNewPrice<- replace (df$myNewPrice, df$MyPrice > df$Median , 3)
df$myNewPrice <- as.numeric (df$myNewPrice) #might, might not be needed.