Below is code to generate data to demonstrate the problem.
con <- textConnection('
Nu Na Vo
100 A 60
103 A 2
104 A 2
106 A 5
107 A 1
108 A 1
112 A 50
100 B 1
108 B 4
109 B 2
120 B 30
109 C 40
')
tt <- read.table(con, header = T)
close(con)
test <- as.data.frame(tt)
I've the following code. It is to assign value to "Sta" column subject to the specific condition and to add the difference in "Nu" between i and i+1 row into "Lag" column.
library(dplyr)
# to sort "Na" column and arrange "Nu" in descending order
# in order to apply the code below.
test2 <- tt %.% arrange(Na, -Nu)
for (i in 1:nrow(test2)) {
if (i < nrow(test2)) {
if (test2[i, ]$Nu - 2 > test2[i+1, ]$Nu) {
test2[i, 4] <- "N"
test2[i, 5] <- test2[i, ]$Nu - test2[i+1, ]$Nu
} else if (test2[i, ]$Nu - 2 <= test2[i+1, ]$Nu) {
test2[i, 4] <- "Y"
test2[i, 5] <- test2[i, ]$Nu - test2[i+1, ]$Nu
}
} else if (i == nrow(test2)) {
test2[i, 4] <- "N"
test2[i, 5] <- 0
}
}
names(test2)[names(test2) == "V4"] <- "Sta"
names(test2)[names(test2) == "V5"] <- "Lag"
test2
After running the code, it produces the result as below:
Nu Na Vo Sta Lag
1 112 A 50 N 4
2 108 A 1 Y 1
3 107 A 1 Y 1
4 106 A 5 Y 2
5 104 A 2 Y 1
6 103 A 2 N 3
7 100 A 60 Y -20
8 120 B 30 N 11
9 109 B 2 Y 1
10 108 B 4 N 8
11 100 B 1 Y -9
12 109 C 40 N 0
The values under "Sta" column are properly assigned but not for the "Lag" column. The original intention is to apply the code based on different values/levels in "Na", that is "A", "B", "C". Don't how to apply the code to "A", "B", "C" separately and combine separate results into ONE table. Desired outcome should be:
Nu Na Vo Sta Lag
1 112 A 50 N 4
2 108 A 1 Y 1
3 107 A 1 Y 1
4 106 A 5 Y 2
5 104 A 2 Y 1
6 103 A 2 N 3
7 100 A 60 Y 0 << Last row for "A". "Lag" should be "0"; "Sta" should be "N".
8 120 B 30 N 11
9 109 B 2 Y 1
10 108 B 4 N 8
11 100 B 1 Y 0 << Last row for "B". "Lag" should be "0"; "Sta" should be "N".
12 109 C 40 N 0 << Last row for "C". "Lag" should be "0"; "Sta" should be "N".
Edited
Not sure how to apply the code to different factors / levels of "Na": "A", "B" and "C". Possible to use split() or apply family of functions? As could see from the result and intent of the code above, the result should be FACTOR / LEVEL / Element dependent (hope I'm using the proper terminology) and will affect values under both "Sta" and "Lag" columns. However my code could not distinguish this. Appreciate for any help provided. Thanks
An inelegant solution!
For completeness, I post herewith a possible solution. I code it the hard way. If anyone could help simplify it, it would be very much appreciated.
con <- textConnection('
Nu Na Vo
100 A 60
103 A 2
104 A 2
106 A 5
107 A 1
108 A 1
112 A 50
100 B 1
108 B 4
109 B 2
120 B 30
109 C 40
')
tt <- read.table(con, header = T)
close(con)
require(dplyr); require(data.table)
test2 <- tt %.% arrange(Na, -Nu)
spl <- split(test2, test2$Na)
spl
for (i in 1:length(levels(test2$Na))) {
for (j in 1:nrow(spl[[i]])) {
if (j < nrow(spl[[i]])) {
if (spl[[i]][j, ]$Nu - 2 > spl[[i]][j+1, ]$Nu) {
spl[[i]][j, 4] <- "N"
spl[[i]][j, 5] <- spl[[i]][j, ]$Nu - spl[[i]][j+1, ]$Nu
} else if (spl[[i]][j, ]$Nu - 2 <= spl[[i]][j+1, ]$Nu) {
spl[[i]][j, 4] <- "Y"
spl[[i]][j, 5] <- spl[[i]][j, ]$Nu - spl[[i]][j+1, ]$Nu
}
} else if (j == nrow(spl[[i]])) {
spl[[i]][j, 4] <- "N"
spl[[i]][j, 5] <- 0
}
}
}
spl <- rbindlist(spl)
setnames(spl, c("V4", "V5"), c("Sta", "Lag"))
spl
ave to the rescue - if applied twice this will essentially do the same comparisons as your long loop code.
First, calculate the lag differences using diff for each group, and set the value for the last row in each group to 0. Then use the computed lag values to determine the "Sta" column, forcing the last row in each group's value to be assigned "N".
test2$Lag <- with(test2, ave(Nu, Na, FUN=function(x) -c(diff(x),0)) )
test2$Sta <- with(test2, ave(Lag, Na, FUN=function(x) {
out <- ifelse(x > 2, "N", "Y"); out[length(out)] <- "N"; out}))
Same result as requested:
test2[c(1:3,5,4)]
# Nu Na Vo Sta Lag
#1 112 A 50 N 4
#2 108 A 1 Y 1
#3 107 A 1 Y 1
#4 106 A 5 Y 2
#5 104 A 2 Y 1
#6 103 A 2 N 3
#7 100 A 60 N 0
#8 120 B 30 N 11
#9 109 B 2 Y 1
#10 108 B 4 N 8
#11 100 B 1 N 0
#12 109 C 40 N 0
Related
I have created a data frame which has string and integers. The integers which are positive and negative.
I have to change all the ints to be positive without using for/if loops but by only using vectorization and indexing. I have created one with a for loop but I am a bit stuck on the next part.
df <- data.frame(x = letters[1:5],
y = seq(-4,4,2),
z = c(3,4,-5,6,-8))
This is my loop to convert to positive.
loop_df_fn <- function(data){
for(i in names(data)){
if(is.numeric(data[[i]])){
data[[i]][data[[i]]<0] <- abs(data[[i]][data[[i]]< 0])*10
}
}
return(data)
}
print((loop_df_fn(df)))
You can use
df[] <- lapply(df , \(x) if(is.numeric(x)) abs(x)*10 else x)
Output
x y z
1 a 40 30
2 b 20 40
3 c 0 50
4 d 20 60
5 e 40 80
A tidy solution:
library(dplyr)
df1 <- df %>%
mutate(across(where(is.numeric), ~if_else(.<0, .*-10, .)))
rapply(df, \(x) (x*-10)^(x<0)*x^(x>0), 'numeric', how='replace')
x y z
1 a 40 3
2 b 20 4
3 c 1 50
4 d 2 6
5 e 4 80
rapply(df, \(x) replace(x, x<0, x[x<0]*-10), 'numeric', how='replace')
x y z
1 a 40 3
2 b 20 4
3 c 0 50
4 d 2 6
5 e 4 80
lastly:
ind <- sapply(df, is.numeric)
df[ind][df[ind]<0] <- df[ind][df[ind]<0] * -10
df
x y z
1 a 40 3
2 b 20 4
3 c 0 50
4 d 2 6
5 e 4 80
I have a data set as follows
Name Price
A 100
B 123
C 112
D 114
E 101
F 102
I need a way to update the value in the price column if the price is between +3 or -3 of a vector of values specified to the value specified in the vector. The vector may contain any number of elements.
Vector = c(100,111)
Updated dataframe
Name Price
A 100
B 123
C 111
D 111
E 100
F 100
If the vector is
Vector = c(104,122)
then the updated dataframe needs to be
Name Price
A 100
B 122
C 112
D 114
E 104
F 104
df <- data.frame('Name' = LETTERS[1:6], 'Price'= c(100,123,112,114,101,102))
transform <- function(value, conditionals){
for(cond in conditionals){
if(abs(value - cond) < 4){
return(cond)
}
}
return(value)
}
sapply(df$Price, transform, c(104,122))
This should work. It can probably done in one line with apply (but I find it difficult to read sometimes so this should be easier to read).
Here's one approach
bound <- 3
upper_bound <- Vector+bound
lower_bound <- Vector-bound
vi <- Reduce("pmax", lapply(seq_along(Vector), function(i) i*(df$Price <= upper_bound[i] & df$Price >= lower_bound[i])))
# [1] 1 0 2 2 1 1
vi_na <- replace(vi, vi == 0, NA)
# [1] 1 NA 2 2 1 1
df$Price <- dplyr::mutate(df, Price = ifelse(is.na(Vector[vi_na]), Price, Vector[vi_na]))
# Name Price.Name Price.Price
# 1 A A 100
# 2 B B 123
# 3 C C 111
# 4 D D 111
# 5 E E 100
# 6 F F 100
Data
df <- read.table(text = "Name Price
A 100
B 123
C 112
D 114
E 101
F 102", header=TRUE)
Vector = c(100,111)
I have a data frame where I need to apply a formula to create new columns. The catch is, I need to calculate these numbers one row at a time. For eg,
df <- data.frame(c(1:10),c(21:30),5,10)
names(df) <- c('a','b','c','d')
I now need to create columns 'c' and 'd' as follows. Column 'c' whose R1 value is fixed as 5. But from R2 onwards the value of 'c' is calculated as (c (from previous row) - b(from previous row). Column 'd' R1 value is fixed as 10, but from R2 onwards, 'd' is calculated as 'c' from R2 - d from previous row.
I want my output to look like this:
A B C D
1 21 5 10
2 22 -16 -26
3 23 -38 -12
And so on. My actual data has over 1000 rows and 18 columns. For every row, 5 of the column values come from different columns of the previous row only (no other rows). And the rest of the column values are calculated from these newly calculated row values. I am quite at a loss in creating a formula that will apply my formulae to each row, calculate values for that row and then move to the next row. I know that I have simplified the problem a bit here, but this captures the essence of what I am attempting.
This is what I attempted:
df <- within(df, {
v1 <- shift(c)
v2 <- shift(d)
c <- v1-shift(b)
d <- c-v2
})
However, I need to apply this only from row 2 onwards and I have no idea how to do that.Because of that, I get something like this:
a b c d v2 v1
1 21 NA NA NA NA
2 22 4 -6 10 5
3 23 4 -6 10 5
I only get these values repeatedly for c, and d (4, -6, 10, 5).
Output
Thank you for your help.
df <- data.frame(a = 1:10, b = 21:30, c = 5:-4, d = 10)
for (i in (2:nrow(df))) {
df[i, "c"] <- df[i - 1, "c"] - df[i - 1, "b"]
df[i, "d"] <- df[i, "c"] - df[i - 1, "d"]
}
df[1:3, ]
a b c d
1 1 21 5 10
2 2 22 -16 -26
3 3 23 -38 -12
Edit: adapting to your comment
# Let's define the coefficients of the equations into a dataframe
equation1 <- c("c", 0, 0, 0, 0, 0, -1, 1, 0) # c (from previous row) - b(from previous row)
equation2 <- c("d", 0, 0, 1, 0, 0, 0, 0, -1) # d is calculated as 'c' from R2 - d from previous row
equations <- data.frame(rbind(equation1,equation2), stringsAsFactors = F)
names(equations) <- c("y","a","b","c","d","a_previous","b_previous","c_previous","d_previous")
equations
# y a b c d a_previous b_previous c_previous d_previous
# "c" 0 0 0 0 0 -1 1 0
# "d" 0 0 1 0 0 0 0 -1
# define function to mutiply the rows of the dataframes
sumProd <- function(vect1, vect2) {
return(as.numeric(as.numeric(vect1) %*% as.numeric(vect2)))
}
# Apply the formulas to the originaldataframe
for (i in (2:nrow(df))) {
for(e in 1:nrow(equations)) {
df[i, equations[e, 'y']] <- sumProd(equations[e, c('a','b','c','d')], df[i, c('a','b','c','d')]) +
sumProd(equations[e, paste0(c('a','b','c','d'),'_previous')], df[i - 1, c('a','b','c','d')])
}
}
df[1:3,]
a b c d
1 1 21 5 10
2 2 22 -16 -26
3 3 23 -38 -12
It might not be the most elegant way to do it with a for loop but it works. Your column c sounds like a simple sequence to me.
This is waht I would do:
df <- data.frame(c(1:10),c(21:30),5,10)
names(df) <- c('a','b','c','d')
# Use a simple sequence for c
df$c <- seq(5,5-(dim(df)[1]-1))
# Use for loop to calculate d
for(i in 2:(length(df$d)-1))
{
df$d[i] <- df$c[i] - df$d[i-1]
}
> df
a b c d
1 1 21 5 10
2 2 22 4 -6
3 3 23 3 9
4 4 24 2 -7
5 5 25 1 8
6 6 26 0 -8
7 7 27 -1 7
8 8 28 -2 -9
9 9 29 -3 6
10 10 30 -4 10
although I searched long for solutions, e.g.
Assign value to group based on condition in column
I am not able to solve the following problem and would appreciate greatly any help!
I have the following data frame (in reality, many more with thousands of rows):
df <- data.frame(ID1 = c(1,1,1,2,2,2,2,3,3,4,4,4,5,5,5,6,6,6,7,7),
ID2 = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20),
Percentage = c(0,10,NA,65,79,81,52,0,0,11,12,35,0,24,89,76,0,NA,59,16),
Group_expected_result = c(6,6,6,7,7,7,7,1,1,3,3,3,4,4,4,5,5,5,2,2))
What I want to do is to assign a group type from 1 to 7 to each group as indicated by ID1. Which group type should be assigned is dependent on the conditions of column 3, Percentage (can have values from 0-100) and is split into seven types:
Type 1 has a percentage of 0, i.e.
Type 1 = 0
Type 2 > 0 & < 10
Type 3 > 9 & < 20
Type 4 > 19 & < 30
Type 5 > 29 & < 40
Type 6 > 39 & < 50
Type 7 > 49
The combination of these types (above) defines the group type (G1-G7) below:
G1 = only T7
G2 = only T7 & T2-T6
G3 = only T2-T6
G4 = at least one T1, & one T2-T6, & one T7 (= all)
G5 = only T7 & T1
G6 = only T2-T6 & T1
G7 = only T1
The expected result is in the last column of the sample data frame, e.g.
the first group consists of types T1 and T2, therefore should be group type G6.
So, the question is how to get the expected result in the last column? I hope I made the problem clear! Thanks in advance!
Try this:
myType <- function(x) {
if (is.na(x) || x==0) {
return(1L)
} else if (x < 50) {
return(2L)
} else {
return(3L)
}
}
myGroup <- function(myDf) {
myIds <- unique(myDf$ID1)
myGs <- list(G1=1L, G2=2:3, G3=2L, G4=1:3, G5=c(1L,3L), G6=1:2, G7=3L)
assignG <- vector(mode = "integer", length=nrow(myDf))
vT <- vapply(myDf[,"Percentage"], function(x) myType(x), 1L)
for (i in myIds) {
myV <- which(myDf[,1L]==i)
testV <- sort(unique(vT[myV]))
assignG[myV] <- which(vapply(myGs, function(x) identical(x,testV), TRUE, USE.NAMES = FALSE))
}
myDf$myResult <- assignG
myDf
}
Calling it, we obtain:
myGroup(df,7)
ID1 ID2 Percentage Group_expected_result myResult
1 1 1 0 6 6
2 1 2 10 6 6
3 1 3 NA 6 6
4 2 4 65 7 7
5 2 5 79 7 7
6 2 6 81 7 7
7 2 7 52 7 7
8 3 8 0 1 1
9 3 9 0 1 1
10 4 10 11 3 3
11 4 11 12 3 3
12 4 12 35 3 3
13 5 13 0 4 4
14 5 14 24 4 4
15 5 15 89 4 4
16 6 16 76 5 5
17 6 17 0 5 5
18 6 18 NA 5 5
19 7 19 59 2 2
20 7 20 16 2 2
Here is a less intuitive, but more efficient solution.
myGroup2 <- function(myDf) {
myIds <- unique(myDf$ID1)
AltGs <- c(G1=2L, G2=7L, G3=3L, G4=9L, G5=6L, G6=5L, G7=4L)
assignG <- vector(mode = "integer", length=nrow(myDf))
vT <- vapply(myDf[,"Percentage"], function(x) myType(x), 1L)
for (i in myIds) {
myV <- which(myDf[,1L]==i)
testV <- unique(vT[myV])
assignG[myV] <- which(AltGs==(length(testV)+sum(testV)))
}
myDf$myResult <- assignG
myDf
}
It is about twice as fast.
microbenchmark(t1=myGroup(df,7), t2=myGroup2(df,7))
Unit: microseconds
expr min lq mean median uq max neval
t1 692.117 728.4470 779.6459 748.562 819.170 1018.060 100
t2 320.608 340.3115 390.7098 351.395 414.203 1781.195 100
You can obtain AltGs above by running the following:
myGs <- list(G1=1L, G2=2:3, G3=2L, G4=1:3, G5=c(1L,3L), G6=1:2, G7=3L)
AltGs <- vapply(myGs, function(x) length(x)+sum(x), 2L, USE.NAMES = FALSE)
I have a data.frame with two columns:
category quantity
a 20
b 30
c 100
d 10
e 1
f 23
g 3
h 200
I need to write a function with two parameters: dataframe, bin_size which runs a cumsum over the quantity column, does a split of the subsequent row if the the cumsum exceeds the bin_size and adds a running bin number as an additional column.
Say, by entering this:
function(dataframe, 50)
in the above example should give me:
category quantity cumsum bin_nbr
a 20 20 1
b 30 50 1
c 50 50 2
c 50 50 3
d 10 10 4
e 1 11 4
f 23 34 4
g 3 37 4
h 13 50 4
h 50 50 5
h 50 50 6
h 50 50 7
h 37 37 8
Explanation:
row a + b sum up to 50 --> bin_nbr 1
row c is 100 -> split into 2 rows # 50 -> bin nbr 2, bin_nbr 3
row d,e,f,g sum up to 37 -> bin_nbr 4
I need another 13 from row h to fill in bin_nbr 4 to 50
The rest of the remaining quantity from h will be spitted into 4 bins -> bin_nbr 5, 6, 7, 8
I couldn't think of a clean way to do this with apply/data.table etc since you have an inter-row dependency and a changing size data frame. You can probably do it in an iterative/recursive manner, but I felt it would be quicker to figure out to just write the loop. One challenge is that it is difficult to know the final size of your object, so this is likely to be slow. You can mitigate the problem somewhat by switching from a df to a matrix (code should work fine, except transform bits) if performance is an issue in this application.
fun <- function(df, binsize){
df$cumsum <- cumsum(df$quantity)
df$bin <- 1
i <- 1
repeat {
if((extra <- (df[i, "cumsum"] - binsize)) > 0) { # Bin finished halfway through
top <- if(i > 1L) df[1L:(i - 1L), ] else df[0L, ]
mid <- transform(df[i, ], quantity=quantity-extra, cumsum=cumsum-extra)
bot <- transform(df[i, ], quantity=extra, cumsum=extra, bin=bin + 1L)
end <- if(i >= nrow(df)) df[0L, ] else df[(i + 1L):nrow(df), ]
end <- transform(end, cumsum=cumsum(end$quantity) + extra, bin=bin + 1L)
df <- rbind(top, mid, bot, end)
} else if (extra == 0 && nrow(df) > i) { # Bin finished cleanly
df[(i + 1L):nrow(df), ]$cumsum <- df[(i + 1L):nrow(df), ]$cumsum - binsize
df[(i + 1L):nrow(df), ]$bin <- df[(i + 1L):nrow(df), ]$bin + 1L
}
if(nrow(df) < (i <- i + 1)) break
}
rownames(df) <- seq(len=nrow(df))
df
}
fun(df, binsize)
# category quantity cumsum bin
# 1 a 20 20 1
# 2 b 30 50 1
# 3 c 50 50 2
# 4 c 50 50 3
# 5 d 10 10 4
# 6 e 1 11 4
# 7 f 23 34 4
# 8 g 3 37 4
# 9 h 13 50 4
# 10 h 50 50 5
# 11 h 50 50 6
# 12 h 50 50 7
# 13 h 37 37 8
Another solution with a loop:
DF <- read.table(text="category quantity
a 20
b 30
c 100
d 10
e 1
f 23
g 3
h 200", header=TRUE)
bin_size <- 50
n_bin <- ceiling(sum(DF$quantity)/bin_size)
DF$bin <- findInterval(cumsum(DF$quantity)-1, c(0, seq_len(n_bin)*50))
DF$cumsum <- cumsum(DF$quantity)
result <- lapply(seq_along(DF[,1]), function(i, df) {
if (i==1) {
d <- df[i, "bin"]
} else {
d <- df[i, "bin"]-df[i-1, "bin"]
}
if (d > 1) {
res <- data.frame(
category = df[i, "category"],
bin_nbr = df[i, "bin"]-seq_len(d+1)+1
)
res[,"quantity"] <- bin_size
if (i!=1) {
res[nrow(res),"quantity"] <- df[i-1, "bin"]*bin_size-df[i-1, "cumsum"]
} else {
res[nrow(res),"quantity"] <- 0
}
res[1,"quantity"] <- df[i, "quantity"]-sum(res[-1,"quantity"])
return(res[res$quantity > 0,])
} else {
return(data.frame(
category = df[i, "category"],
quantity = df[i, "quantity"],
bin_nbr = df[i, "bin"]
))
}
}, df=DF)
res <- do.call(rbind, result)
res <- res[order(res$category, res$bin_nbr),]
library(plyr)
res <- ddply(res, .(bin_nbr), transform, cumsum=cumsum(quantity))
res
# category quantity bin_nbr cumsum
# 1 a 20 1 20
# 2 b 30 1 50
# 3 c 50 2 50
# 4 c 50 3 50
# 5 d 10 4 10
# 6 e 1 4 11
# 7 f 23 4 34
# 8 g 3 4 37
# 9 h 13 4 50
# 10 h 50 5 50
# 11 h 50 6 50
# 12 h 50 7 50
# 13 h 37 8 37
This amounts to merging the bin boundaries with the data which gives this loop-free solution:
library(zoo)
fun <- function(DF, binsize = 50) {
nr <- nrow(DF)
DF2 <- data.frame(cumsum = seq(0, sum(DF$quantity), binsize) + binsize, bin_nbr = 1:nr)
DF.cs <- transform(DF, cumsum = cumsum(DF$quantity))
m <- na.locf(merge(DF.cs, DF2, all = TRUE), fromLast = TRUE)
m$bin_nbr <- as.numeric(m$bin_nbr)
cs <- as.numeric(m$cumsum)
m$quantity <- c(cs[1], diff(cs))
m$cumsum <- ave(m$quantity, m$bin_nbr, FUN = cumsum)
na.omit(m)[c("category", "quantity", "cumsum", "bin_nbr")]
}
giving:
> fun(DF)
category quantity cumsum bin_nbr
1 a 20 20 1
2 b 30 50 1
3 c 50 50 2
4 c 50 50 3
5 d 10 10 4
6 e 1 11 4
7 f 23 34 4
8 g 3 37 4
9 h 13 50 4
10 h 50 50 5
11 h 50 50 6
12 h 50 50 7
13 h 37 37 8
Note: For purposes of reproducing the result above this is the input we used:
Lines <- "category quantity
a 20
b 30
c 100
d 10
e 1
f 23
g 3
h 200
"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)
REVISION An error in the code was corrected.