Update values of a column based on predefined thresholds - r

I have a data set as follows
Name Price
A 100
B 123
C 112
D 114
E 101
F 102
I need a way to update the value in the price column if the price is between +3 or -3 of a vector of values specified to the value specified in the vector. The vector may contain any number of elements.
Vector = c(100,111)
Updated dataframe
Name Price
A 100
B 123
C 111
D 111
E 100
F 100
If the vector is
Vector = c(104,122)
then the updated dataframe needs to be
Name Price
A 100
B 122
C 112
D 114
E 104
F 104

df <- data.frame('Name' = LETTERS[1:6], 'Price'= c(100,123,112,114,101,102))
transform <- function(value, conditionals){
for(cond in conditionals){
if(abs(value - cond) < 4){
return(cond)
}
}
return(value)
}
sapply(df$Price, transform, c(104,122))
This should work. It can probably done in one line with apply (but I find it difficult to read sometimes so this should be easier to read).

Here's one approach
bound <- 3
upper_bound <- Vector+bound
lower_bound <- Vector-bound
vi <- Reduce("pmax", lapply(seq_along(Vector), function(i) i*(df$Price <= upper_bound[i] & df$Price >= lower_bound[i])))
# [1] 1 0 2 2 1 1
vi_na <- replace(vi, vi == 0, NA)
# [1] 1 NA 2 2 1 1
df$Price <- dplyr::mutate(df, Price = ifelse(is.na(Vector[vi_na]), Price, Vector[vi_na]))
# Name Price.Name Price.Price
# 1 A A 100
# 2 B B 123
# 3 C C 111
# 4 D D 111
# 5 E E 100
# 6 F F 100
Data
df <- read.table(text = "Name Price
A 100
B 123
C 112
D 114
E 101
F 102", header=TRUE)
Vector = c(100,111)

Related

In R create a data.table with X trials, Y reps, and Z plots

I'm trying to crate a data.table with 3 vectors. Where vector A Trial = [a,b,c...n], vector B rep = [1,2,3,...,n] and vector C plot = [r01, r02, r03,...,n] where r= "rep" (replicates)
Example:
> trial <- c("a", "b", "c")
> plot <- c(101:103,201:203,301:303)
> rep <- c(1,2,3)
> trial
[1] "a" "b" "c"
> plot
[1] 101 102 103 201 202 203 301 302 303
> rep
[1] 1 2 3
> dt <- data.table(trial,plot,rep)
> dt
trial plot rep
1: a 101 1
2: b 102 2
3: c 103 3
4: a 201 1
5: b 202 2
6: c 203 3
7: a 301 1
8: b 302 2
9: c 303 3
> dt <- data.table(trial,rep,plot)
> dt
trial rep plot
1: a 1 101
2: b 2 102
3: c 3 103
4: a 1 201
5: b 2 202
6: c 3 203
7: a 1 301
8: b 2 302
9: c 3 303
Neither of these are quite correct.
I want rep to increment plot by 100 x rep + plot #.
For trial (x): rep 1, plot 1 -> 101
For trial (x): rep 1, plot 2 -> 102
For trial (x): rep 2, plot 1 -> 201
For trial (x): rep 2, plot 2 -> 202
etc.
The problem seems to be apply a function to vectors rep and plot in the correct order. outer is a good candidate to solve it.
library(data.table)
trial <- c("a", "b", "c")
plot <- 1:3
rep <- 1:3
f <- function(r, p) 100*r + p
as.vector(t(outer(rep, plot, f)))
#[1] 101 102 103 201 202 203 301 302 303
dt <- data.table(trial, rep, plot = as.vector(outer(rep, plot, f)))
dt
# trial rep plot
#1: a 1 101
#2: b 2 201
#3: c 3 301
#4: a 1 102
#5: b 2 202
#6: c 3 302
#7: a 1 103
#8: b 2 203
#9: c 3 303
Thanks for all the help. The following allows me to dynamically edit all portions of the df.
I was able to work through the problem with the code provided by everyone. I don't know who to give credit to.
Thanks to all!
loc <- c("Orchard", "Roggen", "Yuma", "Walsh", "Akron", "Julesburg", "Arapahoe", "Genoa", "Burlington", "Lamar", "Brandon")
i <- loc
rep <- 3 ## num reps
j <- seq(rep)
plot <- 5## plots
k <- seq(plot)
df <- data.frame(loc = i, block = rep(j, each = length(loc)), plot=rep(k, length(i)*rep))

R help - change the maximum value of each row in a certain condition

I am in a novice of R. I have a dataframe with columns 1:n. Excluding column 1 and n, I want to change the maximum value of each row if the row has a specific value in a different column AND set the remaining values (excluding column 1 and n) to zero. I have about 300,000 cases and 40 columns in my real data, however, the example below illustrates what I am trying to achieve:
A <- c(1,1,5,5,10)
B <- rnorm(1:5)
C <- rnorm(1:5)
D <- rnorm(1:5)
E <- c(10,15,100,100,100)
df <- data.frame(A,B,C,D,E)
df
A B C D E
1 1 0.74286670 0.3222136 0.9381296 10
2 1 -0.03352498 0.5262685 0.1225731 15
3 5 -0.17689629 -0.8949740 -1.4376567 100
4 5 0.48329153 1.1574834 -1.1116581 100
5 10 0.13117277 -0.2068736 0.4841806 100
Here, if column A of each row has 1, I want to change the maximum value of each row into the value of column E, and set columns B, C and D to 0.
So, the result should be like this:
A B C D E
1 1 0 0 10 10
2 1 0 15 0 15
3 5 -0.17689629 -0.8949740 -1.4376567 100
4 5 0.48329153 1.1574834 -1.1116581 100
5 10 0.13117277 -0.2068736 0.4841806 100
I tried to do this for two days. Thanks.
Try this out and see what happens :)
df <- read.table(text = "A B C D E
1 1 0.74286670 0.3222136 0.9381296 10
2 1 -0.03352498 0.5262685 0.1225731 15
3 5 -0.17689629 -0.8949740 -1.4376567 100
4 5 0.48329153 1.1574834 -1.1116581 100
5 10 0.13117277 -0.2068736 0.4841806 100", stringsAsFactor = FALSE)
# find the max in columns B,C,D
z <- apply(df[df$A == 1, 2:4], 1, max)
# substitute the maximum value of each row for columns B,C,D where A == 1
# with the value of column E. Assign 0 to the others
y <- ifelse(df[df$A == 1, 2:4] == z, df$E[df$A == 1], 0)
# Change the values in your dataframe
df[df$A == 1, 2:4] <- y

R Function to write 3 calculated columns to a data.table

This may have already been answered, but couldn't quite find the answer I am looking for. I am trying to write the output of a function that calculates 3 variables to a data.table.
Currently I am copying the function three times (with three different names), each time returning a different variable. This is taking a lot more time as it runs thrice. I understand
there may be a better way to do it, using list or some unique data.table command.
I would greatly appreciate any input you can provide to simplify this. Below is the example of how I am calling it one variable at a time.
Example
fn_1 <- function(a, b, c, d){
for (i in 1:b) { col_1[i] = calculation }
for (i in 1:c) { col_2[i] = calculation }
for (i in 1:d) { col_3[i] = calculation }
return(col_1)
}
data[ ,column_1 := fn_1(a,b,c,d) ,by= .(e,f) ]
fn_2 <- function(a, b, c, d){
for (i in 1:b) { col_1[i] = calculation }
for (i in 1:c) { col_2[i] = calculation }
for (i in 1:d) { col_3[i] = calculation }
return(col_2)
}
data[ ,column_2 := fn_2(a,b,c,d) ,by= .(e,f) ]
The OP has tagged the question with data.table. docendo discimus' comment is showing the direction to follow.
Create sample data
library(data.table) # CRAN version 1.10.4 used
n <- 10L
DT <- data.table(
a = 1:n, b = (n:1)^2, c = -(1:n), d = 2 * (1:n) - n/2,
e = rep(LETTERS[1:2], length.out = n),
f = rep(LETTERS[3:4], each = n/2, length.out = n))
DT
# a b c d e f
# 1: 1 100 -1 -3 A C
# 2: 2 81 -2 -1 B C
# 3: 3 64 -3 1 A C
# 4: 4 49 -4 3 B C
# 5: 5 36 -5 5 A C
# 6: 6 25 -6 7 B D
# 7: 7 16 -7 9 A D
# 8: 8 9 -8 11 B D
# 9: 9 4 -9 13 A D
#10: 10 1 -10 15 B D
Define function
fn <- function(p, q, r, s) {
list(X1 = p + mean(q) + r + s,
Y2 = p * q + r * s,
Z3 = p * q - r * s)
}
The function takes 4 parameters and returns a list of 3 named vectors. Note that the computations inside the function don't need to use for loops in contrast to OP's approach.
Apply function to data.table
Note that the OP wants to group on columns e and f when the function is applied.
The first variant creates a new data.table. By default, the names of the list elements as defined in fn are used:
DT[, fn(a, b, c, d), .(e, f)]
# e f X1 Y2 Z3
# 1: A C 63.66667 103 97
# 2: A C 67.66667 189 195
# 3: A C 71.66667 155 205
# 4: B C 64.00000 164 160
# 5: B C 68.00000 184 208
# 6: B D 18.66667 108 192
# 7: B D 22.66667 -16 160
# 8: B D 26.66667 -140 160
# 9: A D 19.00000 49 175
#10: A D 23.00000 -81 153
The second variant updates DT by reference. The names of the new columns are explicitely stated.
DT[, c("x", "y", "z") := fn(a, b, c, d), .(e, f)]
DT
# a b c d e f x y z
# 1: 1 100 -1 -3 A C 63.66667 103 97
# 2: 2 81 -2 -1 B C 64.00000 164 160
# 3: 3 64 -3 1 A C 67.66667 189 195
# 4: 4 49 -4 3 B C 68.00000 184 208
# 5: 5 36 -5 5 A C 71.66667 155 205
# 6: 6 25 -6 7 B D 18.66667 108 192
# 7: 7 16 -7 9 A D 19.00000 49 175
# 8: 8 9 -8 11 B D 22.66667 -16 160
# 9: 9 4 -9 13 A D 23.00000 -81 153
#10: 10 1 -10 15 B D 26.66667 -140 160
You're in the second circle of hell. To solve the problem, pre-allocate what you want to add.
data <- data.table(c(1, 2, 3), c(4, 5, 6), c(7, 8, 9))
Then, make a vectorized function to do the calculation, which returns the whole column to append.
calculation <- Vectorize(function(x) mean(c(x, 3)))
Write fn in terms of this new function, and return the whole block of columns to be added, then cbind it with data to add all the columns at once. It's extremely slow to do all the calculations every time, and then only return one part.
fn <- function(b, c, d) {
toBeAdded <- data.table(matrix(nrow = nrow(data), ncol = 3))
toBeAdded[ , 1] <- calculation(b)
toBeAdded[ , 2] <- calculation(b)
toBeAdded[ , 3] <- calculation(b)
toBeAdded
}
data <- cbind(data, fn(data[1,], data[2,], data[3,]))
Answering my own question, based on inputs from #docendodiscimus & #ConCave, i solved it like this. appreciate everyone's input!
fn_1 <- function(a, b, c, d){
for (i in 1:b) { col_1[i] = calculation }
for (i in 1:c) { col_2[i] = calculation }
for (i in 1:d) { col_3[i] = calculation }
df = data.table(col_1, col_2, col_3)
return(df)
}
data[,c("column_1","column_2","column_3"):= fn_1(a,b,c,d) ,by= .(e,f)]
Does it have to be a data.table? If not , then you can just use mutate in dplyr
a <- c(1,2,2,1,2,3,4,2)
b <- c(3,3,2,3,5,4,3,2)
c <- c(9,9,8,7,8,9,8,7)
d <- c(0,1,1,0,1,1,0,1)
have <- data.frame(a,b,c,d)
want <-
have %>%
mutate(abc = a+ b + c,
db = d * b,
aa = 2 * a)

R apply code to different factors or levels

Below is code to generate data to demonstrate the problem.
con <- textConnection('
Nu Na Vo
100 A 60
103 A 2
104 A 2
106 A 5
107 A 1
108 A 1
112 A 50
100 B 1
108 B 4
109 B 2
120 B 30
109 C 40
')
tt <- read.table(con, header = T)
close(con)
test <- as.data.frame(tt)
I've the following code. It is to assign value to "Sta" column subject to the specific condition and to add the difference in "Nu" between i and i+1 row into "Lag" column.
library(dplyr)
# to sort "Na" column and arrange "Nu" in descending order
# in order to apply the code below.
test2 <- tt %.% arrange(Na, -Nu)
for (i in 1:nrow(test2)) {
if (i < nrow(test2)) {
if (test2[i, ]$Nu - 2 > test2[i+1, ]$Nu) {
test2[i, 4] <- "N"
test2[i, 5] <- test2[i, ]$Nu - test2[i+1, ]$Nu
} else if (test2[i, ]$Nu - 2 <= test2[i+1, ]$Nu) {
test2[i, 4] <- "Y"
test2[i, 5] <- test2[i, ]$Nu - test2[i+1, ]$Nu
}
} else if (i == nrow(test2)) {
test2[i, 4] <- "N"
test2[i, 5] <- 0
}
}
names(test2)[names(test2) == "V4"] <- "Sta"
names(test2)[names(test2) == "V5"] <- "Lag"
test2
After running the code, it produces the result as below:
Nu Na Vo Sta Lag
1 112 A 50 N 4
2 108 A 1 Y 1
3 107 A 1 Y 1
4 106 A 5 Y 2
5 104 A 2 Y 1
6 103 A 2 N 3
7 100 A 60 Y -20
8 120 B 30 N 11
9 109 B 2 Y 1
10 108 B 4 N 8
11 100 B 1 Y -9
12 109 C 40 N 0
The values under "Sta" column are properly assigned but not for the "Lag" column. The original intention is to apply the code based on different values/levels in "Na", that is "A", "B", "C". Don't how to apply the code to "A", "B", "C" separately and combine separate results into ONE table. Desired outcome should be:
Nu Na Vo Sta Lag
1 112 A 50 N 4
2 108 A 1 Y 1
3 107 A 1 Y 1
4 106 A 5 Y 2
5 104 A 2 Y 1
6 103 A 2 N 3
7 100 A 60 Y 0 << Last row for "A". "Lag" should be "0"; "Sta" should be "N".
8 120 B 30 N 11
9 109 B 2 Y 1
10 108 B 4 N 8
11 100 B 1 Y 0 << Last row for "B". "Lag" should be "0"; "Sta" should be "N".
12 109 C 40 N 0 << Last row for "C". "Lag" should be "0"; "Sta" should be "N".
Edited
Not sure how to apply the code to different factors / levels of "Na": "A", "B" and "C". Possible to use split() or apply family of functions? As could see from the result and intent of the code above, the result should be FACTOR / LEVEL / Element dependent (hope I'm using the proper terminology) and will affect values under both "Sta" and "Lag" columns. However my code could not distinguish this. Appreciate for any help provided. Thanks
An inelegant solution!
For completeness, I post herewith a possible solution. I code it the hard way. If anyone could help simplify it, it would be very much appreciated.
con <- textConnection('
Nu Na Vo
100 A 60
103 A 2
104 A 2
106 A 5
107 A 1
108 A 1
112 A 50
100 B 1
108 B 4
109 B 2
120 B 30
109 C 40
')
tt <- read.table(con, header = T)
close(con)
require(dplyr); require(data.table)
test2 <- tt %.% arrange(Na, -Nu)
spl <- split(test2, test2$Na)
spl
for (i in 1:length(levels(test2$Na))) {
for (j in 1:nrow(spl[[i]])) {
if (j < nrow(spl[[i]])) {
if (spl[[i]][j, ]$Nu - 2 > spl[[i]][j+1, ]$Nu) {
spl[[i]][j, 4] <- "N"
spl[[i]][j, 5] <- spl[[i]][j, ]$Nu - spl[[i]][j+1, ]$Nu
} else if (spl[[i]][j, ]$Nu - 2 <= spl[[i]][j+1, ]$Nu) {
spl[[i]][j, 4] <- "Y"
spl[[i]][j, 5] <- spl[[i]][j, ]$Nu - spl[[i]][j+1, ]$Nu
}
} else if (j == nrow(spl[[i]])) {
spl[[i]][j, 4] <- "N"
spl[[i]][j, 5] <- 0
}
}
}
spl <- rbindlist(spl)
setnames(spl, c("V4", "V5"), c("Sta", "Lag"))
spl
ave to the rescue - if applied twice this will essentially do the same comparisons as your long loop code.
First, calculate the lag differences using diff for each group, and set the value for the last row in each group to 0. Then use the computed lag values to determine the "Sta" column, forcing the last row in each group's value to be assigned "N".
test2$Lag <- with(test2, ave(Nu, Na, FUN=function(x) -c(diff(x),0)) )
test2$Sta <- with(test2, ave(Lag, Na, FUN=function(x) {
out <- ifelse(x > 2, "N", "Y"); out[length(out)] <- "N"; out}))
Same result as requested:
test2[c(1:3,5,4)]
# Nu Na Vo Sta Lag
#1 112 A 50 N 4
#2 108 A 1 Y 1
#3 107 A 1 Y 1
#4 106 A 5 Y 2
#5 104 A 2 Y 1
#6 103 A 2 N 3
#7 100 A 60 N 0
#8 120 B 30 N 11
#9 109 B 2 Y 1
#10 108 B 4 N 8
#11 100 B 1 N 0
#12 109 C 40 N 0

Change the index number of a dataframe

After I'm done with some manipulation in Dataframe, I got a result dataframe. But the index are not listed properly as below.
MsgType/Cxr NoOfMsgs AvgElpsdTime(ms)
161 AM 86 30.13
171 CM 1 104
18 CO 27 1244.81
19 US 23 1369.61
20 VK 2 245
21 VS 11 1273.82
112 fqa 78 1752.22
24 SN 78 1752.22
I would like to get the result as like below.
MsgType/Cxr NoOfMsgs AvgElpsdTime(ms)
1 AM 86 30.13
2 CM 1 104
3 CO 27 1244.81
4 US 23 1369.61
5 VK 2 245
6 VS 11 1273.82
7 fqa 78 1752.22
8 SN 78 1752.22
Please guide how I can get this ?
These are the rownames of your dataframe, which by default are 1:nrow(dfr). When you reordered the dataframe, the original rownames are also reordered. To have the rows of the new order listed sequentially, just use:
rownames(dfr) <- 1:nrow(dfr)
Or, simply
rownames(df) <- NULL
gives what you want.
> d <- data.frame(x = LETTERS[1:5], y = letters[1:5])[sample(5, 5), ]
> d
x y
5 E e
4 D d
3 C c
2 B b
1 A a
> rownames(d) <- NULL
> d
x y
1 E e
2 D d
3 C c
4 B b
5 A a
The index is actually the data frame row names. To change them, you can do something like:
rownames(dd) = 1:dim(dd)[1]
or
rownames(dd) = 1:nrow(dd)
Personally, I never use rownames.
In your example, I suspect that you don't need to worry about them either, since you are just renaming them 1 to n. In particular, when you subset your data frame the rownames will again be incorrect. For example,
##Simple data frame
R> dd = data.frame(a = rnorm(6))
R> dd$type = c("A", "B")
R> rownames(dd) = 1:nrow(dd)
R> dd
a type
1 2.1434 A
2 -1.1067 B
3 0.7451 A
4 -0.1711 B
5 1.4348 A
6 -1.3777 B
##Basic subsetting
R> dd_sub = dd[dd$type=="A",]
##Rownames are "wrong"
R> dd_sub
a type
1 2.1434 A
3 0.7451 A
5 1.4348 A

Resources