How to get the output from the function in a column of a dataframe - R - r

Hi I have a simple function:
same_picking <- function(cena){
data_model2$price_model2 <- 0.6 + cena * data_model2$item_SKU + 0.4
}
I would like the output to be rewritten in a column of a data.frame.
currently, because I still did not get the first writing of a function the column is still filled with NAs.. but I would like that after every run of a function the values would be rewriten in theat column.
count_code sifra item_SKU price_model2
281 0421 2 NA
683 0499 5 NA
903 0654 3 NA
7390 0942 3 NA
2778 0796 5 NA
2778 0796 7 NA
7066 0907 83 NA
281 0421 2 NA
I have tried with the comands: data.frame and within... but it got me nowhere.
I would appraciate the help.
Andraz
Solution:
same_picking <- function(cena){
data_model2$price_model2 <<- 0.6 + cena * data_model2$item_SKU + 0.4
}
<<- operator allows you to access the object from the ouside. Very clean :)

The simplest way would be to return the df from function:
df <- read.table(
text = "count_code sifra item_SKU price_model2
281 0421 2 NA
683 0499 5 NA
903 0654 3 NA
7390 0942 3 NA
2778 0796 5 NA
2778 0796 7 NA
7066 0907 83 NA
281 0421 2 NA",
header = TRUE)
head(df, 2)
# count_code sifra item_SKU price_model2
# 1 281 421 2 NA
# 2 683 499 5 NA
# 1st ---------------------------------------------------------------------
same_picking_1 <- function(df, cena){
df$price_model2 <- 0.6 + cena * df$item_SKU + 0.4
return(df)
}
df2 <- same_picking_1(df, 1)
head(df2, 2)
# count_code sifra item_SKU price_model2
# 1 281 421 2 3
# 2 683 499 5 6
Other options, data.table and dplyr:
same_picking_2 <- function(cena, item_SKU){
0.6 + cena * df$item_SKU + 0.4
}
# data.table --------------------------------------------------------------
library(data.table)
dt <- data.table(df)
dt[, price_model2 := same_picking_2(1, item_SKU)]
head(dt, 2)
# count_code sifra item_SKU price_model2
# 1: 281 421 2 3
# 2: 683 499 5 6
# dplyr -------------------------------------------------------------------
library(dplyr)
df3 <- df %>% mutate(price_model2 = same_picking_2(1, item_SKU))
head(df3, 2)
# count_code sifra item_SKU price_model2
# 1 281 421 2 3
# 2 683 499 5 6
Edit after OP comment:
You can also wrap data.table solution into a function
# data.table --------------------------------------------------------------
library(data.table)
same_picking_2_int <- function(cena, item_SKU){
0.6 + cena * df$item_SKU + 0.4
}
same_picking_2 <- function(dt, cena){
dt[, price_model2 := same_picking_2_int(cena, item_SKU)]
}
# Use update by reference
dt <- data.table(df)
head(dt, 2)
same_picking_2(dt, 1)
head(dt, 2)
# Slightly more readable, the same output, also utilizes the update by reference of data.table (see tracemem())
dt <- data.table(df)
tracemem(dt)
head(dt, 2)
dt <- same_picking_2(dt, 1)
head(dt, 2)

Related

Replacing NA with mean using loop in R

I have to solve this problem using loop in R (I am aware that you can do it much more easily without loops, but it is for school...).
So I have vector with NAs like this:
trades<-sample(1:500,150,T)
trades<-trades[order(trades)]
trades[sample(10:140,25)]<-NA
and I have to create a FOR loop that will replace NAs with mean from 2 numbers before the NA and 2 numbers that come after the NA.
This I am able to do, with loop like this:
for (i in 1:length(trades)) {
if (is.na(trades[i])==T) {
trades[i] <- mean(c(trades[c(i-1:2)], trades[c(i+1:2)]), na.rm = T)
}
}
But there is another part to the homework. If there is NA within the 2 previous or 2 following numbers, then you have to replace the NA with mean from 4 previous numbers and 4 following numbers (I presume with removing the NAs). But I just am not able to crack it... I have the best results with this loop:
for (i in 1:length(trades)) {
if (is.na(trades[i])==T && is.na(trades[c(i-1:2)]==T || is.na(trades[c(i+1:2)]==T))) {
trades[i] <- mean(c(trades[c(i-1:4)], trades[c(i+1:4)]), na.rm = T)
}else if (is.na(trades[i])==T){
trades[i] <- mean(c(trades[c(i-1:2)], trades[c(i+1:2)]))
}
}
But it still misses some NAs.
Thank you for your help in advance.
We can use na.approx from zoo
library(zoo)
na.approx(trades)
Here is another solution using a loop. I did shortcut some code by using lead and lag from dplyr. First we use 2 recursive functions to calculate the lead and lag sums. Then we use conditional statements to determine if there are any missing data. Lastly, we fill the missing data using either the output of the recursive or the sum of the previous and following 4 (with NA removed). I would note that this is not the way that I would go about this issue, but I tried it out with a loop as requested.
library(dplyr)
r.lag <- function(x, n){
if (n == 1) return(lag(x = x, n = 1))
else return( lag(x = x, n = n) + r.lag(x = x, n = n-1))
}
r.lead <- function(x, n){
if (n == 1) return(lead(x = x, n = 1))
else return( lead(x = x, n = n) + r.lead(x = x, n = n-1))
}
lead.vec <- r.lead(trades, 2)
lag.vec <- r.lag(trades, 2)
output <- vector(length = length(trades))
for(i in 1:length(trades)){
if(!is.na(trades[[i]])){
output[[i]] <- trades[[i]]
}
else if(is.na(trades[[i]]) & !is.na(lead.vec[[i]]) & !is.na(lag.vec[[i]])){
output[[i]] <- (lead.vec[[i]] + lag.vec[[i]])/4
}
else
output[[i]] <- mean(
c(trades[[i-4]], trades[[i-3]], trades[[i-2]], trades[[i-1]],
trades[[i+4]], trades[[i+3]], trades[[i+2]], trades[[i+1]]),
na.rm = T
)
}
tibble(
original = trades,
filled = output
)
#> # A tibble: 150 x 2
#> original filled
#> <int> <dbl>
#> 1 7 7
#> 2 7 7
#> 3 12 12
#> 4 18 18
#> 5 30 30
#> 6 31 31
#> 7 36 36
#> 8 NA 40
#> 9 43 43
#> 10 50 50
#> # … with 140 more rows
So it seems that posting to StackOverflow helped me solve the problem.
trades<-sample(1:500,25,T)
trades<-trades[order(trades)]
trades[sample(1:25,5)]<-NA
which gives us:
[1] NA 20 24 30 NA 77 188 217 238 252 264 273 296 NA 326 346 362 368 NA NA 432 451 465 465 490
and if you run this loop:
for (i in 1:length(trades)) {
if (is.na(trades[i])== T) {
test1 <- c(trades[c(i+1:2)])
if (any(is.na(test1))==T) {
test2 <- c(trades[abs(c(i-1:4))], trades[c(i+1:4)])
trades[i] <- round(mean(test2, na.rm = T),0)
}else {
test3 <- c(trades[abs(c(i-1:2))], trades[c(i+1:2)])
trades[i] <- round(mean(test3, na.rm = T),0)
}
}
}
it changes the NAs to this:
[1] 22 20 24 30 80 77 188 217 238 252 264 273 296 310 326 346 362 368 387 410 432 451 465 465 490
So it works pretty much as expected.
Thank you for all your help.

Adding a second column as a function of first with data.table's (`:=`) [duplicate]

I want to create a new data.table or maybe just add some columns to a data.table. It is easy to specify multiple new columns but what happens if I want a third column to calculate a value based on one of the columns I am creating. I think plyr package can do something such as that. Can we perform such iterative (sequential) column creation in data.table?
I want to do as follows
dt <- data.table(shop = 1:10, income = 10:19*70)
dt[ , list(hope = income * 1.05, hopemore = income * 1.20, hopemorerealistic = hopemore - 100)]
or maybe
dt[ , `:=`(hope = income*1.05, hopemore = income*1.20, hopemorerealistic = hopemore-100)]
You can also use <- within the call to list eg
DT <- data.table(a=1:5)
DT[, c('b','d') := list(b1 <- a*2, b1*3)]
DT
a b d
1: 1 2 6
2: 2 4 12
3: 3 6 18
4: 4 8 24
5: 5 10 30
Or
DT[, `:=`(hope = hope <- a+1, z = hope-1)]
DT
a b d hope z
1: 1 2 6 2 1
2: 2 4 12 3 2
3: 3 6 18 4 3
4: 4 8 24 5 4
5: 5 10 30 6 5
It is possible by using curly braces and semicolons in j
There are multiple ways to go about it, here are two examples:
# If you simply want to output:
dt[ ,
{hope=income*1.05;
hopemore=income*1.20;
list(hope=hope, hopemore=hopemore, hopemorerealistic=hopemore-100)}
]
# if you want to save the values
dt[ , c("hope", "hopemore", "hopemorerealistic") :=
{hope=income*1.05;
hopemore=income*1.20;
list(hope, hopemore, hopemore-100)}
]
dt
# shop income hope hopemore hopemorerealistic
# 1: 1 700 735.0 840 740
# 2: 2 770 808.5 924 824
# 3: 3 840 882.0 1008 908
# 4: 4 910 955.5 1092 992
# 5: 5 980 1029.0 1176 1076
# 6: 6 1050 1102.5 1260 1160
# 7: 7 1120 1176.0 1344 1244
# 8: 8 1190 1249.5 1428 1328
# 9: 9 1260 1323.0 1512 1412
# 10: 10 1330 1396.5 1596 1496

R: sum vector by vector of conditions

I am trying to obtain a vector, which contains sum of elements which fit condition.
values = runif(5000)
bin = seq(0, 0.9, by = 0.1)
sum(values < bin)
I expected that sum will return me 10 values - a sum of "values" elements which fit "<" condition per each "bin" element.
However, it returns only one value.
How can I achieve the result without using a while loop?
I understand this to mean that you want, for each value in bin, the number of elements in values that are less than bin. So I think you want vapply() here
vapply(bin, function(x) sum(values < x), 1L)
# [1] 0 497 1025 1501 1981 2461 2955 3446 3981 4526
If you want a little table for reference, you could add names
v <- vapply(bin, function(x) sum(values < x), 1L)
setNames(v, bin)
# 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
# 0 497 1025 1501 1981 2461 2955 3446 3981 4526
I personally prefer data.table over tapply or vapply, and findInterval over cut.
set.seed(1)
library(data.table)
dt <- data.table(values, groups=findInterval(values, bin))
setkey(dt, groups)
dt[,.(n=.N, v=sum(values)), groups][,list(cumsum(n), cumsum(v)),]
# V1 V2
# 1: 537 26.43445
# 2: 1041 101.55686
# 3: 1537 226.12625
# 4: 2059 410.41487
# 5: 2564 637.18782
# 6: 3050 904.65876
# 7: 3473 1180.53342
# 8: 3951 1540.18559
# 9: 4464 1976.23067
#10: 5000 2485.44920
cbind(vapply(bin, function(x) sum(values < x), 1L)[-1],
cumsum(tapply( values, cut(values, bin), sum)))
# [,1] [,2]
#(0,0.1] 537 26.43445
#(0.1,0.2] 1041 101.55686
#(0.2,0.3] 1537 226.12625
#(0.3,0.4] 2059 410.41487
#(0.4,0.5] 2564 637.18782
#(0.5,0.6] 3050 904.65876
#(0.6,0.7] 3473 1180.53342
#(0.7,0.8] 3951 1540.18559
#(0.8,0.9] 4464 1976.23067
Using tapply with a cut()-constructed INDEX vector seems to deliver:
tapply( values, cut(values, bin), sum)
(0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6]
25.43052 71.06897 129.99698 167.56887 222.74620 277.16395
(0.6,0.7] (0.7,0.8] (0.8,0.9]
332.18292 368.49341 435.01104
Although I'm guessing you would want the cut-vector to extend to 1.0:
bin = seq(0, 1, by = 0.1)
tapply( values, cut(values, bin), sum)
(0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6]
25.48087 69.87902 129.37348 169.46013 224.81064 282.22455
(0.6,0.7] (0.7,0.8] (0.8,0.9] (0.9,1]
335.43991 371.60885 425.66550 463.37312
I see that I understood the question differently than Richard. If you wanted his result you can use cumsum on my result.
Using dplyr:
set.seed(1)
library(dplyr)
df %>% group_by(groups) %>%
summarise(count = n(), sum = sum(values)) %>%
mutate(cumcount= cumsum(count), cumsum = cumsum(sum))
Output:
groups count sum cumcount cumsum
1 (0,0.1] 537 26.43445 537 26.43445
2 (0.1,0.2] 504 75.12241 1041 101.55686
3 (0.2,0.3] 496 124.56939 1537 226.12625
4 (0.3,0.4] 522 184.28862 2059 410.41487
5 (0.4,0.5] 505 226.77295 2564 637.18782
6 (0.5,0.6] 486 267.47094 3050 904.65876
7 (0.6,0.7] 423 275.87466 3473 1180.53342
8 (0.7,0.8] 478 359.65217 3951 1540.18559
9 (0.8,0.9] 513 436.04508 4464 1976.23067
10 NA 536 509.21853 5000 2485.44920

Filter rows based on values of multiple columns in R

Here is the data set, say name is DS.
Abc Def Ghi
1 41 190 67
2 36 118 72
3 12 149 74
4 18 313 62
5 NA NA 56
6 28 NA 66
7 23 299 65
8 19 99 59
9 8 19 61
10 NA 194 69
How to get a new dataset DSS where value of column Abc is greater than 25, and value of column Def is greater than 100.It should also ignore any row if value of atleast one column in NA.
I have tried few options but wasn't successful. Your help is appreciated.
There are multiple ways of doing it. I have given 5 methods, and the first 4 methods are faster than the subset function.
R Code:
# Method 1:
DS_Filtered <- na.omit(DS[(DS$Abc > 20 & DS$Def > 100), ])
# Method 2: which function also ignores NA
DS_Filtered <- DS[ which( DS$Abc > 20 & DS$Def > 100) , ]
# Method 3:
DS_Filtered <- na.omit(DS[(DS$Abc > 20) & (DS$Def >100), ])
# Method 4: using dplyr package
DS_Filtered <- filter(DS, DS$Abc > 20, DS$Def >100)
DS_Filtered <- DS %>% filter(DS$Abc > 20 & DS$Def >100)
# Method 5: Subset function by default ignores NA
DS_Filtered <- subset(DS, DS$Abc >20 & DS$Def > 100)

I am trying to change the sign of a number to negative if it is below a specific target and it is to remain positive if it is above.

I put the section of code in bold that seems to be the problem. Here is the code:
## price impact analysis
rm(list=ls())
### import data from excel spreadsheets
chtr_trades <- read.csv("F:/FRE 6951 Mkt Micro Struc/CHTRTRADES.csv")
chtr_quotes <- read.csv("F:/FRE 6951 Mkt Micro Struc/CHTRQUOTES.csv")
## initialize bid ask
max_bid <- NULL
min_ask <- NULL
### cleans data
maxrm <- function(x) {
max(x, na.rm=TRUE)
}
minrm <- function(x) {
min(x, na.rm=TRUE)
}
## retrieve max bid and ask for each iteration
max_bid<- tapply(chtr_quotes[,4],chtr_quotes[,3], maxrm)
min_ask<- tapply(chtr_quotes[,5],chtr_quotes[,3], minrm)
time <- levels(chtr_quotes[,3])
## calculate previous second midpoint
midpoint <- (min_ask + max_bid)/2
askbidtime <- data.frame(midpoint,time,max_bid,min_ask)
row.names(askbidtime) <- seq(nrow(askbidtime))
askbidtime[,2] <- as.POSIXct(askbidtime[,2], format="%H:%M:%S")
ordered.askbidtime <- askbidtime[order(askbidtime$time),]
row.names(ordered.askbidtime) <- seq(nrow(ordered.askbidtime))
chtr_trades_revised <-chtr_trades[which(as.POSIXct(chtr_trades[,3],format="%H:%M:%S") %in% ordered.askbidtime[,2]),]
midpoint<-NULL
midpoint[1:5] <- NA
for(i in 6:3917) {
midpoint[i] <- as.numeric(ordered.askbidtime[which(ordered.askbidtime[,2]==as.POSIXct(chtr_trades_revised[i,3],format="%H:%M:%S"))-1,1])
}
***## sign trades
chtr_trades_revised$midpoint
chtr_trades_revised$midpoint <- midpoint
for(i in 6:3917) {
if((!is.na(chtr_trades_revised$midpoint[i])) & (chtr_trades_revised$midpoint[i] > chtr_trades_revised$PRICE[i])) {
chtr_trades_revised$signed_volume <- -chtr_trades_revised$SIZE
}
if((!is.na(chtr_trades_revised$midpoint[i])) & (chtr_trades_revised$midpoint[i] < chtr_trades_revised$PRICE[i])) {
chtr_trades_revised$signed_volume <- chtr_trades_revised$SIZE
}
}***
Here are the results. In the last column rows 4062 and 4054 should be positive but it makes the entire column negative:
SYMBOL DATE TIME PRICE SIZE midpoint signed_volume
4060 CHTR 20130718 2014-08-26 15:59:44 124.46 100 124.485 -100
4061 CHTR 20130718 2014-08-26 15:59:52 124.46 100 124.495 -100
4062 CHTR 20130718 2014-08-26 15:59:55 124.52 100 124.490 -100
4063 CHTR 20130718 2014-08-26 15:59:58 124.53 100 124.410 -100
4064 CHTR 20130718 2014-08-26 16:00:00 124.57 7951 124.550 -7951
4065 CHTR 20130718 2014-08-26 16:00:00 124.53 100 124.550 -100
Here's a cute way:
foo<- 1:10
threshold <- 5
foo<- foo*(-1)^(foo < threshold)
foo
[1] -1 -2 -3 -4 5 6 7 8 9 10
Another method:
foo = 1:10 ; threshold = 5
foo
[1] 1 2 3 4 5 6 7 8 9 10
foo = ifelse(foo>=threshold, foo, -foo)
foo
[1] -1 -2 -3 -4 5 6 7 8 9 10

Resources