How to cutoff values in dataframe? - r

From the following dataframe:
df1 <- data.frame(id=c(1, 2, 3, 4, 5, 6, 7),
revenue=c(34, 1000, 40, 49, 43, 55, 99))
df2 <- data.frame(id=c(1, 2, 3, 4, 5, 6, 7),
expenses=c(22, 26, 31, 40, 20, 2000, 22))
df3 <- data.frame(id=c(1, 2, 3, 4, 5, 6, 7),
profit=c(12, 10, 14, 12, 9, 15, 16))
df_list <- list(df1, df2, df3)
test <- Reduce(function(x, y) merge(x, y, all=TRUE), df_list)
rownames(test) <- test[,1]
test[,1] <- NULL
test
I would like to eliminate extreme values (e.g. 1000 and 2000). I need to cutoff everything that is greater than 100. When I check test<100 I see TRUE and FALSE positions but I would like to replace them with NA or zeroes.

To replace all values in a dataframe (df) which values are higher than 100 with a 0 simply use: df[df > 100] = 0

We can use replace()
replace(test, test>100, NA)
revenue expenses profit
1 34 22 12
2 NA 26 10
3 40 31 14
4 49 40 12
5 43 20 9
6 55 NA 15
7 99 22 16

Related

Taking the mean of a column within a function and a for loop

I have the below function :
compute_treatment_effects <- function(dataset, outcome, baseline_outcome,
covariates,
standardize){
baseline_covariates <- c(baseline_outcome, covariates)
dataset <- dataset %>%
mutate(treat =ifelse(treatment_group == "trt", 1,
ifelse(treatment_group == "control", 0, NA))) %>%
filter(!is.na(treat))
if (standardize){
dataset[,outcome] <- (dataset[,outcome] - mean(dataset[dataset$treat==0,outcome], na.rm=TRUE))/
sd(dataset[dataset$treat==0,outcome], na.rm=TRUE)
}
}
Now the issue, is whenever it gets to the standardization procedure, I get an error :
"Error in is.data.frame(x) :
'list' object cannot be coerced to type 'double'
In addition: Warning message:
In mean.default(dataset[dataset$treat == 0, outcome], na.rm = TRUE)"
I am really not sure why this is the case, I dont believe the syntax is wrong anywhere ?
Here is an example of a dataframe to use with the code:
dataframe <- data.frame("var1" = c(1, 2, 5, 1, 642, 5, 1, 2, 5, 9, NA, 8, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10 ),
"Var2" = c(1, 3, 5, 1, 642, 5, NA, NA, NA, NA, NA, NA, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10 ),
"var3" = c(1, 2, 635, 9, NA, 1, 2, 5, NA, NA, 12, NA, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10),
"var4" = c(1, 21, 15, 19, NA, 1, 26656, 56,6 , NA, 512, NA, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10),
"cov1" = c(1, 22,335, 29, NA, NA, NA, 645, NA, NA, 12, NA, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10),
"cov2" = c(44251, 2322,5, 29, 45, 35, 42, 645, 55, 525, NA, NA, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10),
"cov3" = c(154, 2552,35, 53529, 5, 3, 53542, 645, 25, 2, 12, 23, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10))
dataframe <- dataframe %>%
mutate(treatment_group = ifelse(var3 == 2, "trt", ifelse(var3 == 10, "control", NA)))
dataset <- dataframe
outcome <- "Var2"
baseline_outcome <- "var1"
covariates = c("cov1", "cov2","cov3")
Thank you so much!!!
It is possible that the OP's original dataset is either tibble or data.table because both of them doesn't subset the columns into a vector when we do , column as it is drop = FALSE in both when compared to data.frame (which is drop = TRUE)
> compute_treatment_effects(as_tibble(dataset), outcome, baseline_outcome, covariates, standardize = TRUE)
Error in is.data.frame(x) :
'list' object cannot be coerced to type 'double'
In addition: Warning message:
In mean.default(dataset[dataset$treat == 0, outcome], na.rm = TRUE) :
argument is not numeric or logical: returning NA
The fix would be to either convert to data.frame with as.data.frame
compute_treatment_effects(as.data.frame(dataset), outcome, baseline_outcome, covariates, standardize = TRUE)
-output
var1 Var2 var3 var4 cov1 cov2 cov3 treatment_group treat
1 2 -Inf 2 21 22 2322 2552 trt 1
2 1 NA 2 26656 NA 42 53542 trt 1
3 10 NaN 10 10 10 10 10 control 0
4 10 NaN 10 10 10 10 10 control 0
5 10 NaN 10 10 10 10 10 control 0
6 10 NaN 10 10 10 10 10 control 0
7 10 NaN 10 10 10 10 10 control 0
8 10 NaN 10 10 10 10 10 control 0
9 10 NaN 10 10 10 10 10 control 0
10 10 NaN 10 10 10 10 10 control 0
11 10 NaN 10 10 10 10 10 control 0
12 10 NaN 10 10 10 10 10 control 0
Or make the changes in the function by using [[ instead of [ for subsetting the column i.e.
compute_treatment_effects <- function(dataset, outcome, baseline_outcome,
covariates,
standardize){
baseline_covariates <- c(baseline_outcome, covariates)
dataset <- dataset %>%
mutate(treat =ifelse(treatment_group == "trt", 1,
ifelse(treatment_group == "control", 0, NA))) %>%
filter(!is.na(treat))
if (standardize){
dataset[[outcome]] <- (dataset[[outcome]] -
mean(dataset[[outcome]][dataset$treat==0], na.rm=TRUE))/
sd(dataset[[outcome]][dataset$treat==0], na.rm=TRUE)
}
dataset
}

How to generate sequence with exclusions in R

I need 4 functions that generate some numbers (each)
First function generates sequence from n odd numbers except 5, 15, 25, etc...
example with n=2: 1, 1, 3, 3, 7, 7, 9, 9, 11, 11, 13, 13, 17, 17,...
Second function generates sequence from n even numbers except 10, 20, 30, etc...
example witn n=2: 2, 2, 4, 4, 6, 6, 8, 8, 12, 12, 14, 14, 16, 16,...
Third function generates sequence from n numbers from 5 by 10
example witn n=2: 5, 5, 15, 15, 25, 25,...
Fourth function generates sequence from n numbers from 10 by 10
example witn n=2: 10, 10, 20, 20, 30, 30,...
Each function has to get vector 1: N and n as inputs.
For example,
f1(1:10, 3)
> 1, 1, 1, 3, 3, 3, 7, 7, 7, 9
f2(1:5, 10)
> 2, 2, 2, 2, 2
f3(1:15, 5)
> 5, 5, 5, 5, 5, 15, 15, 15, 15, 15, 25, 25, 25, 25, 25
f4(1:2, 1)
> 10, 20
I have some decision for first two functions but I don`t know how to exclude some numbers:
f1 <- function(x) 2*((x-1) %/% 10) + 1 # goes 1, 3, 5, etc for n = 10
f2 <- function(x) 2*((x-1) %/% 10 + 1) # goes 2, 4, 6, etc for n = 10
why not use seq and rep ?
n = 25
nrep = 2 # number of repetitions
by5 <- sort(rep(seq(5, n, by = 10), nrep )) # numbers from 5 by 10
by5
by10 <- sort(rep(seq(10, n, by = 10), nrep )) # numbers from 10 by 10
by10
odd <- sort(rep(seq(1, n, by = 2), nrep )) # odd number
odd[!odd %in% by5] # remove all the by5 values
even <- sort(rep(seq(2, n, by = 2), nrep )) # Even numbers
even[!even %in% by10] # remove all the by 10 values
output
> [1] 5 5 15 15 25 25
> [1] 10 10 20 20
> [1] 1 1 3 3 7 7 9 9 11 11 13 13 17 17 19 19 21 21 23 23
> [1] 2 2 4 4 6 6 8 8 12 12 14 14 16 16 18 18 22 22 24 24.

2 columns into list and sort in R

Let's say we have two list
x <- c(1, 3, 4, 2, 6, 5)
y <- c(12, 14, 15, 61, 71, 21)
I want to combine into a list so that we have 2 column x and y and values should be in same order.
x <- c(1, 3, 4, 2, 6, 5)
y <- c(12, 14, 15, 61, 71, 21)
After you have a list I want to sort it on y so the final list looks like
x <- c(1, 3, 4, 5, 2, 6)
y <- c(12, 14, 15, 21, 61, 71)
I am really new to R.
I tried list(x,y) but it seems to make a
list(1, 3, 4, 2, 6, 5, 12, 14, 15, 61, 71, 21)
so I was wondering someone could help me.
You need to put them in a data.frame first and then use order:
x <- c(1, 3, 4, 2, 6, 5)
y <- c(-12, 14, 15, 61, 71, 21)
DF <- data.frame(x, y)
> DF[order(DF$y),]
x y
1 1 -12
2 3 14
3 4 15
6 5 21
4 2 61
5 6 71
keeping as a list, using lapply:
x <- c(1, 3, 4, 2,6,5)
y <- c(12, 14,15,61,71,21)
l <- list(x = x, y = y)
## thelatemail
lapply(l, `[`, order(l$y))
# $x
# [1] 1 3 4 5 2 6
#
# $y
# [1] 12 14 15 21 61 71
a more explicit version of the short one given by #thelatemail above but doesn't preserve the names:
lapply(seq_along(l), function(x) l[[x]][order(l$y)])
# [[1]]
# [1] 1 3 4 5 2 6
#
# [[2]]
# [1] 12 14 15 21 61 71
or rapply:
rapply(l, function(x) x[order(l$y)], how = 'list')
# $x
# [1] 1 3 4 5 2 6
#
# $y
# [1] 12 14 15 21 61 71

Calculate moving recency-weighted mean in R

I would like to calculate the moving recency-weighted mean finishing positions of a horse given the times (day) and finishing positions (pos) for a sequence of races in which the horse participated. Such statistics are useful in handicapping.
Currently, I am using a "loop-inside-a-loop" approach. Is there a faster or more elegant R-language approach to this problem?
#
# Test data
#
day <- c(0, 6, 10, 17, 21, 26, 29, 31, 34, 38, 41, 47, 48, 51, 61)
pos <- c(3, 5, 6, 1, 1, 3, 4, 1, 2, 2, 2, 6, 4, 5, 6)
testdata <- data.frame(id = 1, day = day, pos = pos, wt.pos = NA)
#
# No weight is given to observations earlier than cutoff
#
cutoff <- 30
#
# Rolling recency-weighted mean (wt.pos)
#
for(i in 2:nrow(testdata)) {
wt <- numeric(i-1)
for(j in 1:(i-1))
wt[j] <- max(0, cutoff - day[i] + day[j] + 1)
if (sum(wt) > 0)
testdata$wt.pos[i] <- sum(pos[1:j] * wt) / sum(wt)
}
> testdata
id day pos wt.pos
1 1 0 3 NA
2 1 6 5 3.000000
3 1 10 6 4.125000
4 1 17 1 4.931034
5 1 21 1 3.520548
6 1 26 3 2.632911
7 1 29 4 2.652174
8 1 31 1 2.954128
9 1 34 2 2.436975
10 1 38 2 2.226891
11 1 41 2 2.119048
12 1 47 6 2.137615
13 1 48 4 3.030534
14 1 51 5 3.303704
15 1 61 6 4.075000
I'd go for
# Calculate `wt` for all values of `i` in one go
wt <- lapply(2:nrow(testdata), function(i)
pmax(0, cutoff - day[i] + day[1:(i-1)] + 1))
# Fill in the column
testdata$wt.pos[-1] <- mapply(
function(i, w) if(sum(w) > 0) sum(pos[1:i]*w)/sum(w) else NA,
1:(nrow(testdata)-1), wt)
Note that by calculating the second argument to max for all values of j at the same time we have vectorized the computation, which improves the speed by many orders of magnitude.
I found no easy way to vectorize the outer loop and the if case though (apart from rewriting it in C which seems like overkill), but lapply, mapply and similar are still faster than for loops.
This version demonstrates how to calculate moving recency-weighted means for 1 or more variables (e.g., finishing position, speed rating, etc.) and 1 or more subjects (horses).
library(plyr)
day <- c(0, 6, 10, 17, 21, 26, 29, 31, 34, 38, 41, 47, 48, 51, 61)
pos <- c(3, 5, 6, 1, 1, 3, 4, 1, 2, 2, 2, 6, 4, 5, 6)
dis <- 100 + 0.5 * (pos - 1)
testdata1 <- data.frame(id = 1, day = day, pos = pos, dis = dis)
day <- c(0, 4, 7, 14, 22, 23, 31, 38, 42, 47, 52, 59, 68, 69, 79)
pos <- c(1, 3, 2, 6, 4, 5, 2, 1, 4, 5, 2, 1, 5, 5, 2)
dis <- 100 + 0.5 * (pos - 1)
testdata2 <- data.frame(id = 2, day = day, pos = pos, dis = dis)
testdata <- rbind(testdata1, testdata2)
# Moving recency-weighted mean
rollmean <- function(day, obs, cutoff = 90) {
obs <- as.matrix(obs)
wt <- lapply(2:nrow(obs), function(i)
pmax(0, cutoff - day[i] + day[1:(i-1)] + 1))
wt.obs <- lapply(1:(nrow(obs)-1), FUN =
function(i)
if(sum(wt[[i]]) > 0) {
apply(obs[1:i, , drop = F] * wt[[i]], 2, sum) / sum(wt[[i]])
} else {
rep(NA, ncol(obs))
}
)
answer <- rbind(rep(NA, ncol(obs)), do.call(rbind, wt.obs))
if (!is.null(dimnames(answer)))
dimnames(answer)[[2]] <- paste("wt", dimnames(answer)[[2]], sep = ".")
return(answer)
}
x <- dlply(testdata, .(id), .fun =
function(DF) rollmean(DF$day, DF[, c("pos", "dis"), drop = F])
)
y <- do.call(rbind, x)

Another nested loop in R

I have the following data and nested for loop:
x <- c(12, 27, 21, 16, 12, 21, 18, 16, 20, 23, 21, 10, 15, 26, 21, 22, 22, 19, 26, 26)
y <- c(8, 10, 7, 7, 9, 5, 7, 7, 10, 4, 10, 3, 9, 6, 4, 2, 4, 2, 3, 6)
a <- c(20,25)
a.sub <- c()
df <- c()
for(j in 1:length(a)){
a.sub <- which(x >= a[j])
for(i in 1:length(a.sub)){
df[i] <- y[a.sub[i]]
}
print(df)
}
I'd like the loop to return values for df as:
[1] 10 6 3 6 4 10 6 4 2 4 3 6
[1] 10 6 3 6
As I have it, however, the loop returns the same values twice of df for a <- 20 but not a <- 25:
[1] 10 7 5 10 4 10 6 4 2 4 3 6
[1] 10 6 3 6 4 10 6 4 2 4 3 6
for(i in 1:length(a.sub)){
df[i] <- y[a.sub[i]]
}
can become
df <- y[a.sub]
neither a.sub nor df need to be predefined then and thus...
x <- c(12, 27, 21, 16, 12, 21, 18, 16, 20, 23, 21, 10, 15, 26, 21, 22, 22, 19, 26, 26)
y <- c(8, 10, 7, 7, 9, 5, 7, 7, 10, 4, 10, 3, 9, 6, 4, 2, 4, 2, 3, 6)
a <- c(20,25)
for(j in 1:length(a)){
a.sub <- which(x >= a[j])
df <- y[a.sub]
print(df)
}
It could be made shorter. df is unnecessary if you're just printing the subset of y anyway. Just print it directly. And the selector is so short it wouldn't make a single line confusing. Furthermore, why use length of a and index.. loop through a directly. So, it could be...
a <- c(20,25)
for(ax in a){
print( y[ which(x >= ax) ] )
}
Not sure if this is a simplified version of a more complex problem, but I'd probably solve this using some direct indexing and an apply function. Something like this:
z <- cbind(x,y)
sapply(c(20,25), function(x) z[z[, 1] >= x, 2])
[[1]]
[1] 10 7 5 10 4 10 6 4 2 4 3 6
[[2]]
[1] 10 6 3 6

Resources