Assign value to dataframe inside parent environment - r

I have a parent function (fun1) that takes data and a helper function (fun2) that operates on columns from that data. I want to assign the value from that helper function to it's matching column in data in fun1. In reality there are lots of little helper functions operating on columns and I want the value changed by fun2 to be what the other helper functions deal with.
How can I take the outout from fun2 and assign it to the matching column of data in fun1? I do not want to make that alteration to data in fun1 but want to change data from within fun2.
fun1 <- function(data){
Map(fun2, data, colnames(data))
data[[1]]
}
fun2 <- function(x, colname_x){
x <- x * 2
# my attempt to assign to `data[["mpg"]]`
assign(sprintf("data[[\"%s\"]]", colname_x), value=x, pos = 1)
}
fun1(mtcars[1:3])
Desired output of fun1:
## [1] 42.0 42.0 45.6 42.8 37.4 36.2 28.6 48.8 45.6 38.4 35.6 32.8 34.6 30.4 20.8
## [16] 20.8 29.4 64.8 60.8 67.8 43.0 31.0 30.4 26.6 38.4 54.6 52.0 60.8 31.6 39.4
## [31] 30.0 42.8
EDIT: Also tried:
fun2 <- function(x, colname_x = "mpg"){
x <- x * 2
assign(sprintf("data[[\"%s\"]]", colname_x), value=x, env = parent.frame(3))
}
Edit 2 This works but seems not so clean:
'
fun2 <- function(x, colname_x = "mpg"){
x <- x * 2
data <- get('data', env=parent.frame(3))
data[[colname_x]] <- x
assign('data', value=data, env = parent.frame(3))
}

Related

How to calculate the cumulative mean of a block of the data frame up to each row in R

I have a simplified dataframe that looks like this:
df
volume ask1 ask2 bid1 bid2
0 38 NA 38 37.9
100 38.1 38.2 37.8 38.2
0 38.4 38.5 38.2 38.3
0 38.4 38.5 38.2 NA
200 38.3 38.1 38 38.4
250 38.4 38.2 NA 38.6
I want to have another column, which contains the mean of df[1:i, 2:5] on the ith row.
I can do this with a for loop:
df[, "midpoint"] <- NA
for (i in 1:nrow(df)) {
df$midpoint[i] <- mean(as.matrix(df[c(1:i), c(2:5)]), na.rm = TRUE)
}
But as my dataframe is actually large, the for loop takes a long time.
I have tried sapply but failed:
df[, "midpoint"] <- sapply(df, function(i) mean(as.matrix(df[c(1:i), c(2:5)]), na.rm = TRUE))
Could anyone give me some suggestions?
With sapply you can do :
mat <- as.matrix(df[, 2:5])
df$midpoint <- sapply(seq(nrow(df)), function(i) mean(mat[1:i, ], na.rm = TRUE))
You can also take mean of means which would be faster but introduce a small error.
library(dplyr)
df %>%
mutate(res = rowMeans(select(., 2:5), na.rm = TRUE),
res = cummean(res))
# volume ask1 ask2 bid1 bid2 midpoint res
#1 0 38.0 NA 38.0 37.9 37.96667 37.96667
#2 100 38.1 38.2 37.8 38.2 38.02857 38.02083
#3 0 38.4 38.5 38.2 38.3 38.14545 38.13056
#4 0 38.4 38.5 38.2 NA 38.19286 38.18958
#5 200 38.3 38.1 38.0 38.4 38.19444 38.19167
#6 250 38.4 38.2 NA 38.6 38.22381 38.22639
Here midpoint is the actual answer from the for loop or sapply code and res is the answer from the above calculation.
You were close with your sapply command, but you need to iterate over the number of rows.
Try
sapply(1:nrow(df), function(x) mean(as.matrix(df)[x, 2:5], na.rm = TRUE))
rowSums(cumsum(df[2:5]), na.rm=T) / cumsum(rowSums(!is.na(df[2:5])))

how to pass variable name in for loop or lapply functions in R? [duplicate]

This question already has answers here:
Dynamically select data frame columns using $ and a character value
(10 answers)
Closed 2 years ago.
Not able to pass variable names correctly in for loop or use lapply functions.
When I try this command without loop/laaply it works and I get values:
> boxplot.stats(df$price)$out
[1] 38.7 43.8 41.3 50.0 50.0 50.0 50.0 37.2 39.8 37.9 50.0
[12] 50.0 42.3 48.5 50.0 44.8 50.0 37.6 46.7 41.7 48.3 42.8
[23] 44.0 50.0 43.1 48.8 50.0 43.5 45.4 46.0 50.0 37.3 50.0
[34] 50.0 50.0 50.0 50.0
But when I put this under a lapply or for-loop then I get Null, why ?
df_numeric_names <- names(select_if(df, is.numeric))
df_numeric_names
[1] "price" "resid_area" "air_qual" "room_num" "age" "dist1" "dist2" "dist3"
[9] "dist4" "teachers" "poor_prop" "n_hos_beds" "n_hot_rooms" "rainfall" "parks" "Sold"
loop
for (feature in df_numeric_names){
outlier_values <- boxplot.stats(df$feature)$out
print(outlier_values)
}
- Output:
NULL
NULL
NULL
lapply
lapply(df_numeric_names, function(x) {
boxplot.stats(df$x)$out
})
- output
[[1]]
NULL
[[2]]
NULL
[[3]]
NULL
[[4]]
NULL
[[5]]
NULL
This is a fairly simple thing but I am not sure what am I doing wrong and how do I fix.
This slight change in the loop could solve your issue:
for (feature in df_numeric_names){
outlier_values <- boxplot.stats(df[,feature])$out
print(outlier_values)
}
And a little example:
library(dplyr)
#Data
data("iris")
df <- iris
#Numeric names
df_numeric_names <- names(select_if(df, is.numeric))
#Loop
for (feature in df_numeric_names){
outlier_values <- boxplot.stats(df[,feature])$out
print(outlier_values)
}
The output:
numeric(0)
[1] 4.4 4.1 4.2 2.0
numeric(0)
numeric(0)
Also using lapply() you should use a code similar to this:
lapply(df_numeric_names, function(x) {
boxplot.stats(df[,x])$out
})
Output:
[[1]]
numeric(0)
[[2]]
[1] 4.4 4.1 4.2 2.0
[[3]]
numeric(0)
[[4]]
numeric(0)

How to make frequency table with specific class in R

I want to make a frequency table with matrix "b", which has 100 observations.
I'm trying to make all observations are cut into 15 classes, so that for example the frequency table should include some 'empty classes' that no observations are included.
However, when I use function table(), classes(or levels) are included whose observations are not empty. ( 12 levels)
How can I force them to have 15 levels?
> b <- matrix(as.matrix(b),ncol=1)
> fivenum(b)
[1] 24.2 24.7 24.9 25.1 25.6
> bcut <- seq(from = 24.2, by =0.1, length.out = 16); bcut
[1] 24.2 24.3 24.4 24.5 24.6 24.7 24.8 24.9 25.0 25.1 25.2 25.3
[13] 25.4 25.5 25.6 25.7
> bgroup <- factor(cut(x = b, breaks = bcut, include.lowest = T))
> levels(bgroup)
[1] "[24.2,24.3]" "(24.3,24.4]" "(24.4,24.5]" "(24.6,24.7]"
[5] "(24.7,24.8]" "(24.8,24.9]" "(24.9,25]" "(25.1,25.2]"
[9] "(25.2,25.3]" "(25.3,25.4]" "(25.4,25.5]" "(25.6,25.7]"

Estimating missing values in time-series data frame based on a "rate of change"

I am trying to use a loop in R to estimate values that will replace the NAs in my data frame based on a rate of change ("rate") that multiplies my last value (ok, this is confusing, but please refer to the example below). This is something similar to my data:
l1 <- c(NA,NA,NA,27,31,0.5)
l2 <- c(NA,8,12,28,39,0.5)
l3 <- c(NA,NA,NA,NA,39,0.3)
l4 <- c(NA,NA,11,15,31,0.2)
l5 <- c(NA,NA,NA,NA,51,0.9)
data <- as.data.frame(rbind(l1,l2,l3,l4,l5))
colnames(data) <- c("dbh1","dbh2","dbh3","dbh4","dbh5","rate")
So I created a loop to identify my first no-NA value in each line, then use that value to estimate its previous values based on the "rate". So for instance, in row 1, the first NA value would be replace by "27-(0.5*3)", then the second one would be "27-(0.5*2)" and the third one by "27-(0.5*1)". This is the loop I came up with. I know the first part (the outside loop) works but the the inside one doesn't:
for (i in 1: nrow(data)) {
dbh.cols <- data3[i,c("dbh1","dbh2","dbh3","dbh4","dbh5")]
sample.year <- which(dbh.cols != "NA")
data$first.dbh[i] <- min(dbh.cols, na.rm = T)
data$first.index[i] <- min(sample.year)
for (j on 1: (min(sample.year)-1)) {
ifelse(is.na(data[i,j]), min(dbh.cols, na.rm = T) - (min(sample.year)-j)*rate[i,j], data[i,j])
}
}
I am not good at programming so probably my internal loop strategy with "ifelse" is too weird (and wrong) but I just couldn't think of anything else that would work here... Any suggestions?
1) This uses no explicit loops, just an apply. It assumes that the NAs are all leading as in the example given.
fillIn <- function(x) {
rate <- tail(x, 1)
n <- sum(is.na(x)) # no of NAs
c(x[n+1] - rate * seq(n, 1), na.omit(x))
}
replace(data, TRUE, t(apply(data, 1, fillIn)))
giving:
dbh1 dbh2 dbh3 dbh4 dbh5 rate
l1 25.5 26.0 26.5 27.0 31 0.5
l2 7.5 8.0 12.0 28.0 39 0.5
l3 37.8 38.1 38.4 38.7 39 0.3
l4 10.6 10.8 11.0 15.0 31 0.2
l5 47.4 48.3 49.2 50.1 51 0.9
2) Here is a second approach that uses na.approx from the zoo package. It does not require apply. Here data1 has the same content as data except that the first column is filled in. The other NAs remain. The last line uses na.approx to fill in the remaining NAs linearly.
library(zoo)
NAs <- rowSums(is.na(data))
data1 <- cbind( data[cbind(1:nrow(data), NAs + 1)] - data$rate * NAs, data[-1] )
replace(data, TRUE, t(na.approx(t(data1))))
giving:
dbh1 dbh2 dbh3 dbh4 dbh5 rate
l1 25.5 26.0 26.5 27.0 31 0.5
l2 7.5 8.0 12.0 28.0 39 0.5
l3 37.8 38.1 38.4 38.7 39 0.3
l4 10.6 10.8 11.0 15.0 31 0.2
l5 47.4 48.3 49.2 50.1 51 0.9
2a) A variation on (2) uses na.locf in the middle line to bring forward the first non-NA in each row. The first and last lines are the same.
library(zoo)
NAs <- rowSums(is.na(data))
data1 <- cbind(na.locf(t(data), fromLast = TRUE)[1, ] - data$rate * NAs, data[-1])
replace(data, TRUE, t(na.approx(t(data1))))
You do not need to use multiple for loops for this. Here is some simplified code to do what you want just for the for loop. Working explicitly with your data we need to get the first non-NA value from each row.
for_estimate <- apply(data, 1, function(x) x[min(which(is.na(x) == FALSE))])
Secondly, we need to determine what integer to multiply the rate by for each row depending on how many NA values there are.
# total number of NA values per row
n_na <- apply(data,1, function(x) sum(is.na(x)) )
# make it a matrix with a 0's appended on
n_na <- matrix(c(n_na, rep(0, nrow(data) * (ncol(data)-1))),
nrow = nrow(data), ncol = ncol(data)-1)
# fill in the rest of the matrix
for(i in 2:ncol(n_na)){
n_na[,i] <- n_na[,i-1] -1
}
Once we have that we can use this code to back fill the NA values in that way you are interested in.
for(i in (ncol(data)-1):1){
if(sum(is.na(data[,i]))>0){
to_fill <- which(is.na(data[,i])==TRUE)
data[to_fill,i] <- for_estimate[to_fill] - (data$rate[to_fill]*(n_na[to_fill,i])
}
}
output
dbh1 dbh2 dbh3 dbh4 dbh5 rate
l1 25.5 26.0 26.5 27.0 31 0.5
l2 7.5 8.0 12.0 28.0 39 0.5
l3 37.8 38.1 38.4 38.7 39 0.3
l4 10.6 10.8 11.0 15.0 31 0.2
l5 47.4 48.3 49.2 50.1 51 0.9

Remove column values with NA in R

I have a data frame, called gen, which is represented below
A B C D E
1 NA 4.35 35.3 3.36 4.87
2 45.2 .463 34.3 NA 34.4
3 NA 34.5 35.6 .457 46.3
I would like to remove the columns where there are NA's. (I know na.omit does it for rows, but I can't seem to find one for columns). The final result would read:
B C E
1 4.35 35.3 4.87
2 .463 34.3 34.4
3 34.5 35.6 46.3
Thanks!
gen <- gen[sapply(gen, function(x) all(!is.na(x)))]
dfrm[ , sapply(dfrm, function(x){ !any(is.na(x)) } )
You might want to use instead this variant:
dfrm[ , sapply(dfrm, function(x){ all(is.finite(x)) } )
If you have Inf or -Inf values in a vector they are not removed or identified with selection based on is.na.
Just use this:
gen[colSums(is.na(gen)) == 0]

Resources