i want to create a function that calculte the rate of missing values of a data frame's column.
here's my code :
Pourcentage_NA = function(df, col_Name){
res=100*( length(df[col_Name])-length(na.omit(df[col_Name])) ) / length(df[col_Name])
}
when I call it like this : x = Pourcentage_NA( Data , "A" )
It shows that x=0 although I know there's some missing values. Anyone can help me pls ?
I try to change the formula but it keep saying the same thing
You can do:
Pourcentage_NA = function(df, col_Name){
100 * sum(is.na(df[col_Name])) / nrow(df)
}
Testing, we have:
dat <- data.frame(A = c(1, NA, 3, NA), B = c(NA, 2, 3, 4))
Pourcentage_NA(dat, "A")
#> [1] 50
Pourcentage_NA(dat, "B")
#> [1] 25
An alternative would be
Pourcentage_NA <- function(df, col_Name) 100 * colMeans(is.na(df))[col_Name]
Pourcentage_NA(dat, "A")
#> A
#> 50
Pourcentage_NA(dat, "B")
#> B
#> 25
Created on 2022-11-13 with reprex v2.0.2
Related
I want to write a sliding window function in order to use the model trained from t, t+1, and t+2 year to make prediction on the outcome of the t+3 year. This means that for a 10-year's data, the desired sliding window function should create 7 train-test splits and make 7 predictions (for the t+3, t+4, t+5, t+6, t+7, t+8, t+9 year).
I came up with the following code but the result doesn't ring the bell. Not only does the resulting object length differs, but even if I try to manually work through the prediction task, the predict function actually generates 3 predicted values for a single year's outcome, which doesn't make sense. It would be grateful if someone could point out the sources of the error.
# generate the data
set.seed(123)
df <- data.frame(year = 2000:2009, # T = 10
y = c(1, 1, 1, 1, 0, 0, 1, 0, 0, 0),
var1 = runif(10, min=0, max=1),
var2 = runif(10, min=1, max=2))
# store predicted values in a list
pred <- list()
# loop from the 1st year to the T-3 year
for(i in 2000:2007){
df_sub1 <- subset(df, year == c(i, i+1, i+2))
mod <- glm(y~var1+var2, data=df_sub1, family=binomial())
df_sub2 <- subset(df, year == i+3)
pred[[i]] <- predict(mod, data=df_sub2, type = "response")
}
# error message
Error in family$linkfun(mustart) :
Argument mu must be a nonempty numeric vector
In addition: Warning messages:
1: In year == c(i, i + 1, i + 2) :
longer object length is not a multiple of shorter object length
2: In year == c(i, i + 1, i + 2) :
longer object length is not a multiple of shorter object length
The error/warning is from using == when the rhs is of length > 1. Use %in%
pred <- vector('list', 8)
names(pred) <- 2000:2007
for(i in 2000:2007){
df_sub1 <- subset(df, year %in% c(i, i+1, i+2))
mod <- glm(y~var1+var2, data=df_sub1, family=binomial())
df_sub2 <- subset(df, year == (i+3))
pred[[as.character(i)]] <- tryCatch(predict(mod,
newdata=df_sub2, type = "response"), error = function(e) NA_real_)
}
-output
> pred
$`2000`
4
1
$`2001`
5
1
$`2002`
6
1
$`2003`
7
2.220446e-16
$`2004`
8
0.1467543
$`2005`
9
0.001408577
$`2006`
10
2.220446e-16
$`2007`
[1] NA
Here is another way with one of package zoo's functions to apply a function to a rolling window. The function to be applied, roll_pred is almost a copy&paste of akrun's, only the creation of the subsets is different.
# generate the data
set.seed(123)
df <- data.frame(year = 2000:2009, # T = 10
y = c(1, 1, 1, 1, 0, 0, 1, 0, 0, 0),
var1 = runif(10, min=0, max=1),
var2 = runif(10, min=1, max=2))
library(zoo, quietly = TRUE)
#>
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:base':
#>
#> as.Date, as.Date.numeric
roll_pred <- function(year, X) {
i <- match(year, X$year)
df_sub1 <- X[i, ]
mod <- glm(y ~ var1 + var2, data = df_sub1, family = binomial())
df_sub2 <- X[ i[length(year)] + 1, ]
tryCatch(predict(mod, newdata = df_sub2, type = "response"),
error = function(e) NA_real_)
}
rollapplyr(df$year, 3, roll_pred, X = df)
#> 4 5 6 7 8 9
#> 1.000000e+00 1.000000e+00 1.000000e+00 2.220446e-16 1.467543e-01 1.408577e-03
#> 10 NA
#> 2.220446e-16 NA
Created on 2022-06-05 by the reprex package (v2.0.1)
I'm trying to modify a function so that if I put in a dataframe, I get the rownumber and row output.
These functions taken from Zip or enumerate in R? are a good starting point for me:
zip <- function(...) {
mapply(list, ..., SIMPLIFY = FALSE)
}
enumerate <- function(...) {
zip(k=seq_along(..1), ...)
}
I modified enumerate to work as I want when the input is a dataframe:
enumerate2 <- function(...){
mod <- ..1
if(is.data.frame(mod)){
mod = split(mod, seq(nrow(mod)))
}
zip(k = seq_along(mod), ...)
}
So for example:
g = data.frame(a = c(1, 2, 3), b = c(4, 5, 6))
enumerate2(v = g)
This will enumerate the rows of a dataframe, so I can do:
for(i in enumerate2(v = g)){
"rowNumber = %s, rowValues = %s" %>% sprintf(i$k, list(i$v)) %>% print
}
The problem is I get a warning:
Warning message:
In mapply(list, ..., SIMPLIFY = FALSE) :
longer argument not a multiple of length of shorter
Also, I'd rather the dataframe still be a dataframe so that I can do things like i$v$b to return the value of row i$k column b from the dataframe.
How can I get rid of the warning, and how can I keep the dataframe structure after split?
edit:
example 1 - data frame input
output:
enumerate2(v = data.frame(A = c(1, 2), B = c(3, 4)))
[[1]]
[[1]]$k
[1] 1
[[1]]$v
A B
1 1 3
[[2]]
[[2]]$k
[1] 2
[[2]]$v
A B
1 2 4
example 2 - list input
output:
enumerate2(v = LETTERS[1:2])
[[1]]
[[1]]$k
[1] 1
[[1]]$v
[1] "A"
[[2]]
[[2]]$k
[1] 2
[[2]]$v
[1] "B"
I created a list containing sub-lists, each sub-list containing information for one task I want R to do.
df <- as.data.frame(matrix(1:6 , ncol =2 , nrow = 3))
colnames(df) <- c("Col1", "Col2")
myList <- list()
myList[["Dataset1"]] <- list()
myList[["Dataset1"]]["Function"] <- "mean"
myList[["Dataset1"]][["DataFrame"]] <- df
myList[["Dataset2"]] <- list()
myList[["Dataset2"]]["Function"] <- "lm"
myList[["Dataset2"]][["DataFrame"]] <- df*2
Now I want R to apply the Function to the Dataframe and Store the results in a new List. How do I do that best?
So far I had two ideas: I either use lapply to run through the list, each time accessing several items from the sub-list, supplying it to a new function
myResult <- lapply(myList, FUN = myList[["Dataset1"]]["Function"](x) , x = as.matrix(myList[["Dataset1"]][["DataFrame"]]))
But I dont know how to tell R how to cycle correctly through the sub-lists.
Second, I was hoping to be able to send the whole sublist to a function, but I could not get this to run either.
myFunction <- function(x){
TempData <- x[["DataFrame"]]
TempFunction <- x["Function"]
TempResult <- get(TempFunction)(TempData)
return(TempResult)
}
myResult <- lapply(myList, myFunction(x))
If someone could give me an idea how to solve this id be very happy.
Many thanks in advance!
Sounds more complicated than it should be.. And it's not clear how you would apply mean() or lm(). Below is an example where you store the function as an object in the list:
myList[["Dataset1"]] <- list()
myList[["Dataset1"]][["Function"]] <- function(x)mean(as.matrix(x))
myList[["Dataset1"]][["DataFrame"]] <- df
myList[["Dataset2"]] <- list()
myList[["Dataset2"]][["Function"]] <- function(x){lm(Col2~Col1,data=x)}
myList[["Dataset2"]][["DataFrame"]] <- df*2
So you iterate through elements in the list and you get the result of function(dataframe) for each element:
lapply(myList,function(i)i$Function(i$DataFrame))
$Dataset1
[1] 3.5
$Dataset2
Call:
lm(formula = Col2 ~ Col1, data = x)
Coefficients:
(Intercept) Col1
6 1
A first thing you can do is use the match.fun function that allows you to recover a function from a string. Applied to a loop, it returns the following solution :
list<-list()
for (i in 1:length(myList)){
TempFunc <- match.fun(myList[[i]][["Function"]])
Results <- TempFunc(myList[[i]][["DataFrame"]])
list[i]<-Results
}
> list
[[1]]
[1] NA
[[2]]
(Intercept) Col2
-6 1
Actually, the functions don't make sense because the functions are not appropriate.
The function call allows you to call a function by its name given as a character string, and you can evaluate this call with eval. Note though, that neither of your function calls make sense applied to a data frame:
lapply(myList, function(sublist) call(sublist$Function, sublist[["DataFrame"]]))
#> $Dataset1
#> mean(list(Col1 = 1:3, Col2 = 4:6))
#> $Dataset2
#> lm(list(Col1 = c(2, 4, 6), Col2 = c(8, 10, 12)))
So the results are kind of useless for the examples used:
lapply(myList, function(sublist) eval(call(sublist$Function, sublist[["DataFrame"]])))
#> $Dataset1
#> [1] NA
#>
#> $Dataset2
#>
#> Call:
#> lm(formula = structure(list(Col1 = c(2, 4, 6), Col2 = c(8, 10,
#> 12)), class = "data.frame", row.names = c(NA, -3L)))
#>
#> Coefficients:
#> (Intercept) Col2
#> -6 1
#>
#>
#> Warning message:
#> In mean.default(list(Col1 = 1:3, Col2 = 4:6)) :
#> argument is not numeric or logical: returning NA
This is my debut post here. So please bear with me if it doesn't live up to the high standards of clarity of more seasoned members.
I have 4 objects (representing 4 years) in my global environment that are lists consisting of 12 data-frames (one for each month in the year). They have a consistent structure, and the column names of the data-frames are all the same. I'm trying to change these column names of the data-frames in all 4 lists in one fell swoop using a function, and then overwrite all 4 objects in my global environment with new objects that have the data-frames with the new column names.
This is my function:
change.name <- function(data){
for (i in 1:length(data)){
names(data[[i]]) <- c("a", "b", "c", "d", "e")
}
assign(deparse(substitute(data)), value = data, envir = globalenv())
}
I use my function:
change.name(my_object1)
It works, except that I get this warning message:
Warning message:
In assign(deparse(substitute(data)), value = data,
envir = globalenv()) : only the first element is used as variable
name
And the object in my global environment is not overwritten. I get a new object with a name like this:
"list(Jan = structure(list(a = c(11, 34, 36, 49, 55, 68, "
I understand that this has to do with the way the function variable is stored in the new environment R creates when running a function (or something along those lines).
My question is simple: How do I remedy this?
You can fix this by using deparse(substitute(data)) before you do anything to data:
# Let's change your function just a bit
change.name <- function(data){
# call deparse(substutite()) *before* you do anything to data
object_name <- deparse(substitute(data))
for (i in 1:length(data)){
names(data[[i]]) <- c("a", "b", "c", "d", "e")
}
assign(object_name, value = data, envir = globalenv())
}
# Create sample data
my_object1 <- lapply(1:12, function(x) {
data.frame(u = 1, v = 2, x = 3, y = 4, z = 5)
})
names(my_object1) <- month.name
change.name(my_object1)
ls()
#> [1] "change.name" "my_object1"
head(my_object1, 2)
#> $January
#> a b c d e
#> 1 1 2 3 4 5
#>
#> $February
#> a b c d e
#> 1 1 2 3 4 5
Created on 2018-12-20 by the reprex package (v0.2.1)
A more idiomatic (and probably safer) way to approach this task might be to simply use lapply and setNames:
my_object1 <- lapply(1:12, function(x) {
data.frame(u = 1, v = 2, x = 3, y = 4, z = 5)
})
names(my_object1) <- month.name
change.name <- function(obj){
lapply(obj,function(x) setNames(x,letters[1:5]))
}
my_object1 <- change.name(my_object1)
Right you are, the problem lies in the way functions behave. Take a look at the following code, it might help
testFun1 <- function (val) {
a <<- val
assign("b",a)
}
testFun2 <- function (val) {
a <<- val
assign("b",a, pos = 1)
}
# environment pretty much empty apart from our functions
ls()
[1] "testFun1" "testFun2"
# run
set.seed(123)
testFun1(runif(1))
# less empty
ls()
[1] "a" "testFun1" "testFun2"
# still not quite it though
testFun2(runif(1))
# now that's better
ls()
[1] "a" "b" "testFun1" "testFun2"
For more information, take a look at the documentation (?assign), especially the pos argument.
I have a list like:
mylist <- list(a = 1, b = list(A = 1, B = 2), c = list(C = 1, D = 3))
is there an (loop-free) way to identify the positions of the elements, e.g. if I want to replace a values of "C" with 5, and it does not matter where the element "C" is found, can I do something like:
Aindex <- find_index("A", mylist)
mylist[Aindex] <- 5
I have tried grepl, and in the current example, the following will work:
mylist[grepl("C", mylist)][[1]][["C"]]
but this requires an assumption of the nesting level.
The reason that I ask is that I have a deep list of parameter values, and a named vector of replacement values, and I want to do something like
replacements <- c(a = 1, C = 5)
for(i in names(replacements)){
indx <- find_index(i, mylist)
mylist[indx] <- replacements[i]
}
this is an adaptation to my previous question, update a node (of unknown depth) using xpath in R?, using R lists instead of XML
One method is to use unlist and relist.
mylist <- list(a = 1, b = list(A = 1, B = 2), c = list(C = 1, D = 3))
tmp <- as.relistable(mylist)
tmp <- unlist(tmp)
tmp[grep("(^|.)C$",names(tmp))] <- 5
tmp <- relist(tmp)
Because list names from unlist are concatenated with a ., you'll need to be careful with grep and how your parameters are named. If there is not a . in any of your list names, this should be fine. Otherwise, names like list(.C = 1) will fall into the pattern and be replaced.
Based on this question, you could try it recursively like this:
find_and_replace <- function(x, find, replace){
if(is.list(x)){
n <- names(x) == find
x[n] <- replace
lapply(x, find_and_replace, find=find, replace=replace)
}else{
x
}
}
Testing in a deeper mylist:
mylist <- list(a = 1, b = list(A = 1, B = 2), c = list(C = 1, D = 3, d = list(C=10, D=55)))
find_and_replace(mylist, "C", 5)
$a
[1] 1
$b
$b$A
[1] 1
$b$B
[1] 2
$c
$c$C ### it worked
[1] 5
$c$D
[1] 3
$c$d
$c$d$C ### it worked
[1] 5
$c$d$D
[1] 55
This can now also be done using rrapply in the rrapply-package (an extended version of base rapply). To return the position of an element in the nested list based on its name, we can use the special arguments .xpos and .xname. For instance, to look up the position of the element with name "C":
library(rrapply)
mylist <- list(a = 1, b = list(A = 1, B = 2), c = list(C = 1, D = 3))
## get position C-node
(Cindex <- rrapply(mylist, condition = function(x, .xname) .xname == "C", f = function(x, .xpos) .xpos, how = "unlist"))
#> c.C1 c.C2
#> 3 1
We could then update its value in the nested list with:
## update value C-node
mylist[[Cindex]] <- 5
The two steps can also be combined directly in the call to rrapply:
rrapply(mylist, condition = function(x, .xname) .xname == "C", f = function(x) 5, how = "replace")
#> $a
#> [1] 1
#>
#> $b
#> $b$A
#> [1] 1
#>
#> $b$B
#> [1] 2
#>
#>
#> $c
#> $c$C
#> [1] 5
#>
#> $c$D
#> [1] 3