R Functions with variables - r

I have a function where im trying to compare a dataframe column to a ref table of type character. I have downloaded some data from the Norwegian central statistics office with popular first names. I want to add a column to my data frame which is basically a 1 or a 0 if the name appears in the list (1 being a boy 0 being a girl). Im getting the following error with the code
*Error in match(x, table, nomatch = 0L) : object 'x' not found*
Data frame is train.
Reference data is male_names
male_names <- read.csv("~/R/Functions_Practice/NO/BoysNames_Data.csv", sep=";",as.is = TRUE)[ ,1]
get.sex <- function(x, ref)
for (i in ref)
{
if(x %in% ref)
{return (1)}
}
# set default for column
train$sex <- 2
# Update column if it appears in the names list
train$sex <- sapply(train$sex, FUN=get.sex(x,male_names))
I would then use the function to run the second Girls Name file against the table and set the flag for each record to zero where that occurs
Can anyone help

When using sapply, you don't write arguments directly in the FUN parameter.
train$sex <- sapply(train$sex, FUN=get.sex,ref = male_names)
It is implied that train$sex is the x argument, and all other parameters are passed after that (in this case, it's just ref) and are explicitly defined.
Edit:
As joran noted, in this case sapply isn't particularly useful, and you can do the results in one line:
train$sex = (train$sex %in% male_names)*1
%in% can be used when the argument on the left is a vector, so you don't have to loop over it. Multiplying the result by one converts logical (boolean) values into integers. 1*TRUE yields 1, and 1*FALSE yields 0.

Related

When creating new data.frame column, what is the difference between `df$NewCol=` and `df[,"NewCol"]=` methods?

Using the default "iris" DataFrame in R, how come when creating a new column "NewCol"
iris[,'NewCol'] = as.POSIXlt(Sys.Date()) # throws Warning
BUT
iris$NewCol = as.POSIXlt(Sys.Date()) # is correct
This issue doesn't exist when assigning Primitive types like chr, int, float, ....
First, notice as #sindri_baldur pointed, as.POSIXlt returns a list.
From R help ($<-.data.frame):
There is no data.frame method for $, so x$name uses the default method which treats x as a list (with partial matching of column names if the match is unique, see Extract). The replacement method (for $) checks value for the correct number of rows, and replicates it if necessary.
So, if You try iris[, "NewCol"] <- as.POSIClt(Sys.Date()) You get warning that You're trying assign a list object to a vector. So only the first element of the list is used.
Again, from R help:
"For [ the replacement value can be a list: each element of the list is used to replace (part of) one column, recycling the list as necessary".
And in Your case, only one column is specified meaning only the first element of the as.POSIXlt's result (list) will be used. And You are warned of that.
Using $ syntax the iris data.frame is treated as a list and then the result of as.POSIXlt - a list again - is appended to it. Finally, the result is data.frame, but take a look at the type of the NewCol - it's a list.
iris[, "NewCol"] <- as.POSIXlt(Sys.Date()) # warning
iris$NewCol2 <- as.POSIXlt(Sys.Date())
typeof(iris$NewCol) # double
typeof(iris$NewCol2) # list
Suggestion: maybe You wanted to use as.POSIXct()?

make function detect nonexistent column when specified as df$x

I have functions that operate on a single vector (for example, a column in a data frame). I want users to be able to use $ to specify the columns that they pass to these functions; for example, I want them to be able to write myFun(df$x), where df is a data frame. But in such cases, I want my functions to detect when x isn't in df. How may I do this?
Here is a minimal illustration of the problem:
myFun <- function (x) sum(x)
data(iris)
myFun(iris$Petal.Width) # returns 180
myFun(iris$XXX) # returns 0
I don't want the last line to return 0. I want it to throw an error message, as XXX isn't a column in iris. How may I do this?
One way is to run as.character(match.call()) inside the function. I could then use the parts of the resulting string to determine the name of df, and in turn, I could check for the existence of x. But this seems like a not–so–robust solution.
It won't suffice to throw an error whenever x has length 0: I want to detect whether the vector exists, not whether it has length 0.
I searched for related posts on Stack Overflow, but I didn't find any.
The iris$XXX returns NULL and NULL is passed to sum
sum(NULL)
#[1] 0
Note that either iris$XXX or iris[['XXX']] returns NULL as value. If we need to get an error either subset or dplyr::select gives that
iris %>%
select(XXX)
Error: Can't subset columns that don't exist.
✖ Column XXX doesn't exist.
Run rlang::last_error() to see where the error occurred.
Or with pull
iris %>%
pull(XXX)
Error: object 'XXX' not found Run rlang::last_error() to see where
the error occurred.
subset(iris, select = XXX)
Error in eval(substitute(select), nl, parent.frame()) :
object 'XXX' not found
>
We could make the function to return an error if NULL is passed. Based on the way the function takes arguments, it is taking the value and not any info about the object.
myFun <- function (x) {
stopifnot(!is.null(x))
sum(x)
}
However, this would be non-specific error because NULL values can be passed to the function from other cases as well i.e. consider if the column exists and the value is NULL.
If we need to check if the column is valid, then the data and the column name should be passed into
myFun2 <- function(data, colnm) {
stopifnot(exists(colnm, data))
sum(data[[colnm]])
}
myFun2(iris, 'XXX')
#Error in myFun2(iris, "XXX") : exists(colnm, data) is not TRUE

Combining For and If loop in R

I wish to merge tables in R only if that variable name exists. For the same, I have made a variable with the various table names that may or may not exist. And then added a "for" and "if" loop to combine the tables. All the tables if they exist, have a common "names" column. The code entered by me is as follows:
Designation.Attrition1<- data.frame(names)
x<- c("despivot2020new", "despivot2019new", "despivot2018new", "despivot2017new")
for( i in 1: length(x)){if (exists(x[i])){Designation.Attrition1<- merge(Designation.Attrition1, x[i] , by = "names")}}
However, I'm getting the error as "Error in fix.by(by.y, y) : 'by' must specify a uniquely valid column".
One of the reasons for the error, maybe that the merge function fails to consider the element of x as variable name.
x[i] is still a string and not a dataframe. Try to get the data first before merging.
for( i in seq_along(x)) {
if (exists(x[i])) {
Designation.Attrition1 <- merge(Designation.Attrition1,get(x[i]),by = 'names')
}
})

Why does using paste in for loop return error?

I have a few problems concerning the same topic.
(1) I am trying to loop over:
premium1999 <- as.data.frame(coef(summary(data1999_mod))[c(19:44), 1])
for 10 years, in which I wrote:
for (year in seq(1999,2008)) {
paste0('premium',year) <- as.data.frame(coef(summary(paste0('data',year,'_mod')))[c(19:44), 1])
}
Note:
for data1999_mod is regression results that I want extract some of its estimators as a dataframe vector.
The coef(summary(data1999_mod)) looks like this:
#A matrix: ... of type dbl
Estimate Std. Error t value Pr(>|t|)
age 0.0388573570 2.196772e-03 17.6883885 3.362887e-6
age_sqr -0.0003065876 2.790296e-05 -10.9876373 5.826926e-28
relation 0.0724525759 9.168118e-03 7.9026659 2.950318e-15
sex -0.1348453659 8.970138e-03 -15.0326966 1.201003e-50
marital 0.0782049161 8.928773e-03 8.7587533 2.217825e-18
reg 0.1691004469 1.132230e-02 14.9351735 5.082589e-50
...
However, it returns Error: $ operator is invalid for atomic vectors, even if I did not use $ operator here.
(2) Also,
I want to create a column 'year' containing repeated values of the associated year and am trying to loop over this:
premium1999$year <- 1999
In which I wrote:
for (i in seq(1999,2008)) {
assign(paste0('premium',i)[['year']], i)
}
In this case, it returns Error in paste0("premium", i)[["year"]]: subscript out of bounds
(3) Moreover, I'd like to repeat some rows and loop over:
premium1999 <- rbind(premium1999, premium1999[rep(1, 2),])
for 10 years again and I wrote:
for (year in seq(1999,2008)) {
paste0('premium',year) <- rbind(paste0('premium',year), paste0('premium',year)[rep(1, 2),])
}
This time it returns Error in paste0("premium", year)[rep(1, 2), ]: incorrect number of dimensions
I also tried to loop over a few other similar things but I always get Error.
Each code works fine individually.
I could not find what I did wrong. Any help or suggestions would be very highly appreciated.
The problem with the code is that the paste0() function returns the character and not calling the object that is having the name as this character. For example, paste0('data',year,'_mod') returns a character vector of length 1, i.e., "data1999_mod" and not calling the object data1999_mod.
For easy understanding, there is huge a difference between, "data1999_mod"["Estimate"] and data1999_mod["Estimate"]. Subsetting as data frame merely by paste0() function returns the former, however, the expected output will be given by the latter only. That is why you are getting, Error: $ operator is invalid for atomic vectors.
The same error is found in all of your codes. On order to call the object by the output of a paste0() function, we need to enclose is by get().
As, you have not supplied the reproducible sample, I couldn't test it. However, you can try running these.
#(1)
for (year in seq(1999,2008)) {
paste0('premium',year) <- as.data.frame(coef(summary(get(paste0('data',year,'_mod'))))[c(19:44), 1])
}
#(2)
for (i in seq(1999,2008)) {
assign(get(paste0('premium',i))[['year']], i)
}
#(3)
for (year in seq(1999,2008)) {
paste0('premium',year) <- rbind(get(paste0('premium',year)), get(paste0('premium',year))[rep(1, 2),])
}

Assign a variable in R using another variable

I have to run 10's of different permutations with same structure but different base names for the output. to avoid having to keep replacing the whole character names within each formula, I was hoping to great a variable then use paste function to assign the variable to the name of the output..
Example:
var<-"Patient1"
(paste0("cells_", var, sep="") <- WhichCells(object=test, expression = test > 0, idents=c("patient1","patient2"))
The expected output would be a variable called "cells_Patient1"
Then for subsequent runs, I would just copy and paste these 2 lines and change var <-"Patient1" to var <-"Patient2"
[please note that I am oversimplifying the above step of WhichCells as it entails ~10 steps and would rather not have to replace "Patient1" by "Patient2" using Search and Replaced
Unfortunately, I am unable to crate the variable "cells_Patient1" using the above command. I am getting the following error:
Error in variable(paste0("cells_", var, sep = "")) <-
WhichCells(object = test, : target of assignment expands to
non-language object
Browsing stackoverflow, I couldn't find a solution. My understanding of the error is that R can't assign an object to a variable that is not a constant. Is there a way to bypass this?
1) Use assign like this:
var <- "Patient1"
assign(paste0("cells_", var), 3)
cells_Patient1
## [1] 3
2) environment This also works.
e <- .GlobalEnv
e[[ paste0("cells_", var) ]] <- 3
cells_Patient1
3) list or it might be better to make these variables into a list:
cells <- list()
cells[[ var ]] <- 3
cells[[ "Patient1" ]]
## [1] 3
Then we could easily iterate over all such variables. Replace sqrt with any suitable function.
lapply(cells, sqrt)
## $Patient1
## [1] 1.732051

Resources