Cannot figure out how to use IF statement - r

I want to create a categorical variable for my DB: I want to create the "Same_Region" group, that includes all the people that live and work in the same Region and a "Diff_Region" for those who don't. I tried to use the IF statement, but I actually don't know how to proper say "if the variable Region of residence and Region of work are the same, return...". It's the very first time I try to approach by my self R, and I feel a lil bit lost.
I tried to put the two variables (Made by 2 letters - f.i. "BO") as Characters and use the "grep" command. But it eventually took to no results.
Then I tried by putting both the variables as factors, and nothing much changed.
----In R-----
extractSamepr <- function(RegionOfRes, RegionOfWo){
if(RegionOfRes== RegionOfWo){
return("SamePr")
}
else {
return("DiffPr")
}
SamePr <- NULL
for (i in 1:nrow(Data.Base)) {
SamePr <- c(SamePr, extractSamepr(Data.Base[i, "RegionOfRes", "RegionOfWo"]))
}

The ifelse way proposed in #deepseefan's comment is a standard way of solving this type of problem.
Here is another one. It uses the fact that FALSE/TRUE are coded as integers 0/1 to create a logical vector based on equality and then add 1 to that vector, giving a vector of 1/2 values. This result is used in the function's final instruction to index a vector with the two possible outcomes.
extractSamepr <- function(DF){
i <- 1 + (DF[["RegionOfRes"]] == DF[["RegionOfWo"]])
c("DiffPr", "SamePr")[i]
}
Data.Base$SamePr <- extractSamepr(Data.Base)

Related

Is there a generalizable way to pass variable names into functions in R? If not, why? [duplicate]

This question already has answers here:
Dynamically select data frame columns using $ and a character value
(10 answers)
Closed 3 years ago.
It seems like one of the primary things I get stuck on when R programming is passing through variable names. I come from a Stata background, where we can easily call globals with "$" in any code or function. However, that doesn't seem to work in R. It seems like sometimes I have to use some special package or use something like df[[x]] or something like that. Instead of doing all of this ad-hoc, I was wondering if someone can walk me through the R architecture so I understand how to address this problem every time I run into it.
As a simple example, I am currently working on a code that stores a row count:
rowcount <- function(x){
all_n <- length(which(!is.na(df$x) & df$model=="Honda"))
print(all_n)
}
The function simply stores the count of rows when x is not missing and make is "Honda". I want to be able to pass the variable name into the function, then have it return this count. For instance, for variable gender, I want to be able to write rowcount(gender)', and for gender to be passed into the function asdf$gender'. However, this doesn't happen.
Can someone explain how to fix this code, and in the process, how I can generally fix these types of problems? I know there may be more elegant ways to achieve my goal, but my intention is both to (1) get a code that fulfills a specific goal for my project, and (2) more generally understand how R treats variable names as arguments in functions.
Thanks
We can pass the column name as string and then uses [[. It is better to have the data also as an argument in the function so that it can be reused for different datasets
rowcount <- function(data, x){
all_n <- length(which(!is.na(data[[x]] & model=="Honda"))
all_n
}
Note that print only prints the output. We need to return the object created. In R, we don't have to explicitly specify the return
In addition to the OP's method, it can also be done with sum
rowcount <- function(data, x){
sum(!is.na(data[[x]] & model=="Honda")
}
Note that we don't have to create an object and then return if it is a single expression
As an aside, the tidyverse option would be
library(dplyr)
rowcount <- function(data, x) {
x <- enquo(x)
data %>%
summarise(out = sum(!is.na(!!x) & model == "Honda")) %>%
pull(out)
}
where we can pass the column name unquoted
rowcount(df1, columnname)

function to remove all observations that contain a "prohibited" value - R

I have an large dataset looking like:
There are overall 43 different values for PID. I have identified PIDs that need to be removed and summarized them in a vector:
I want to remove all observations (rows) from my data set that contain one of the PIDs from the vecotor NullNK. I have tried writing a function for it, but i get an error ( i have never written functiones before):
for (i in length(NullNK)){
SR_DynUeber_einfam <- SR_DynUeber_einfam [-which(SR_DynUeber_einfam$PID == NullNK(i)),]
}
How can i efficently remove the observations from my original data set that are containing PIDs from NullNK vector?
What is wrong with my function?
Thanks!
For basic operations like this, for loops are often not needed. This does what you are looking for:
SR_DynUeber_einfam[!SR_DynUeber_einfam$PID %in% NullNK,]
One mistake in your function is NullNK(i). You should subset from a vector with NullNK[i] in R.
Hope this helps!

R to ignore NULL values

I have 2 vector in R, but some of the values in both are marked as "NULL".
I want R to ignore "NULLS", but still "acknowledge" their presence because of indexes ( I´m using intersect and which function).
I have tried this:
for i in 1:length(vector)
if vector=="NULL"
i=i+1
else
'rest of the code'
Is this a good approach? The algorithm is running, but vector are very large.
You should change "NULL" for NA, which is R's native representation for NULL values. Then many functions have ways of dealing with NA values, such as na.action option... You shouldn't call your vector 'vector' since this is a reserved word for the class.
yourvector[yourvector == "NULL"] <- NA
Also you shouldn't add 1 to i in your if, just do nothing:
for (i in 1:length(yourvector)) {
if (!is.na(yourvector[i])) {
#rest of the code
}
}
Also tell what you wanna do. You probably don't need a for.
This code contains several errors:
First off, a vector cannot normally contain NULL values at all. Are you maybe using a list?
if vector=="NULL"
you probably mean if (vector[i] == "NULL"). Even so, that’s wrong. You cannot filter for NULL by comparing to the character string "NULL" – those two are fundamentally different things. You need to use the function is.null instead. Or, if you’re working with an actual vector which contains NA values (not NULL, like I said, that’s not possible), something like is.na.
i=i+1
This code makes no sense – leaving it out won’t change the result because the loop is in charge of incrementing i.
Finally, don’t iterate over indices – for (i in 1 : length(x)) is bad style in R. Instead, iterate over the elements directly:
for (x in vector) {
if (! is.na(x)) {
Perform action
}
}
But even this isn’t very R like. Instead, you would do two things:
use subsetting to get rid of NA values:
vector[! is.na(vector)]
Use one of the *apply functions (for instance, sapply) instead of a loop, and put the loop body into a function:
sapply(vector[! is.na(vector)], function (x) Do something with x)

Subsetting within a function

I'm trying to subset a dataframe within a function using a mixture of fixed variables and some variables which are created within the function (I only know the variable names, but cannot vectorise them beforehand). Here is a simplified example:
a<-c(1,2,3,4)
b<-c(2,2,3,5)
c<-c(1,1,2,2)
D<-data.frame(a,b,c)
subbing<-function(Data,GroupVar,condition){
g=Data$c+3
h=Data$c+1
NewD<-data.frame(a,b,g,h)
subset(NewD,select=c(a,b,GroupVar),GroupVar%in%condition)
}
Keep in mind that in my application I cannot compute g and h outside of the function. Sometimes I'll want to make a selection according to the values of h (as above) and other times I'll want to use g. There's also the possibility I may want to use both, but even just being able to subset using 1 would be great.
subbing(D,GroupVar=h,condition=5)
This returns an error saying that the object h cannot be found. I've tried to amend subset using as.formula and all sorts of things but I've failed every single time.
Besides the ease of the function there is a further reason why I'd like to use subset.
In the function I'm actually working on I use subset twice. The first time it's the simple subset function. It's just been pointed out below that another blog explored how it's probably best to use the good old data[colnames()=="g",]. Thanks for the suggestion, I'll have a go.
There is however another issue. I also use subset (or rather a variation) in my function because I'm dealing with several complex design surveys (see package survey), so subset.survey.design allows you to get the right variance estimation for subgroups. If I selected my group using [] I would get the wrong s.e. for my parameters, so I guess this is quite an important issue.
Thank you
It's happening right as the function is trying to define GroupVar in the beginning. R is looking for the object h by itself (not within the dataframe).
The best thing to do is refer to the column names in quotes in the subset function. But of course, then you'd have to sidestep the condition part:
subbing <- function(Data, GroupVar, condition) {
....
DF <- subset(Data, select=c("a","b", GroupVar))
DF <- DF[DF[,3] %in% condition,]
}
That will do the trick, although it can be annoying to have one data frame indexing inside another.

How to use a value that is specified in a function call as a "variable"

I am wondering if it is possible in R to use a value that is declared in a function call as a "variable" part of the function itself, similar to the functionality that is available in SAS IML.
Given something like this:
put.together <- function(suffix, numbers) {
new.suffix <<- as.data.frame(numbers)
return(new.suffix)
}
x <- c(seq(1000,1012, 1))
put.together(part.a, x)
new.part.a ##### does not exist!!
new.suffix ##### does exist
As it is written, the function returns a dataframe called new.suffix, as it should because that is what I'm asking it to do.
I would like to get a dataframe returned that is called new.part.a.
EDIT: Additional information was requested regarding the purpose of the analysis
The purpose of the question is to produce dataframes that will be sent to another function for analysis.
There exists a data bank where elements are organized into groups by number, and other people organize the groups
into a meaningful set.
Each group has an id number. I use the information supplied by others to put the groups together as they are specified.
For example, I would be given a set of id numbers like: part-1 = 102263, 102338, 202236, 302342, 902273, 102337, 402233.
So, part-1 has seven groups, each group having several elements.
I use the id numbers in a merge so that only the groups of interest are extracted from the large data bank.
The following is what I have for one set:
### all.possible.elements.bank <- .csv file from large database ###
id.part.1 <- as.data.frame(c(102263, 102338, 202236, 302342, 902273, 102337, 402233))
bank.names <- c("bank.id")
colnames(id.part.1) <- bank.names
part.sort <- matrix(seq(1,nrow(id.part.1),1))
sort.part.1 <- cbind(id.part.1, part.sort)
final.part.1 <- as.data.frame(merge(sort.part.1, all.possible.elements.bank,
by="bank.id", all.x=TRUE))
The process above is repeated many, many times.
I know that I could do this for all of the collections that I would pull together, but I thought I would be able to wrap the selection process into a function. The only things that would change would be the part numbers (part-1, part-2, etc..) and the groups that are selected out.
It is possible using the assign function (and possibly deparse and substitute), but it is strongly discouraged to do things like this. Why can't you just return the data frame and call the function like:
new.part.a <- put.together(x)
Which is the generally better approach.
If you really want to change things in the global environment then you may want a macro, see the defmacro function in the gtools package and most importantly read the document in the refrences section on the help page.
This is rarely something you should want to do... assigning to things out of the function environment can get you into all sorts of trouble.
However, you can do it using assign:
put.together <- function(suffix, numbers) {
assign(paste('new',
deparse(substitute(suffix)),
sep='.'),
as.data.frame(numbers),
envir=parent.env(environment()))
}
put.together(part.a, 1:20)
But like Greg said, its usually not necessary, and always dangerous if used incorrectly.

Resources