how to apply multiple "if" functions by group in R - r

I'm trying to apply some basic "if" by group over a large dataset in R.
I tried to write a function and to apply over the groups using dplyr but it's not working. What could be the problem?
#dataframe
db <- data.frame(ID=c(1,1),
type=c("a","b"),
qual=c("no","OK"))
#if (no problem)
attach(db)
if(db[type =="a","qual"]=="OK"){
db[type =="a","qual_fin"] <- "OK"
db[type =="b","qual_fin"] <- "no"
} else if ( db[type =="b","qual"]=="OK"){
db[type =="b","qual_fin"] <- "OK"
db[type =="a","qual_fin"] <- "no"
} else {db$qual_fin <- "no"
}
#dataframe with groups
db <- data.frame(ID=c(1,1,2,2),
type=c("a","b","a","b"),
qual=c("OK","OK","no","OK"))
#function
quality <- function( a,b, qual_fin_a,qual_fin_b){
if(a =="OK"){
qual_fin_a <- "OK"
qual_fin_b <- "no"
} else if ( b =="OK"){
qual_fin_b <- "OK"
qual_fin_a <- "no"
} else {qual_fin_a <- "no"
qual_fin_b <- "no"
}}
#if by group
library(dplyr)
db2 <- db %>%
group_by(ID) %>%
do(quality(a=db[db$type =="a","qual"],
b=db[db$type =="b","qual"],
qual_fin_a=db[db$type=="a","qual_fin"],
qual_fin_b=db[db$type=="b","qual_fin"]))
I expect this result:
> db
ID type qual qual_fin
1 1 a OK OK
2 1 b OK no
3 2 a no no
4 2 b OK OK
I imagine that the solution is pretty simple but I'm struggling to find it!

Related

Loop function in r to compare values of different data frames

Introduction
Hi to everyone,
for a little project, I try to get a function to compare values of a Data Frame 1 with values from a Data Frame 2. Thereafter, data frames 3 and 4 are supposed to get printed with the information of the comparison.
Data Frame 1:
ID
x1i
x2i
x3i
a
1
2
4
b
1
4
1
Data Frame 2:
Data_Frame_2 <- c(1:4)
Read x1a and compare with Data Frame 2. The value 1 is in Data Frame 2. Print value 1 and the name of the variable (x1a) in Data Frame 3 and cross out the value 1 from Data Frame 2.
Read x1b and compare with Data Frame 2. The value 1 is (not anymore) in Data Frame 2. Read x2b. The value 4 is in Data Frame 2. Print value 4 and the name of the variable (x2b) in Data Frame 3 and cross out the value 4 from Data Frame 2.
The Data Frame 3 is supposed to be something like this:
Data Frame 3:
ID
Value
Variable
a
1
x1i
b
4
x2i
Data Frame 4 (the remaining numbers of Data Frame 2):
Remaining numbers
2
3
Example in R to solve this theoretical problem
Until now, I worked out this code which does the job:
b <- as.data.frame(c(1:4)) # data frame 2
colnames(b, do.NULL = FALSE)
colnames(b) <- c("b")
View(b)
a <- as.data.frame(cbind(c("a","b"), c(3,3), c(2,1), c(1,2))) # data frame 1
colnames(a, do.NULL = FALSE)
colnames(a) <- c("ID","x1i","x2i","x3i")
View(a)
`%notin%` <- Negate(`%in%`) #got this one from <https://www.marsja.se/how-to-use-in-in-r/>
Read_Info <- function(a,b)
{
if (a[1,2] %in% b[1:4,1]) {c_1<-c(a[1,1:2],names(a)[2]); b1<-subset(b,b %notin% a[1,2])}
if (a[2,2] %in% b1[1:3,1]) {c_2<-c(a[2,1:2],names(a)[2]); b2<-subset(b,b %notin% c(a[1,2],a[2,2]))}
else if (a[2,3] %in% b1[1:3,1]) {c_2<-c(a[2,1],a[2,3],names(a)[3]); b2<-subset(b,b %notin% c(a[1,2],a[2,3]))}
if (a[3,2] %in% b2[1:2,1]) {c_3<-c(a[3,1],a[3,2],names(a)[2]); b3<-subset(b,b %notin% c(a[1,2],a[2,3],a[3,2]))}
else if (a[3,2] %notin% b2[1:2,1]) {c_3<-c(NA,NA,NA); b3<-b2}
c<-rbind(c_1,c_2,c_3)
colnames(c, do.NULL = FALSE)
colnames(c) <- c("ID","Value","Variable")
bx<-b3
colnames(bx, do.NULL = FALSE)
colnames(bx) <- c("Remaining numbers")
print(c)
print(bx)
}
Read_Info(a,b)
# In this example, c is data frame 3 and bx is data frame 4
Actual Task at hand - If, else if Loop Function in R
I do face the following obstacle: the actual data which I have is a little bit larger than the above example. Nevertheless, it follows the same structure:
b <- as.data.frame(c(1:20)) # this would be Data Frame 2 in the theoretical considerations
colnames(l, do.NULL = FALSE)
colnames(l) <- c("b")
View(l)
# This would be data frame 1 in the theoretical considerations
# Note: between "ID" and "x1i", there are now two additional variables which were not in the example above
# Although these two variables are part of the data, they are not of interest right know
a2 <- cbind(c("a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t"),c(0),c(1))
a1 <- data.frame(replicate(16,sample(1:20,rep=T)))
a <- cbind(a2, a1)
colnames(a, do.NULL = FALSE)
colnames(a) <- c("ID","variable1","variable2","x1i","x2i","x3i","x4i","x5i","x6i","x7i","x8i","x9i","x10i","x11i","x12i","x13i","x14i")
View(a)
I try to create an “if”, “else if” loop function utilizing "for" which is supposed to do this reading task by itself. Until now, I wrote down the following code which does not work yet.
`%notin%` <- Negate(`%in%`) # got this one from <https://www.marsja.se/how-to-use-in-in-r/>
Read_Info_Loop <- function(a,b)
{for (i in 1:20)
{ if (a[i,4] %in% b[1:(21-i),1]) {x[i]<-c(a[i,1],a[i,4],names(a)[4]); b[i]<-subset(b,b %notin% a[i,4])}
if (a[i,5] %in% b[i-1][1:(21-i),1]) {x[i]<-c(a[i,1],a[i,5],names(a)[5]); b[i]<-subset(b,b %notin% c(a[1,4],a[i,5]))
} else if (a[i,6] %in% b[i-1][1:(21-i),1]) {x[i]<-c(a[i,1],a[i,6],names(a)[6]); b[i]<-subset(b,b %notin% c(a[1,4],a[i,6]))
} else if (a[i,7] %in% b[i-1][1:(21-i),1]) {x[i]<-c(a[i,1],a[i,7],names(a)[7]); b[i]<-subset(b,b %notin% c(a[1,4],a[i,7]))
} else if (a[i,8] %in% b[i-1][1:(21-i),1]) {x[i]<-c(a[i,1],a[i,8],names(a)[8]); b[i]<-subset(b,b %notin% c(a[1,4],a[i,8]))
} else if (a[i,9] %in% b[i-1][1:(21-i),1]) {x[i]<-c(a[i,1],a[i,9],names(a)[9]); b[i]<-subset(b,b %notin% c(a[1,4],a[i,9]))
} else if (a[i,10] %in% b[i-1][1:(21-i),1]) {x[i]<-c(a[i,1],a[i,10],names(a)[10]); b[i]<-subset(b,b %notin% c(a[1,4],a[i,10]))
} else if (a[i,11] %in% b[i-1][1:(21-i),1]) {x[i]<-c(a[i,1],a[i,11],names(a)[11]); b[i]<-subset(b,b %notin% c(a[1,4],a[i,11]))
} else if (a[i,12] %in% b[i-1][1:(21-i),1]) {x[i]<-c(a[i,1],a[i,12],names(a)[12]); b[i]<-subset(b,b %notin% c(a[1,4],a[i,12]))
} else if (a[i,13] %in% b[i-1][1:(21-i),1]) {x[i]<-c(a[i,1],a[i,13],names(a)[13]); b[i]<-subset(b,b %notin% c(a[1,4],a[i,13]))
} else if (a[i,14] %in% b[i-1][1:(21-i),1]) {x[i]<-c(a[i,1],a[i,14],names(a)[14]); b[i]<-subset(b,b %notin% c(a[1,4],a[i,14]))
} else if (a[i,15] %in% b[i-1][1:(21-i),1]) {x[i]<-c(a[i,1],a[i,15],names(a)[15]); b[i]<-subset(b,b %notin% c(a[1,4],a[i,15]))
} else if (a[i,16] %in% b[i-1][1:(21-i),1]) {x[i]<-c(a[i,1],a[i,16],names(a)[16]); b[i]<-subset(b,b %notin% c(a[1,4],a[i,16]))
} else if (a[i,17] %in% b[i-1][1:(21-i),1]) {x[i]<-c(a[i,1],a[i,17],names(a)[17]); b[i]<-subset(b,b %notin% c(a[1,4],a[i,17]))
} else if (a[i,17] %notin% b[1:(21-i),1]) {x[i]<-c(NA,NA,NA); b[i]<-c(b[i-1])}
y<-rbind(x[i[1:20]])
colnames(y, do.NULL = FALSE)
colnames(y) <- c("ID","Value","Variable")
u<-rbind(b[i=20])
colnames(u, do.NULL = FALSE)
colnames(u) <- c("Remaining numbers")
print(y)
print(u)
}
}
# y is supposed to be data frame 3 and u is supposed to be data frame 4
# in the above theoretical considerations
Errors
I now get the following errors:
Error in `[<-.data.frame`(`*tmp*`, i, value = c("a", "1", "x3i")) :
replacement has 3 rows, data has 4
Error in Read_Info_Loop(test, l) : object 'x' not found
...nevertheless, the first error, I got yesterday. Today, after restarting R, the second error occurred which seems to address internal structural problems of the function code. Additionally, I am pretty sure, that there might be further errors which are right now "hidden" behind the other errors and which will occur as soon as the two above mentioned errors are dealt with.
However, I do not want you to just solve any problems. I rather would like to ask, if you have ideas how I can solve these two specific errors, and maybe a hint to just get the function a little bit closer to work properly. So, for me the focus is clearly on learning a thing or two in general.
A few disclaimers: I have little experience in programming, so the code or my descriptions are probably rather messy. Therefore, if you have any questions for clarification, please feel free to ask. I try to respond as quickly as possible. English is not my first language, so please excuse me for any language mistakes.
I am looking forward to learning and hear your ideas about the code itself, ideas regarding the theoretical considerations or the approach to the loop function.
Kind Regards
Paul
Edits / Progression
Edit: I just realized, that the code can already be simplified with another "for". Nevertheless, I read that one should rather avoid nested "for" loops (for...for...)
`%notin%` <- Negate(`%in%`) #got this one from <https://www.marsja.se/how-to-use-in-in-r/>
Read_Info_Loop2 <- function(a,b)
{for (i in 1:20) for (k in 5:17) {
{ if (a[i,4] %in% b[1:(21-i),1]) {x[i]<-c(a[i,1],a[i,4],names(a)[4]); b[i]<-subset(b,b %notin% a[i,4])
} else if (a[i,k] %in% b[i-1][1:(21-i),1]) {x[i]<-c(a[i,1],a[i,k],names(a)[k]); b[i]<-subset(b,b %notin% c(a[1,4],a[i,k]))
} else if (a[i,k] %notin% b[1:(21-i),1]) {x[i]<-c(NA,NA,NA); b[i]<-c(b[i-1])}
}
y<-rbind(x[i[1:20]])
colnames(y, do.NULL = FALSE)
colnames(y) <- c("ID","Value","Variable")
u<-rbind(b[i=20])
colnames(u, do.NULL = FALSE)
colnames(u) <- c("Remaining numbers")
print(y)
print(u)
}
}
The same error was shown:
Error in Read_Info_Loop2(test, l) : object 'x' not found
I try to use this resource, going forward: https://cran.r-project.org/doc/manuals/r-release/R-intro.html#Repetitive-execution
I am going to give further updates.
This is a tricky one. I was able to find a solution for the underlying problem but unfortunately I wasn't able to fix OP's code as it was requested.
However, here is my solution:
library(data.table)
long <- melt(setDT(a), "ID", patterns("^x"))
df3 <- long[, {
if (any(.SD$value %in% b)) {
result <- first(.SD[value %in% b])
b <- setdiff(b, result$value)
} else {
result <- data.table(variable = NA_integer_, value = NA_integer_)
}
result
}, by = ID]
df3
ID variable value
1: a x1i 1
2: b x2i 4
# remaining values
df4 <- data.table(Remaining.numbers = setdiff(b, df3$value))
df4
Remaining.numbers
1: 2
2: 3
Explanation
In a first step, the dataset a is reshaped into long format
long
ID variable value
1: a x1i 1
2: b x1i 1
3: a x2i 2
4: b x2i 4
5: a x3i 4
6: b x3i 1
Now, variable contains the column names as data items which simplifies subsequent steps. Note that melt() has maintained the original order of rows and columns which is important for picking the correct values later on.
Now, we kind of loop through long by unique values of ID. This is achieved by grouping. As a speciality of data.table, we can use an arbitrary expression (enclosed in curly brackets) for aggregation.
For each ID, we check if there is at least one value still available in the vector of remaining values. If so, the first appearance is taken as resulting row. The corresponding value is removed from b which is then used in the next "iteration", i.e., the next group level.
Please note that b inside the expression (in curly brackets) is a local variable. The modified value of b is not available outside of the environment of the expression.
While testing with arbitrary datasets I have noticed that there might be situations where all numbers which belong to an ID already have been removed from remaining. To indicate this, a dummy result with NA values is returned.
So, for each ID group one row is returned which are then combined into one data.table object and assigned to df3.
df4 contains the Remaining.numbers and is created from building the set difference between b and the vector of picked values df3$value.
Note that I have tried to rewrite the code as a loop for demonstration purposes but I have given up because I found that the bookkeeping overhead wasn't worth it.
Data
For the first use case in OP's question:
a <- fread("ID x1i x2i x3i
a 1 2 4
b 1 4 1")
b <- 1:4
Other use cases with varying numbers of rows, columns, and lengths of b can be created using the code below. Please note that set.seed() is important because the created dataset a and the results df3 and df4 depend on it. For example, with set.seed(123) we can reproduce the situation where the list of remaining numbers for the last ID is exhausted.
# number of rows and columns to create
n_rows <- 18
n_cols <- 16
# create vector b
b <- 1:20
# create data.frame a
a2 <- data.frame(ID = letters[seq(n_rows)], variable1 = 0, variable2 = 1)
set.seed(123) # to ensure reproducible results
a1 <- as.data.frame(replicate(n_cols, sample(b, n_rows, replace = TRUE)))
colnames(a1) <- sprintf("x%ii", seq(n_cols))
a <- cbind(a2, a1)
Uwe’s Solution
Thank you very much, Uwe, for your solution and comprehensive explanation! It did not even occur to me, to combine the values into one list and to let the function run over that list. So, your solution opened a new perspective on the data. I am going to try out your solution in detail to learn as much as possible and report back here as soon as possible!
Solution regarding the original code
I was able to get to a solution for the original code which took quite some time.
test2 <- cbind(c("a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t"),c(0),c(1),c(1,1,1,sample(1:15),1,1),c(2,3,3,sample(1:15),2,3))
test1 <- data.frame(replicate(12,sample(1:20,rep=T)))
data.frame1 <- cbind(test2, test1)
colnames(data.frame1) <- c("ID","variable1","variable2","x1i","x2i","x3i","x4i","x5i","x6i","x7i","x8i","x9i","x10i","x11i","x12i","x13i","x14i")
data.frame2 <- as.data.frame(c(1:20))
x <- as.data.frame(matrix(NA,nrow = 3,ncol = 20))
rownames(x) <- c("ID","value","variable")
colnames(x) <- c()
View(x)
`%notin%` <- Negate(`%in%`) #got this one from <https://www.marsja.se/how-to-use-in-in-r/>
Read_Info_Loop2 <- function(a,b) {for (k in 1:20) {for (i in 4:17)
{if (a[k,i] %in% b[,1]) {x[k]<-c(a[k,1],(a[k,i]),names(a[i])); b<-subset(b,b %notin% a[k,i]);break}}
}
c<-rbind(x)
bx<-b
colnames(bx) <- c("numbers remaining")
print(c)
print(bx)
}
Read_Info_Loop2(data.frame1, data.frame2)
The only downfall with this solution is the output. It is rather in a weird form. But I don’t mind really. So now we already have two solutions which use different approaches. Very exciting. Regarding the output (see picture below of the output of some of the actual data) of data.frames 3 and 4: The last 7 columns are NAs because this data.frame1_original has just 13 rows (k=13). So for the last 7 iterations (k=14 to k=20), there is no output.
Here is the output of the random data.frame1 as described above. Here, the solution looks rather weird, since for "r" and "t" all entries are already deleted from data.frame2 which returns NAs for these rows. The two numbers, which remain are 18 and 20.

If Loop with several conditions

I have the following dataframe:
df1
Name Ch1 Val1
A a x1
B b x2
C a x3
...
And I want to add another row that gives me a solution on the loop I am trying to get:
for (i in nrow(df))
if ( (df[i,3]>=-2)==T & (df3[i,3] <=2)==T & df[i,2]=="a"){
df[i,4]<-TRUE
}else if ((df[i,3]>2)==T & df[i,2]=="b"){
df[i,4]<-TRUE
}else (df[i,4]<-FALSE)
So basically if the value in Val1 is in an interval of -2 and +2 AND Ch1 is "a" it should result in TRUE
OR if Val1 is bigger than 2 AND Ch1 is "b" then the result is TRUE
Otherwise it should always be false.
My loop seems to only return the result for the first row the rest is NA.
Any idea where the mistake is? Or another way to solve this (even though I actually have a few more ORs)
Thank you!
If I understand correctly you try to create a new column, which contains true or false. I would use dplyrfor this.
df <- df %>%
mutate(new_column = case_when(
Val1 >=-2 & Val1 <=2 & Ch1 =="a" ~ TRUE,
Val1 > 2 & Ch1 == "b" ~ TRUE,
TRUE ~ FALSE
))
Your for loop only does one iteration because it is passed a single value instead of a sequence: i takes on only the single value you specify, not each value in a sequence such as each number from 1 up to nrow(df).
For example:
df <- data.frame(a = 1:5)
for (i in nrow(df)) {
print(i)
}
results in:
5
but,
for (i in 1:nrow(df)) {
print(i)
}
results in:
1
2
3
4
5
but the answer posted by #annet is more elegant.

R: match values in dataframes

I have the following list of dataframes:
a<-data.frame(
Data0=c("Y","Y","Y","Y","Y","Y","N","N","N","N","N","N"),
Data1=c(16,18,19,20,21,50,16,18,19,20,21,50),
Data2=c(2.2291,2.0743,1.9369,1.8148,1.7064,1.6102,2.2291,2.0743,1.9369,1.8148,1.7064,1.6102)
)
b<-data.frame(
Data0=c(-2 , 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 ,10 ,11) ,
Data1=c(0.8891 ,0.8891,0.9051,1,0.8891,0.8891,0.7907,0.8891,0.9929,0.8891,0.8891,0.8891,0.8891)
)
dfl<-list(a,b)
I would like to make a function, for each of the dataframes in the list, that returns the numbers in the last column, for the matching variables in the previous columns.
For a, if I send "Y" and 16, I want 2.2291 and for b, if I send 5 I want 0.7907. The problem for a is that if I send a number smaller than 16, for either "Y" or "N", I want it to return the value as if it were 16, without manipulating the input variable. I manage to do this for b, but for a, where I have a "split" between "Y" and "N", I don't know how to solve this.
So far:
get_value<-function(vector){
if (length(vector)==1) {
from<-append(head(unlist(dfl[[2]][1],use.names=FALSE),-1),-9999,0)
to <-unlist(dfl[[2]][1],use.names=FALSE)
match_from<-vector > from
match_to<-vector <=to
pos<-which(match_from==match_to)
return(unlist(dfl[[2]][pos,2]))
}
else{
print("not sure what to do for a")
}
}
get_value(4)
get_value(-44)
The solution needs to take into account that sometimes for a, the columns Data0 and Data1 have switched places.
EDIT:
input/output table:
INPUT OUTPUT
c("Y",13) --> 2.2291
c("Y",50) --> 1.6102
c("N",20) --> 1.8148
c("N",50) --> 1.6102
c(-44)) --> 0.8891
An alternate approach using tidyverse functions could be as follows:
library(tidyverse)
library(magrittr)
get_value<-function(vector){
if (length(vector)==1)
{ df <- dfl[[2]]
k <- df %>%
arrange(Data0)%>%
filter(Data0 >= vector) %>%
select(Data1) %>%
head(1)
}
else
{df <- dfl[[1]]
k <- df %>%
arrange(Data0)%>%
filter(Data0 == vector[1]) %>%
filter(Data1 >= vector[2]) %>%
select(Data2) %>%
head(1)
}
return(k)
}
and test some outputs as:
> get_value(c(-44))
Data1
1 0.8891
> get_value(c('N',16))
Data2
1 2.2291
> get_value(c('Y',16))
Data2
1 2.2291
> get_value(c('N',12))
Data2
1 2.2291
> get_value(c('Y',11))
Data2
1 2.2291
> get_value(c('Y',18))
Data2
1 2.0743
Ok, I finally figured out a way, but it is not so elegant, so help would be appreciated.
get_value<-function(vector){
if (length(vector)==1) {
from<-append(head(unlist(dfl[[2]][1],use.names=FALSE),-1),-9999,0)
to <-unlist(dfl[[2]][1],use.names=FALSE)
match_from<-vector > from
match_to<-vector <= to
pos<-which(match_from==match_to)
return(unlist(dfl[[2]][pos,2]))
}
else{
pos_1<-which(unlist(dfl[[1]][1],use.names=FALSE)==vector[1])
from<-append(head(as.numeric(unlist(dfl[[1]][2],use.names=FALSE))[pos_1],-1),-9999,0)
to <-as.numeric(unlist(dfl[[1]][2],use.names=FALSE))[pos_1]
match_from<-as.numeric(vector[2]) > from
match_to <-as.numeric(vector[2]) <=to
pos<-pos_1[match_from == match_to]
return(unlist(dfl[[1]][pos,3]))
}
}

Populating new column in R

I'm new to R. I'm trying to set a new column in my data frame depending on what's in 3 other columns. I've looked at other queries like:
Populate a column using if statements in r
Which I thought would solve it but it looks like I can only give sapply a single vector as when I try the following code:
IHC <- c("N","N","Y","N","N")
CCD <- c("13-Nov-2009", NA, "09-Feb-2011", "10-Dec-2012", "16-Nov-2009")
IHE <- c(NA, "20-Feb-2011",NA,NA,NA)
df1 <- data.frame(IHC, CCD, IHE)
InHouse <- function(IHC,CCD,IHE) {
if(IHE == "" && CCD == NA | IHC == "N") y <- ""
if(IHE == "") y <- CCD
if(CCD > IHE) y <- IHE
else y <- CCD
return(y)
}
df1$AAA <- sapply(c(df1$IHC, df1$CCD, df1$IHE), InHouse)
I get the following error:
Error in IHE == "" : 'IHE' is missing
Any help would be great.
There are several issues.
Your conditions involve comparisons like: IHE=="". IHE is NA but never "". So I assume you want is.na(IHE)??
You are mixing the scalar form of and (&& instead of &) with the vectorized form of or (| instead of ||). Why??
The comparison CCD > IHE is meaningless if either is NA (which is always the case).
The logical operators & and | have equal precedence, so IHE == "" && CCD == NA | IHC == "N" is equivalent to (IHE == "" && CCD == NA) | IHC == "N". Is that what you want??
Most important, your condition are not mutually exclusive.
This is a way to apply the conditions without the use of any of the apply(...) functions.
df1 <- data.frame(IHC, CCD, IHE, stringsAsFactors=F)
df1$AAA <- CCD
cond <- with(df1,is.na(IHE) & is.na(CCD) | IHC == "N")
df1[cond,]$AAA <- ""
cond <- is.na(df1$IHE)
df1[cond,]$AAA <- df1[cond,]$CCD
cond <- with(df1,CCD > IHE & is.na(CCD) & is.na(IHE))
df1[cond,]$AAA <- df1[cond,]$IHE

Build Efficient R Filter

I have this dataframe called data. In the data frame I have a few columns, for simplicity I will explain the columns with a weather analogy, it is like "weather_st_louis", "weather_boston", "weather_ny"... I want to build a column "weather" and it should be done like this, "if weather in st louis exists, use this column, else if weather in boston exists, use this column, else if weather in ny exists, use this column, else NONE". I'm going to be using this logic many times, with many columns, so need a way to make this more efficient. What is the R way to do this.
Also, side question, is what I'm trying to build here called a "filter"?
if(exists("data['w_stlouis']")) {
data['w'] <- data['w_stlouis']
} else if(exists("data['w_boston]")){
data['w'] <- data['w_boston']
} else if(exists("data['w_ny']")){
data['w'] <- data['w_ny']
} else {data['w'] <- NA}
Try something like that :
example <- matrix(NA,ncol=5,nrow=5)
colnames(example) <- c("weather_1","weather_2","weather_3","weather_4","weather_5")
example[5,3] <- 1
example[3,2] <- 1
example[1,2] <- 1
example[4,4] <- 1
example[5,2] <- 1
w <- apply(example,1,function(x){
o <- which(!is.na(x))[1]
if (is.na(o)) r <- "NONE"
else r <- colnames(example)[o]
r
})
w
When you have repeating tasks to do, try to use apply/tapply/sapply functions
Here is another posibility. I'm not sure if this is what you need, but perhaps it gives you another way to handle it.
df <- data.frame(matrix(rnorm(100, 100, 20),ncol=5,nrow=5))
colnames(df) <- c("weather_1","weather_2","weather_3","weather_4","weather_5")
library(reshape2)
df <- melt(df)
df[1:10,2] <- NA
str(df)
weather_levels <- levels(df$variable)
df$case <- ifelse(is.na(df$value), 0, 1)
These two output the same result
subset(df, df$case == 1)
na.omit(df)

Resources