what should i do when i want to make new column with mutate but with if condition status on it.
example :
dt <- read.table(text="
name,gender,fat_%
adam,male,32
anya,female,27
gilang,male,24
andine,female,34
",sep=',',header=TRUE)
## + > dt
## name gender fat_.
## 1 adam male 32
## 2 anya female 27
## 3 gilang male 24
## 4 andine female 34
my question :
what code i have to write if i want to make new column where gonna take 2 answer "yes" or "no".
and my new column will be like this :
name gender fat_% obesity
adam male 32 yes
anya female 27 no
gilang male 24 yes
andine female 34 no
note : formula to find obesity is
(if male & fat > 26 = yes ,if girl & fat >32 = yes) if (if male & fat < 26 = no ,if girl & fat <32 = no)
Couple of suggestions first. Gender can be a single char M/F. You cannot use % in column name. Your column name 'fat', you probably meant BMI??
Does this work for you?
dt %>%
mutate (newcol = ifelse ((gender == "male"), (ifelse ((fat_ > 26), TRUE, FALSE)),
(ifelse ((fat_ > 32), TRUE, FALSE))))
Two solutions.
First, a base Rsolution:
df$obesity <- ifelse (df$gender == "m" & df$fat_ > 26 , "yes",
ifelse(df$gender == "f" & df$fat_ > 32, "yes", "no"))
Using mutatefrom dplyr, a more compact code based on dplyrs if_else rather than base R's ifelse is this:
df %>%
mutate(obesity = if_else(gender=="m" & fat_ > 26|gender=="f" & fat_ > 32, "yes", "no"))
RESULT:
df
name gender fat_ obesity
1 adam m 32 yes
2 anya f 27 no
3 gilang m 24 no
4 andine f 34 yes
DATA:
df <- data.frame(
name = c("adam", "anya", "gilang", "andine"),
gender = c("m", "f", "m", "f"),
fat_ = c(32,27,24,34)
)
One approach is to use case_when from dplyr:
library(dplyr)
df %>%
mutate(obesity = case_when(gender == "male" & fat > 26 ~ "yes",
gender == "female" & fat > 32 ~ "yes",
TRUE ~ "no"))
# name gender fat obesity
#1 adam male 32 yes
#2 anya female 27 no
#3 gilang male 24 no
#4 andine female 34 yes
Once you understand the syntax, it comes in handy quite often.
Data
structure(list(name = structure(c(1L, 3L, 4L, 2L), .Label = c("adam",
"andine", "anya", "gilang"), class = "factor"), gender = structure(c(2L,
1L, 2L, 1L), .Label = c("female", "male"), class = "factor"),
fat = c(32, 27, 24, 34)), class = "data.frame", row.names = c(NA,
-4L))
Related
I want to change all the gender entries 'Male, Female, Woman, Man, man etc.' to be more consistent so its only 3 elements (Male, Female and Non-Binary). This is my current code
# Cleaning of Specific Variable Types
removed <- removed %>%
mutate(gender=substr(toupper(gender), 1, 1))
removed <- removed %>%
mutate(gender=case_when(
gender == "M"~"Male",
gender == "F"~"Female",
gender == "N"~"Non-binary")
)
This maybe the long version, but it should work:
Data from jay.sf (many thanks)
Capitalize first letter
check for unique entries in gender
create pattern for each category
apply case_when condition with str_detect and pattern:
# Capitalize each value to avoid interaction of "man" and "woman" in str_detect
# check for unique elements in `gender`
removed$gender <- str_to_title(removed$gender)
unique(removed$gender)
[1] "Male" "Woman" "Other" "Mtf" "Female"
[6] "Man" "Ftm" "Androgyne"
# define pattern for each category
Male <- paste(c("Male", "Man"), collapse = "|")
Female <- paste(c("Woman", "Female"), collapse = "|")
Non_binary <- paste(c("Other", "Mtf", "Ftm", "Androgyne"), collapse= "|")
# apply category with `case_when` and pattern:
library(dplyr)
library(stringr)
removed %>%
mutate(gender = case_when(
str_detect(gender, Male) ~ "Male",
str_detect(gender, Female) ~ "Female",
str_detect(gender, Non_binary) ~ "Non-binary"))
Output:
gender
1 Male
2 Female
3 Male
4 Non-binary
5 Non-binary
6 Female
7 Male
8 Non-binary
9 Male
10 Male
11 Male
12 Female
13 Non-binary
14 Female
15 Female
16 Non-binary
17 Male
18 Female
19 Non-binary
20 Non-binary
21 Non-binary
22 Female
23 Female
24 Female
25 Female
26 Male
27 Non-binary
28 Male
29 Female
30 Non-binary
The problem seems to be with the default value of gender. Use TRUE, instead of matching it with "N".
Tested with the data in jay.sf's answer.
library(dplyr)
removed %>%
mutate(
gender = toupper(substr(gender, 1, 1)),
gender = case_when(
gender == "M" ~ "Male",
gender %in% c("F", "W") ~ "Female",
TRUE ~ "Non-binary"
))
You probably have a data frame like this.
removed
# gender
# 1 Male
# 2 Woman
# 3 Male
# 4 other
# 5 MtF
# 6 female
# ...
You could now create a key table in a half-automated way like so.
key <- data.frame(x=sort(unique(tolower(removed$gender))),
y=factor(c(3, 1, 3, 2, 2, 3, 3, 1),
labels=c('female', 'male', 'non-binary')))
Then use match to replace the labels.
library(dplyr)
removed %>%
mutate(gender=key$y[match(tolower(gender), key$x)])
# gender
# 1 male
# 2 female
# 3 male
# 4 non-binary
# 5 non-binary
# 6 female
# 7 ...
Data
removed <- structure(list(gender = c("Male", "Woman", "Male", "other", "MtF",
"female", "male", "MtF", "Male", "man", "Man", "female", "other",
"Woman", "female", "MtF", "male", "Female", "other", "other",
"FtM", "female", "Woman", "Woman", "female", "male", "androgyne",
"man", "Female", "MtF")), class = "data.frame", row.names = c(NA,
-30L))
I have 4 columns, called Amplification, CNV.gain, Homozygous.Deletion.Frequency, Heterozygous.Deletion.Frequency. I want to create a new column in which, if any of the values in these 4 columns are:
greater than or equal to 5 and less than or equal to 10, it returns low:
greater than 10 and less than or equal to 20, it returns medium
greater than 20, it returns high
An example of the final table (long_fused) would look like this:
CNV.Gain
Amplification
Homozygous.Deletion.Frequency
Heterozygous.Deletion.Frequency
Threshold
3
5
10
0
Low
0
0
11
8
Medium
7
16
25
0
High
So far, I've tried the following code, although it seems to fill in the "Threshold" Column, is doing so incorrectly.
library(dplyr)
long_fused <- long_fused %>%
mutate(Percent_sample_altered = case_when(
Amplification>=5 & Amplification < 10 & CNV.gain>=5 & CNV.gain < 10 | CNV.gain>=5 & CNV.gain<=10 & Homozygous.Deletion.Frequency>=5 & Homozygous.Deletion.Frequency<=10| Heterozygous.Deletion.Frequency>=5 & Heterozygous.Deletion.Frequency<=10 ~ 'Low',
Amplification>= 10 & Amplification<20 |CNV.gain>=10 & CNV.gain<20| Homozygous.Deletion.Frequency>= 10 & Homozygous.Deletion.Frequency<20 | Heterozygous.Deletion.Frequency>=10 & Heterozygous.Deletion.Frequency<20 ~ 'Medium',
Amplification>20 | CNV.gain >20 | Homozygous.Deletion.Frequency >20 | Heterozygous.Deletion.Frequency>20 ~ 'High'))
As always any help is appreciated!
Data in dput format
long_fused <-
structure(list(CNV.Gain = c(3L, 0L, 7L), Amplification = c(5L,
0L, 16L), Homozygous.Deletion.Frequency = c(10L, 11L, 25L),
Heterozygous.Deletion.Frequency = c(0L, 8L, 0L), Threshold =
c("Low", "Medium", "High")), class = "data.frame",
row.names = c(NA, -3L))
Here is a way with rowwise followed by base function cut.
library(dplyr)
long_fused %>%
rowwise() %>%
mutate(new = max(c_across(-Threshold)),
new = cut(new, c(5, 10, 20, Inf), labels = c("Low", "Medium", "High"), left.open = TRUE))
Here's an alternative using case_when -
library(dplyr)
long_fused %>%
mutate(max = do.call(pmax, select(., -Threshold)),
#If you don't have Threshold column in your data just use .
#mutate(max = do.call(pmax, .),
Threshold = case_when(between(max, 5, 10) ~ 'Low',
between(max, 11, 15) ~ 'Medium',
TRUE ~ 'High'))
# CNV.Gain Amplification Homozygous.Deletion.Frequency
#1 3 5 10
#2 0 0 11
#3 7 16 25
# Heterozygous.Deletion.Frequency max Threshold
#1 0 10 Low
#2 8 11 Medium
#3 0 25 High
So I've seen many pages on the generalized version of this issue but here specifically I would like to sum all values in a row after a specific column.
Let's say we have this df:
id city identity q1 q2 q3
0110 detroit ella 2 4 3
0111 boston fitz 0 0 0
0112 philly gerald 3 1 0
0113 new_york doowop 8 11 2
0114 ontario wazaaa NA 11 NA
Now the df's I work with aren't usually with 3 "q" variables, they vary. Hence, I would like to rowSum every row but only sum the rows that are after the column identity.
Rows with NA are to be ignored.
Eventually I would like to take the rows which sum to 0 to be removed and end with a df that looks like this:
id city identity q1 q2 q3
0110 detroit ella 2 4 3
0112 philly gerald 3 1 0
0113 new_york doowop 8 11 2
Doing this in dplyr is the preference but not required.
EDIT:
I have added below the data of which this solution is not working for, apologies for the confusion.
df <- structure(list(Program = c("3002", "111", "2455", "2929", "NA",
"NA", NA), Project_ID = c("299", "11", "271", "780", "207", "222",
NA), Advance_Identifier = c(14, 24, 12, 15, NA, 11, NA), Sequence = c(6,
4, 4, 5, 2, 3, 79), Item = c("payment", "hero", "prepayment_2",
"UPS", "period", "prepayment", "yeet"), q1 = c("500", "12", "-1",
"0", NA, "0", "0"), q2 = c("500", "12", "-1", "0", NA, "0", "1"
), q3 = c("500", "12", "2", "0", NA, "0", "2"), q4 = c("500",
"13", "0", "0", NA, "0", "3")), row.names = c(NA, -7L), class = c("tbl_df",
"tbl", "data.frame"))
Base R version with zero extra dependencies:
[Edit: I always forget rowSums exists]
> df1$new = rowSums(
df1[,(1+which(names(df1)=="identity")):ncol(df1),drop=FALSE]
)
> df1
id city identity q1 q2 q3 new
1 110 detroit ella 2 4 3 9
2 111 boston fitz 0 0 0 0
3 112 philly gerald 3 1 0 4
4 113 new_york doowop 8 11 2 21
If you need to convert chars to numbers, use apply with as.numeric:
df$new = apply(df[,(1+which(names(df)=="Item")):ncol(df),drop=FALSE], 1, function(col){sum(as.numeric(col))})
BUT look out if they are really factors because this will fail, which is why converting things that look like numbers to numbers before you do anything else is a Good Thing.
Benchmark
In case you are worried about speed here's a benchmark test of my function against the currently accepted solution:
akrun = function(df1){df1 %>%
mutate(new = rowSums(select(., ((match('identity', names(.)) +
1):ncol(.))), na.rm = TRUE))}
baz = function(df1){rowSums(
df1[,(1+which(names(df1)=="identity")):ncol(df1),drop=FALSE]
)}
sample data
df = data.frame(id=sample(100,100), city=sample(LETTERS,100,TRUE), identity=sample(letters,100,TRUE), q1=runif(100), q2=runif(100),q3=runif(100))
Test - note I remove the new column from the source data frame each time otherwise the code keeps adding one of those into it (although akrun doesn't modify df in place it can get run after baz has modified it by assigning it the new column in the benchmark code).
> microbenchmark({df$new=NULL;df2 = akrun(df)},{df$new=NULL;df$new=baz(df)})
Unit: microseconds
expr min lq mean
{ df$new = NULL df2 = akrun(df) } 1300.682 1328.941 1396.63477
{ df$new = NULL df$new = baz(df) } 63.102 72.721 87.78668
median uq max neval
1376.9425 1398.5880 2075.894 100
84.3655 86.7005 685.594 100
The tidyverse version takes 16 times as long as the base R version.
We can use
out <- df1 %>%
mutate(new = rowSums(select(., ((match('identity', names(.)) +
1):ncol(.))), na.rm = TRUE))
out
# id city identity q1 q2 q3 new
#1 110 detroit ella 2 4 3 9
#2 111 boston fitz 0 0 0 0
#3 112 philly gerald 3 1 0 4
#4 113 new_york doowop 8 11 2 21
and then filter out the rows that have 0 in 'new'
out %>%
filter(new >0)
In the OP's updated dataset, the type of columns are character. We can automatically convert the types to respective types with
df %>%
#type.convert %>% # base R
# or with `readr::type_convert
type_convert %>%
...
NOTE: The OP mentioned in the title and in the description about a tidyverse option. It is not a question about efficiency.
Also, rowSums is a base R option. Here, we showed how to use that in tidyverse chain. I could have written an answer in base R way too earlier with the same option.
If we remove the select, it becomes just a base R i.e
df1$new < rowSums(df1[(match('identity', names(df1)) + 1):ncol(df1)], na.rm = TRUE)
Benchmarks
df = data.frame(id=sample(100,100), city=sample(LETTERS,100,TRUE),
identity=sample(letters,100,TRUE), q1=runif(100), q2=runif(100),q3=runif(100))
akrun = function(df1){
rowSums(df1[(match('identity', names(df1)) + 1):ncol(df1)], na.rm = TRUE)
}
baz = function(df1){rowSums(
df1[,(1+which(names(df1)=="identity")):ncol(df1),drop=FALSE]
)}
microbenchmark({df$new=NULL;df2 = akrun(df)},{df$new=NULL;df$new=baz(df)})
#Unit: microseconds
# expr min lq mean median uq max neval
# { df$new = NULL df2 = akrun(df) } 69.926 73.244 112.2078 75.4335 78.7625 3539.921 100
# { df$new = NULL df$new = baz(df) } 73.670 77.945 118.3875 80.5045 83.5100 3767.812 100
data
df1 <- structure(list(id = 110:113, city = c("detroit", "boston", "philly",
"new_york"), identity = c("ella", "fitz", "gerald", "doowop"),
q1 = c(2L, 0L, 3L, 8L), q2 = c(4L, 0L, 1L, 11L), q3 = c(3L,
0L, 0L, 2L)), class = "data.frame", row.names = c(NA, -4L
))
Similar to akrun you can try
df %>%
mutate_at(vars(starts_with("q")),funs(as.numeric)) %>%
mutate(sum_new = rowSums(select(., starts_with("q")), na.rm = TRUE)) %>%
filter(sum_new>0)
Here i use reduce in purrr to sum rows, it's the fastest way.
library(tidyverse)
data %>% filter_at(vars(starts_with('q')),~!is.na(.)) %>%
mutate( Sum = reduce(select(., starts_with("q")), `+`)) %>%
filter(Sum > 0)
I have made a function which increments the values in certain columns in a certain row. I did this by writing a function that subsets through my dataframe to find the row it needs (by looking at sex, then age, then deprivation, then number of partners) and then adds numbers to whichever column I need it to (depending on these risk factors), it then calculates the risk (my code is for STI testing).
However, this does not change my existing dataframe with the new values, but creates a new variable patientRow which holds these new values. I need help with how I can incorporate this into my existing dataframe. Thanks!
adaptRisk <- function(dataframe, sexNum, ageNum, deprivationNum,
partnerNum, testResult){
sexRisk = subset(dataframe, sex == sexNum)
ageRisk = subset(sexRisk, age == ageNum)
depRisk = subset(ageRisk, deprivation == deprivationNum)
patientRow = subset(depRisk, partners == partnerNum)
if (testResult == "positive") {
patientRow$tested <- patientRow$tested + 1
patientRow$infected <- patientRow$infected + 1
}
else if (testResult == "negative") {
patientRow$tested <- patientRow$tested + 1
}
patientRow <- transform(patientRow, risk = infected/tested)
return(patientRow)
}
This is the head of my dataframe to give you an idea:
sex age deprivation partners tested infected risk
1 Female 16-19 1-2 0-1 132 1 0.007575758
2 Female 16-19 1-2 2 25 1 0.040000000
3 Female 16-19 1-2 >=3 30 1 0.033333333
4 Female 16-19 3 0-1 80 2 0.025000000
5 Female 16-19 3 2 12 1 0.083333333
6 Female 16-19 3 >=3 18 1 0.055555556
The dput of my data is:
structure(list(sex = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label =
c("Female",
"Male"), class = "factor"), age = structure(c(1L, 1L, 1L, 1L,
1L, 1L), .Label = c("16-19", "20-24", "25-34", "35-44"), class =
"factor"),
deprivation = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("1-2",
"3", "4-5"), class = "factor"), partners = structure(c(2L,
3L, 1L, 2L, 3L, 1L), .Label = c(">=3", "0-1", "2"), class = "factor"),
tested = c(132L, 25L, 30L, 80L, 12L, 18L), infected = c(1L,
1L, 1L, 2L, 1L, 1L), uninfected = c(131L, 24L, 29L, 78L,
11L, 17L), risk = c(0.00757575757575758, 0.04, 0.0333333333333333,
0.025, 0.0833333333333333, 0.0555555555555556)), .Names = c("sex",
"age", "deprivation", "partners", "tested", "infected", "uninfected",
"risk"), row.names = c(NA, 6L), class = "data.frame")
An example call to the function:
adaptRisk(data, "Female", "16-19", 3, 2, "positive")
sex age deprivation partners tested infected uninfected risk
5 Female 16-19 3 2 13 2 11 0.1538462
I have adjusted your function (see all the way below) using base R syntax. It does the job, but is not the most beautiful code.
Issue:
The subsets create a lot of extra (and not needed) data.frames, instead of replacing the internal values when the conditions match. And the return was a different data.frame so the existing data.frame could not handle it correctly.
I adjusted it so that the filters are done on the needed objects that you want to change.
Transform might have unintended side effects and you were recalculating the whole risk column. Now only the affected value is recalculated.
You might want to built in some warnings / stops in case the filters return more than 1 record.
You can now use
df <- adaptRisk(df, "Female", "16-19", "3", "2", "positive") to replace the values in the data.frame you supply to the function
examples
# affects row 5
adaptRisk(df, "Female", "16-19", "3", "2", "positive")
sex age deprivation partners tested infected uninfected risk
1 Female 16-19 1-2 0-1 132 1 131 0.007575758
2 Female 16-19 1-2 2 25 1 24 0.040000000
3 Female 16-19 1-2 >=3 30 1 29 0.033333333
4 Female 16-19 3 0-1 80 2 78 0.025000000
5 Female 16-19 3 2 13 2 11 0.153846154
6 Female 16-19 3 >=3 18 1 17 0.055555556
# affects row 5
adaptRisk(df, "Female", "16-19", "3", "2", "negative")
sex age deprivation partners tested infected uninfected risk
1 Female 16-19 1-2 0-1 132 1 131 0.007575758
2 Female 16-19 1-2 2 25 1 24 0.040000000
3 Female 16-19 1-2 >=3 30 1 29 0.033333333
4 Female 16-19 3 0-1 80 2 78 0.025000000
5 Female 16-19 3 2 13 1 11 0.076923077
6 Female 16-19 3 >=3 18 1 17 0.055555556
function:
adaptRisk <- function(data, sexNum, ageNum, deprivationNum,
partnerNum, testResult){
if (testResult == "positive") {
data$tested[data$sex == sexNum &
data$age == ageNum &
data$deprivation == deprivationNum &
data$partners == partnerNum] <- data$tested[data$sex == sexNum &
data$age == ageNum &
data$deprivation == deprivationNum &
data$partners == partnerNum] + 1
data$infected[data$sex == sexNum &
data$age == ageNum &
data$deprivation == deprivationNum &
data$partners == partnerNum] <- data$infected[data$sex == sexNum &
data$age == ageNum &
data$deprivation == deprivationNum &
data$partners == partnerNum] + 1
data$risk[data$sex == sexNum &
data$age == ageNum &
data$deprivation == deprivationNum &
data$partners == partnerNum] <- data$infected[data$sex == sexNum &
data$age == ageNum &
data$deprivation == deprivationNum &
data$partners == partnerNum]/data$tested[data$sex == sexNum &
data$age == ageNum &
data$deprivation == deprivationNum &
data$partners == partnerNum]
}
else if (testResult == "negative") {
data$tested[data$sex == sexNum &
data$age == ageNum &
data$deprivation == deprivationNum &
data$partners == partnerNum] <- data$tested[data$sex == sexNum &
data$age == ageNum &
data$deprivation == deprivationNum &
data$partners == partnerNum] + 1
data$risk[data$sex == sexNum &
data$age == ageNum &
data$deprivation == deprivationNum &
data$partners == partnerNum] <- data$infected[data$sex == sexNum &
data$age == ageNum &
data$deprivation == deprivationNum &
data$partners == partnerNum]/data$tested[data$sex == sexNum &
data$age == ageNum &
data$deprivation == deprivationNum &
data$partners == partnerNum]
}
return(data)
}
The function outputs a single row that -- apparently -- you intend to replace the original row(s). You could replace the original row by doing something like this:
## original data frame is named patientData
patientRow <- adaptRisk(data, "Female", "16-19", 3, 2, "positive")
patientData[row.names(patientRow), ] <- patientRow
I have a large data frame (~4.5m rows), each row corresponds to a separate admission to hospital.
Within each admission are up to 20 diagnosis codes in columns #7 to #26. In addition, I have a field assigned as the "main diagnosis". It was my assumption that the "main diagnosis" corresponded to the first of the 20 diagnosis codes. That is incorrect - sometimes it's the 1st, others the 2nd, 3rd, etc. I'm interested in that distribution.
ID MainDiagCode Diag_1 Diag_2 Diag_3 ...
Patient1 J123 J123 R343 S753
Patient2 G456 F119 E159 G456
Patient3 T789 L292 T789 W474
I'd like to add a column to my data frame that tells me which of the 20 diagnosis codes matches to the "main" one.
ID MainDiagCode Diag_1 Diag_2 Diag_3 ... NewColumn
Patient1 J123 J123 R343 S753 1
Patient2 G456 F119 E159 G456 3
Patient3 T789 L292 T789 W474 2
I've been able to get a loop running:
df$NewColumn[i] <-
unname(which(apply(df[i, 7:26], 2, function(x)
any(
grepl(df$MainDiagCode[i], x)
))))
I'm wondering if there's a better way to do this without using a loop, as that's very slow indeed.
Thank you in advance.
df$NewColumn = apply(df, 1, function(x) match(x["MainDiagCode"], x[-c(1,2)]))
df
ID MainDiagCode Diag_1 Diag_2 Diag_3 NewColumn
1 Patient1 J123 J123 R343 S753 1
2 Patient2 G456 F119 E159 G456 3
3 Patient3 T789 L292 T789 W474 2
It's safer to return the actual column name rather than relying on the match position to be equal to the diagnosis number. For example:
# Get the names of the diagnosis columns
diag.cols = names(df)[grep("^Diag", names(df))]
Extract the column name of the matched column:
apply(df, 1, function(x) {
names(df[,diag.cols])[match(x["MainDiagCode"], x[diag.cols])]
})
[1] "Diag_1" "Diag_3" "Diag_2"
Extract the number at the end of the matched column name:
library(stringr)
apply(df, 1, function(x) {
as.numeric(
str_extract(
names(df[,diag.cols])[match(x["MainDiagCode"], x[diag.cols])], "[0-9]{1,2}$")
)
})
[1] 1 3 2
With 20 diagnoses and 4.5m patients it might be more efficient to use a simple loop over columns and search for matches:
ff = function(main, diags)
{
ans = rep_len(NA_integer_, length(main))
for(i in seq_along(diags)) ans[main == diags[[i]]] = i
return(ans)
}
ff(as.character(dat$MainDiagCode), lapply(dat[-(1:2)], as.character))
#[1] 1 3 2
If more than one diagnosis matches the main you might need adjustments to return the first and not the last (as above) diagnosis. Perhaps, it might be even more efficient to reduce the number of rows checked in each iteration depending on when a match is found.
dat = structure(list(PatientID = structure(1:3, .Label = c("Patient1",
"Patient2", "Patient3"), class = "factor"), MainDiagCode = structure(c(2L,
1L, 3L), .Label = c("G456", "J123", "T789"), class = "factor"),
Diag_1 = structure(c(2L, 1L, 3L), .Label = c("F119", "J123",
"L292"), class = "factor"), Diag_2 = structure(c(2L, 1L,
3L), .Label = c("E159", "R343", "T789"), class = "factor"),
Diag_3 = structure(c(2L, 1L, 3L), .Label = c("G456", "S753",
"W474"), class = "factor")), .Names = c("PatientID", "MainDiagCode",
"Diag_1", "Diag_2", "Diag_3"), row.names = c(NA, -3L), class = "data.frame")
This does a row-by-row comparison of the three columns to the 'MainDiagCode':
apply( dat[-1], 1, function(x) which( x[-1] == x['MainDiagCode'] ) )
[1] 1 3 2
So :
dat$NewColumn <- apply( dat[-1], 1, function(x) which( x[-1] == x['MainDiagCode'] ) )
As you have a lot of rows, using data.table could improve performance
library(data.table)
DT <- data.table(PatientID = paste0("Patient", 1:3),
MainDiagCode = c("J123", "G456", "T789"),
Diag_1 = c("J123", "F119", "L292"),
Diag_2 = c("R343", "E159", "T789"),
Diag_3 = c("S753", "G456", "W474")
)
DT[, NewColumn := match(MainDiagCode, .SD[, -1, with = F]), by = PatientID]
DT
#> PatientID MainDiagCode Diag_1 Diag_2 Diag_3 NewColumn
#> 1: Patient1 J123 J123 R343 S753 1
#> 2: Patient2 G456 F119 E159 G456 3
#> 3: Patient3 T789 L292 T789 W474 2