This question already has answers here:
How to convert a factor to integer\numeric without loss of information?
(12 answers)
Closed 7 years ago.
I already went through different links like: How to convert a factor to an integer\numeric without a loss of information?
but could not solve the problem
I have a data frame
SYMBOL PVALUE1 PVALUE2
1 10-Mar 0.813027629406118 0.78820189558684
2 10-Sep 0.00167287722066533 0.00167287722066533
3 11-Mar 0.21179810441316 0.464576340307205
4 11-Sep 0.00221961024320294 0.00221961024320294
5 12-Sep 0.934667427815304 0.986884425214009
6 15-Sep 0.00167287722066533 0.00167287722066533
7 1-Dec 0.464576340307205 0.0911572830792113
8 1-Mar 0.00818426308604705 0.0252302356363697
9 1-Sep 0.60516237199519 0.570568468332992
10 2-Mar 0.0103975819620539 0.00382292568622066
11 2-Sep 0.00167287722066533 0.00167287722066533
When i try str()
str(df)
'data.frame': 20305 obs. of 3 variables:
$ SYMBOL : Factor w/ 21050 levels "","10-Mar","10-Sep",..: 2 3 4 5 6 7 8 9 10 11 ...
$ PVALUE1: Factor w/ 209 levels "0","0.000109570493049298",..: 169 22 110 24 181 22 139 39 149 44 ...
$ PVALUE2: Factor w/ 216 levels "0","0.000109570493049298",..: 172 20 141 23 201 20 90 61 150 29 ...
I try mode()
sapply(df,mode)
SYMBOL PVALUE1 PVALUE2
"numeric" "numeric" "numeric"
When i try to assign values based on the condition below, to the two numeric columns(2,3) by
df$Score <- rowSums(ifelse(df[,-1]==0, 0,
ifelse(df[, -1]<= 0.05, 2, ifelse(df[,-1]>= 0.065,-2,1))))
I get Warning messages:
1: In Ops.factor(left, right) : ‘<=’ not meaningful for factors
2: In Ops.factor(left, right) : ‘<=’ not meaningful for factors
3: In Ops.factor(left, right) : ‘>=’ not meaningful for factors
4: In Ops.factor(left, right) : ‘>=’ not meaningful for factors
and the output comes like this:
SYMBOL PVALUE1 PVALUE2 Score
1 10-Mar 0.813027629406118 0.78820189558684 NA
2 10-Sep 0.00167287722066533 0.00167287722066533 NA
3 11-Mar 0.21179810441316 0.464576340307205 NA
4 11-Sep 0.00221961024320294 0.00221961024320294 NA
5 12-Sep 0.934667427815304 0.986884425214009 NA
6 15-Sep 0.00167287722066533 0.00167287722066533 NA
If the factor is already numeric, why the above code is not working and gives NA. How should i proceed.
Edit dput()
structure(list(SYMBOL = structure(1:6, .Label = c("10-Mar", "10-Sep",
"11-Mar", "11-Sep", "12-Sep", "15-Sep"), class = "factor"), PVALUE1 = structure(c(4L,
1L, 3L, 2L, 5L, 1L), .Label = c("0.00167287722066533", "0.00221961024320294",
"0.21179810441316", "0.813027629406118", "0.934667427815304"), class = "factor"),
PVALUE2 = structure(c(4L, 1L, 3L, 2L, 5L, 1L), .Label = c("0.00167287722066533",
"0.00221961024320294", "0.464576340307205", "0.78820189558684",
"0.986884425214009"), class = "factor")), .Names = c("SYMBOL",
"PVALUE1", "PVALUE2"), row.names = c(NA, 6L), class = "data.frame")
I tried this also:
indx <- sapply(df, is.factor)
df[indx] <- lapply(df[indx], function(x) as.numeric(levels(x))[x])
indx returns
SYMBOL PVALUE1 PVALUE2
TRUE TRUE TRUE
Warning message:
In FUN(X[[3L]], ...) : NAs introduced by coercion
Using your dput data, this works just fine:
df = structure(list(SYMBOL = structure(1:6, .Label = c("10-Mar", "10-Sep",
"11-Mar", "11-Sep", "12-Sep", "15-Sep"), class = "factor"), PVALUE1 = structure(c(4L,
1L, 3L, 2L, 5L, 1L), .Label = c("0.00167287722066533", "0.00221961024320294",
"0.21179810441316", "0.813027629406118", "0.934667427815304"), class = "factor"),
PVALUE2 = structure(c(4L, 1L, 3L, 2L, 5L, 1L), .Label = c("0.00167287722066533",
"0.00221961024320294", "0.464576340307205", "0.78820189558684",
"0.986884425214009"), class = "factor")), .Names = c("SYMBOL",
"PVALUE1", "PVALUE2"), row.names = c(NA, 6L), class = "data.frame")
df$PVALUE1 = as.numeric(as.character(df$PVALUE1))
df$PVALUE2 = as.numeric(as.character(df$PVALUE2))
df
# SYMBOL PVALUE1 PVALUE2
# 1 10-Mar 0.813027629 0.788201896
# 2 10-Sep 0.001672877 0.001672877
# 3 11-Mar 0.211798104 0.464576340
# 4 11-Sep 0.002219610 0.002219610
# 5 12-Sep 0.934667428 0.986884425
# 6 15-Sep 0.001672877 0.001672877
sapply(df, class)
# SYMBOL PVALUE1 PVALUE2
# "factor" "numeric" "numeric"
If you have issues doing this to your whole data frame, it's possible you have some irregular rows. However, I also looked at the CSV you provided in the comments, and it looks just fine.
Also note that this is one of several equivalent solutions in the duplicate question that you linked.
To convert all but the first column, you could do
df[, 2:ncol(df)] = lapply(df[, -1], function(x) as.numeric(as.character(x)))
Note that you don't want to convert date columns or SYMBOL columns this way as they aren't numeric.
Similarly, to convert columns named, say PVALUE1 to PVALUE47, you could construct the column names and then convert them:
col_to_convert = paste0("PVALUE", 1:47)
df[, col_to_convert] = lapply(df[, col_to_convert], function(x) as.numeric(as.character(x)))
In general, best practice is to not have these columns as factors in the first place. However you get this data into R probably has a way to specify column classes, e.g., colClasses in read.table, read.csv, etc.
An option using data.table
library(data.table)
setDT(df)[, 2:3 := lapply(.SD, function(x)
as.numeric(levels(x))[x]), .SDcols=2:3]
Or a bit more faster version would be to use set
indx <- which(sapply(df, is.factor) & grepl('PVALUE', names(df)))
setDT(df)
for(j in indx){
set(df, i=NULL, j=j, value= as.numeric(levels(df[[j]]))[df[[j]]])
}
I guess the reason why you got the warning is because the 'indx' you created also included the first column (as it is also a factor) but it is non-numeric. By converting non-numeric elements from factor to numeric, those elements will be coerced to NA.
According to ?factor
To transform a factor ‘f’ to approximately its
original numeric values, ‘as.numeric(levels(f))[f]’ is recommended
and slightly more efficient than ‘as.numeric(as.character(f))’.
Related
I am trying to delete rows in my dataset, which contains NAs, but none of the functions work, What could be a reason?
Here is sample of my code,
Site_cov<- read.csv("site_cov.csv")
colnames(Site_cov)<- c("Point", "Basal", "Short.Saps", "Tall.Saps")
head(Site_cov)
Point Basal Short.Saps Tall.Saps
1 DEL001 Na 2 0
2 DEL002 Na 1 6
3 DEL003 Na 0 5
4 DEL004 10 21 22
Here, I though that upper and lower case Nas, could be a problem and this is what I run,
Site_cov$Basal<-toupper(Site_cov$Basal)
Site_cov$Short.Saps<-toupper(Site_cov$Short.Saps)
Site_cov$Tall.Saps<-toupper(Site_cov$Tall.Saps)
Then, I try to delete NAs
Site_cov_NA <- Site_cov[complete.cases(Site_cov[ , c("Point", "Basal", "Short.Saps", "Tall.Saps")]), ]
But, NAs are still here
head(Site_cov_NA)
Point Basal Short.Saps Tall.Saps
1 DEL001 NA 2 0
2 DEL002 NA 1 6
3 DEL003 NA 0 5
4 DEL004 10 21 22
5 DEL005 60 8 17
6 DEL006 80 17 13
Obviously you have 'Na' strings that are fake NAs. replace them with real ones, then your code should work.
dat <- replace(dat, dat == 'Na', NA)
dat[complete.cases(dat[, c("Point", "Basal", "Short.Saps", "Tall.Saps")]), ]
# Point Basal Short.Saps Tall.Saps
# 4 DEL004 10 21 22
Data:
dat <- structure(list(Point = c("DEL001", "DEL002", "DEL003", "DEL004"
), Basal = c("Na", "Na", "Na", "10"), Short.Saps = c(2L, 1L,
0L, 21L), Tall.Saps = c(0L, 6L, 5L, 22L)), class = "data.frame", row.names = c("1",
"2", "3", "4"))
Try the complete.cases() function (https://stat.ethz.ch/R-manual/R-patched/library/stats/html/complete.cases.html)
try <- data.frame("a"=c(1,3,NA,NA), "b"=c(3,5,2,3))
try1<-try[complete.cases(try),]
try1
I am trying to create a stacked bar chart showing % frequency of occurrences by group
library(dplyr)
library(ggplot2)
brfss_2013 %>%
group_by(incomeLev, mentalHealth) %>%
summarise(count_mentalHealth=n()) %>%
group_by(incomeLev) %>%
mutate(count_inc=sum(count_mentalHealth)) %>%
mutate(percent=count_mentalHealth / count_inc * 100) %>%
ungroup() %>%
ggplot(aes(x=forcats::fct_explicit_na(incomeLev),
y=count_mentalHealth,
group=mentalHealth)) +
geom_bar(aes(fill=mentalHealth),
stat="identity") +
geom_text(aes(label=sprintf("%0.1f%%", percent)),
position=position_stack(vjust=0.5))
However, this is the traceback I receive:
1. dplyr::group_by(., incomeLev, mentalHealth)
8. plyr::summarise(., count_mentalHealth = n())
9. [ base::eval(...) ] with 1 more call
11. dplyr::n()
12. dplyr:::from_context("..group_size")
13. `%||%`(...)
In addition: Warning message:
Factor `incomeLev` contains implicit NA, consider using `forcats::fct_explicit_na`
>
Here is a sample of my data
brfss_2013 <- structure(list(incomeLev = structure(c(2L, 3L, 3L, 2L, 2L, 3L,
NA, 2L, 3L, 1L, 3L, NA), .Label = c("$25,000-$35,000", "$50,000-$75,000",
"Over $75,000"), class = "factor"), mentalHealth = structure(c(3L,
1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L), .Label = c("Excellent",
"Ok", "Very Bad"), class = "factor")), row.names = c(NA, -12L
), class = "data.frame")
Update:
Output of str(brfss_2013):
'data.frame': 491775 obs. of 9 variables:
$ mentalHealth: Factor w/ 5 levels "Excellent","Good",..: 5 1 1 1 1 1 3 1 1 1 ...
$ pa1min_ : int 947 110 316 35 429 120 280 30 240 260 ...
$ bmiLev : Factor w/ 6 levels "Underweight",..: 5 1 3 2 5 5 2 3 4 3 ...
$ X_drnkmo4 : int 2 0 80 16 20 0 1 2 4 0 ...
$ X_frutsum : num 413 20 46 49 7 157 150 67 100 58 ...
$ X_vegesum : num 53 148 191 136 243 143 216 360 172 114 ...
$ sex : Factor w/ 2 levels "Male","Female": 2 2 2 2 1 2 2 2 1 2 ...
$ X_state : Factor w/ 55 levels "0","Alabama",..: 2 2 2 2 2 2 2 2 2 2 ...
$ incomeLev : Factor w/ 4 levels "$25,000-$35,000",..: 2 4 4 2 2 4 NA 2 4 1 ...
First of all, your code works incredibly well when you transform everything into character. So you could just do
brfss_2013[c("incomeLev", "mentalHealth")] <-
lapply(brfss_2013[c("incomeLev", "mentalHealth")], as.character)
and then just run your code as you figured it out.
But, let's do it with factors (don't run the lapply(.) line in this case!).
You want a "missing" category, which you can obtain by adding a new level "missing" for the NAs.
levels(brfss_2013$incomeLev) <- c(levels(brfss_2013$incomeLev), "missing")
brfss_2013$incomeLev[is.na(brfss_2013$incomeLev)] <- "missing"
Then, your aggregation (in a base R way).
b1 <- with(brfss_2013, aggregate(list(count_mentalHealth=incomeLev),
by=list(mentalHealth=mentalHealth, incomeLev=incomeLev),
length))
b2 <- aggregate(mentalHealth ~ ., brfss_2013, length)
names(b2)[2] <- "count_inc"
brfss_2013.agg <- merge(b1, b2)
rm(b1, b2) # just to clean up
Add the "percent" column.
brfss_2013.agg$percent <- with(brfss_2013.agg, count_mentalHealth / count_inc)
Plot.
library(ggplot2)
ggplot(brfss_2013.agg, aes(x=incomeLev, y=count_mentalHealth, group=mentalHealth)) +
geom_bar(aes(fill=mentalHealth), stat="identity") +
geom_text(aes(label=sprintf("%0.1f%%", percent)),
position=position_stack(vjust=0.5))
Result
So your code actually works fine for me. It looks like it might be an issue with package versions because it seems odd that you're using the plyr summarise function.
However, here's a slightly more concise way to create that graph (and hopefully this is helpful for whatever you want to add to this plot)
brfss_2013 %>%
# Add count of income levels first (note this only adds a variable)
add_count(incomeLev) %>%
rename(count_inc = n) %>%
# Count observations per group (this transforms data)
count(incomeLev, mentalHealth, count_inc) %>%
rename(count_mentalHealth = n) %>%
mutate(percent= count_mentalHealth / count_inc) %>%
ggplot(aes(x= incomeLev,
y= count_mentalHealth,
# Technically you don't need this group here but groups can be handy
group= mentalHealth)) +
geom_bar(aes(fill=mentalHealth),
stat="identity")+
# Using the scales package does the percent formatting for you
geom_text(aes(label = scales::percent(percent)), vjust = 1)+
theme_minimal()
I have 2 dataframes in R: 'dfold' with 175 variables and 'dfnew' with 75 variables. The 2 datframes are matched by a primary key (that is 'pid'). dfnew is a subset of dfold, so that all variables in dfnew are also on dfold but with updated, imputed values (no NAs anymore). At the same time dfold has more variables, and I will need them in the analysis phase. I would like to merge the 2 dataframes in dfmerge so to update common variables from dfnew --> dfold but at the same time retaining pre-existing variables in dfold. I have tried merge(), match(), dplyr, and sqldf packages, but either I obtain a dfmerge with the updated 75 variables only (left join) or a dfmerge with 250 variables (old variables with NAs and new variables without them coexist). The only way I found (here) is an elegant but pretty long (10 rows) loop that is eliminating *.x variables after a merge by pid with all.x = TRUE option). Might you please advice on a more efficient way to obtain such result if available ?
Thank you in advance
P.S: To make things easier, I have created a minimal version of dfold and dfnew: dfnew has now 3 variables, no NAs, while dfold has 5 variables, NAs included. Here it is the dataframes structure
dfold:
structure(list(Country = structure(c(1L, 3L, 2L, 3L, 2L), .Label = c("France",
"Germany", "Spain"), class = "factor"), Age = c(44L, 27L, 30L,
38L, 40L), Salary = c(72000L, 48000L, 54000L, 61000L, NA), Purchased = structure(c(1L,
2L, 1L, 1L, 2L), .Label = c("No", "Yes"), class = "factor"),
pid = 1:5), .Names = c("Country", "Age", "Salary", "Purchased",
"pid"), row.names = c(NA, 5L), class = "data.frame")
dfnew:
structure(list(Age = c(44, 27, 30), Salary = c(72000, 48000,
54000), pid = c(1, 2, 3)), .Names = c("Age", "Salary", "pid"), row.names = c(NA,
3L), class = "data.frame")
Although here the issue is limited to just 2 variables Please remind that the real scenario will involve 75 variables.
Alright, this solution assumes that you don't really need a merge but only want to update NA values within your dfold with imputed values in dfnew.
> dfold
Country Age Salary Purchased pid
1 France NA 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 NA Yes 5
> dfnew
Age Salary pid
1 44 72000 1
2 27 48000 2
3 30 54000 3
4 38 61000 4
5 40 70000 5
To do this for a single column, try
dfold$Salary <- ifelse(is.na(dfold$Salary), dfnew$Salary[dfnew$pid == dfold$pid], dfold$Salary)
> dfold
Country Age Salary Purchased pid
1 France NA 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 70000 Yes 5
Using it on the whole dataset was a bit trickier:
First define all common colnames except pid:
cols <- names(dfnew)[names(dfnew) != "pid"]
> cols
[1] "Age" "Salary"
Now use mapply to replace the NA values with ifelse:
dfold[,cols] <- mapply(function(x, y) ifelse(is.na(x), y[dfnew$pid == dfold$pid], x), dfold[,cols], dfnew[,cols])
> dfold
Country Age Salary Purchased pid
1 France 44 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 70000 Yes 5
This assumes that dfnew only includes columns that are present in dfold. If this is not the case, use
cols <- names(dfnew)[which(names(dfnew) %in% names(dfold))][names(dfnew) != "pid"]
I have a (sample)table like this:
df <- read.table(header = TRUE,
stringsAsFactors = FALSE,
text="Gene SYMBOL Values
TP53 2 3.55
XBP1 5 4.06
TP27 1 2.53
REDD1 4 3.99
ERO1L 6 5.02
STK11 9 3.64
HIF2A 8 2.96")
I want to look up the symbols from two different genelists, given here as genelist1 and genelist2:
genelist1 <- read.table(header = TRUE,
stringsAsFactors = FALSE,
text="Gene SYMBOL
P4H 10
PLK 7
TP27 1
KTD 11
ERO1L 6")
genelist2 <- read.table(header = TRUE,
stringsAsFactors = FALSE,
text="Gene SYMBOL
TP53 2
XBP1 5
BHLHB 12
STK11 9
TP27 1
UPK 18")
What I want to is to get a new column where I can see in which genelist(s) I can find each of the genes in my dataframe, but when I run the following code it is just the symbols that are repeated in the new columns.
df_geneinfo <- df %>%
join(genelist1,by="SYMBOL") %>%
join(genelist2, by="SYMBOL")
Any suggestions of how to solve this, either to make one new column with the name of the genelists, or to make one column for each of the genelists?
Thanks in advance! :)
For the sake of completeness (and performance with large tables, perhaps), here is a data.table approach:
library(data.table)
rbindlist(list(genelist1, genelist2), idcol = "glid")[, -"Gene"][
setDT(df), on = "SYMBOL"][, .(glid = toString(glid)), by = .(Gene, SYMBOL, Values)][]
Gene SYMBOL Values glid
1: TP53 2 3.55 2
2: XBP1 5 4.06 2
3: TP27 1 2.53 1, 2
4: REDD1 4 3.99 1
5: ERO1L 6 5.02 NA
6: STK11 9 3.64 2
7: HIF2A 8 2.96 NA
rbindlist() creates a data.table from all genelists and adds a column glid to identify the origin of each row. The Gene column is ignored as the subsequent join is only on SYMBOL. Before joining, df is coerced to class data.table using setDT(). The joined result is then aggregated by SYMBOL to exhibit cases where a symbol appears in both genelists which is the case for SYMBOL == 1.
Edit
In case there are many genelists or the full name of the genelist is required instead of just a number, we can try this:
rbindlist(mget(ls(pattern = "^genelist")), idcol = "glid")[, -"Gene"][
setDT(df), on = "SYMBOL"][, .(glid = toString(glid)), by = .(Gene, SYMBOL, Values)][]
Gene SYMBOL Values glid
1: TP53 2 3.55 genelist2
2: XBP1 5 4.06 genelist2
3: TP27 1 2.53 genelist1, genelist2
4: REDD1 4 3.99 NA
5: ERO1L 6 5.02 genelist1
6: STK11 9 3.64 genelist2
7: HIF2A 8 2.96 NA
ls()is looking for objects in the environment the name of which is starting with genelist.... mget() returns a named list of those objects which is passed to rbindlist().
Data
As provided by the OP
df <- structure(list(Gene = c("TP53", "XBP1", "TP27", "REDD1", "ERO1L",
"STK11", "HIF2A"), SYMBOL = c(2L, 5L, 1L, 4L, 6L, 9L, 8L), Values = c(3.55,
4.06, 2.53, 3.99, 5.02, 3.64, 2.96)), .Names = c("Gene", "SYMBOL",
"Values"), class = "data.frame", row.names = c(NA, -7L))
genelist1 <- structure(list(Gene = c("P4H", "PLK", "TP27", "KTD", "ERO1L"),
SYMBOL = c(10L, 7L, 1L, 11L, 4L)), .Names = c("Gene", "SYMBOL"
), class = "data.frame", row.names = c(NA, -5L))
genelist2 <- structure(list(Gene = c("TP53", "XBP1", "BHLHB", "STK11", "TP27",
"UPK"), SYMBOL = c(2L, 5L, 12L, 9L, 1L, 18L)), .Names = c("Gene",
"SYMBOL"), class = "data.frame", row.names = c(NA, -6L))
I just wrote my own function, which replaces the column values:
replace_by_lookuptable <- function(df, col, lookup) {
assertthat::assert_that(all(col %in% names(df))) # all cols exist in df
assertthat::assert_that(all(c("new", "old") %in% colnames(lookup)))
cond_na_exists <- is.na(unlist(lapply(df[, col], function(x) my_match(x, lookup$old))))
assertthat::assert_that(!any(cond_na_exists))
df[, col] <- unlist(lapply(df[, col], function(x) lookup$new[my_match(x, lookup$old)]))
return(df)
}
df is the data.frame, col is a vector of column names which should be replaced using lookup, a data.frame with column "old" and "new".
If you add a listid column to your genelists
genelist1$listid = 1
genelist2$listid = 2
you can then merge your df with the genelists:
merge(df,rbind(genelist1,genelist2),all.x=T, by = "SYMBOL")
Note that ERO1L is SYMBOL 6 in your df and 4 in genelist1, and HIF2A and REDD1 are missing from genelists but REDD1 is symbol 4 in your df (which is ERO1L in genlist1... so I'm a not sure of what output you're expecting in that case.
You could also merge only on Gene names:
merge(df,rbind(genelist1,genelist2),all.x=T, by.x = "Gene", by.y= "Gene")
You could put all of your genlists in a list:
gen_list <- list(genelist1 = genelist1,genelist2 = genelist2)
and compare them to your target data.frame:
cbind(df,do.call(cbind,lapply(seq_along(gen_list),function(x) ifelse( df$Gene %in% gen_list[[x]]$Gene,names(gen_list[x]),NA))))
I have a large data frame (~4.5m rows), each row corresponds to a separate admission to hospital.
Within each admission are up to 20 diagnosis codes in columns #7 to #26. In addition, I have a field assigned as the "main diagnosis". It was my assumption that the "main diagnosis" corresponded to the first of the 20 diagnosis codes. That is incorrect - sometimes it's the 1st, others the 2nd, 3rd, etc. I'm interested in that distribution.
ID MainDiagCode Diag_1 Diag_2 Diag_3 ...
Patient1 J123 J123 R343 S753
Patient2 G456 F119 E159 G456
Patient3 T789 L292 T789 W474
I'd like to add a column to my data frame that tells me which of the 20 diagnosis codes matches to the "main" one.
ID MainDiagCode Diag_1 Diag_2 Diag_3 ... NewColumn
Patient1 J123 J123 R343 S753 1
Patient2 G456 F119 E159 G456 3
Patient3 T789 L292 T789 W474 2
I've been able to get a loop running:
df$NewColumn[i] <-
unname(which(apply(df[i, 7:26], 2, function(x)
any(
grepl(df$MainDiagCode[i], x)
))))
I'm wondering if there's a better way to do this without using a loop, as that's very slow indeed.
Thank you in advance.
df$NewColumn = apply(df, 1, function(x) match(x["MainDiagCode"], x[-c(1,2)]))
df
ID MainDiagCode Diag_1 Diag_2 Diag_3 NewColumn
1 Patient1 J123 J123 R343 S753 1
2 Patient2 G456 F119 E159 G456 3
3 Patient3 T789 L292 T789 W474 2
It's safer to return the actual column name rather than relying on the match position to be equal to the diagnosis number. For example:
# Get the names of the diagnosis columns
diag.cols = names(df)[grep("^Diag", names(df))]
Extract the column name of the matched column:
apply(df, 1, function(x) {
names(df[,diag.cols])[match(x["MainDiagCode"], x[diag.cols])]
})
[1] "Diag_1" "Diag_3" "Diag_2"
Extract the number at the end of the matched column name:
library(stringr)
apply(df, 1, function(x) {
as.numeric(
str_extract(
names(df[,diag.cols])[match(x["MainDiagCode"], x[diag.cols])], "[0-9]{1,2}$")
)
})
[1] 1 3 2
With 20 diagnoses and 4.5m patients it might be more efficient to use a simple loop over columns and search for matches:
ff = function(main, diags)
{
ans = rep_len(NA_integer_, length(main))
for(i in seq_along(diags)) ans[main == diags[[i]]] = i
return(ans)
}
ff(as.character(dat$MainDiagCode), lapply(dat[-(1:2)], as.character))
#[1] 1 3 2
If more than one diagnosis matches the main you might need adjustments to return the first and not the last (as above) diagnosis. Perhaps, it might be even more efficient to reduce the number of rows checked in each iteration depending on when a match is found.
dat = structure(list(PatientID = structure(1:3, .Label = c("Patient1",
"Patient2", "Patient3"), class = "factor"), MainDiagCode = structure(c(2L,
1L, 3L), .Label = c("G456", "J123", "T789"), class = "factor"),
Diag_1 = structure(c(2L, 1L, 3L), .Label = c("F119", "J123",
"L292"), class = "factor"), Diag_2 = structure(c(2L, 1L,
3L), .Label = c("E159", "R343", "T789"), class = "factor"),
Diag_3 = structure(c(2L, 1L, 3L), .Label = c("G456", "S753",
"W474"), class = "factor")), .Names = c("PatientID", "MainDiagCode",
"Diag_1", "Diag_2", "Diag_3"), row.names = c(NA, -3L), class = "data.frame")
This does a row-by-row comparison of the three columns to the 'MainDiagCode':
apply( dat[-1], 1, function(x) which( x[-1] == x['MainDiagCode'] ) )
[1] 1 3 2
So :
dat$NewColumn <- apply( dat[-1], 1, function(x) which( x[-1] == x['MainDiagCode'] ) )
As you have a lot of rows, using data.table could improve performance
library(data.table)
DT <- data.table(PatientID = paste0("Patient", 1:3),
MainDiagCode = c("J123", "G456", "T789"),
Diag_1 = c("J123", "F119", "L292"),
Diag_2 = c("R343", "E159", "T789"),
Diag_3 = c("S753", "G456", "W474")
)
DT[, NewColumn := match(MainDiagCode, .SD[, -1, with = F]), by = PatientID]
DT
#> PatientID MainDiagCode Diag_1 Diag_2 Diag_3 NewColumn
#> 1: Patient1 J123 J123 R343 S753 1
#> 2: Patient2 G456 F119 E159 G456 3
#> 3: Patient3 T789 L292 T789 W474 2