Vector of logicals based on row membership - r

thank you for your patience.
I am dealing with a large dataset detailing patients and medications.
Medications are hard to code, as they are (usually) meaningless unless matched with doses.
I have a dataframe with vectors (Drug1, Drug2..... Drug 16) where individual patients are represented by rows.
The vectors are actually factors, with 100s of possible levels (all the drugs the patient could be on).
All I want to do is produce a vector of logicals (TTTTFFFFTTT......) that I could then cbind into a dataframe which will tell me whether a patient is or is not on a particular, drug.
I could then use particularly important drugs' presence or absence as categorical covariates in a model.
I've tried grep, to search along the rows, and I can generate a vector of identifiers, but I cannot seem to generate the vector of logicals.
I realise I'm doing something simply wrong.
names(drugindex)
[1] "book.MRN" "DRUG1" "DRUG2" "DRUG3" "DRUG4" "DRUG5"
[7] "DRUG6" "DRUG7" "DRUG8" "DRUG9" "DRUG10" "DRUG11"
[13] "DRUG12" "DRUG13" "DRUG14" "DRUG15" "DRUG16"
> truvec<-drugindex$book.MRN[as.vector(unlist(apply(drugindex[,2:17], 2, grep, pattern="Lamotrigine")))]
> truvec
truvec
[1] 0024633 0008291 0008469 0030599 0027667
37 Levels: 0008291 0008469 0010188 0014217 0014439 0015822 ... 0034262
> head(drugindex)
book.MRN DRUG1 DRUG2 DRUG3 DRUG4 DRUG5
4 0008291 Venlafaxine Procyclidine Flunitrazepam Amisulpiride Clozapine
31 0008469 Venlafaxine Mirtazapine Lithium Olanzapine Metoprolol
3 0010188 Flurazepam Valproate Olanzapine Mirtazapine Esomeprazole
13 0014217 Aspirin Ramipril Zuclopenthixol Lorazepam Haloperidol
15 0014439 Zopiclone Diazepam Haloperidol Paracetamol <NA>
5 0015822 Olanzapine Venlafaxine Lithium Haloperidol Alprazolam
DRUG6 DRUG7 DRUG8 DRUG9 DRUG10 DRUG11 DRUG12
4 Lamotrigine Alprazolam Lithium Alprazolam <NA> <NA> <NA>
31 Lamotrigine Ramipril Alprazolam Zolpidem Trifluoperazine <NA> <NA>
3 Paracetamol Alprazolam Citalopram <NA> <NA> <NA> <NA>
13 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
15 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
5 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
DRUG13 DRUG14 DRUG15 DRUG16
4 <NA> <NA> <NA> <NA>
31 <NA> <NA> <NA> <NA>
3 <NA> <NA> <NA> <NA>
13 <NA> <NA> <NA> <NA>
15 <NA> <NA> <NA> <NA>
5 <NA> <NA> <NA> <NA>
And what I want is a vector of logicals for each drug, saying whether that patient is on it
Thank you all for your time.
Ross Dunne MRCPsych
"Te occidere possunt sed te edere ne possunt, nefas est".

You were close with your apply attempt, but MARGIN=2 applies the function over columns, not rows. Also, grep returns the locations of the matches; you want grepl, which returns a logical vector. Try this:
apply(x[,-1], 1, function(x) any(grepl("Aspirin",x)))
You could also use %in%, which you may find more intuitive:
apply(x[,-1], 1, "%in%", x="Aspirin")

First, a comment on data structure. You have data in what some call a "wide" format, with a single row per patient and multiple columns for the drugs. It is usually the case that the "long" format, with reapeated rows per patient and a single column for drugs is more amenable to data manipulation. To reshape your data from wide to long and vice versa, take a look at the reshape package. In this case, you would have something like:
library(reshape)
dnow <- melt(drugindex, id.var='book.MRN')
subset(dnow, value=='Lamotrigine')
Much cleaner, and obvious, code, if I may say so ...
Edit: If you need the old structure back you can use cast:
cast(subset(dnow, value=='Lamotrigine'), book.MRN ~ value)
as suggested by #jonw in the comments.

Related

converting as factor for a list of data frames

I am trying to create a custom function to give labels to modified list of data frames. For example, I have a data frame like below.
df<-data.frame(
gender = c(1,2,1,2,1,2,1,2,2,2,2,1,1,2,2,2,2,1,1,1,1,1,2,1,2,1,2,2,2,1,2,1,2,1,2,1,2,2,2),
country = c(3,3,1,2,5,4,4,4,4,3,3,4,3,4,2,1,4,2,3,4,4,4,3,1,2,1,5,5,4,3,1,4,5,2,3,4,5,1,4),
Q1=c(1,1,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,1,1,1,NA,1,1,NA,NA,NA,NA,1,NA,NA,NA,NA,1,NA,1),
Q2=c(1,1,1,1,1,NA,NA,NA,NA,1,1,1,1,1,NA,NA,NA,1,1,1,NA,1,1,1,1,1,NA,NA,NA,1,1,1,1,1,1,1,NA,NA,NA),
Q3=c(1,1,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,1,1,1,NA,NA,NA,1,NA,NA,1,1,1,1,1,NA,NA,1),
Q4=c(1,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA),
Q5=c(1,2,1,1,1,2,1,2,2,1,2,NA,1,1,2,2,2,1,1,1,2,NA,2,1,1,1,2,2,2,NA,1,2,2,1,1,1,2,2,2)
)
I understand your goal to be the following: You want to take a list of data frames (ldat). For each of the dataframes in the list (df, df2) you want to take some existing columns (Q1, Q2, Q3) and replicate them with new names in the same data frame (Q1_new, Q2_new, Q3_new). This you could achieve like this:
variables = c("Q1","Q2","Q3")
new_label =c("Q1_new","Q2_new","Q3_new")
newdfs <- lapply(ldat, FUN = function(x) {
x[,new_label] = x[,variables]
return(x)})
head(newdfs$ALL)
gender country Q1 Q2 Q3 Q4 Q5 cc2 Q1_new Q2_new Q3_new
1 Male USA Yes Available Partner Depends on sales Local 1 Yes Available Partner
2 female USA Yes Available Partner <NA> Overseas NA Yes Available Partner
3 Male CAN <NA> Available <NA> <NA> Local 1 <NA> Available <NA>
4 female EU <NA> Available <NA> <NA> Local 1 <NA> Available <NA>
5 Male UK <NA> Available <NA> <NA> Local 1 <NA> Available <NA>
6 female BR <NA> <NA> <NA> <NA> Overseas NA <NA> <NA> <NA>
Is this what you had in mind?

How to get rid of wierd NA rows in each cell of a dataframe

I have a database as a dataframe named 'data' which constitutes 500 objects and 2 variables.
in fact
dim(data)
returns
[1] 500 2
and
str(data)
returns
'data.frame': 500 obs. of 2 variables:
$ Diagnosis : chr "D1" "D2" "D3" "D4" ...
$ Type : Factor w/ 8 levels "T1","T2",..: 6 4 1 6 1 4 4 4 5 5 ...
But, when I'm trying to retrieve the value of 'Type' for a specific 'Diagnosis', say, 'D4', 11 weird NA values appear in addition to 'Type' value. In fact, it seems that in each cell of this data frame there is a vector of 12 values of which 11 are NA have come out of thin air.
In turn,
data[data$Diagnosis=='D4','Type']
returns:
[1] <NA> <NA> <NA> <NA> <NA> <NA>
[7] <NA> <NA> <NA> <NA> <NA> T6
intrestingly:
data[data$Diagnosis=='D4',]
returns:
Diagnosis Type
NA <NA> <NA>
NA.1 <NA> <NA>
NA.2 <NA> <NA>
NA.3 <NA> <NA>
NA.4 <NA> <NA>
NA.5 <NA> <NA>
NA.6 <NA> <NA>
NA.7 <NA> <NA>
NA.8 <NA> <NA>
NA.9 <NA> <NA>
NA.10 <NA> <NA>
503 D4 T6
The dataframe had been created in excel and then I imported it to R studio, I have done a lot of alterations on the dataframe since.
I have two questions:
Where did these NAs come from and how can I delete them?
In fact, I want data[data$Diagnosis=='D4','Type']
to return:
[1] T6
and:
data[data$Diagnosis=='D4',]
to retun:
Diagnosis Type
[row number] D4 T6
I can not use omit.na(data) complete.cases() for the whole dataframe, as I have some legitimate NAs that I don't want to remove
how can I set more than one value to a cell of a data frame. let's assume that 1# person has 2 concomitant diagnoses. how can I store both values of 'D1' and 'D2' in the 'diagnosis' of the 1# person?
I think this explanation will be helpful.
As you can see Type column is a not a character,it is a factor
so in R,behind the scenes it is consider as categorical field.as you can see it shows levels as integers.so if you try to access the value it returns the level,not the value. what you need is convert Type column to characters first.after that do the operation
df$Type <- as.character(df$Type)

Match and list column terms in R from .xlsx

Hi I am quite new to R and i need to match terms from .xlsx columns to get a list of matched data between three .xlsx. The data in files is like this:
From one.xlsx:
OneID NameOne
ACR019 Acropectoral Syndrome
ACR020 Acropectorovertebral
GNT015 Genital Dwarfism
ACR023 Acral Dysostosis Dyserythropoiesis Syndrome
From two.xlsx:
TwoID TwoName
607907 DERMATOFIBROSARCOMA PROTUBERANS
304730 DERMOIDS OF CORNEA
605967 ACROPECTORAL SYNDROME
102510 ACROPECTOROVERTEBRAL
From three.xlsx:
ThreeID ThreeName
OM85203 Acropectoral syndrome
OM67092 Dermoids cornea
OM76580 Acardia
OM45632 Hypertryptophanemia
And the final result file in .xlsx must look like this:
OneID NameOne TwoID TwoName ThreeID ThreeName
ACR019 Acropectoral Syndrome 605967 ACROPECTORAL SYNDROME OM85203 Acropectoral syndrome
ACR020 Acropectorovertebral 102510 ACROPECTOROVERTEBRAL -
- 304730 DERMOIDS OF CORNEA OM67092 Dermoids cornea
Thank you very much, any suggestion or help to code this will be welcome.
What about this: due your only common fields are names in various dataset, we have to use them, as key to connect the various .xlsx, after make some small transformations (generally imho it's not a great idea use descriptions as key, but we could not do different in this case), using the merge() function.
After import the three MSExcel files, you can do:
# first your data (fake)
one <- data.frame(OneID=c('ACR019','ACR020','GNT015','ACR023'),
NameOne = c('Acropectoral Syndrome','Acropectorovertebral','Genital Dwarfism','Acral Dysostosis Dyserythropoiesis Syndrome'))
two <- data.frame(OneID=c('A607907','304730','605967','102510'),
NameTwo = c('DERMATOFIBROSARCOMA PROTUBERANS','DERMOIDS OF CORNEA','ACROPECTORAL SYNDROME','ACROPECTOROVERTEBRAL'))
three <-data.frame(OneID=c('OM85203','OM67092','OM76580','OM45632'),
NameThree = c('Acropectoral syndrome','Dermoids cornea','Acardia','Hypertryptophanemia'))
# then, to have uniques keys, you can put all of them as upper cases to create ids:
one$ID <- toupper(one$NameOne)
two$ID <- toupper(two$NameTwo)
three$ID <- toupper(three$NameThree)
# after that, you can merge the dataframes:
merged <- merge(merge(one,two, by ='ID', all = TRUE),three, by ='ID', all = TRUE)
#lastly, you give them the names you want (to columns)
colnames(merged) <- c('ID', 'OneID','NameOne','TwoID','NameTwo','ThreeID','NameThree')
# here the results
merged
> merged
ID OneID NameOne TwoID NameTwo
1 ACARDIA <NA> <NA> <NA> <NA>
2 ACRAL DYSOSTOSIS DYSERYTHROPOIESIS SYNDROME ACR023 Acral Dysostosis Dyserythropoiesis Syndrome <NA> <NA>
3 ACROPECTORAL SYNDROME ACR019 Acropectoral Syndrome 605967 ACROPECTORAL SYNDROME
4 ACROPECTOROVERTEBRAL ACR020 Acropectorovertebral 102510 ACROPECTOROVERTEBRAL
5 DERMATOFIBROSARCOMA PROTUBERANS <NA> <NA> A607907 DERMATOFIBROSARCOMA PROTUBERANS
6 DERMOIDS CORNEA <NA> <NA> <NA> <NA>
7 DERMOIDS OF CORNEA <NA> <NA> 304730 DERMOIDS OF CORNEA
8 GENITAL DWARFISM GNT015 Genital Dwarfism <NA> <NA>
9 HYPERTRYPTOPHANEMIA <NA> <NA> <NA> <NA>
ThreeID NameThree
1 OM76580 Acardia
2 <NA> <NA>
3 OM85203 Acropectoral syndrome
4 <NA> <NA>
5 <NA> <NA>
6 OM67092 Dermoids cornea
7 <NA> <NA>
8 <NA> <NA>
9 OM45632 Hypertryptophanemia

R, create new column that consists of 1st column or if condition is met, a value from the 2nd/3rd column

a b c d
1 boiler maker <NA> <NA>
2 clerk assistant <NA> <NA>
3 senior machine setter <NA>
4 operated <NA> <NA> <NA>
5 consultant legal <NA> <NA>
How do I create a new column that takes the value in column 'a' unless any of the other columns contain either legal or assistant in which case it takes that value?
Here is a base-R solution. We use apply and any to test every column at once.
df$col <- as.character(df$a)
df$col[apply(df == "Legal",1,any)] <- "Legal"
df$col[apply(df == "assistant",1,any)] <- "assistant"
Try this:
library("dplyr")
df %>%
mutate(new=ifelse(b=="Legal" | c=="Legal" | d=="Legal", "Legal",
ifelse(b=="assistant" | c=="assistant" | d=="assistant", "assistant",
as.character(a))))
as.character is need if values where factors. If not, it's unnecessary.
A base R alternative of #scoa's answer:
indx <- apply(mydf == "Legal",1,any) + apply(mydf == "assistant",1,any)*2 + 1L
mydf$col <- c("a","Legal","Assistent")[indx]
or in one go:
mydf$col <- c("a","Legal","Assistent")[apply(mydf == "Legal",1,any) + apply(mydf == "assistant",1,any)*2 + 1L]
which gives:
> mydf
a b c d col
1 boiler maker <NA> <NA> a
2 clerk assistant <NA> <NA> Assistent
3 senior machine setter <NA> a
4 operated <NA> <NA> <NA> a
5 consultant Legal <NA> <NA> Legal

Searching for greater/less than values with NAs

I have a dataframe for which I've calculated and added a difftime column:
name amount 1st_date 2nd_date days_out
JEAN 318.5 1971-02-16 1972-11-27 650 days
GREGORY 1518.5 <NA> <NA> NA days
JOHN 318.5 <NA> <NA> NA days
EDWARD 318.5 <NA> <NA> NA days
WALTER 518.5 1971-07-06 1975-03-14 1347 days
BARRY 1518.5 1971-11-09 1972-02-09 92 days
LARRY 518.5 1971-09-08 1972-02-09 154 days
HARRY 318.5 1971-09-16 1972-02-09 146 days
GARRY 1018.5 1971-10-26 1972-02-09 106 days
I want to break it out and take subtotals where days_out is 0-60, 61-90, 91-120, 121-180.
For some reason I can't even reliably write bracket notation. I would expect
members[members$days_out<=120, ] to show just Barry and Garry, but I get a whole lot of lines like:
NA.1095 <NA> NA <NA> <NA> NA days
NA.1096 <NA> NA <NA> <NA> NA days
NA.1097 <NA> NA <NA> <NA> NA days
Those don't exist in the original data. There's no one without a name. What am I doing wrong here?
This is standard behavior for < and other relational operators: when asked to evaluate whether NA is less than (or greater than, or equal to, or ...) some other number, they return NA, rather than TRUE or FALSE.
Here's an example that should make clear what is going on and point to a simple fix.
x <- c(1, 2, NA, 4, 5)
x[x < 3]
# [1] 1 2 NA
x[x < 3 & !is.na(x)]
# [1] 1 2
To see why all of those rows indexed by NA's have row.names like NA.1095, NA.1096, and so on, try this:
data.frame(a=1:2, b=1:2)[rep(NA, 5),]
# a b
# NA NA NA
# NA.1 NA NA
# NA.2 NA NA
# NA.3 NA NA
# NA.4 NA NA
If you are working at the console the subset function does not have that annoying 'feature' which is actually due to the behavior of [ more than to the relational operators.
subset(members, days_out <= 120)
If you are programming, then you can use which or Josh's conjunction with & is.na(.) that which does behind "the scenes":
members[ which(members$days_out <= 120), ]

Resources