Searching for greater/less than values with NAs - r

I have a dataframe for which I've calculated and added a difftime column:
name amount 1st_date 2nd_date days_out
JEAN 318.5 1971-02-16 1972-11-27 650 days
GREGORY 1518.5 <NA> <NA> NA days
JOHN 318.5 <NA> <NA> NA days
EDWARD 318.5 <NA> <NA> NA days
WALTER 518.5 1971-07-06 1975-03-14 1347 days
BARRY 1518.5 1971-11-09 1972-02-09 92 days
LARRY 518.5 1971-09-08 1972-02-09 154 days
HARRY 318.5 1971-09-16 1972-02-09 146 days
GARRY 1018.5 1971-10-26 1972-02-09 106 days
I want to break it out and take subtotals where days_out is 0-60, 61-90, 91-120, 121-180.
For some reason I can't even reliably write bracket notation. I would expect
members[members$days_out<=120, ] to show just Barry and Garry, but I get a whole lot of lines like:
NA.1095 <NA> NA <NA> <NA> NA days
NA.1096 <NA> NA <NA> <NA> NA days
NA.1097 <NA> NA <NA> <NA> NA days
Those don't exist in the original data. There's no one without a name. What am I doing wrong here?

This is standard behavior for < and other relational operators: when asked to evaluate whether NA is less than (or greater than, or equal to, or ...) some other number, they return NA, rather than TRUE or FALSE.
Here's an example that should make clear what is going on and point to a simple fix.
x <- c(1, 2, NA, 4, 5)
x[x < 3]
# [1] 1 2 NA
x[x < 3 & !is.na(x)]
# [1] 1 2
To see why all of those rows indexed by NA's have row.names like NA.1095, NA.1096, and so on, try this:
data.frame(a=1:2, b=1:2)[rep(NA, 5),]
# a b
# NA NA NA
# NA.1 NA NA
# NA.2 NA NA
# NA.3 NA NA
# NA.4 NA NA

If you are working at the console the subset function does not have that annoying 'feature' which is actually due to the behavior of [ more than to the relational operators.
subset(members, days_out <= 120)
If you are programming, then you can use which or Josh's conjunction with & is.na(.) that which does behind "the scenes":
members[ which(members$days_out <= 120), ]

Related

How to get rid of wierd NA rows in each cell of a dataframe

I have a database as a dataframe named 'data' which constitutes 500 objects and 2 variables.
in fact
dim(data)
returns
[1] 500 2
and
str(data)
returns
'data.frame': 500 obs. of 2 variables:
$ Diagnosis : chr "D1" "D2" "D3" "D4" ...
$ Type : Factor w/ 8 levels "T1","T2",..: 6 4 1 6 1 4 4 4 5 5 ...
But, when I'm trying to retrieve the value of 'Type' for a specific 'Diagnosis', say, 'D4', 11 weird NA values appear in addition to 'Type' value. In fact, it seems that in each cell of this data frame there is a vector of 12 values of which 11 are NA have come out of thin air.
In turn,
data[data$Diagnosis=='D4','Type']
returns:
[1] <NA> <NA> <NA> <NA> <NA> <NA>
[7] <NA> <NA> <NA> <NA> <NA> T6
intrestingly:
data[data$Diagnosis=='D4',]
returns:
Diagnosis Type
NA <NA> <NA>
NA.1 <NA> <NA>
NA.2 <NA> <NA>
NA.3 <NA> <NA>
NA.4 <NA> <NA>
NA.5 <NA> <NA>
NA.6 <NA> <NA>
NA.7 <NA> <NA>
NA.8 <NA> <NA>
NA.9 <NA> <NA>
NA.10 <NA> <NA>
503 D4 T6
The dataframe had been created in excel and then I imported it to R studio, I have done a lot of alterations on the dataframe since.
I have two questions:
Where did these NAs come from and how can I delete them?
In fact, I want data[data$Diagnosis=='D4','Type']
to return:
[1] T6
and:
data[data$Diagnosis=='D4',]
to retun:
Diagnosis Type
[row number] D4 T6
I can not use omit.na(data) complete.cases() for the whole dataframe, as I have some legitimate NAs that I don't want to remove
how can I set more than one value to a cell of a data frame. let's assume that 1# person has 2 concomitant diagnoses. how can I store both values of 'D1' and 'D2' in the 'diagnosis' of the 1# person?
I think this explanation will be helpful.
As you can see Type column is a not a character,it is a factor
so in R,behind the scenes it is consider as categorical field.as you can see it shows levels as integers.so if you try to access the value it returns the level,not the value. what you need is convert Type column to characters first.after that do the operation
df$Type <- as.character(df$Type)

In result output, Additional rows for NA is showing

I'm learning R and there is one issue I am facing while running the code. I wrote the code to get the data for NY (New York). But some additional rows as complete NAs is showing up. Please help.
Dummy Data:
ID Name Industry Inception Employees State City Revenue Expenses
1 Over-Hex Software 2006 25 TN Franklin 9,684,527 1,130,700
2 Unimattax IT 2009 36 NY New York 14,016,543 804,035
3 Greenfax Retail 2012 NA SC Greenville 9,746,272 1,044,375
4 Blacklane IT 2011 66 NY New York 15,359,369 4,631,808
Result output:
2 Unimattax IT 2009 36 NY New York 14,016,543 804,035
NA <NA> <NA> NA NA <NA> <NA> NA NA
4 Blacklane IT 2011 66 NY New York 15,359,369 4,631,808
NA <NA> <NA> NA NA <NA> <NA> NA NA
fin[fin$State == "NY",] # fin is the table name
It could be an issue with having NA elements in 'State' which will result in NA when we do the ==. To avoid that, create an & expression with is.na to make those NA elements to FALSE.
fin[fin$State == "NY" & !is.na(fin$State),]
Or another option is %in%, that generates FALSE for NA
fin[fin$State %in% "NY",]

Split text string into column based on variable

I have a dataframe with a text column that I would like to split into multiple columns since the text string contains multiple variables, such a location, education, distance etc.
Dataframe:
text.string = c("&location=NY&distance=30&education=University",
"&location=CA&distance=30&education=Highschool&education=University",
"&location=MN&distance=10&industry=Healthcare",
"&location=VT&distance=30&education=University&industry=IT&industry=Business")
df = data.frame(text.string)
df
text.string
1 &location=NY&distance=30&education=University
2 &location=CA&distance=30&education=Highschool&education=University
3 &location=MN&distance=10&industry=Healthcare
4 &location=VT&distance=30&education=University&industry=IT&industry=Business
I can split this using cSplit: cSplit(df, 'text.string', sep = "&"):
text.string_1 text.string_2 text.string_3 text.string_4 text.string_5 text.string_6
1: NA location=NY distance=30 education=University NA NA
2: NA location=CA distance=30 education=Highschool education=University NA
3: NA location=MN distance=10 industry=Healthcare NA NA
4: NA location=VT distance=30 education=University industry=IT industry=Business
Problem is that the text string may contain a multiple of the same variable, or some miss a certain variable. With cSplit the grouping of the variables per column become all mixed up. I would like to avoid this, and group them together.
So it would like similar to this (education and industry do not appear in multiple columns anymore):
text.string_1 text.string_2 text.string_3 text.string_4 text.string_5 text.string_6
1 NA location=NY distance=30 education=University <NA> NA
2 NA location=CA distance=30 education=Highschool education=University <NA> NA
3 NA location=MN distance=10 <NA> industry=Healthcare NA
4 NA location=VT distance=30 education=University industry=IT industry=Business NA
Taking into account #NicE comment:
This is one way, following your example:
library(data.table)
text.string = c("&location=NY&distance=30&education=University",
"&location=CA&distance=30&education=Highschool&education=University",
"&location=MN&distance=10&industry=Healthcare",
"&location=VT&distance=30&education=University&industry=IT&industry=Business")
clean <- strsplit(text.string, "&|=")
out <- lapply(clean, function(x){ma <- data.table(matrix(x[!x==""], nrow = 2, byrow = F ));
setnames(ma, as.character(ma[1,]));
ma[-1,]})
out <- rbindlist(out, fill = T)
out
location distance education education industry industry
1: NY 30 University NA NA NA
2: CA 30 Highschool University NA NA
3: MN 10 NA NA Healthcare NA
4: VT 30 University NA IT Business

data.frame matching

I have a simple R question. I have two data frames. The first contains all of my possible years. I assign NA to the second column. The second data frame has only a subset of the possible years, but an actual value for the second column. I want to combine the two data frames. More specifically, I want to match them by year and if the second has the correct year, to replace the NA in the first with the value of the second.
Here is example code.
one <- as.data.frame(matrix(1880:1890, ncol=2, nrow=11))
one[,2] <- NA
two <- data.frame(ncol=2, nrow=3)
two[1,] <- c(1880, "a")
two[2,] <- c(1887, "b")
two[3,] <- c(1889, "c")
I want to get the first row, second column of one to have value "a," the eighth row, second column to be "b," and the tenth row, second column to be "c."
Feel free to make the above code more elegant.
One thing I tried as a preliminary step, but it gave a little weird result was:
one[,1]==two[,1] -> test
But test only contains values 1880 and 1887...
one[match(two[,1],one[,1]),2]<-two[,2]
That should give you what you are looking for:
> one
V1 V2
1 1880 a
2 1881 <NA>
3 1882 <NA>
4 1883 <NA>
5 1884 <NA>
6 1885 <NA>
7 1886 <NA>
8 1887 b
9 1888 <NA>
10 1889 c
11 1890 <NA>
I like to use merge for these types of problems. It's pretty straightforward in my opinion. Check out the help article ?merge
three <- merge(one, two, by.x = 'V1', by.y = 'ncol', all = T)
Here's one approach (merge is another):
library(qdap)
one[, 2] <- lookup(one[, 1], two)
one
## V1 V2
## 1 1880 a
## 2 1881 <NA>
## 3 1882 <NA>
## 4 1883 <NA>
## 5 1884 <NA>
## 6 1885 <NA>
## 7 1886 <NA>
## 8 1887 b
## 9 1888 <NA>
## 10 1889 c
## 11 1890 <NA>

Vector of logicals based on row membership

thank you for your patience.
I am dealing with a large dataset detailing patients and medications.
Medications are hard to code, as they are (usually) meaningless unless matched with doses.
I have a dataframe with vectors (Drug1, Drug2..... Drug 16) where individual patients are represented by rows.
The vectors are actually factors, with 100s of possible levels (all the drugs the patient could be on).
All I want to do is produce a vector of logicals (TTTTFFFFTTT......) that I could then cbind into a dataframe which will tell me whether a patient is or is not on a particular, drug.
I could then use particularly important drugs' presence or absence as categorical covariates in a model.
I've tried grep, to search along the rows, and I can generate a vector of identifiers, but I cannot seem to generate the vector of logicals.
I realise I'm doing something simply wrong.
names(drugindex)
[1] "book.MRN" "DRUG1" "DRUG2" "DRUG3" "DRUG4" "DRUG5"
[7] "DRUG6" "DRUG7" "DRUG8" "DRUG9" "DRUG10" "DRUG11"
[13] "DRUG12" "DRUG13" "DRUG14" "DRUG15" "DRUG16"
> truvec<-drugindex$book.MRN[as.vector(unlist(apply(drugindex[,2:17], 2, grep, pattern="Lamotrigine")))]
> truvec
truvec
[1] 0024633 0008291 0008469 0030599 0027667
37 Levels: 0008291 0008469 0010188 0014217 0014439 0015822 ... 0034262
> head(drugindex)
book.MRN DRUG1 DRUG2 DRUG3 DRUG4 DRUG5
4 0008291 Venlafaxine Procyclidine Flunitrazepam Amisulpiride Clozapine
31 0008469 Venlafaxine Mirtazapine Lithium Olanzapine Metoprolol
3 0010188 Flurazepam Valproate Olanzapine Mirtazapine Esomeprazole
13 0014217 Aspirin Ramipril Zuclopenthixol Lorazepam Haloperidol
15 0014439 Zopiclone Diazepam Haloperidol Paracetamol <NA>
5 0015822 Olanzapine Venlafaxine Lithium Haloperidol Alprazolam
DRUG6 DRUG7 DRUG8 DRUG9 DRUG10 DRUG11 DRUG12
4 Lamotrigine Alprazolam Lithium Alprazolam <NA> <NA> <NA>
31 Lamotrigine Ramipril Alprazolam Zolpidem Trifluoperazine <NA> <NA>
3 Paracetamol Alprazolam Citalopram <NA> <NA> <NA> <NA>
13 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
15 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
5 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
DRUG13 DRUG14 DRUG15 DRUG16
4 <NA> <NA> <NA> <NA>
31 <NA> <NA> <NA> <NA>
3 <NA> <NA> <NA> <NA>
13 <NA> <NA> <NA> <NA>
15 <NA> <NA> <NA> <NA>
5 <NA> <NA> <NA> <NA>
And what I want is a vector of logicals for each drug, saying whether that patient is on it
Thank you all for your time.
Ross Dunne MRCPsych
"Te occidere possunt sed te edere ne possunt, nefas est".
You were close with your apply attempt, but MARGIN=2 applies the function over columns, not rows. Also, grep returns the locations of the matches; you want grepl, which returns a logical vector. Try this:
apply(x[,-1], 1, function(x) any(grepl("Aspirin",x)))
You could also use %in%, which you may find more intuitive:
apply(x[,-1], 1, "%in%", x="Aspirin")
First, a comment on data structure. You have data in what some call a "wide" format, with a single row per patient and multiple columns for the drugs. It is usually the case that the "long" format, with reapeated rows per patient and a single column for drugs is more amenable to data manipulation. To reshape your data from wide to long and vice versa, take a look at the reshape package. In this case, you would have something like:
library(reshape)
dnow <- melt(drugindex, id.var='book.MRN')
subset(dnow, value=='Lamotrigine')
Much cleaner, and obvious, code, if I may say so ...
Edit: If you need the old structure back you can use cast:
cast(subset(dnow, value=='Lamotrigine'), book.MRN ~ value)
as suggested by #jonw in the comments.

Resources