Splitting factors in R - r

I have a factor with values of the form Single (w/children), Married (no children), Single (no children), etc. and would like to split these into two factors, one multi-valued factor for marital status, and binary-valued one for children.
How do I do this in R?

Some example data
df <- data.frame(status=c("Domestic partners (w/children)", "Married (no
children)", "Single (no children)"))
Get married status out of string. This assumes that marital status is the first word in the character string. If not, you could do it using grepl
df$married <- sapply(strsplit(as.character(df$status) , " \\(") , "[" , 1)
# Change to factor
df$married <- factor(df$married , levels=c("Single" , "Married",
"Domestic partners"))
Get child status out of string
df$ch <- ifelse(grepl("no children" , df$status) , 0 , 1)
A bit more info
This splits each element where there is a " (" - you need to escape the '(' with \\ as it is a special character.
s <- strsplit(as.character(df$status) , " \\(")
We then subset this by selecting the first term
sapply(s , "[" , 1)
The grepl looks for the string "no children" and return a TRUE or FALSE
grepl("no children" , df$status)
We use an ifelse to dichotomise
EDIT
Re comment: adding in some empty strings ("") to data [Note: rather than having empty strings it is generally better to have these as missing (NA). You can do this when you are reading in the data ie. in read.table you can use the na.strings argument (na.strings=c(NA,"")].
df <- data.frame(status=c("Domestic partners (w/children)", "Married
(no children)", "Single (no children)",""))
The command for married status works but the grepl and ifelse will not. As a quick fix you could add this after the ifelse.
df$ch[df$status==""] <- NA
or if you manage to set empty strings to missing
df$ch[is.na(df$status)] <- NA
Run the commands above and this gives
# status married ch
# 1 Domestic partners (w/children) Domestic partners 1
# 2 Married (no children) Married 0
# 3 Single (no children) Single 0
# 4 <NA> NA

Related

Ignore or display NA in a row if the search word is not available in a list- R

How to print or display Not Available if any of my search list in (Table_search) is not available in the list I input. In the input I have three lines and I have 3 keywords to search through these lines and tell me if the keyword is present in those lines or not. If present print that line else print Not available like I showed in the desired output.
My code just prints all the available lines but that doesn't help as I need to know where is the word is missing as well.
Table_search <- list("Table 14", "Source Data:","VERSION")
Table_match_list <- sapply(Table_search, grep, x = tablelist, value = TRUE)
Input:
Table 14.1.1.1 (Page 1 of 2)
Source Data: Listing 16.2.1.1.1
Summary of Subject Status by Respiratory/Non-Ambulatory at Event Entry
Desired Output:
Table 14.1.1.1 (Page 1 of 2)
Source Data: Listing 16.2.1.1.1
NA
#r2evans
sapply(unlist(Table_search), grepl, x = dat)
I get a good output with this code actually, but instead of true or false I would like to print the actual data.
I think a single regex will do it:
replace(dat, !grepl(paste(unlist(Table_search), collapse="|"), dat), NA)
# [1] "Table 14.1.1.1 (Page 1 of 2)" "Source Data: Listing 16.2.1.1.1"
# [3] NA
One problem with using sapply(., grep) is that grep returns integer indices, and if no match is made then it returns a length-0 vector. For sapply (a class-unsafe function), this means that you may or may not get a integer vector in return. Each return may be length 0 (nothing found) or length 1 (something found), and when sapply finds that each return value is not exactly the same length, it returns a list instead (ergo my "class-unsafe" verbiage above).
This doesn't change when you use value=TRUE: change my reasoning above about "0 or 1 logical" into "0 or 1 character", and it's the same exact problem.
Because of this, I suggest grepl: it should always return logical indicating found or not found.
Further, since you don't appear to need to differentiate which of the patterns is found, just "at least one of them", then we can use a single regex, joined with the regex-OR operator |. This works with an arbitrary length of your Table_search list.
If you somehow needed to know which of the patterns was found, then you might want something like:
sapply(unlist(Table_search), grepl, x = dat)
# Table 14 Source Data: VERSION
# [1,] TRUE FALSE FALSE
# [2,] FALSE TRUE FALSE
# [3,] FALSE FALSE FALSE
and then figure out what to do with the different columns (each row indicates a string within the dat vector).
One way (that is doing the same as my first code suggestion, albeit less efficiently) is
rowSums(sapply(unlist(Table_search), grepl, x = dat)) > 0
# [1] TRUE TRUE FALSE
where the logical return value indicates if something was found. If, for instance, you want to know if two or more of the patterns were found, one might use rowSums(.) >= 2).
Data
Table_search <- list("Table 14", "Source Data:","VERSION")
dat <- c("Table 14.1.1.1 (Page 1 of 2)", "Source Data: Listing 16.2.1.1.1", "Summary of Subject Status by Respiratory/Non-Ambulatory at Event Entry")

extract part of word into a field from a long string using R

I have a single long string variable with 3 obs. I was trying to create a field prob to extract the specific string from the long string. the code and message is below.
data aa: "The probability of being a carrier is 0.0002422359 " " an BRCA1 carrier 0.0001061067 "
" an BRCA2 carrier 0.00013612 "
enter code here
aa$prob <- ifelse(grepl("The probability of being a carrier is", xx)==TRUE, word(aa, 8, 8), ifelse(grepl("BRCA", xx)==TRUE, word(aa, 5, 5), NA))
Warning message: In aa$prob <- ifelse(grepl("The probability of being a carrier is", : Coercing LHS to a list
Here is my previous answer, updated to reflect a data.frame.
library(dplyr)
aa <- data.frame(aa = c("...", "...", "The probability of being a carrier is 0.0002422359 ", " an BRCA1 carrier 0.0001061067 ", " an BRCA2 carrier 0.00013612 ", "..."))
aa %>%
mutate(prob = as.numeric(if_else(grepl("(probability|BRCA[12] carrier)", aa),
gsub("^.*?\\b([0-9]+\\.?[0-9]*)\\s*$", "\\1", aa), NA_character_)))
# aa prob
# 1 ... NA
# 2 ... NA
# 3 The probability of being a carrier is 0.0002422359 0.0002422359
# 4 an BRCA1 carrier 0.0001061067 0.0001061067
# 5 an BRCA2 carrier 0.00013612 0.0001361200
# 6 ... NA
Regex walk-through:
^ and $ are beginning and end of string, respective; \\b is a word-boundary; none of these "consume" any characters, they just mark beginnings and endings
. means one character
? means "zero or one", aka optional; * means "zero or more"; + means "one or more"; all refer to the previous character/class/group
\\s is blank space, including spaces and tabs
[0-9] is a class, meaning any character between 0 and 9; similarly, [a-z] is all lowercase letters, [a-zA-Z] are all letters, [0-9A-F] are hexadecimal digits, etc
(...) is a saved group; it's not uncommon in a group to use | as an "or"; this group is used later in the replacement= part of gsub as numbered groups, so \\1 recalls the first group from the pattern
So grouped and summarized:
"^.*?\\b([0-9]+\\.?[0-9]*)\\s*$"
1 ^^^^^^^^^^^^^^^^^^
2 ^^^
3 ^^^
4 ^^^^
This is the "number" part, that allows for one or more digits, an optional decimal point, and zero or more digits. This is saved in group "1".
The word boundary guarantees that we include leading numbers (it's possible, depending on a few things, for "12.345" to be parsed as "2.345" without this.
Anything before the number-like string.
Some or no blank space after the number.
Grouped logically, in an organized way
Regex isn't unique to R, it's a parsing language that R (and most other programming languages) supports in one way or another.

How to print data frame column names containing parentheses in R [duplicate]

Let's say I have a data.frame, like so:
x <- c(1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10)
df <- data.frame("Label 1"=x,"Label 2"=rnorm(100))
head(df,3)
returns:
Label.1 Label.2
1 1 1.9825458
2 2 -0.4515584
3 3 0.6397516
How do I get R to stop automagically replacing the space with a period in the column name? ie, "Label 1" instead of "Label.1".
You may set check.names = FALSE in data.frame (as well as in read.table):
df <- data.frame("Label 1" = 1:3, "Label 2" = rnorm(3), check.names = FALSE)
returns:
Label 1 Label 2
1 1 0.2013347
2 2 1.8823111
3 3 -0.5233811
From ?data.frame:
check.names
logical. If TRUE then the names of the variables in the data frame are checked to ensure that they are syntactically valid variable names and are not duplicated. If necessary they are adjusted (by make.names) so that they are.
From ?make.names:
A syntactically valid name consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number. Names such as ".2way" are not valid, and neither are the reserved words.
All invalid characters are translated to "."
Also, if you need to subset a variable with an 'invalid' name using $, you can use backticks `. For example:
df$`Label 1`
You don't.
With the space you desire the format would not satisfy the requirements for an identifier that come to play when you use df$column.1 -- that could not cope with a space. So see the make.names() function for details or an example:
> make.names(c("Foo Bar", "tic tac"))
[1] "Foo.Bar" "tic.tac"
>
Edit eleven years later: The answer still stands that R prefers column names can be valid variable names. But R is flexible: if you insist you can use the other form _but then need to require the not-otherwise-valid-within-the-language column names explicitly:
> x <- c(1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10)
> df <- data.frame("Label 1"=x,"Label 2"=rnorm(100), check.names=FALSE)
> summary( df$`Label 2` )
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.2719 -0.7148 -0.0971 -0.0275 0.6559 2.5820
>
So by saying check.names=FALSE we override the default (and sensible) check, and by wrapping the identifier in backticks we can access the column.
You can change an existing data frames names to contain spaces ie using your example
x <- c(1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10)
df <- data.frame("Label 1"=x,"Label 2"=rnorm(100))
colnames(df) <- c("Label 1", "Label 2")
head(df, 3)
returns
Label 1 Label 2
1 1 0.2013347
2 2 1.8823111
3 3 -0.5233811
and you can still access the columns using the $ operator, you just need to use double quotes eg
df$"Label 2"[1:3]
returns
[1] 0.2013347 1.8823111 -0.5233811
It seems rather inconsistent to me to auto-convert column names upon data.frame creation, but not to-do the same during column name alteration, but thats how R works at the moment.
names(df)<-c('Label 1','Label 2)

Extract text according to delimiters but miss out missing entries

I have some text as follows:
inputString<- “Patient Name:MRS Comfor Atest Date of Birth:23/02/1981 Hospital Number:000000 Date of Procedure:01/01/2010 Endoscopist:Dr. Sebastian Zeki: Nurses:Anthony Nurse , Medications:Medication A 50 mcg, Another drug 2.5 mg Instrument:D111 Extent of Exam:second part of duodenum Visualization:Good Tolerance: Good Complications: None Co-morbidity:None INDICATIONS FOR EXAMINATION Illness Stomach pain. PROCEDURE PERFORMED Gastroscopy (OGD) FINDINGS Things found and biopsied DIAGNOSIS Biopsy of various RECOMMENDATIONS Chase for histology. FOLLOW UP Return Home"
I want to extract parts of the test in to their own columns according to some text boundaries I have set:
myWords<-c("Patient Name","Date of Birth","Hospital Number","Date of Procedure","Endoscopist","Second Endoscopist","Trainee","Referring Physician","Nurses"."Medications")
Not all of the delimiter words are in the text (but they are always in the same order).
I have a function that should separate them out (with the column title as the start of the word boundary:
delim<-myWords
inputStringdf <- data.frame(inputString,stringsAsFactors = FALSE)
inputStringdf <- inputStringdf %>%
tidyr::separate(inputString, into = c("added_name",delim),
sep = paste(delim, collapse = "|"),
extra = "drop", fill = "right")
However, when there is no finding between two delimiters, or if the delimiters do not exist, rather than place NA in the column, it just fills it with the next text found between two delimiters. How can I make sure that the correct columns are filled with the correct text as defined by the delimiters?
Using the input shown in the Note at the end transform it into DCF format and then read it in using read.dcf which converts the input lines into a character matrix m. See ?read.dcf for more info. No packages are used.
pat <- sprintf("(%s)", paste(myWords, collapse = "|"))
g <- gsub(pat, "\n\\1", paste0(Lines, "\n"))
m <- read.dcf(textConnection(g))
Here are the first three columns:
m[, 1:3]
## Patient Name Date of Birth Hospital Number
## [1,] "MRS Comfor Atest" "23/02/1981" "000000"
## [2,] "MRS Comfor Atest" NA "000000"
Note
The input is assumed to have one record per patient like this example which has two records. We have just repeated the first patient for simplicity in synthesizing an input data set except we have omitted the Date of Birth in the second record.
Lines <- c(inputString, sub("Date of Birth:23/02/1981 ", "", inputString))

Specifying column names in a data.frame changes spaces to "."

Let's say I have a data.frame, like so:
x <- c(1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10)
df <- data.frame("Label 1"=x,"Label 2"=rnorm(100))
head(df,3)
returns:
Label.1 Label.2
1 1 1.9825458
2 2 -0.4515584
3 3 0.6397516
How do I get R to stop automagically replacing the space with a period in the column name? ie, "Label 1" instead of "Label.1".
You may set check.names = FALSE in data.frame (as well as in read.table):
df <- data.frame("Label 1" = 1:3, "Label 2" = rnorm(3), check.names = FALSE)
returns:
Label 1 Label 2
1 1 0.2013347
2 2 1.8823111
3 3 -0.5233811
From ?data.frame:
check.names
logical. If TRUE then the names of the variables in the data frame are checked to ensure that they are syntactically valid variable names and are not duplicated. If necessary they are adjusted (by make.names) so that they are.
From ?make.names:
A syntactically valid name consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number. Names such as ".2way" are not valid, and neither are the reserved words.
All invalid characters are translated to "."
Also, if you need to subset a variable with an 'invalid' name using $, you can use backticks `. For example:
df$`Label 1`
You don't.
With the space you desire the format would not satisfy the requirements for an identifier that come to play when you use df$column.1 -- that could not cope with a space. So see the make.names() function for details or an example:
> make.names(c("Foo Bar", "tic tac"))
[1] "Foo.Bar" "tic.tac"
>
Edit eleven years later: The answer still stands that R prefers column names can be valid variable names. But R is flexible: if you insist you can use the other form _but then need to require the not-otherwise-valid-within-the-language column names explicitly:
> x <- c(1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10)
> df <- data.frame("Label 1"=x,"Label 2"=rnorm(100), check.names=FALSE)
> summary( df$`Label 2` )
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.2719 -0.7148 -0.0971 -0.0275 0.6559 2.5820
>
So by saying check.names=FALSE we override the default (and sensible) check, and by wrapping the identifier in backticks we can access the column.
You can change an existing data frames names to contain spaces ie using your example
x <- c(1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10)
df <- data.frame("Label 1"=x,"Label 2"=rnorm(100))
colnames(df) <- c("Label 1", "Label 2")
head(df, 3)
returns
Label 1 Label 2
1 1 0.2013347
2 2 1.8823111
3 3 -0.5233811
and you can still access the columns using the $ operator, you just need to use double quotes eg
df$"Label 2"[1:3]
returns
[1] 0.2013347 1.8823111 -0.5233811
It seems rather inconsistent to me to auto-convert column names upon data.frame creation, but not to-do the same during column name alteration, but thats how R works at the moment.
names(df)<-c('Label 1','Label 2)

Resources