Specifying column names in a data.frame changes spaces to "." - r

Let's say I have a data.frame, like so:
x <- c(1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10)
df <- data.frame("Label 1"=x,"Label 2"=rnorm(100))
head(df,3)
returns:
Label.1 Label.2
1 1 1.9825458
2 2 -0.4515584
3 3 0.6397516
How do I get R to stop automagically replacing the space with a period in the column name? ie, "Label 1" instead of "Label.1".

You may set check.names = FALSE in data.frame (as well as in read.table):
df <- data.frame("Label 1" = 1:3, "Label 2" = rnorm(3), check.names = FALSE)
returns:
Label 1 Label 2
1 1 0.2013347
2 2 1.8823111
3 3 -0.5233811
From ?data.frame:
check.names
logical. If TRUE then the names of the variables in the data frame are checked to ensure that they are syntactically valid variable names and are not duplicated. If necessary they are adjusted (by make.names) so that they are.
From ?make.names:
A syntactically valid name consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number. Names such as ".2way" are not valid, and neither are the reserved words.
All invalid characters are translated to "."
Also, if you need to subset a variable with an 'invalid' name using $, you can use backticks `. For example:
df$`Label 1`

You don't.
With the space you desire the format would not satisfy the requirements for an identifier that come to play when you use df$column.1 -- that could not cope with a space. So see the make.names() function for details or an example:
> make.names(c("Foo Bar", "tic tac"))
[1] "Foo.Bar" "tic.tac"
>
Edit eleven years later: The answer still stands that R prefers column names can be valid variable names. But R is flexible: if you insist you can use the other form _but then need to require the not-otherwise-valid-within-the-language column names explicitly:
> x <- c(1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10)
> df <- data.frame("Label 1"=x,"Label 2"=rnorm(100), check.names=FALSE)
> summary( df$`Label 2` )
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.2719 -0.7148 -0.0971 -0.0275 0.6559 2.5820
>
So by saying check.names=FALSE we override the default (and sensible) check, and by wrapping the identifier in backticks we can access the column.

You can change an existing data frames names to contain spaces ie using your example
x <- c(1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10)
df <- data.frame("Label 1"=x,"Label 2"=rnorm(100))
colnames(df) <- c("Label 1", "Label 2")
head(df, 3)
returns
Label 1 Label 2
1 1 0.2013347
2 2 1.8823111
3 3 -0.5233811
and you can still access the columns using the $ operator, you just need to use double quotes eg
df$"Label 2"[1:3]
returns
[1] 0.2013347 1.8823111 -0.5233811
It seems rather inconsistent to me to auto-convert column names upon data.frame creation, but not to-do the same during column name alteration, but thats how R works at the moment.

names(df)<-c('Label 1','Label 2)

Related

Substring match when filtering rows

I have strings in file1 that matches part of the strings in file2. I want to filter out the strings from file2 that partly matches those in file1. Please see my try. Not sure how to define substring match in this way.
file1:
V1
species1
species121
species14341
file2
V1
genus1|species1|strain1
genus1|species121|strain1
genus1|species1442|strain1
genus1|species4242|strain1
genus1|species4131|strain1
my try:
file1[!file1$V1 %in% file2$V1]
You cannot use the %in% operator in this way in R. It is used to determine whether an element of a vector is in another vector, not like in in Python which can be used to match a substring: Look at this:
"species1" %in% "genus1|species1|strain1" # FALSE
"species1" %in% c("genus1", "species1", "strain1") # TRUE
You can, however, use grepl for this (the l is for logical, i.e. it returns TRUE or FALSE).
grepl("species1", "genus1|species1|strain1") # TRUE
There's an additional complication here in that you cannot use grepl with a vector, as it will only compare the first value:
grepl(file1$V1, "genus1|species1|strain1")
[1] TRUE
Warning message:
In grepl(file1$V1, "genus1|species1|strain1") :
argument 'pattern' has length > 1 and only the first element will be used
The above simply tells you that the first element of file1$V1 is in "genus1|species1|strain1".
Furthermore, you want to compare each element in file1$V1 to an entire vector of strings, rather than just one string. That's OK but you will get a vector the same length as the second vector as an output:
grepl("species1", file2$V1)
[1] TRUE TRUE TRUE FALSE FALSE
We can just see if any() of those are a match. As you've tagged your question with tidyverse, here's a dplyr solution:
library(dplyr)
file1 |>
rowwise() |> # This makes sure you only pass one element at a time to `grepl`
mutate(
in_v2 = any(grepl(V1, file2$V1))
) |>
filter(!in_v2)
# A tibble: 1 x 2
# Rowwise:
# V1 in_v2
# <chr> <lgl>
# 1 species14341 FALSE
One way to get what you want is using the grepl function. So, you can run the following code:
# Load library
library(qdapRegex)
# Extract the names of file2$V1 you are interested in (those between | |)
v <- unlist(rm_between(file2$V1, "|", "|", extract = T))
# Which of theese elements are in file1$V1?
elem.are <- which(v %in% file1$V1)
# Delete the elements in elem.are
file2$V1[-elem.are]
In v we save the names of file2$V1 we are interested in (those
between | |)
Then we save at elem.are the positions of those names which appear
in file1$V1
Finally, we omit those elements using file2$V1[-elem.are]

Ignore or display NA in a row if the search word is not available in a list- R

How to print or display Not Available if any of my search list in (Table_search) is not available in the list I input. In the input I have three lines and I have 3 keywords to search through these lines and tell me if the keyword is present in those lines or not. If present print that line else print Not available like I showed in the desired output.
My code just prints all the available lines but that doesn't help as I need to know where is the word is missing as well.
Table_search <- list("Table 14", "Source Data:","VERSION")
Table_match_list <- sapply(Table_search, grep, x = tablelist, value = TRUE)
Input:
Table 14.1.1.1 (Page 1 of 2)
Source Data: Listing 16.2.1.1.1
Summary of Subject Status by Respiratory/Non-Ambulatory at Event Entry
Desired Output:
Table 14.1.1.1 (Page 1 of 2)
Source Data: Listing 16.2.1.1.1
NA
#r2evans
sapply(unlist(Table_search), grepl, x = dat)
I get a good output with this code actually, but instead of true or false I would like to print the actual data.
I think a single regex will do it:
replace(dat, !grepl(paste(unlist(Table_search), collapse="|"), dat), NA)
# [1] "Table 14.1.1.1 (Page 1 of 2)" "Source Data: Listing 16.2.1.1.1"
# [3] NA
One problem with using sapply(., grep) is that grep returns integer indices, and if no match is made then it returns a length-0 vector. For sapply (a class-unsafe function), this means that you may or may not get a integer vector in return. Each return may be length 0 (nothing found) or length 1 (something found), and when sapply finds that each return value is not exactly the same length, it returns a list instead (ergo my "class-unsafe" verbiage above).
This doesn't change when you use value=TRUE: change my reasoning above about "0 or 1 logical" into "0 or 1 character", and it's the same exact problem.
Because of this, I suggest grepl: it should always return logical indicating found or not found.
Further, since you don't appear to need to differentiate which of the patterns is found, just "at least one of them", then we can use a single regex, joined with the regex-OR operator |. This works with an arbitrary length of your Table_search list.
If you somehow needed to know which of the patterns was found, then you might want something like:
sapply(unlist(Table_search), grepl, x = dat)
# Table 14 Source Data: VERSION
# [1,] TRUE FALSE FALSE
# [2,] FALSE TRUE FALSE
# [3,] FALSE FALSE FALSE
and then figure out what to do with the different columns (each row indicates a string within the dat vector).
One way (that is doing the same as my first code suggestion, albeit less efficiently) is
rowSums(sapply(unlist(Table_search), grepl, x = dat)) > 0
# [1] TRUE TRUE FALSE
where the logical return value indicates if something was found. If, for instance, you want to know if two or more of the patterns were found, one might use rowSums(.) >= 2).
Data
Table_search <- list("Table 14", "Source Data:","VERSION")
dat <- c("Table 14.1.1.1 (Page 1 of 2)", "Source Data: Listing 16.2.1.1.1", "Summary of Subject Status by Respiratory/Non-Ambulatory at Event Entry")

How to print data frame column names containing parentheses in R [duplicate]

Let's say I have a data.frame, like so:
x <- c(1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10)
df <- data.frame("Label 1"=x,"Label 2"=rnorm(100))
head(df,3)
returns:
Label.1 Label.2
1 1 1.9825458
2 2 -0.4515584
3 3 0.6397516
How do I get R to stop automagically replacing the space with a period in the column name? ie, "Label 1" instead of "Label.1".
You may set check.names = FALSE in data.frame (as well as in read.table):
df <- data.frame("Label 1" = 1:3, "Label 2" = rnorm(3), check.names = FALSE)
returns:
Label 1 Label 2
1 1 0.2013347
2 2 1.8823111
3 3 -0.5233811
From ?data.frame:
check.names
logical. If TRUE then the names of the variables in the data frame are checked to ensure that they are syntactically valid variable names and are not duplicated. If necessary they are adjusted (by make.names) so that they are.
From ?make.names:
A syntactically valid name consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number. Names such as ".2way" are not valid, and neither are the reserved words.
All invalid characters are translated to "."
Also, if you need to subset a variable with an 'invalid' name using $, you can use backticks `. For example:
df$`Label 1`
You don't.
With the space you desire the format would not satisfy the requirements for an identifier that come to play when you use df$column.1 -- that could not cope with a space. So see the make.names() function for details or an example:
> make.names(c("Foo Bar", "tic tac"))
[1] "Foo.Bar" "tic.tac"
>
Edit eleven years later: The answer still stands that R prefers column names can be valid variable names. But R is flexible: if you insist you can use the other form _but then need to require the not-otherwise-valid-within-the-language column names explicitly:
> x <- c(1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10)
> df <- data.frame("Label 1"=x,"Label 2"=rnorm(100), check.names=FALSE)
> summary( df$`Label 2` )
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.2719 -0.7148 -0.0971 -0.0275 0.6559 2.5820
>
So by saying check.names=FALSE we override the default (and sensible) check, and by wrapping the identifier in backticks we can access the column.
You can change an existing data frames names to contain spaces ie using your example
x <- c(1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10)
df <- data.frame("Label 1"=x,"Label 2"=rnorm(100))
colnames(df) <- c("Label 1", "Label 2")
head(df, 3)
returns
Label 1 Label 2
1 1 0.2013347
2 2 1.8823111
3 3 -0.5233811
and you can still access the columns using the $ operator, you just need to use double quotes eg
df$"Label 2"[1:3]
returns
[1] 0.2013347 1.8823111 -0.5233811
It seems rather inconsistent to me to auto-convert column names upon data.frame creation, but not to-do the same during column name alteration, but thats how R works at the moment.
names(df)<-c('Label 1','Label 2)

Capitalizing letters. R equivalent of excel "PROPER" function [duplicate]

This question already has answers here:
Capitalize the first letter of both words in a two word string
(15 answers)
Closed 6 years ago.
Colleagues,
I'm looking at a data frame resembling the extract below:
Month Provider Items
January CofCom 25
july CofCom 331
march vobix 12
May vobix 0
I would like to capitalise first letter of each word and lower the remaining letters for each word. This would result in the data frame resembling the one below:
Month Provider Items
January Cofcom 25
July Cofcom 331
March Vobix 12
May Vobix 0
In a word, I'm looking for R's equivalent of the ROPER function available in the MS Excel.
With regular expressions:
x <- c('woRd Word', 'Word', 'word words')
gsub("(?<=\\b)([a-z])", "\\U\\1", tolower(x), perl=TRUE)
# [1] "Word Word" "Word" "Word Words"
(?<=\\b)([a-z]) says look for a lowercase letter preceded by a word boundary (e.g., a space or beginning of a line). (?<=...) is called a "look-behind" assertion. \\U\\1 says replace that character with it's uppercase version. \\1 is a back reference to the first group surrounded by () in the pattern. See ?regex for more details.
If you only want to capitalize the first letter of the first word, use the pattern "^([a-z]) instead.
The question is about an equivalent of Excel PROPER and the (former) accepted answer is based on:
proper=function(x) paste0(toupper(substr(x, 1, 1)), tolower(substring(x, 2)))
It might be worth noting that:
proper("hello world")
## [1] "Hello world"
Excel PROPER would give, instead, "Hello World". For 1:1 mapping with Excel see #Matthew Plourde.
If what you actually need is to set only the first character of a string to upper-case, you might also consider the shorter and slightly faster version:
proper=function(s) sub("(.)", ("\\U\\1"), tolower(s), pe=TRUE)
Another method uses the stringi package. The stri_trans_general function appears to lower case all letters other than the initial letter.
require(stringi)
x <- c('woRd Word', 'Word', 'word words')
stri_trans_general(x, id = "Title")
[1] "Word Word" "Word" "Word Words"
I dont think there is one, but you can easily write it yourself
(dat <- data.frame(x = c('hello', 'frIENds'),
y = c('rawr','rulZ'),
z = c(16, 18)))
# x y z
# 1 hello rawr 16
# 2 frIENds rulZ 18
proper <- function(x)
paste0(toupper(substr(x, 1, 1)), tolower(substring(x, 2)))
(dat <- data.frame(lapply(dat, function(x)
if (is.numeric(x)) x else proper(x)),
stringsAsFactors = FALSE))
# x y z
# 1 Hello Rawr 16
# 2 Friends Rulz 18
str(dat)
# 'data.frame': 2 obs. of 3 variables:
# $ x: chr "Hello" "Friends"
# $ y: chr "Rawr" "Rulz"
# $ z: num 16 18

Splitting factors in R

I have a factor with values of the form Single (w/children), Married (no children), Single (no children), etc. and would like to split these into two factors, one multi-valued factor for marital status, and binary-valued one for children.
How do I do this in R?
Some example data
df <- data.frame(status=c("Domestic partners (w/children)", "Married (no
children)", "Single (no children)"))
Get married status out of string. This assumes that marital status is the first word in the character string. If not, you could do it using grepl
df$married <- sapply(strsplit(as.character(df$status) , " \\(") , "[" , 1)
# Change to factor
df$married <- factor(df$married , levels=c("Single" , "Married",
"Domestic partners"))
Get child status out of string
df$ch <- ifelse(grepl("no children" , df$status) , 0 , 1)
A bit more info
This splits each element where there is a " (" - you need to escape the '(' with \\ as it is a special character.
s <- strsplit(as.character(df$status) , " \\(")
We then subset this by selecting the first term
sapply(s , "[" , 1)
The grepl looks for the string "no children" and return a TRUE or FALSE
grepl("no children" , df$status)
We use an ifelse to dichotomise
EDIT
Re comment: adding in some empty strings ("") to data [Note: rather than having empty strings it is generally better to have these as missing (NA). You can do this when you are reading in the data ie. in read.table you can use the na.strings argument (na.strings=c(NA,"")].
df <- data.frame(status=c("Domestic partners (w/children)", "Married
(no children)", "Single (no children)",""))
The command for married status works but the grepl and ifelse will not. As a quick fix you could add this after the ifelse.
df$ch[df$status==""] <- NA
or if you manage to set empty strings to missing
df$ch[is.na(df$status)] <- NA
Run the commands above and this gives
# status married ch
# 1 Domestic partners (w/children) Domestic partners 1
# 2 Married (no children) Married 0
# 3 Single (no children) Single 0
# 4 <NA> NA

Resources