Unite column names - r

I am trying to unite all columns of the dataframe df separating them with |.
However, for the name of the new column I would like to have all column names merged together separated the same way (eg S_n|S_s|S_b).
Here is what I tried and received error message Error: Must supply a symbol or a string as argument
S_n = c(2, 3, 5)
S_s = c("aa", "bb", "cc")
S_b = c(TRUE, FALSE, TRUE)
df = data.frame(S_n, S_s, S_b)
unite(df, S_n|S_s|S_b, sep="|", remove=TRUE)

unite(df, "S_n|S_s|S_b", sep="|", remove=TRUE). You need quotes around the column name because it's a non-standard name. (Standard names can't contain symbols other than . and _).

One idea via base R can be,
df[paste(names(df), collapse = '|')] <- do.call(paste, c(df, sep = '|'))
# S_n S_s S_b S_n|S_s|S_b
#1 2 aa TRUE 2|aa|TRUE
#2 3 bb FALSE 3|bb|FALSE
#3 5 cc TRUE 5|cc|TRUE

Related

split the superhero names with their ages in the given vector

Given this vector:
vector <- c("Superman1000", "Batman35", "Wonderwoman240")
I want to split the superhero's name and age.
df=data.frame(vector= c("Superman1000", "Batman35", "Wonderwoman240"))
library(stringr)
library(stringi)
library(dplyr)
df %>% separate(vector, c("A", "B"))
I tried this but it doesn't work.
If the data is same as shown, we can remove all the digits to get super hero name and remove all the non-digits to get their age.
df$super_hero <- gsub("\\D", "", df$vector)
df$super_hero_age <- gsub("\\d+", "", df$vector)
Or with tidyr::extract
tidyr::extract(df, vector, into = c("name", "age"),regex = "(.*\\D)(\\d+)")
# name age
#1 Superman 1000
#2 Batman 35
#3 Wonderwoman 240
As mentioned by #A5C1D2H2I1M1N2O1R2T1, we can also use strcapture
strcapture("(.*\\D)(\\d+)", df$vector,
proto = data.frame(superhero = character(), age = integer()))
We can use read.csv from base R after creating a delimiter before the numeric part with sub
read.csv(text = sub("(\\d+)", ",\\1", df$vector), header = FALSE,
stringsAsFactors = FALSE, col.names = c('name', 'age'))
# name age
#1 Superman 1000
#2 Batman 35
#3 Wonderwoman 240
Or another option is separate where we specify a regex lookaround
library(tidyr)
separate(df, vector, into = c("name", "age"), sep= "(?<=[a-z])(?=\\d)")
# name age
#1 Superman 1000
#2 Batman 35
#3 Wonderwoman 240

How to remove missing values (NA) when uniting columns?

I am trying to unite 5 columns into one new column using the Unite function. However, all rows contain lots of NA values, creating variables that look like
Mother|NA|NA|NA|NA
NA|NA|Father|Mother|NA
Mother|Father|NA|Stepmother|NA
I've tried to unite them using this code:
df2 <- df %>%
unite(Parent_full, Parent:Parent5, sep = "|", remove = TRUE, na.rm = TRUE)
But that gives me the following error:
Error: TRUE must evaluate to column positions or names, not a logical vector
I've also looked on the forum, and found that possibly the na.rm function of unite is not active?
Here is some data to recreate my dataset
Name <- c('Paul', 'Edward', 'Mary')
Postalcode <- c('4732', '9045', '3476')
Parent <- c('Mother', 'NA', 'Mother')
Parent2 <- c('NA', 'NA', 'Father')
Parent3 <- c('NA', 'Father', 'NA')
Parent4 <- c('NA', 'Mother', 'Stepmother')
Parent5 <- c('NA', 'NA', 'NA')
df <- data.frame(Name, Postalcode, Parent, Parent2, Parent3, Parent4, Parent5)
Would love to know how to unite my columns without NA's.
UPDATE:
I've now updated the tidyr package and I added "na = c("", "NA")" to my read_csv command.
Now the
df2 <- df %>%
unite(Parent_full, Parent:Parent5, sep = "|", remove = TRUE, na.rm = TRUE)
Command works, however for some reasons the NA at the end of the value stays. Now my columns look like this:
Mother|NA
Father|Mother|NA
Mother|Father|Stepmother|NA
Does anyone know what went wrong now?
You have got couple of problems,
1) the NAs are not reals NA's (Check is.na(df$Parent2))
2) Your columns are factors
While constructing the dataframe use stringsAsFactors = FALSE
df <- data.frame(Name, Postalcode, Parent, Parent2, Parent3, Parent4,
Parent5, stringsAsFactors = FALSE)
and then replace NA and use unite
library(dplyr)
df %>%
na_if('NA') %>%
tidyr::unite(Parent_full, Parent:Parent5, sep = "|", na.rm = TRUE)
# Name Postalcode Parent_full
#1 Paul 4732 Mother
#2 Edward 9045 Father|Mother
#3 Mary 3476 Mother|Father|Stepmother
If the data is already loaded, we can change them by using mutate_if
df %>%
mutate_if(is.factor, as.character) %>%
na_if('NA') %>%
tidyr::unite(Parent_full, Parent:Parent5, sep = "|", na.rm = TRUE)
Your main problem here is that you haven't updated to tidyr 1.0 yet. That error message is the best that the previous version can do with the input na.rm = TRUE, since that argument didn't exist before. It thinks you're giving it a named argument as part of the ....
Specifically, just run install.packages("tidyr") and it should work. You might need to restart R first, so tidyr isn't currently loaded.
If your missing values are "NA" strings, then, as Ronak pointed out, you need to use na_if() on them first. It's strange to me because your initial code chunk makes it look like those are proper NAs, due to the red highlighting. But then your reprex code has 'NA' values which would definitely be strings. Anyway, you say you're reading in from CSV, so, it would be cleaner and quicker to run the CSV-reading code so as to read NAs in properly with an na argument or the like.
Response to Edit: That does seem like a bug, that NAs at the end of the united string don't get properly removed. Well, anyway, the fix is easy, and probably better than anything else we could do:
df2 <- df %>%
unite(Parent_full, Parent:Parent5, sep = "|", na.rm = TRUE) %>%
mutate_at("Parent_full", . %>%
str_remove("(^|\\|)NA$") %>%
na_if(""))
This ensures two things: 1) that the letters "NA" at the end of a string are only removed if they're there because of the unite(), with a pipe (if anything) in front of them; and 2) if there's no non-missing values on a line here, then the value will be a proper NA rather than "NA", "", or what have you, which I assume is what you want.
Update: I've found that the bug applies to any column that contains nothing but NAs, i.e. na.rm = TRUE only removes NAs from columns that have at least one non-missing value. I've filed a bug report: https://github.com/tidyverse/tidyr/issues/765
Given this, though, the optimal solution is probably just to remove any columns that are all NA beforehand. If this is production code, though, then that gets real tricky, since you have to specify the unite() so as to not break if any or even all of the columns to be united are dropped by that prior step.
Update 2: As a response to the bug report pointed out, the issue is actually that that all-missing column is logicals. So that makes the optimal solution: read in such columns as character, or coerce them to character before uniting. Full reprex for that:
library(tidyverse)
Name <- c('Paul', 'Edward', 'Mary')
Postalcode <- c('4732', '9045', '3476')
Parent <- c('Mother', NA, 'Mother')
Parent2 <- c(NA, NA, 'Father')
Parent3 <- c(NA, 'Father', NA)
Parent4 <- c(NA, 'Mother', 'Stepmother')
Parent5 <- c(NA, NA, NA)
(df <- data.frame(Name, Postalcode, Parent, Parent2, Parent3, Parent4, Parent5))
#> Name Postalcode Parent Parent2 Parent3 Parent4 Parent5
#> 1 Paul 4732 Mother <NA> <NA> <NA> NA
#> 2 Edward 9045 <NA> <NA> Father Mother NA
#> 3 Mary 3476 Mother Father <NA> Stepmother NA
(df2 <- df %>%
mutate_at(vars(Parent:Parent5), as.character) %>%
unite(Parent_full, Parent:Parent5, sep = "|", na.rm = TRUE))
#> Name Postalcode Parent_full
#> 1 Paul 4732 Mother
#> 2 Edward 9045 Father|Mother
#> 3 Mary 3476 Mother|Father|Stepmother
Created on 2019-09-27 by the reprex package (v0.3.0)
unite() (and na.rm = TRUE) only works for character columns (as far as I can tell). This isn't made clear in the help docs.
For factors, it also returns the integer code rather than the factor level - something to watch out for.
Numeric: Doesn't remove NAs:
df <- data.frame("to.combine1" = c(NA, 1, 3),
"to.combine2" = c(2, NA, 3))
sapply(df, class) #not functional, just illustrative
#> to.combine1 to.combine2
#> "numeric" "numeric"
unite(df, "combined", to.combine1:to.combine2, sep="_", na.rm = TRUE)
#> combined
#> 1 NA_2
#> 2 1_NA
#> 3 3_3
Factor: Doesn't remove NAs and uses integer code rather than level:
df <- data.frame("to.combine1" = as.character(c(NA, 1, "a")),
"to.combine2" = as.character(c(2, NA, "a")),
stringsAsFactors = TRUE)
sapply(df, class) #not functional, just illustrative
#> to.combine1 to.combine2
#> "factor" "factor"
unite(df, "combined", to.combine1:to.combine2, sep="_", na.rm = TRUE)
#> combined
#>1 NA_1
#>2 1_NA
#>3 2_2
Character: Expected behaviour
df <- data.frame("to.combine1" = as.character(c(NA, 1, "a")),
"to.combine2" = as.character(c(2, NA, "a")),
stringsAsFactors = FALSE)
sapply(df, class) #not functional, just illustrative
#>to.combine1 to.combine2
#>"character" "character"
unite(df, "combined", to.combine1:to.combine2, sep="_", na.rm = TRUE)
#> combined
#> 1 2
#> 2 1
#> 3 a_a
You can remove the NAs later with something like this
df %>%
unite(Parent_full, Parent:Parent5, sep = "|", remove = TRUE) %>%
mutate(Parent_full = gsub("(?<![a-zA-Z])NA\\||\\|NA(?![a-zA-Z])|\\|NA$", '', Parent_full, perl = T))
Name Postalcode Parent_full
1 Paul 4732 Mother
2 Edward 9045 Father|Mother
3 Mary 3476 Mother|Father|Stepmother
It replaces NA| not preceded by a letter or |NA not followed by a letter or |NA at the end of the string, with an empty string

split string each x characters in dataframe

I know there are some answers here about splitting a string every nth character, such as this one and this one, However these are pretty question specific and mostly related to a single string and not to a data frame of multiple strings.
Example data
df <- data.frame(id = 1:2, seq = c('ABCDEFGHI', 'ZABCDJHIA'))
Looks like this:
id seq
1 1 ABCDEFGHI
2 2 ZABCDJHIA
Splitting on every third character
I want to split the string in each row every thrid character, such that the resulting data frame looks like this:
id 1 2 3
1 ABC DEF GHI
2 ZAB CDJ HIA
What I tried
I used the splitstackshape before to split a string on a single character, like so: df %>% cSplit('seq', sep = '', stripWhite = FALSE, type.convert = FALSE) I would love to have a similar function (or perhaps it is possbile with cSplit) to split on every third character.
An option would be separate
library(tidyverse)
df %>%
separate(seq, into = paste0("x", 1:3), sep = c(3, 6))
# id x1 x2 x3
#1 1 ABC DEF GHI
#2 2 ZAB CDJ HIA
If we want to create it more generic
n1 <- nchar(as.character(df$seq[1])) - 3
s1 <- seq(3, n1, by = 3)
nm1 <- paste0("x", seq_len(length(s1) +1))
df %>%
separate(seq, into = nm1, sep = s1)
Or using base R, using strsplit, split the 'seq' column for each instance of 3 characters by passing a regex lookaround into a list and then rbind the list elements
df[paste0("x", 1:3)] <- do.call(rbind,
strsplit(as.character(df$seq), "(?<=.{3})", perl = TRUE))
NOTE: It is better to avoid column names that start with non-standard labels such as numbers. For that reason, appended 'x' at the beginning of the names
You can split a string each x characters in base also with read.fwf (Read Fixed Width Format Files), which needs either a file or a connection.
read.fwf(file=textConnection(as.character(df$seq)), widths=c(3,3,3))
V1 V2 V3
1 ABC DEF GHI
2 ZAB CDJ HIA

colMeans not working in R

The data set I have as file Dummy.txt is as follows
A|B|C|D
1|2|1.9|5
2.5|5|53|3
4|48|49|0.4
8|94|495|B6
(please note a text character in 5th row, 4th column)
I would like to obtain the mean of each column (i.e. column A, B, C and D).
The code I am using is as follows:
mydata_1 <- read.delim("Dummy.txt", skipNul = TRUE, sep = "|", header = FALSE, row.names = NULL)
mydata_1 <- as.numeric(as.character(mydata_1))
colMeans(mydata_1, na.rm = TRUE,)
However, this doesn't seem to be working. Any suggestions please?
You need to set header = TRUE to have the A|B|C|D row be used for column names, otherwise they are included as values, and all columns are parsed as string columns.
Then, passing stringsAsFactors = FALSE prevents columns D from being turned into a factor, and then the value 'B6' will automatically be turned into an NA when converted to a numeric type.
mydata_1 <- read.delim("Dummy.txt", skipNul = TRUE, sep = "|", header = TRUE,
row.names = NULL, stringsAsFactors = FALSE)
mydata_1[] <- lapply(mydata_1, as.numeric)
#> Warning message:
#> In lapply(mydata_1, as.numeric) : NAs introduced by coercion
colMeans(mydata_1, na.rm = TRUE)
#> A B C D
#> 3.875 37.250 149.725 2.800
The syntax mydata_1[] <- ... makes mydata_1 keep its data frame structure even though a list is being returned on the right-hand side.
The problem here is that as.numeric(as.character(mydata_1)) returns [1] NA NA NA NA.
My suggestion would be to first go through all columns and coerce the types using sapply(), and then calculate the means of the columns:
library(magrittr)
mydata_1 %>%
sapply(., function(col) as.numeric(as.character(col))) %>%
colMeans(na.rm = TRUE)
This will return:
A B C D
3.875 37.250 149.725 2.800
Note: I am using magrittr to make use of the pipe (%>%) operator to chain the operations so you can check the output of every step.

How to remove '.' from column names in a dataframe?

My dataframe which I read from a csv file has column names like this
abc.def, ewf.asd.fkl, qqit.vsf.addw.coil
I want to remove the '.' from all the names and convert them to
abcdef, eqfasdfkl, qqitvsfaddwcoil.
I tried using the sub command sub(".","",colnames(dataframe)) but this command took out the first letter of each column name and the column names changed to
bc.def, wf.asd.fkl, qit.vsf.addw.coil
Anyone know another command to do this. I can change the column name one by one, but I have a lot of files with 30 or more columns in each file.
Again, I want to remove the "." from all the colnames. I am trying to do this so I can use "sqldf" commands, which don't deal well with "."
Thank you for your help
1) sqldf can deal with names having dots in them if you quote the names:
library(sqldf)
d0 <- read.csv(text = "A.B,C.D\n1,2")
sqldf('select "A.B", "C.D" from d0')
giving:
A.B C.D
1 1 2
2) When reading the data using read.table or read.csv use the check.names=FALSE argument.
Compare:
Lines <- "A B,C D
1,2
3,4"
read.csv(text = Lines)
## A.B C.D
## 1 1 2
## 2 3 4
read.csv(text = Lines, check.names = FALSE)
## A B C D
## 1 1 2
## 2 3 4
however, in this example it still leaves a name that would have to be quoted in sqldf since the names have embedded spaces.
3) To simply remove the periods, if DF is a data frame:
names(DF) <- gsub(".", "", names(DF), fixed = TRUE)
or it might be nicer to convert the periods to underscores so that it is reversible:
names(DF) <- gsub(".", "_", names(DF), fixed = TRUE)
This last line could be alternatively done like this:
names(DF) <- chartr(".", "_", names(DF))
UPDATE dplyr 0.8.0
As of dplyr 0.8 funs() is soft deprecated, use formula notation.
a dplyr way to do this using stringr.
library(dplyr)
library(stringr)
data <- data.frame(abc.def = 1, ewf.asd.fkl = 2, qqit.vsf.addw.coil = 3)
renamed_data <- data %>%
rename_all(~str_replace_all(.,"\\.","_")) # note we have to escape the '.' character with \\
Make sure you install the packages with install.packages().
Remember you have to escape the . character with \\. in regex, which functions like str_replace_all use, . is a wildcard.
To replace all the dots in the names you'll need to use gsub, rather than sub, which will only replace the first occurrence.
This should work.
test <- data.frame(abc.def = NA, ewf.asd.fkl = NA, qqit.vsf.addw.coil = NA)
names(test) <- gsub( ".", "", names(test), fixed = TRUE)
test
abcdef ewfasdfkl qqitvsfaddwcoil
1 NA NA NA
You can also try:
names(df) = gsub(pattern = ".", replacement = "", x = names(df))

Resources