gsub not working on colnames? - r

I have a dataframe called df with column names in the following format:
"A Agarwal" "A Agrawal" "A Balachandran"
"A.Brush" "A.Casavant" "A.Chakrabarti"
They are first initial and last name. However, some of them are separated with a space, while other are with a period. I need to replace the period with a period.(The first column is called author.ID, and I excluded it from the following code)
I have tried the following codes but the resulting colnames still do not change.
colnames(df[, -1]) = gsub("\\s", "\\.", colnames(df[, -1]))
colnames(df[, -1]) = gsub(" ", ".", colnames(df[, -1]))
What am I doing wrong?
Thanks.

Note that df[, -1] gets you all rows and columns except the first column (see this reference). In order to modify the column names you should use colnames(df).
To replace the first literal space with a dot, use
colnames(df) <- sub(" ", ".", colnames(df), fixed=TRUE)
If there can be more than one whitespace, use a regex:
colnames(df) <- sub("\\s+", ".", colnames(df))
If you need to remove all whitespaces sequences with a single dot in the column names, use gsub:
colnames(df) <- gsub("\\s+", ".", colnames(df))

Related

How to replace commas in a non-numerical list in R?

I have a data.frame in R, that is also a list. I want to replace the "," with "." in the numbers. The data.frame is not numerical, but I think it has to be to be able to change the decimal separator.
I tried a lot, but nothing works. I do not want to rearrange or manipulate my data.frame. All I want is to get rid off "," in the deciaml numbers.
df <- data.frame(colnames(c("a","b","c")),"row1"=c("2,3","6"),"row2"=c("56,0","56,8"),"row3"=c("1",0"))
#trials to make df numeric and change from , to .
as.numeric(str_replace_all(df,",","."))
as.numeric(unlist(df[ ,2:3]))
lapply(df, as.numeric)
as.numeric(gsub(pattern = ",",replacement = ".",df[ ,2:3]))
as.numeric(df$a)
What else can I do about this nasty problem?
I guess you read the data incorrectly (you can specify dec = ",") while reading the data).
You can use gsub to replace commas (,) with dot (.) and turn them to numeric.
df[] <- lapply(df, function(x) as.numeric(gsub(',', '.', x)))
We can also use mutate_all
library(dplyr)
library(stringr)
df %>%
mutate_all(~ as.numeric(str_replace(., ",", ".")))

Adding a period between characters in a column in R

species <- c("Dacut","Hhyde","Faffi","Dmelan","Jrobusta")
leg <- c(1,2,3,4,5)
df <- data.frame(species, leg)
I am trying to add a period (".") between the first and second letter of every character in the first column of a data frame.
#End Goal:
#D.acut
#H.hyde
#F.affi
#D.melan
#J.robusta
Does anyone know of any code I can use for this issue?
Using substr() to split the string at the positions:
species <- c("Dacut","Hhyde","Faffi","Dmelan","Jrobusta")
leg <- c(1,2,3,4,5)
df <- data.frame(species, leg, stringsAsFactors = FALSE)
df$species <- paste0(
substr(df$species, 1, 1),
".",
substr(df$species, 2, nchar(df$species))
)
df$species
the first substr() extracts character 1 to 1, the second extracts character 2 to last character in string. With paste() we can put the . in between.
Or sub() with a back-reference:
df$species <- sub("(^.)", "\\1.", df$species)
(^.) is the first character in the string grouped with (). sub() replaces the first instance with the back-refernce to the group (\\1) plus the ..
Using sub, we can find on the zero-width lookbehind (?<=^.), and then replace with a dot. This has the effect of inserting a dot into the second position.
df$species <- sub("(?<=^.)", "\\.", df$species, perl=TRUE)
df$species
[1] "D.acut" "H.hyde" "F.affi" "D.melan" "J.robusta"
Note: If, for some reason, you only want to do this replacement if the first character in the species name be an actual capital letter, then find on the following pattern instead:
(?<=^[A-Z])

replace substring in subset of column names in R

I have a dataframe where I want to replace a subset of the columns with new names created by prepending an identifier to the old one. For example, to prepend columns 3:7 with the string, "TEST", I tried the following.
What am I missing here?
# Make a test df
df <- data.frame(replicate(10,sample(0:1,100,rep=TRUE)))
#Subsetting works fine
colnames(df[,3:7])
#sub works fine
sub("^", "TEST.", colnames(df[,3:7]))
#replacing the subset of column names with sub does not
colnames(df[,3:7]) <- sub("^", "TEST.", colnames(df[,3:7]))
colnames(df)
#Also doesn't work
colnames(df[,3:7]) <- paste("TEST.", colnames(df[,3:7]), sep ="")
colnames(df)
The column names should be a vector, with the indices outside of the parentheses:
colnames(df)[3:7] <- sub("^", "TEST.", colnames(df)[3:7])
You could also:
colnames(df)[3:7] <- paste0("TEST.", colnames(df)[3:7])

Replace special character in data frame

I have a dataframe which contains in different cells a special character which I know which is. An example of the structure:
df = data.frame(col_1 = c("21 myspec^ch2 12",NA),
col_2 = c("1 myspec^ch2 4","4 myspec^ch2 212"))
The character is this myspec^ch2 and I would like to replace with -. An example of expected output:
df = data.frame(col_1 = c("21-12",NA),
col_2 = c("1-4","4-212"))
I tried this but it is not working:
df [ df == " myspec^ch2 " ] <- "-"
To get gsub work on whole dataframe use apply:
apply(df, 2, function(x) gsub(" myspec\\^ch2 ", "-", x))
You really want to do a regex-style substitution here. However, in regex, ^ is seen as the beginning of the line (rather than a literal caret). So you can do something like this (using the stringr package):
library(dplyr)
library(stringr)
fixed_df <- df %>%
mutate_all(funs(str_replace_all( . , " myspec\\^ch2 ", "-"))
Note the double backslashes in front of the caret--that escapes the caret and tells R to interpret it literally, rather than as the beginning of the line.

Whitespace string can't be replaced with NA in R

I want to substitute whitespaces with NA. A simple way could be df[df == ""] <- NA, and that works for most of the cells of my data frame....but not for everyone!
I have the following code:
library(rvest)
library(dplyr)
library(tidyr)
#Read website
htmlpage <- read_html("http://www.soccervista.com/results-Liga_MX_Apertura-2016_2017-844815.html")
#Extract table
df <- htmlpage %>% html_nodes("table") %>% html_table()
df <- as.data.frame(df)
#Set whitespaces into NA's
df[df == ""] <- NA
I figured out that some whitespaces have a little whitespace between the quotation marks
df[11,1]
[1] " "
So my solution was to do the next: df[df == " "] <- NA
However the problem is still there and it has the little whitespace! I thought the trim function would work but it didn't...
#Trim
df[,c(1:10)] <- sapply(df[,c(1:10)], trimws)
However, the problem can't go off.
Any ideas?
We need to use lapply instead of sapply as sapply returns a matrix instead of a list and this can create problems in the quotes.
df[1:10] <- lapply(df[1:10], trimws)
and another option if we have spaces like " " is to use gsub to replace those spaces to ""
df[1:10] <- lapply(df[,c(1:10)], function(x) gsub("^\\s+|\\s+$", "", x))
and then change the "" to NA
df[df == ""] <- NA
Or instead of doing the two replacements, we can do this one go and change the class with type.convert
df[] <- lapply(df, function(x)
type.convert(replace(x, grepl("^\\s*$", trimws(x)), NA), as.is = TRUE))
NOTE: We don't have to specify the column index when all the columns are looped
I just spent some time trying to determine a method usable in a pipe.
Here is my method:
df <- df %>%
dplyr::mutate_all(funs(sub("^\\s*$", NA, .)))
Hope this helps the next searcher.

Resources