exclude part of a character string in a column in a dataframe - r

I have a df that has several ids separated with an underscore, like so:
df1:
id v1
1001 2
10002_10002 19
I want the underscore removed and anything after the underscore, like so:
df1:
id v1
1001 2
10002 19
I tried this code, but it's giving me a list, not a df. Can someone please help?
df2 <- strsplit(df1$id, split='_', fixed=TRUE)

You need to access the contents of the list, and then retain the first of two elements resulting from the split:
df1$id <- strsplit(df1$id, "_", fixed=TRUE)[[1]][1]
You could also use sub here:
df1$id <- sub("_.*$", "", df1$id)

Here a solution with the tidyverse/stringr:
library(tidyverse)
my_df <- data.frame(
stringsAsFactors = FALSE,
id = c("1001", "10002_10002"),
v1 = c(2L, 19L)
)
my_df %>%
mutate(id=str_remove(id, regex("(_.*)")))
#> id v1
#> 1 1001 2
#> 2 10002 19
Created on 2020-12-03 by the reprex package (v0.3.0)

I like to use the stringr package
require(stringr)
df1 <- data.frame(id = c("1001","1002_10002"), v1 = c(2,19))
df1$id <- str_remove(df1$id, pattern = "_.+")

We can use word from stringr
library(stringr)
library(dplyr)
df1 %>%
mutate(id = word(id, 1, sep="_"))
# id v1
#1 1001 2
#2 10002 19
Another option is trimws from base R
df1$id <- trimws(df1$id, whitespace = "_.*")
data
df1 <- structure(list(id = c("1001", "10002_10002"),
v1 = c(2L, 19L)), class = "data.frame", row.names = c(NA,
-2L))

Related

Group same value and concat the value in another column in R [duplicate]

This question already has answers here:
Collapse text by group in data frame [duplicate]
(2 answers)
Closed 2 years ago.
I'm trying to sequence a dataset, and I'm a little lost with this.
I would like to get this result and save it to a csv file:
Any idea on how to do this?
Thanks in advance!
With tidyverse you can group_by your ID and then summarise. Use collapse to indicate what you want to use a separator (in this case, semicolon) for paste.
library(tidyverse)
df %>%
group_by(ID) %>%
summarise(MovieCode = paste(MovieCode, collapse = ";"))
Or in base R, use aggregate:
aggregate(MovieCode ~ ID, df, paste, collapse = ";")
Output
ID MovieCode
1 ABC A10;A12;A13
2 BDE A11
3 CDE A14
Data
df <- structure(list(ID = c("ABC", "BDE", "ABC", "ABC", "CDE"), MovieCode = c("A10",
"A11", "A12", "A13", "A14")), class = "data.frame", row.names = c(NA,
-5L))
You can do it easily with data table:
library(data.table)
#create table
ID <- c("ABC","BDE", "ABC","ABC","CDE")
Movie_Code <- c("A10","A11","A12","A13","A14")
df <- data.frame(ID, Movie_Code)
#convert to data table
df<-data.table(df)
setkey(df, ID)
final_tab<-df[, paste(unique(Movie_Code), collapse=", "),by=ID]
#CSV: comma separated values
Output:
ID V1
1: ABC A10, A12, A13
2: BDE A11
3: CDE A14
From: Daniel Bachen
Here's a base R solution:
Illustrative data:
df <- data.frame(
v1 = c("ABA", "BCB", "ABA", "BCB", "DCD"),
v2 = letters[1:5]
)
Solution:
df1 <- data.frame(
v1_unique = unique(df$v1),
v2_paste = tapply(df$v2, df$v1, paste0, collapse = ";"), row.names = NULL)
Result:
df1
v1_unique v2_paste
1 ABA a;c
2 BCB b;d
3 DCD e

Creating another column in R

This is my current data set
I want to take the numbers after "narrow" (e.g. 20) and make another vector. Any idea how I can do that?
We can use sub to remove the substring "Narrow", followed by a , and zero or more spaces (\\s+), replace with blank ("") and convert to numeric
df1$New <- as.numeric(sub("Narrow,\\s*", "", df1$Stimulus))
You could use separate to separate the stimulus column into two vectors.
library(tidyr)
df %>%
separate(col = stimulus,
sep = ", ",
into = c("Text","Number"))
Maybe you can try the code below, using regmatches
df$new <- with(df, as.numeric(unlist(regmatches(stimulus,gregexpr("\\d+",stimulus)))))
You want separate from the tidyr package.
library(dplyr)
df <- data.frame(x = c(NA, "a.b", "a.d", "b.c"))
df %>% separate(x, c("A", "B"))
#> A B
#> 1 <NA> <NA>
#> 2 a b
#> 3 a d
#> 4 b c

How to rename individual column in dataframe [duplicate]

I know if I have a data frame with more than 1 column, then I can use
colnames(x) <- c("col1","col2")
to rename the columns. How to do this if it's just one column?
Meaning a vector or data frame with only one column.
Example:
trSamp <- data.frame(sample(trainer$index, 10000))
head(trSamp )
# sample.trainer.index..10000.
# 1 5907862
# 2 2181266
# 3 7368504
# 4 1949790
# 5 3475174
# 6 6062879
ncol(trSamp)
# [1] 1
class(trSamp)
# [1] "data.frame"
class(trSamp[1])
# [1] "data.frame"
class(trSamp[,1])
# [1] "numeric"
colnames(trSamp)[2] <- "newname2"
# Error in names(x) <- value :
# 'names' attribute [2] must be the same length as the vector [1]
This is a generalized way in which you do not have to remember the exact location of the variable:
# df = dataframe
# old.var.name = The name you don't like anymore
# new.var.name = The name you want to get
names(df)[names(df) == 'old.var.name'] <- 'new.var.name'
This code pretty much does the following:
names(df) looks into all the names in the df
[names(df) == old.var.name] extracts the variable name you want to check
<- 'new.var.name' assigns the new variable name.
colnames(trSamp)[2] <- "newname2"
attempts to set the second column's name. Your object only has one column, so the command throws an error. This should be sufficient:
colnames(trSamp) <- "newname2"
colnames(df)[colnames(df) == 'oldName'] <- 'newName'
This is an old question, but it is worth noting that you can now use setnames from the data.table package.
library(data.table)
setnames(DF, "oldName", "newName")
# or since the data.frame in question is just one column:
setnames(DF, "newName")
# And for reference's sake, in general (more than once column)
nms <- c("col1.name", "col2.name", etc...)
setnames(DF, nms)
This can also be done using Hadley's plyr package, and the rename function.
library(plyr)
df <- data.frame(foo=rnorm(1000))
df <- rename(df,c('foo'='samples'))
You can rename by the name (without knowing the position) and perform multiple renames at once. After doing a merge, for example, you might end up with:
letterid id.x id.y
1 70 2 1
2 116 6 5
3 116 6 4
4 116 6 3
5 766 14 9
6 766 14 13
Which you can then rename in one step using:
letters <- rename(letters,c("id.x" = "source", "id.y" = "target"))
letterid source target
1 70 2 1
2 116 6 5
3 116 6 4
4 116 6 3
5 766 14 9
6 766 14 13
I think the best way of renaming columns is by using the dplyr package like this:
require(dplyr)
df = rename(df, new_col01 = old_col01, new_col02 = old_col02, ...)
It works the same for renaming one or many columns in any dataset.
I find that the most convenient way to rename a single column is using dplyr::rename_at :
library(dplyr)
cars %>% rename_at("speed",~"new") %>% head
cars %>% rename_at(vars(speed),~"new") %>% head
cars %>% rename_at(1,~"new") %>% head
# new dist
# 1 4 2
# 2 4 10
# 3 7 4
# 4 7 22
# 5 8 16
# 6 9 10
works well in pipe chaines
convenient when names are stored in variables
works with a name or an column index
clear and compact
I like the next style for rename dataframe column names one by one.
colnames(df)[which(colnames(df) == 'old_colname')] <- 'new_colname'
where
which(colnames(df) == 'old_colname')
returns by the index of the specific column.
Let df be the dataframe you have with col names myDays and temp.
If you want to rename "myDays" to "Date",
library(plyr)
rename(df,c("myDays" = "Date"))
or with pipe, you can
dfNew <- df %>%
plyr::rename(c("myDays" = "Date"))
Try:
colnames(x)[2] <- 'newname2'
This is likely already out there, but I was playing with renaming fields while searching out a solution and tried this on a whim. Worked for my purposes.
Table1$FieldNewName <- Table1$FieldOldName
Table1$FieldOldName <- NULL
Edit begins here....
This works as well.
df <- rename(df, c("oldColName" = "newColName"))
You can use the rename.vars in the gdata package.
library(gdata)
df <- rename.vars(df, from = "oldname", to = "newname")
This is particularly useful where you have more than one variable name to change or you want to append or pre-pend some text to the variable names, then you can do something like:
df <- rename.vars(df, from = c("old1", "old2", "old3",
to = c("new1", "new2", "new3"))
For an example of appending text to a subset of variables names see:
https://stackoverflow.com/a/28870000/180892
You could also try 'upData' from 'Hmisc' package.
library(Hmisc)
trSamp = upData(trSamp, rename=c(sample.trainer.index..10000. = 'newname2'))
If you know that your dataframe has only one column, you can use:
names(trSamp) <- "newname2"
The OP's question has been well and truly answered. However, here's a trick that may be useful in some situations: partial matching of the column name, irrespective of its position in a dataframe:
Partial matching on the name:
d <- data.frame(name1 = NA, Reported.Cases..WHO..2011. = NA, name3 = NA)
## name1 Reported.Cases..WHO..2011. name3
## 1 NA NA NA
names(d)[grepl("Reported", names(d))] <- "name2"
## name1 name2 name3
## 1 NA NA NA
Another example: partial matching on the presence of "punctuation":
d <- data.frame(name1 = NA, Reported.Cases..WHO..2011. = NA, name3 = NA)
## name1 Reported.Cases..WHO..2011. name3
## 1 NA NA NA
names(d)[grepl("[[:punct:]]", names(d))] <- "name2"
## name1 name2 name3
## 1 NA NA NA
These were examples I had to deal with today, I thought might be worth sharing.
I would simply change a column name to the dataset with the new name I want with the following code:
names(dataset)[index_value] <- "new_col_name"
I found colnames() argument easier
https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/row%2Bcolnames
select some column from the data frame
df <- data.frame(df[, c( "hhid","b1005", "b1012_imp", "b3004a")])
and rename the selected column in order,
colnames(df) <- c("hhid", "income", "cost", "credit")
check the names and the values to be sure
names(df);head(df)
I would simply add a new column to the data frame with the name I want and get the data for it from the existing column. like this:
dataf$value=dataf$Article1Order
then I remove the old column! like this:
dataf$Article1Order<-NULL
This code might seem silly! But it works perfectly...
We can use rename_with to rename columns with a function (stringr functions, for example).
Consider the following data df_1:
df_1 <- data.frame(
x = replicate(n = 3, expr = rnorm(n = 3, mean = 10, sd = 1)),
y = sample(x = 1:2, size = 10, replace = TRUE)
)
names(df_1)
#[1] "x.1" "x.2" "x.3" "y"
Rename all variables with dplyr::everything():
library(tidyverse)
df_1 %>%
rename_with(.data = ., .cols = everything(.),
.fn = str_replace, pattern = '.*',
replacement = str_c('var', seq_along(.), sep = '_')) %>%
names()
#[1] "var_1" "var_2" "var_3" "var_4"
Rename by name particle with some dplyr verbs (starts_with, ends_with, contains, matches, ...).
Example with . (x variables):
df_1 %>%
rename_with(.data = ., .cols = contains('.'),
.fn = str_replace, pattern = '.*',
replacement = str_c('var', seq_along(.), sep = '_')) %>%
names()
#[1] "var_1" "var_2" "var_3" "y"
Rename by class with many functions of class test, like is.integer, is.numeric, is.factor...
Example with is.integer (y):
df_1 %>%
rename_with(.data = ., .cols = is.integer,
.fn = str_replace, pattern = '.*',
replacement = str_c('var', seq_along(.), sep = '_')) %>%
names()
#[1] "x.1" "x.2" "x.3" "var_1"
The warning:
Warning messages:
1: In stri_replace_first_regex(string, pattern, fix_replacement(replacement), :
longer object length is not a multiple of shorter object length
2: In names[cols] <- .fn(names[cols], ...) :
number of items to replace is not a multiple of replacement length
It is not relevant, as it is just an inconsistency of seq_along(.) with the replace function.
library(dplyr)
rename(data, de=de.y)

Removing suffix from column names using rename_all?

I have a data frame with a number of columns in a form var1.mean, var2.mean. I would like to strip the suffix ".mean" from all columns that contain it. I tried using rename_all in conjunction with regex in a pipe but could not come up with a correct syntax. Any suggestions?
If you want to use the dplyr package, I'd recommend using the rename_at function.
Dframe <- data.frame(var1.mean = rnorm(10),
var2.mean = rnorm(10),
var1.sd = runif(10))
library(dplyr)
Dframe %>%
rename_at(.vars = vars(ends_with(".mean")),
.funs = funs(sub("[.]mean$", "", .)))
Using new dplyr:
df %>% rename_with(~str_remove(., '.mean'))
We can use rename_all
df1 %>%
rename_all(.funs = funs(sub("\\..*", "", names(df1)))) %>%
head(2)
# var1 var2 var3 var1 var2 var3
#1 -0.5458808 -0.09411013 0.5266526 -1.3546636 0.08314367 0.5916817
#2 0.5365853 -0.08554095 -1.0736261 -0.9608088 2.78494703 -0.2883407
NOTE: If the column names are duplicated, it needs to be made unique with make.unique
data
set.seed(24)
df1 <- as.data.frame(matrix(rnorm(25*6), 25, 6, dimnames = list(NULL,
paste0(paste0("var", 1:3), rep(c(".mean", ".sd"), each = 3)))))
You may use gsub.
colnames(df) <- gsub('.mean','',colnames(df))
The below works for me
dat <- data.frame(var1.mean = 1, var2.mean = 2)
col_old <- colnames(dat)
col_new <- gsub(pattern = ".mean",replacement = "", x = col_old)
colnames(dat) <- col_new
You can replace this names using stringi package stri_replace_last_regex function like this:
require(stringi)
df <- data.frame(1,2,3,4,5,6)
names(df) <- stri_paste("var",1:6,c(".mean",".sd"))
df
## var1.mean var2.sd var3.mean var4.sd var5.mean var6.sd
##1 1 2 3 4 5 6
names(df) <- stri_replace_last_regex(names(df),"\\.mean$","")
df
## var1 var2.sd var3 var4.sd var5 var6.sd
##1 1 2 3 4 5 6
The regex is \\.mean$ because you need to escape dot character (it has special meaning in regex) and also you can add $ sign at the end to ensure that you replace only names that ENDS with this pattern (if the .mean text is in the middle of string then it wan't be replaced).
I would use stringsplit:
x <- as.data.frame(matrix(runif(16), ncol = 4))
colnames(x) <- c("var1.mean", "var2.mean", "var3.mean", "something.else")
colnames(x) <- strsplit(colnames(x), split = ".mean")
colnames(x)
Lot's of quick answers have been given, the most intuitive, to me would be:
Dframe <- data.frame(var1.mean = rnorm(10), #Create Example
var2.mean = rnorm(10),
var1.sd = runif(10))
names(Dframe) <- gsub("[.]mean","",names(Dframe)) #remove ".mean"

How to rename a single column in a data.frame?

I know if I have a data frame with more than 1 column, then I can use
colnames(x) <- c("col1","col2")
to rename the columns. How to do this if it's just one column?
Meaning a vector or data frame with only one column.
Example:
trSamp <- data.frame(sample(trainer$index, 10000))
head(trSamp )
# sample.trainer.index..10000.
# 1 5907862
# 2 2181266
# 3 7368504
# 4 1949790
# 5 3475174
# 6 6062879
ncol(trSamp)
# [1] 1
class(trSamp)
# [1] "data.frame"
class(trSamp[1])
# [1] "data.frame"
class(trSamp[,1])
# [1] "numeric"
colnames(trSamp)[2] <- "newname2"
# Error in names(x) <- value :
# 'names' attribute [2] must be the same length as the vector [1]
This is a generalized way in which you do not have to remember the exact location of the variable:
# df = dataframe
# old.var.name = The name you don't like anymore
# new.var.name = The name you want to get
names(df)[names(df) == 'old.var.name'] <- 'new.var.name'
This code pretty much does the following:
names(df) looks into all the names in the df
[names(df) == old.var.name] extracts the variable name you want to check
<- 'new.var.name' assigns the new variable name.
colnames(trSamp)[2] <- "newname2"
attempts to set the second column's name. Your object only has one column, so the command throws an error. This should be sufficient:
colnames(trSamp) <- "newname2"
colnames(df)[colnames(df) == 'oldName'] <- 'newName'
This is an old question, but it is worth noting that you can now use setnames from the data.table package.
library(data.table)
setnames(DF, "oldName", "newName")
# or since the data.frame in question is just one column:
setnames(DF, "newName")
# And for reference's sake, in general (more than once column)
nms <- c("col1.name", "col2.name", etc...)
setnames(DF, nms)
This can also be done using Hadley's plyr package, and the rename function.
library(plyr)
df <- data.frame(foo=rnorm(1000))
df <- rename(df,c('foo'='samples'))
You can rename by the name (without knowing the position) and perform multiple renames at once. After doing a merge, for example, you might end up with:
letterid id.x id.y
1 70 2 1
2 116 6 5
3 116 6 4
4 116 6 3
5 766 14 9
6 766 14 13
Which you can then rename in one step using:
letters <- rename(letters,c("id.x" = "source", "id.y" = "target"))
letterid source target
1 70 2 1
2 116 6 5
3 116 6 4
4 116 6 3
5 766 14 9
6 766 14 13
I think the best way of renaming columns is by using the dplyr package like this:
require(dplyr)
df = rename(df, new_col01 = old_col01, new_col02 = old_col02, ...)
It works the same for renaming one or many columns in any dataset.
I find that the most convenient way to rename a single column is using dplyr::rename_at :
library(dplyr)
cars %>% rename_at("speed",~"new") %>% head
cars %>% rename_at(vars(speed),~"new") %>% head
cars %>% rename_at(1,~"new") %>% head
# new dist
# 1 4 2
# 2 4 10
# 3 7 4
# 4 7 22
# 5 8 16
# 6 9 10
works well in pipe chaines
convenient when names are stored in variables
works with a name or an column index
clear and compact
I like the next style for rename dataframe column names one by one.
colnames(df)[which(colnames(df) == 'old_colname')] <- 'new_colname'
where
which(colnames(df) == 'old_colname')
returns by the index of the specific column.
Let df be the dataframe you have with col names myDays and temp.
If you want to rename "myDays" to "Date",
library(plyr)
rename(df,c("myDays" = "Date"))
or with pipe, you can
dfNew <- df %>%
plyr::rename(c("myDays" = "Date"))
Try:
colnames(x)[2] <- 'newname2'
This is likely already out there, but I was playing with renaming fields while searching out a solution and tried this on a whim. Worked for my purposes.
Table1$FieldNewName <- Table1$FieldOldName
Table1$FieldOldName <- NULL
Edit begins here....
This works as well.
df <- rename(df, c("oldColName" = "newColName"))
You can use the rename.vars in the gdata package.
library(gdata)
df <- rename.vars(df, from = "oldname", to = "newname")
This is particularly useful where you have more than one variable name to change or you want to append or pre-pend some text to the variable names, then you can do something like:
df <- rename.vars(df, from = c("old1", "old2", "old3",
to = c("new1", "new2", "new3"))
For an example of appending text to a subset of variables names see:
https://stackoverflow.com/a/28870000/180892
You could also try 'upData' from 'Hmisc' package.
library(Hmisc)
trSamp = upData(trSamp, rename=c(sample.trainer.index..10000. = 'newname2'))
If you know that your dataframe has only one column, you can use:
names(trSamp) <- "newname2"
The OP's question has been well and truly answered. However, here's a trick that may be useful in some situations: partial matching of the column name, irrespective of its position in a dataframe:
Partial matching on the name:
d <- data.frame(name1 = NA, Reported.Cases..WHO..2011. = NA, name3 = NA)
## name1 Reported.Cases..WHO..2011. name3
## 1 NA NA NA
names(d)[grepl("Reported", names(d))] <- "name2"
## name1 name2 name3
## 1 NA NA NA
Another example: partial matching on the presence of "punctuation":
d <- data.frame(name1 = NA, Reported.Cases..WHO..2011. = NA, name3 = NA)
## name1 Reported.Cases..WHO..2011. name3
## 1 NA NA NA
names(d)[grepl("[[:punct:]]", names(d))] <- "name2"
## name1 name2 name3
## 1 NA NA NA
These were examples I had to deal with today, I thought might be worth sharing.
I would simply change a column name to the dataset with the new name I want with the following code:
names(dataset)[index_value] <- "new_col_name"
I found colnames() argument easier
https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/row%2Bcolnames
select some column from the data frame
df <- data.frame(df[, c( "hhid","b1005", "b1012_imp", "b3004a")])
and rename the selected column in order,
colnames(df) <- c("hhid", "income", "cost", "credit")
check the names and the values to be sure
names(df);head(df)
I would simply add a new column to the data frame with the name I want and get the data for it from the existing column. like this:
dataf$value=dataf$Article1Order
then I remove the old column! like this:
dataf$Article1Order<-NULL
This code might seem silly! But it works perfectly...
We can use rename_with to rename columns with a function (stringr functions, for example).
Consider the following data df_1:
df_1 <- data.frame(
x = replicate(n = 3, expr = rnorm(n = 3, mean = 10, sd = 1)),
y = sample(x = 1:2, size = 10, replace = TRUE)
)
names(df_1)
#[1] "x.1" "x.2" "x.3" "y"
Rename all variables with dplyr::everything():
library(tidyverse)
df_1 %>%
rename_with(.data = ., .cols = everything(.),
.fn = str_replace, pattern = '.*',
replacement = str_c('var', seq_along(.), sep = '_')) %>%
names()
#[1] "var_1" "var_2" "var_3" "var_4"
Rename by name particle with some dplyr verbs (starts_with, ends_with, contains, matches, ...).
Example with . (x variables):
df_1 %>%
rename_with(.data = ., .cols = contains('.'),
.fn = str_replace, pattern = '.*',
replacement = str_c('var', seq_along(.), sep = '_')) %>%
names()
#[1] "var_1" "var_2" "var_3" "y"
Rename by class with many functions of class test, like is.integer, is.numeric, is.factor...
Example with is.integer (y):
df_1 %>%
rename_with(.data = ., .cols = is.integer,
.fn = str_replace, pattern = '.*',
replacement = str_c('var', seq_along(.), sep = '_')) %>%
names()
#[1] "x.1" "x.2" "x.3" "var_1"
The warning:
Warning messages:
1: In stri_replace_first_regex(string, pattern, fix_replacement(replacement), :
longer object length is not a multiple of shorter object length
2: In names[cols] <- .fn(names[cols], ...) :
number of items to replace is not a multiple of replacement length
It is not relevant, as it is just an inconsistency of seq_along(.) with the replace function.
library(dplyr)
rename(data, de=de.y)

Resources