Unite two numerical columns into one in R - r

I have a df1
X Y
CA23-11 002 0033
CA67-55 011 0245
I would like to create df2
Z
CA2311-2-33
CA6755-11-245
My code to do this is
df2 <- df %>% unite(Z, X:Y, remove = "0", sep="-")
and my error is: Error in if(remove) { : argument in not interpertable as logical
Any assistance in this would be appreciated. Thanks!

We could use base R to return the expected output. Read the 'Y' column with read.table so that it automatically reads into numeric columns splitting at the whitespace, cbind with - removed 'X' column, change the format with sprintf
data.frame(Z = do.call(sprintf, c(fmt = '%s-%d-%d',
cbind(sub("-", "", df1$X),
read.table(text = df1$Y, header = FALSE)))))
-ouptut
Z
1 CA2311-2-33
2 CA6755-11-245
Or using tidyverse
We separate the 'Y' by splitting at space, and convert the type to numeric
Remove the - in 'X' - str_remove
Then unite the 3 columns together
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
separate(Y, into = c("Y1", "Y2"), convert = TRUE) %>%
mutate(X = str_remove(X, '-')) %>%
unite(Z, X, Y1, Y2, sep= '-')
Z
1 CA2311-2-33
2 CA6755-11-245
data
df1 <- structure(list(X = c("CA23-11", "CA67-55"), Y = c("002 0033",
"011 0245")), class = "data.frame", row.names = c(NA, -2L))

Here is another solution using built-in functions:
df2 <- data.frame(Z = paste0(sub("-", "", df1$X), gsub("^0*| 0*", "-", df1$Y)))
df2
# Z
# 1 CA2311-2-33
# 2 CA6755-11-245

Related

How do I get the column number from a dataframe which contains specific strings?

I have a data frame df with 7 columns and I have a list z containing multiple strings.
I want a dataframe containing only the columns in df which contain the sting from z.
df <- data.frame("a_means","b_means","c_means","d_means","e_mean","f_means","g_means")
z <- c("a_m","c_m","f_m")
How do I get the column number of the z strings in df? Or how do I get a dataframe with only the columns which contains the z strings.
What I want is:
print(df)
"a_means" "c_m" "f_m"
What I tried:
match(a, names(df)
and
df[,which(colnames(df) %in% colnames(df[ ,grepl(z,names(df)])]
You can use:
df[,match(z, substring(colnames(df), 1, 3))]
With base R:
z <- paste(z, collapse = "|")
df[, grepl(z, names(df))] # you could use grep as well
Combine the search patterns and use that as a pattern for stringr::str_detect() function.
library(dplyr)
library(stringr)
df <- data.frame(a_means = "a_means",
b_means = "b_means",
c_means = "c_means",
d_means = "d_means",
e_means = "e_means",
f_means = "f_means",
g_means = "g_means"
)
z <- c("a_m","c_m","f_m")
z <- paste(z, collapse = "|")
df %>% select_if(str_detect(names(df), z))
#> a_means c_means f_means
#> 1 a_means c_means f_means
You can simply do this:
library(dplyr)
df %>%
select(contains(z))
Check out help("starts_with"). You can also match to a starting prefix with starts_with() among other things.
You can use select and matches to subest the columns based on z
library(dplyr)
df <- data.frame("a_means","b_means","c_means","d_means","e_mean","f_means","g_means")
z <- c("a_m","c_m","f_m")
df %>%
select(matches(z))
#> X.a_means. X.c_means. X.f_means.
#> 1 a_means c_means f_means

Creating another column in R

This is my current data set
I want to take the numbers after "narrow" (e.g. 20) and make another vector. Any idea how I can do that?
We can use sub to remove the substring "Narrow", followed by a , and zero or more spaces (\\s+), replace with blank ("") and convert to numeric
df1$New <- as.numeric(sub("Narrow,\\s*", "", df1$Stimulus))
You could use separate to separate the stimulus column into two vectors.
library(tidyr)
df %>%
separate(col = stimulus,
sep = ", ",
into = c("Text","Number"))
Maybe you can try the code below, using regmatches
df$new <- with(df, as.numeric(unlist(regmatches(stimulus,gregexpr("\\d+",stimulus)))))
You want separate from the tidyr package.
library(dplyr)
df <- data.frame(x = c(NA, "a.b", "a.d", "b.c"))
df %>% separate(x, c("A", "B"))
#> A B
#> 1 <NA> <NA>
#> 2 a b
#> 3 a d
#> 4 b c

Flipping two sides of string

I need to prepare a certain dataset for analysis. What I have is a table with column names (obviously). The column names are as follows (sample colnames):
"X99_NORM", "X101_NORM", "X76_110_T02_09747", "X30_NORM"
(this is a vector, for those not familiair with R colnames() function)
Now, what I want is simply to flip the values in front of, and after the underscore. e.g. X99_NORM becomes NORM_X99. Note that I want this only for the column names which contain NORM in their name.
Some other base R options
1)
Use sub to switch the beginning and end - we can make use of capturing groups here.
x <- sub(pattern = "(^X\\d+)_(NORM$)", replacement = "\\2_\\1", x = x)
Result
x
# [1] "NORM_X99" "NORM_X101" "X76_110_T02_09747" "NORM_X30"
2)
A regex-free approach that might be more efficient using chartr, dirname and paste. But we need to get the indices of the columns that contain "NORM" first
idx <- grep(x = x, pattern = "NORM", fixed = TRUE)
x[idx] <- paste0("NORM_", dirname(chartr("_", "/", x[idx])))
x
data
x <- c("X99_NORM", "X101_NORM", "X76_110_T02_09747", "X30_NORM")
x = c("X99_NORM", "X101_NORM", "X76_110_T02_09747", "X30_NORM")
replace(x,
grepl("NORM", x),
sapply(strsplit(x[grepl("NORM", x)], "_"), function(x){
paste(rev(x), collapse = "_")
}))
#[1] "NORM_X99" "NORM_X101" "X76_110_T02_09747" "NORM_X30"
A tidyverse solution with stringr:
library(tidyverse)
library(stringr)
my_data <- tibble(column = c("X99_NORM", "X101_NORM", "X76_110_T02_09747", "X30_NORM"))
my_data %>%
filter(str_detect(column, "NORM")) %>%
mutate(column_2 = paste0("NORM", "_", str_extract(column, ".+(?=_)"))) %>%
select(column_2)
# A tibble: 3 x 1
column_2
<chr>
1 NORM_X99
2 NORM_X101
3 NORM_X30

Removing suffix from column names using rename_all?

I have a data frame with a number of columns in a form var1.mean, var2.mean. I would like to strip the suffix ".mean" from all columns that contain it. I tried using rename_all in conjunction with regex in a pipe but could not come up with a correct syntax. Any suggestions?
If you want to use the dplyr package, I'd recommend using the rename_at function.
Dframe <- data.frame(var1.mean = rnorm(10),
var2.mean = rnorm(10),
var1.sd = runif(10))
library(dplyr)
Dframe %>%
rename_at(.vars = vars(ends_with(".mean")),
.funs = funs(sub("[.]mean$", "", .)))
Using new dplyr:
df %>% rename_with(~str_remove(., '.mean'))
We can use rename_all
df1 %>%
rename_all(.funs = funs(sub("\\..*", "", names(df1)))) %>%
head(2)
# var1 var2 var3 var1 var2 var3
#1 -0.5458808 -0.09411013 0.5266526 -1.3546636 0.08314367 0.5916817
#2 0.5365853 -0.08554095 -1.0736261 -0.9608088 2.78494703 -0.2883407
NOTE: If the column names are duplicated, it needs to be made unique with make.unique
data
set.seed(24)
df1 <- as.data.frame(matrix(rnorm(25*6), 25, 6, dimnames = list(NULL,
paste0(paste0("var", 1:3), rep(c(".mean", ".sd"), each = 3)))))
You may use gsub.
colnames(df) <- gsub('.mean','',colnames(df))
The below works for me
dat <- data.frame(var1.mean = 1, var2.mean = 2)
col_old <- colnames(dat)
col_new <- gsub(pattern = ".mean",replacement = "", x = col_old)
colnames(dat) <- col_new
You can replace this names using stringi package stri_replace_last_regex function like this:
require(stringi)
df <- data.frame(1,2,3,4,5,6)
names(df) <- stri_paste("var",1:6,c(".mean",".sd"))
df
## var1.mean var2.sd var3.mean var4.sd var5.mean var6.sd
##1 1 2 3 4 5 6
names(df) <- stri_replace_last_regex(names(df),"\\.mean$","")
df
## var1 var2.sd var3 var4.sd var5 var6.sd
##1 1 2 3 4 5 6
The regex is \\.mean$ because you need to escape dot character (it has special meaning in regex) and also you can add $ sign at the end to ensure that you replace only names that ENDS with this pattern (if the .mean text is in the middle of string then it wan't be replaced).
I would use stringsplit:
x <- as.data.frame(matrix(runif(16), ncol = 4))
colnames(x) <- c("var1.mean", "var2.mean", "var3.mean", "something.else")
colnames(x) <- strsplit(colnames(x), split = ".mean")
colnames(x)
Lot's of quick answers have been given, the most intuitive, to me would be:
Dframe <- data.frame(var1.mean = rnorm(10), #Create Example
var2.mean = rnorm(10),
var1.sd = runif(10))
names(Dframe) <- gsub("[.]mean","",names(Dframe)) #remove ".mean"

R strsplit function in a data frame

I create a data frame which now I want to separate one new column by split the ":" in first column.
data frame:
unc.edu.0057f9f7-779b-4914-8290-abbad2a0d81e.2556919.rsem.genes.normalized_results:ASL|435 214.4421
unc.edu.0057f9f7-779b-4914-8290-abbad2a0d81e.2556919.rsem.genes.normalized_results:ASS1|445 2863.8055
unc.edu.0057f9f7-779b-4914-8290-abbad2a0d81e.2556919.rsem.genes.normalized_results:OTC|5009 0
unc.edu.050c2191-b96c-41e7-abdb-e52cbe82f268.2456235.rsem.genes.normalized_results:ASL|435 332.7522
unc.edu.050c2191-b96c-41e7-abdb-e52cbe82f268.2456235.rsem.genes.normalized_results:ASS1|445 3322.629
unc.edu.050c2191-b96c-41e7-abdb-e52cbe82f268.2456235.rsem.genes.normalized_results:OTC|5009 0
desired output:
unc.edu.0057f9f7-779b-4914-8290-abbad2a0d81e.2556919.rsem.genes.normalized_results ASL|435 214.4421
unc.edu.0057f9f7-779b-4914-8290-abbad2a0d81e.2556919.rsem.genes.normalized_results ASS1|445 2863.8055
unc.edu.0057f9f7-779b-4914-8290-abbad2a0d81e.2556919.rsem.genes.normalized_results OTC|5009 0
unc.edu.050c2191-b96c-41e7-abdb-e52cbe82f268.2456235.rsem.genes.normalized_results ASL|435 332.7522
unc.edu.050c2191-b96c-41e7-abdb-e52cbe82f268.2456235.rsem.genes.normalized_results ASS1|445 3322.629
unc.edu.050c2191-b96c-41e7-abdb-e52cbe82f268.2456235.rsem.genes.normalized_results OTC|5009 0
I have tried
strsplit(df$V1, split = "\\:")
but Error in strsplit(t$V1, split = "\:") : non-character argument come out. Thank you.
The error is because we have a variable of class factor. Convert it to character and it should work
lst <- strsplit(as.character(df$V1), split = ":", fixed = TRUE)
If we need to create two columns, one easy way is with read.table
df1 <- read.table(text = as.character(df$V1), sep=":", stringsAsFactors=FALSE)
Or using separate from tidyr
library(tidyr)
separate(df1, V1, into = c("V1", "V2"))
tidyr::separate(data = df, col = V1, into = c('a', 'b'), sep = ':')

Resources