I have a dataset with column names
Col_a_b1 Col_a_b2 Col_a_b3 Col_a_b4 Col_a_b5 Col_a_b6 Col_a_b7 Col_a_b8 Col_a_b9 Col_a_b10 Col_a_b11 ... Col_a_b94
How do I add 0s to column names 1 to 10 , expected column names
Col_a_b01 Col_a_b02 Col_a_b03 Col_a_b04 Col_a_b05 Col_a_b06 Col_a_b07 Col_a_b08 Col_a_b09 Col_a_b10 Col_a_b11 ... Col_a_b94
Any suggestions much appreciated. Thanks.
With a tidyverse approach:
library(tidyverse)
names <- c("Col_a_b1", "Col_a_b2", "Col_a_b3", "Col_a_b4", "Col_a_b5", "Col_a_b6", "Col_a_b7", "Col_a_b8", "Col_a_b9", "Col_a_b10", "Col_a_b11")
names %>%
str_split("(?<=Col_a_b)(?=\\d+)") %>%
map_chr(~ str_c(.x[1], str_pad(.x[2], width = 2, pad = "0")))
#> [1] "Col_a_b01" "Col_a_b02" "Col_a_b03" "Col_a_b04" "Col_a_b05" "Col_a_b06"
#> [7] "Col_a_b07" "Col_a_b08" "Col_a_b09" "Col_a_b10" "Col_a_b11"
For a given dataframe called data:
colnames(data) <- sprintf("Col_a_b%02d", parse_number(colnames(data)))
%02d means a decimal integer, left padded, with zeros up to 2 digits.
Example
# Sample data
data = structure(list(Col_a_b1 = c("Name1", "Name2"), Col_a_b94 = c(1,
2)), class = "data.frame", row.names = c(NA, -2L))
> data
Col_a_b1 Col_a_b94
1 Name1 1
2 Name2 2
colnames(data) <- sprintf("Col_a_b%02d", parse_number(colnames(data)))
> data
Col_a_b01 Col_a_b94
1 Name1 1
2 Name2 2
One way to do is
#column names
nam = c( 'Col_a_b1', 'Col_a_b2' , 'Col_a_b3')
#extract the number
num = parse_number(nam)
#convert to two digits.
num = sub('(^[0-9]$)','0\\1', num)
#remove the numbers
nam = gsub('[0-9]+', '', nam)
#add 0
mod_nam = paste0(nam, num)
[1] "Col_a_b01" "Col_a_b02" "Col_a_b03"
Related
I need your support while working with dates.
While importing an .xls file, the column of dates was correctly converted into numbers by R. Unfortunately some dates are still there in the format: dd/mm/yyyy or d/mm/yyyy or dd/mm/yy. Probably this results from different settings of different os. I don't know. Is there a way to manage this?
Thank you in advance
my_data <- read_excel("my_file.xls")
born_date
18520
30859
16/04/1972
26612
30291
24435
11/02/1964
26/09/1971
18427
23688
Original_dates
14/9/1950
26/6/1984
16/04/1972
9/11/1972
6/12/1982
24/11/1966
11/02/1964
26/09/1971
13/6/1950
Here is one way how we could solve it:
First we define the numeric values only by exlcuden those containing the string /.
Then we use excel_numeric_to_date function from janitor package.
Finally with coalesce we combine both:
library(dplyr)
library(janitor)
library(lubridate)
df %>%
mutate(x = ifelse(str_detect(born_date, '\\/'), NA_real_, born_date),
x = excel_numeric_to_date(as.numeric(as.character(x)), date_system = "modern"),
born_date = dmy(born_date)) %>%
mutate(born_date = coalesce(born_date, x), .keep="unused")
born_date
1 1950-09-14
2 1984-06-26
3 1972-04-16
4 1972-11-09
5 1982-12-06
6 1966-11-24
7 1964-02-11
8 1971-09-26
9 1950-06-13
10 1964-11-07
data:
df <- structure(list(born_date = c("18520", "30859", "16/04/1972",
"26612", "30291", "24435", "11/02/1964", "26/09/1971", "18427",
"23688")), class = "data.frame", row.names = c(NA, -10L))
1) This translates the two types of dates. Each returns an NA for those elements not of that type. Then we use coalesce to combine them. This only needs dplyr and no warnings are produced.
library(dplyr)
my_data %>%
mutate(born_date = coalesce(
as.Date(born_date, "%d/%m/%Y"),
as.Date(as.numeric(ifelse(grepl("/",born_date), NA, born_date)), "1899-12-30"))
)
## born_date
## 1 1950-09-14
## 2 1984-06-26
## 3 1972-04-16
## 4 1972-11-09
## 5 1982-12-06
## 6 1966-11-24
## 7 1964-02-11
## 8 1971-09-26
## 9 1950-06-13
## 10 1964-11-07
2) Here is a base R version.
my_data |>
transform(born_date = pmin(na.rm = TRUE,
as.Date(born_date, "%d/%m/%Y"),
as.Date(as.numeric(ifelse(grepl("/",born_date), NA, born_date)), "1899-12-30"))
)
Note
The input in reproducible form.
my_data <-
structure(list(born_date = c("18520", "30859", "16/04/1972",
"26612", "30291", "24435", "11/02/1964", "26/09/1971", "18427",
"23688")), class = "data.frame", row.names = c(NA, -10L))
here is a dataframe for example:
test_df <- structure(list(plant_id = c("AB1234", "CC0000", "ZX9998", "AA1110", "LO8880"),
NewName = c("ZY8765", "XX9999", "AC0001", "ZZ8889", "OL1119")),
row.names = c(NA, -5L), class = "data.frame",
.Names = c("plant_sp", "NewName"))
As you can see, there is a column call "plant_sp" with a 6 character code.
I'd like to tansform this code to a new code (like at column "NewName" by this format:
For letters:
A-Z
B-Y
C-X
D-W
E-V
F-U
G-T
.
.
.
For numbers:
0-9
1-8
2-7
3-6
4-5
5-4
.
.
.
plant_sp NewName
1 AB1234 ZY8765
2 CC0000 XX9999
3 ZX9998 AC0001
4 AA1110 ZZ8889
5 LO8880 OL1119
So that each character will get the opposite one by its value (0=9, 1=8... A=Z, B=Y...)
How can I do it? a pipe solution would be great.
Thanks a lot!
One option to achieve your desired result would be via a lookup table and stringr::str_replace_all:
library(dplyr)
library(stringr)
lt_letters <- setNames(rev(LETTERS), LETTERS)
lt_numbers <- setNames(rev(0:9),0:9)
test_df %>%
mutate(NewName1 = str_replace_all(plant_sp, "[A-Z0-9]", function(x) c(lt_letters, lt_numbers)[x]))
#> plant_sp NewName NewName1
#> 1 AB1234 ZY8765 ZY8765
#> 2 CC0000 XX9999 XX9999
#> 3 ZX9998 AC0001 AC0001
#> 4 AA1110 ZZ8889 ZZ8889
#> 5 LO8880 OL1119 OL1119
Say I have a dataframe of tens of columns, and my custom function needs each one of these columns plus a number in a vector to give me the desired output. After being done with all that, I need to generate new column names based on the original column names in the dataframe. How to accomplish this using the tidyverse, instead of for loops or other solutions in base R.
MWE
structure(list(col1 = c(36.0520583373645, 37.9423749063706, 33.6806634587719,
34.031649012457, 29.5448679963449, NA, 34.7576769718877, 30.484217745574,
32.9849083643022, 27.4081694831058, 35.8624919654559, 35.0284347997991,
NA, 32.112605893241, 27.819354948082, 35.6499532124921, 35.0265642403216,
32.4006569441297, 30.3698557864842, 31.8229364456928, 34.3715903109276
), col2 = c(32.9691195198199, 35.6643664156284, 33.8748732989736,
34.5436311813644, 33.2228201914256, 38.7621696867191, 34.8399804318992,
32.9063078995457, 35.7391166214367, 32.7217251282669, 36.3039268989853,
35.9607654868559, 33.1385915196435, 34.7987649028199, 33.7100463668523,
34.7773403671057, 35.8592997980752, 33.8537127786535, 31.9106243803505,
39.3099469314882, 35.1849826815196), col3 = c(33.272278716963,
NA, 31.8594920410129, 33.1695042551974, 29.3800694974438, 35.1504378875245,
34.0771487001433, 29.0162879030415, 30.6960024888799, 29.5542117965184,
34.3726321365982, 36.0602274148362, 33.1207772548047, 31.5506876209822,
28.8649303491974, 33.4598790144265, 30.5573454464747, 31.6026723913051,
30.4716061556625, 33.009463000301, 30.846230953425)), row.names = c(NA,
-21L), class = "data.frame")
save above in a file, and then use example <- dget(file.choose()) to read the above dataframe.
Code
y <- c (2, 1, 1.5)
customfun <- function(x, y){
n <- log (x) * y
print (n)
}
df <- example %>%
dplyr::mutate(col1.log = customfun (col1, y = y[1])) %>%
dplyr::mutate(col2.log = customfun (col2, y = y[2])) %>%
dplyr::mutate(col3.log = customfun (col3, y = y[3]))
Question
Imagine I have tens of these columns not only 3 as in the MWE, how to generate the new ones dynamically using the tidyverse?
We can use map2 and bind_cols to add new columns
library(dplyr)
library(purrr)
bind_cols(example, map2_df(example, y, customfun) %>%
rename_all(~paste0(., ".log")))
# col1 col2 col3 col1.log col2.log col3.log
#1 36.05206 32.96912 33.27228 7.169928 3.495571 5.257087
#2 37.94237 35.66437 NA 7.272137 3.574152 NA
#3 33.68066 33.87487 31.85949 7.033848 3.522674 5.192003
#4 34.03165 34.54363 33.16950 7.054582 3.542223 5.252446
#...
tidyverse is not great for these sweep()-like operations, however, one option could be:
example %>%
do(., sweep(., 2, FUN = customfun, y)) %>%
rename_all(~ paste(., "log", sep = "."))
col1.log col2.log col3.log
1 7.169928 3.495571 5.257087
2 7.272137 3.574152 NA
3 7.033848 3.522674 5.192003
4 7.054582 3.542223 5.252446
5 6.771820 3.503237 5.070475
6 NA 3.657445 5.339456
7 7.096801 3.550766 5.292941
8 6.834418 3.493664 5.051786
9 6.992100 3.576246 5.136199
10 6.621682 3.488039 5.079339
I have data with sample names that need to be unpacked and created into new columns.
sample
P10.1
P11.2
S1.1
S3.3
Using the sample ID data, I need to make three new columns: tissue, plant, stage.
sample tissue plant stage
P10.1 P 10 1
P11.2 P 11 2
S1.1 S 1 1
S3.3 S 3 3
Is there a way to pull the data from the sample column to populate the three new columns?
using dplyr and tidyr.
First we insert a "." in the sample code, next we separate sample into 3 columns.
library(dplyr)
library(tidyr)
df %>%
mutate(sample = paste0(substring(df$sample, 1, 1), ".", substring(df$sample, 2))) %>%
separate(sample, into = c("tissue", "plant", "stage"), remove = FALSE)
sample tissue plant stage
1 P.10.1 P 10 1
2 P.11.2 P 11 2
3 S.1.1 S 1 1
4 S.3.3 S 3 3
data:
df <- structure(list(sample = c("P10.1", "P11.2", "S1.1", "S3.3")),
.Names = "sample",
class = "data.frame",
row.names = c(NA, -4L))
Similar to #phiver, but uses regular expressions.
Within pattern:
The first parentheses captures any single uppercase letter (for tissue)
The second parentheses captures any one or two digit number (for plant)
The third parentheses captures any one or two digit number (for stage)
The sub() function pulls out those capturing groups, and places then in new variables.
library(magrittr)
pattern <- "^([A-Z])(\\d{1,2})\\.(\\d{1,2})$"
df %>%
dplyr::mutate(
tissue = sub(pattern, "\\1", sample),
plant = as.integer(sub(pattern, "\\2", sample)),
stage = as.integer(sub(pattern, "\\3", sample))
)
Result (displayed with str()):
'data.frame': 4 obs. of 4 variables:
$ sample: chr "P10.1" "P11.2" "S1.1" "S3.3"
$ tissue: chr "P" "P" "S" "S"
$ plant : int 10 11 1 3
$ stage : int 1 2 1 3
This is similar to phiver's answer, but use separate twice. Notice that we can specify the position index in the sep argument.
library(tidyr)
dat2 <- dat %>%
separate(sample, into = c("tissue", "number"), sep = 1, remove = FALSE) %>%
separate(number, into = c("plant", "stage"), sep = "\\.", remove = TRUE, convert = TRUE)
dat2
# sample tissue plant stage
# 1 P10.1 P 10 1
# 2 P11.2 P 11 2
# 3 S1.1 S 1 1
# 4 S3.3 S 3 3
DATA
dat <- read.table(text = "sample
P10.1
P11.2
S1.1
S3.3",
header = TRUE, stringsAsFactors = FALSE)
There is the basic width : xxxx.xxxxxx (4digits before "." 6 digits after".")
Have to add "0" when each side before and after "." is not enough digits.
Use regexr find "[.]" location with combination of str_pad can
fix the first 4 digits but
don't know how to add value after the specific character with fixed digits.
(cannot find a library can count the location from somewhere specified)
Data like this
> df
Category
1 300.030340
2 3400.040290
3 700.07011
4 1700.0901
5 700.070114
6 700.0791
7 3600.05059
8 4400.0402
Desired data
> df
Category
1 0300.030340
2 3400.040290
3 0700.070110
4 1700.090100
5 0700.070114
6 0700.079100
7 3600.050590
8 4400.040200
I am beginner of coding that sometime can't understand some regex like "["
e.t.c .With some explain of them would be super helpful.
Also i have a combination like this :
df$Category<-ifelse(regexpr("[.]",df$Category)==4,
paste("0",df1$Category,sep = ""),df$Category)
df$Category<-str_pad(df$Category,11,side = c("right"),pad="0")
Desire to know are there is any better way do this , especially count and
return the location from the END until specific character appear.
Using formatC:
df$Category <- formatC(as.numeric(df$Category), format = 'f', width = 11, flag = '0', digits = 6)
# > df
# Category
# 1 0300.030340
# 2 3400.040290
# 3 0700.070110
# 4 1700.090100
# 5 0700.070114
# 6 0700.079100
# 7 3600.050590
# 8 4400.040200
format = 'f': formating doubles;
width = 11: 4 digits before . + 1 . + 6 digits after .;
flag = '0': pads leading zeros;
digits = 6: the desired number of digits after the decimal point (format = "f");
Input df seems to be character data.frame:
structure(list(Category = c("300.030340", "3400.040290", "700.07011",
"1700.0901", "700.070114", "700.0791", "3600.05059", "4400.0402"
)), .Names = "Category", row.names = c(NA, -8L), class = "data.frame")
We can use sprintf
df$Category <- sprintf("%011.6f", df$Category)
df
# Category
#1 0300.030340
#2 3400.040290
#3 0700.070110
#4 1700.090100
#5 0700.070114
#6 0700.079100
#7 3600.050590
#8 4400.040200
data
df <- structure(list(Category = c(300.03034, 3400.04029, 700.07011,
1700.0901, 700.070114, 700.0791, 3600.05059, 4400.0402)),
.Names = "Category", class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8"))
There are plenty of great tricks, functions, and shortcuts to be learned, and I would encourage you to explore them all! For example, if you're trying to win code golf, you will want to use #akrun's sprintf() approach. Since you stated you're a beginner, it might be more helpful to breakdown the problem into its component parts. One transparent and easy-to-follow, in my opinion, approach would be to utilize the stringr package:
library(stringr)
location_of_dot <- str_locate(df$Category, "\\.")[, 1]
substring_left_of_dot <- str_sub(df$Category, end = location_of_dot - 1)
substring_right_of_dot <- str_sub(df$Category, start = location_of_dot + 1)
pad_left <- str_pad(substring_left_of_dot, 4, side = "left", pad = "0")
pad_right <- str_pad(substring_right_of_dot, 6, side = "right", pad = "0")
result <- paste0(pad_left, ".", pad_right)
result
Use separate in tidyr to separate Category on decimal. Use str_pad from stringr to add zeros in the front or back and paste them together.
library(tidyr) # to separate columns on decimal
library(dplyr) # to mutate and pipes
library(stringr) # to strpad
input_data <- read.table(text =" Category
1 300.030340
2 3400.040290
3 700.07011
4 1700.0901
5 700.070114
6 700.0791
7 3600.05059
8 4400.0402", header = TRUE, stringsAsFactors = FALSE) %>%
separate(Category, into = c("col1", "col2")) %>%
mutate(col1 = str_pad(col1, width = 4, side= "left", pad ="0"),
col2 = str_pad(col2, width = 6, side= "right", pad ="0"),
Category = paste(col1, col2, sep = ".")) %>%
select(-col1, -col2)