I need your support while working with dates.
While importing an .xls file, the column of dates was correctly converted into numbers by R. Unfortunately some dates are still there in the format: dd/mm/yyyy or d/mm/yyyy or dd/mm/yy. Probably this results from different settings of different os. I don't know. Is there a way to manage this?
Thank you in advance
my_data <- read_excel("my_file.xls")
born_date
18520
30859
16/04/1972
26612
30291
24435
11/02/1964
26/09/1971
18427
23688
Original_dates
14/9/1950
26/6/1984
16/04/1972
9/11/1972
6/12/1982
24/11/1966
11/02/1964
26/09/1971
13/6/1950
Here is one way how we could solve it:
First we define the numeric values only by exlcuden those containing the string /.
Then we use excel_numeric_to_date function from janitor package.
Finally with coalesce we combine both:
library(dplyr)
library(janitor)
library(lubridate)
df %>%
mutate(x = ifelse(str_detect(born_date, '\\/'), NA_real_, born_date),
x = excel_numeric_to_date(as.numeric(as.character(x)), date_system = "modern"),
born_date = dmy(born_date)) %>%
mutate(born_date = coalesce(born_date, x), .keep="unused")
born_date
1 1950-09-14
2 1984-06-26
3 1972-04-16
4 1972-11-09
5 1982-12-06
6 1966-11-24
7 1964-02-11
8 1971-09-26
9 1950-06-13
10 1964-11-07
data:
df <- structure(list(born_date = c("18520", "30859", "16/04/1972",
"26612", "30291", "24435", "11/02/1964", "26/09/1971", "18427",
"23688")), class = "data.frame", row.names = c(NA, -10L))
1) This translates the two types of dates. Each returns an NA for those elements not of that type. Then we use coalesce to combine them. This only needs dplyr and no warnings are produced.
library(dplyr)
my_data %>%
mutate(born_date = coalesce(
as.Date(born_date, "%d/%m/%Y"),
as.Date(as.numeric(ifelse(grepl("/",born_date), NA, born_date)), "1899-12-30"))
)
## born_date
## 1 1950-09-14
## 2 1984-06-26
## 3 1972-04-16
## 4 1972-11-09
## 5 1982-12-06
## 6 1966-11-24
## 7 1964-02-11
## 8 1971-09-26
## 9 1950-06-13
## 10 1964-11-07
2) Here is a base R version.
my_data |>
transform(born_date = pmin(na.rm = TRUE,
as.Date(born_date, "%d/%m/%Y"),
as.Date(as.numeric(ifelse(grepl("/",born_date), NA, born_date)), "1899-12-30"))
)
Note
The input in reproducible form.
my_data <-
structure(list(born_date = c("18520", "30859", "16/04/1972",
"26612", "30291", "24435", "11/02/1964", "26/09/1971", "18427",
"23688")), class = "data.frame", row.names = c(NA, -10L))
Related
Good day!
I have a dataset in which I have values like "Invalid", "Invalid(N/A)", "Invalid(1.23456)", lots of them in different columns and they are different from file to file.
Goal is to make script file to process different CSVs.
I tried read.csv and read_csv, but faced errors with data types or no errors, but no action either.
All columns are col_character except one - col_double.
Tried this:
is.na(df) <- startsWith(as.character(df, "Inval")
no luck
Tried this:
is.na(df) <- startsWith(df, "Inval")
no luck, some error about non char object
Tried this:
df %>%
mutate(across(everything(), .fns = ~str_replace(., "invalid", NA_character_)))
no luck
And other google stuff - no luck, again, errors with data types or no errors, but no action either.
So R is incapable of simple find and replace in data frame, huh?
data frame exampl
Output of dput(dtype_Result[1:20, 1:4])
structure(list(Location = c("1(1,A1)", "2(1,B1)", "3(1,C1)",
"4(1,D1)", "5(1,E1)", "6(1,F1)", "7(1,G1)", "8(1,H1)", "9(1,A2)",
"10(1,B2)", "11(1,C2)", "12(1,D2)", "13(1,E2)", "14(1,F2)", "15(1,G2)",
"16(1,H2)", "17(1,A3)", "18(1,B3)", "19(1,C3)", "20(1,D3)"),
Sample = c("Background0", "Background0", "Standard1", "Standard1",
"Standard2", "Standard2", "Standard3", "Standard3", "Standard4",
"Standard4", "Standard5", "Standard5", "Standard6", "Standard6",
"Control1", "Control1", "Control2", "Control2", "Unknown1",
"Unknown1"), EGF = c(NA, NA, "6.71743640129069", "2.66183193679533",
"16.1289784536322", "16.1289784536322", "78.2706654825781",
"78.6376213069722", "382.004087907716", "447.193928257862",
"Invalid(N/A)", "1920.90297258996", "7574.57784103579", "29864.0308009592",
"167.830723655146", "109.746615928611", "868.821939675054",
"971.158518683179", "9.59119569511596", "4.95543581398464"
), `FGF-2` = c(NA, NA, "25.5436745776637", NA, "44.3280630362038",
NA, "91.991708192168", "81.9459159768959", "363.563899234418",
"425.754478700876", "Invalid(2002.97340881547)", "2027.71958119836",
"9159.40221389147", "11138.8722428849", "215.58494072476",
"70.9775438699825", "759.798876479002", "830.582605561901",
"58.7007261370257", "70.9775438699825")), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
The error is in the use of startsWith. The following grepl solution is simpler and works.
is.na(df) <- sapply(df, function(x) grepl("^Invalid", x))
The str_replace function will attempt to edit the content of a character string, inserting a partial replacement, rather than replacing it entirely. Also, the across function is targeting all of the columns including the numeric id. The following code works, building on the tidyverse example you provided.
To fix it, use where to identify the columns of interest, then use if_else to overwrite the data with NA values when there is a partial string match, using str_detect to spot the target text.
Example data
library(tiyverse)
df <- tibble(
id = 1:3,
x = c("a", "invalid", "c"),
y = c("d", "e", "Invalid/NA")
)
df
# A tibble: 3 x 3
id x y
<int> <chr> <chr>
1 1 a d
2 2 invalid e
3 3 c Invalid/NA
Solution
df <- df %>%
mutate(
across(where(is.character),
.fns = ~if_else(str_detect(tolower(.x), "invalid"), NA_character_, .x))
)
print(df)
Result
# A tibble: 3 x 3
id x y
<int> <chr> <chr>
1 1 a d
2 2 NA e
3 3 c NA
Say I have a dataframe of tens of columns, and my custom function needs each one of these columns plus a number in a vector to give me the desired output. After being done with all that, I need to generate new column names based on the original column names in the dataframe. How to accomplish this using the tidyverse, instead of for loops or other solutions in base R.
MWE
structure(list(col1 = c(36.0520583373645, 37.9423749063706, 33.6806634587719,
34.031649012457, 29.5448679963449, NA, 34.7576769718877, 30.484217745574,
32.9849083643022, 27.4081694831058, 35.8624919654559, 35.0284347997991,
NA, 32.112605893241, 27.819354948082, 35.6499532124921, 35.0265642403216,
32.4006569441297, 30.3698557864842, 31.8229364456928, 34.3715903109276
), col2 = c(32.9691195198199, 35.6643664156284, 33.8748732989736,
34.5436311813644, 33.2228201914256, 38.7621696867191, 34.8399804318992,
32.9063078995457, 35.7391166214367, 32.7217251282669, 36.3039268989853,
35.9607654868559, 33.1385915196435, 34.7987649028199, 33.7100463668523,
34.7773403671057, 35.8592997980752, 33.8537127786535, 31.9106243803505,
39.3099469314882, 35.1849826815196), col3 = c(33.272278716963,
NA, 31.8594920410129, 33.1695042551974, 29.3800694974438, 35.1504378875245,
34.0771487001433, 29.0162879030415, 30.6960024888799, 29.5542117965184,
34.3726321365982, 36.0602274148362, 33.1207772548047, 31.5506876209822,
28.8649303491974, 33.4598790144265, 30.5573454464747, 31.6026723913051,
30.4716061556625, 33.009463000301, 30.846230953425)), row.names = c(NA,
-21L), class = "data.frame")
save above in a file, and then use example <- dget(file.choose()) to read the above dataframe.
Code
y <- c (2, 1, 1.5)
customfun <- function(x, y){
n <- log (x) * y
print (n)
}
df <- example %>%
dplyr::mutate(col1.log = customfun (col1, y = y[1])) %>%
dplyr::mutate(col2.log = customfun (col2, y = y[2])) %>%
dplyr::mutate(col3.log = customfun (col3, y = y[3]))
Question
Imagine I have tens of these columns not only 3 as in the MWE, how to generate the new ones dynamically using the tidyverse?
We can use map2 and bind_cols to add new columns
library(dplyr)
library(purrr)
bind_cols(example, map2_df(example, y, customfun) %>%
rename_all(~paste0(., ".log")))
# col1 col2 col3 col1.log col2.log col3.log
#1 36.05206 32.96912 33.27228 7.169928 3.495571 5.257087
#2 37.94237 35.66437 NA 7.272137 3.574152 NA
#3 33.68066 33.87487 31.85949 7.033848 3.522674 5.192003
#4 34.03165 34.54363 33.16950 7.054582 3.542223 5.252446
#...
tidyverse is not great for these sweep()-like operations, however, one option could be:
example %>%
do(., sweep(., 2, FUN = customfun, y)) %>%
rename_all(~ paste(., "log", sep = "."))
col1.log col2.log col3.log
1 7.169928 3.495571 5.257087
2 7.272137 3.574152 NA
3 7.033848 3.522674 5.192003
4 7.054582 3.542223 5.252446
5 6.771820 3.503237 5.070475
6 NA 3.657445 5.339456
7 7.096801 3.550766 5.292941
8 6.834418 3.493664 5.051786
9 6.992100 3.576246 5.136199
10 6.621682 3.488039 5.079339
I've got following data:
ex <- structure(list(X1 = c("0", "2912.99", "922.1", "772.9100000000001",
"7112.97", "933.09", "1190.03")), row.names = c(NA, -7L), .Names = "X1", class =
c("tbl_df", "tbl", "data.frame"))
To my surprise, when I'm trying to convert them to double, decimals are rounded up to integers.
ex %>% mutate(X1 = as.double(X1))
I have tried to change the digits options with options(digits = 22), but it didn't help. What causes this problem? Is it a matter of using dplyr::mutate? How this can be alternatively converted to double?
One may choose the number of digits printed by setting the option
of the pillar package, which is used by tibble.
> options(pillar.sigfig=16)
> ex %>% mutate(X1 = as.double(X1))
# A tibble: 7 x 1
X1
<dbl>
1 0.
2 2912.990000000000
3 922.1000000000000
4 772.9100000000001
5 7112.970000000000
6 933.0900000000000
7 1190.030000000000
If you want tibbles to be printed with more than 3 decimals as a default, you need to adjust the pillar.sigfig option as well:
options(pillar.sigfig = 22)
options(digits = 22)
ex %>% mutate(X1 = as.double(X1))
X1
<dbl>
1 0.00000000000000
2 2912.98999999999978
3 922.10000000000002
4 772.91000000000008
5 7112.97000000000025
6 933.09000000000003
7 1190.02999999999997
There is the basic width : xxxx.xxxxxx (4digits before "." 6 digits after".")
Have to add "0" when each side before and after "." is not enough digits.
Use regexr find "[.]" location with combination of str_pad can
fix the first 4 digits but
don't know how to add value after the specific character with fixed digits.
(cannot find a library can count the location from somewhere specified)
Data like this
> df
Category
1 300.030340
2 3400.040290
3 700.07011
4 1700.0901
5 700.070114
6 700.0791
7 3600.05059
8 4400.0402
Desired data
> df
Category
1 0300.030340
2 3400.040290
3 0700.070110
4 1700.090100
5 0700.070114
6 0700.079100
7 3600.050590
8 4400.040200
I am beginner of coding that sometime can't understand some regex like "["
e.t.c .With some explain of them would be super helpful.
Also i have a combination like this :
df$Category<-ifelse(regexpr("[.]",df$Category)==4,
paste("0",df1$Category,sep = ""),df$Category)
df$Category<-str_pad(df$Category,11,side = c("right"),pad="0")
Desire to know are there is any better way do this , especially count and
return the location from the END until specific character appear.
Using formatC:
df$Category <- formatC(as.numeric(df$Category), format = 'f', width = 11, flag = '0', digits = 6)
# > df
# Category
# 1 0300.030340
# 2 3400.040290
# 3 0700.070110
# 4 1700.090100
# 5 0700.070114
# 6 0700.079100
# 7 3600.050590
# 8 4400.040200
format = 'f': formating doubles;
width = 11: 4 digits before . + 1 . + 6 digits after .;
flag = '0': pads leading zeros;
digits = 6: the desired number of digits after the decimal point (format = "f");
Input df seems to be character data.frame:
structure(list(Category = c("300.030340", "3400.040290", "700.07011",
"1700.0901", "700.070114", "700.0791", "3600.05059", "4400.0402"
)), .Names = "Category", row.names = c(NA, -8L), class = "data.frame")
We can use sprintf
df$Category <- sprintf("%011.6f", df$Category)
df
# Category
#1 0300.030340
#2 3400.040290
#3 0700.070110
#4 1700.090100
#5 0700.070114
#6 0700.079100
#7 3600.050590
#8 4400.040200
data
df <- structure(list(Category = c(300.03034, 3400.04029, 700.07011,
1700.0901, 700.070114, 700.0791, 3600.05059, 4400.0402)),
.Names = "Category", class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8"))
There are plenty of great tricks, functions, and shortcuts to be learned, and I would encourage you to explore them all! For example, if you're trying to win code golf, you will want to use #akrun's sprintf() approach. Since you stated you're a beginner, it might be more helpful to breakdown the problem into its component parts. One transparent and easy-to-follow, in my opinion, approach would be to utilize the stringr package:
library(stringr)
location_of_dot <- str_locate(df$Category, "\\.")[, 1]
substring_left_of_dot <- str_sub(df$Category, end = location_of_dot - 1)
substring_right_of_dot <- str_sub(df$Category, start = location_of_dot + 1)
pad_left <- str_pad(substring_left_of_dot, 4, side = "left", pad = "0")
pad_right <- str_pad(substring_right_of_dot, 6, side = "right", pad = "0")
result <- paste0(pad_left, ".", pad_right)
result
Use separate in tidyr to separate Category on decimal. Use str_pad from stringr to add zeros in the front or back and paste them together.
library(tidyr) # to separate columns on decimal
library(dplyr) # to mutate and pipes
library(stringr) # to strpad
input_data <- read.table(text =" Category
1 300.030340
2 3400.040290
3 700.07011
4 1700.0901
5 700.070114
6 700.0791
7 3600.05059
8 4400.0402", header = TRUE, stringsAsFactors = FALSE) %>%
separate(Category, into = c("col1", "col2")) %>%
mutate(col1 = str_pad(col1, width = 4, side= "left", pad ="0"),
col2 = str_pad(col2, width = 6, side= "right", pad ="0"),
Category = paste(col1, col2, sep = ".")) %>%
select(-col1, -col2)
i have a the following matrix
data=structure(list(Metric = c("x", "y"), Result1 = c(9,
18), Result2 = c(7, 14
), Delta = c(-170, -401)), .Names = c("Metric",
"Result_1", "Result_2", "Delta"), row.names = c(NA,
-2L), class = "data.frame")
I would like to transform it in 1 row and to double the number of columns
I tried the following (as.vector(t(data))). However when I do that it transofrms everything to character and i loose data information
Any help?
We can split the data frame and then use bind_cols from the dplyr package. Although I am not sure this is your expected output as you did not provide any examples, just description.
dplyr::bind_cols(split(data, data$Metric))
Metric Result_1 Result_2 Delta Metric1 Result_11 Result_21 Delta1
1 x 9 7 -170 y 18 14 -401
In Base R
dd=stack(data)
A=dd$values
names(A)=dd$ind
data.frame(as.list(A))
Metric Metric.1 Result_1 Result_1.1 Result_2 Result_2.1 Delta Delta.1
1 x y 9 18 7 14 -170 -401