Say I have a dataframe of tens of columns, and my custom function needs each one of these columns plus a number in a vector to give me the desired output. After being done with all that, I need to generate new column names based on the original column names in the dataframe. How to accomplish this using the tidyverse, instead of for loops or other solutions in base R.
MWE
structure(list(col1 = c(36.0520583373645, 37.9423749063706, 33.6806634587719,
34.031649012457, 29.5448679963449, NA, 34.7576769718877, 30.484217745574,
32.9849083643022, 27.4081694831058, 35.8624919654559, 35.0284347997991,
NA, 32.112605893241, 27.819354948082, 35.6499532124921, 35.0265642403216,
32.4006569441297, 30.3698557864842, 31.8229364456928, 34.3715903109276
), col2 = c(32.9691195198199, 35.6643664156284, 33.8748732989736,
34.5436311813644, 33.2228201914256, 38.7621696867191, 34.8399804318992,
32.9063078995457, 35.7391166214367, 32.7217251282669, 36.3039268989853,
35.9607654868559, 33.1385915196435, 34.7987649028199, 33.7100463668523,
34.7773403671057, 35.8592997980752, 33.8537127786535, 31.9106243803505,
39.3099469314882, 35.1849826815196), col3 = c(33.272278716963,
NA, 31.8594920410129, 33.1695042551974, 29.3800694974438, 35.1504378875245,
34.0771487001433, 29.0162879030415, 30.6960024888799, 29.5542117965184,
34.3726321365982, 36.0602274148362, 33.1207772548047, 31.5506876209822,
28.8649303491974, 33.4598790144265, 30.5573454464747, 31.6026723913051,
30.4716061556625, 33.009463000301, 30.846230953425)), row.names = c(NA,
-21L), class = "data.frame")
save above in a file, and then use example <- dget(file.choose()) to read the above dataframe.
Code
y <- c (2, 1, 1.5)
customfun <- function(x, y){
n <- log (x) * y
print (n)
}
df <- example %>%
dplyr::mutate(col1.log = customfun (col1, y = y[1])) %>%
dplyr::mutate(col2.log = customfun (col2, y = y[2])) %>%
dplyr::mutate(col3.log = customfun (col3, y = y[3]))
Question
Imagine I have tens of these columns not only 3 as in the MWE, how to generate the new ones dynamically using the tidyverse?
We can use map2 and bind_cols to add new columns
library(dplyr)
library(purrr)
bind_cols(example, map2_df(example, y, customfun) %>%
rename_all(~paste0(., ".log")))
# col1 col2 col3 col1.log col2.log col3.log
#1 36.05206 32.96912 33.27228 7.169928 3.495571 5.257087
#2 37.94237 35.66437 NA 7.272137 3.574152 NA
#3 33.68066 33.87487 31.85949 7.033848 3.522674 5.192003
#4 34.03165 34.54363 33.16950 7.054582 3.542223 5.252446
#...
tidyverse is not great for these sweep()-like operations, however, one option could be:
example %>%
do(., sweep(., 2, FUN = customfun, y)) %>%
rename_all(~ paste(., "log", sep = "."))
col1.log col2.log col3.log
1 7.169928 3.495571 5.257087
2 7.272137 3.574152 NA
3 7.033848 3.522674 5.192003
4 7.054582 3.542223 5.252446
5 6.771820 3.503237 5.070475
6 NA 3.657445 5.339456
7 7.096801 3.550766 5.292941
8 6.834418 3.493664 5.051786
9 6.992100 3.576246 5.136199
10 6.621682 3.488039 5.079339
Related
I need your support while working with dates.
While importing an .xls file, the column of dates was correctly converted into numbers by R. Unfortunately some dates are still there in the format: dd/mm/yyyy or d/mm/yyyy or dd/mm/yy. Probably this results from different settings of different os. I don't know. Is there a way to manage this?
Thank you in advance
my_data <- read_excel("my_file.xls")
born_date
18520
30859
16/04/1972
26612
30291
24435
11/02/1964
26/09/1971
18427
23688
Original_dates
14/9/1950
26/6/1984
16/04/1972
9/11/1972
6/12/1982
24/11/1966
11/02/1964
26/09/1971
13/6/1950
Here is one way how we could solve it:
First we define the numeric values only by exlcuden those containing the string /.
Then we use excel_numeric_to_date function from janitor package.
Finally with coalesce we combine both:
library(dplyr)
library(janitor)
library(lubridate)
df %>%
mutate(x = ifelse(str_detect(born_date, '\\/'), NA_real_, born_date),
x = excel_numeric_to_date(as.numeric(as.character(x)), date_system = "modern"),
born_date = dmy(born_date)) %>%
mutate(born_date = coalesce(born_date, x), .keep="unused")
born_date
1 1950-09-14
2 1984-06-26
3 1972-04-16
4 1972-11-09
5 1982-12-06
6 1966-11-24
7 1964-02-11
8 1971-09-26
9 1950-06-13
10 1964-11-07
data:
df <- structure(list(born_date = c("18520", "30859", "16/04/1972",
"26612", "30291", "24435", "11/02/1964", "26/09/1971", "18427",
"23688")), class = "data.frame", row.names = c(NA, -10L))
1) This translates the two types of dates. Each returns an NA for those elements not of that type. Then we use coalesce to combine them. This only needs dplyr and no warnings are produced.
library(dplyr)
my_data %>%
mutate(born_date = coalesce(
as.Date(born_date, "%d/%m/%Y"),
as.Date(as.numeric(ifelse(grepl("/",born_date), NA, born_date)), "1899-12-30"))
)
## born_date
## 1 1950-09-14
## 2 1984-06-26
## 3 1972-04-16
## 4 1972-11-09
## 5 1982-12-06
## 6 1966-11-24
## 7 1964-02-11
## 8 1971-09-26
## 9 1950-06-13
## 10 1964-11-07
2) Here is a base R version.
my_data |>
transform(born_date = pmin(na.rm = TRUE,
as.Date(born_date, "%d/%m/%Y"),
as.Date(as.numeric(ifelse(grepl("/",born_date), NA, born_date)), "1899-12-30"))
)
Note
The input in reproducible form.
my_data <-
structure(list(born_date = c("18520", "30859", "16/04/1972",
"26612", "30291", "24435", "11/02/1964", "26/09/1971", "18427",
"23688")), class = "data.frame", row.names = c(NA, -10L))
I have a dataframe that I have to sort in decreasing order of absolute row value without changing the actual values (some of which are negative).
To give you an example, e.g. for the 1st row, I would like to go from
-0.01189179 0.03687456 -0.12202753 to
-0.12202753 0.03687456 -0.01189179.
For the 2nd row from
-0.04220260 0.04129326 -0.07178175 to
-0.07178175 -0.04220260 0.04129326 etc.
How can I do this in R?
Many thanks!
Try this
lst <- lapply(df , \(x) order(-abs(x)))
ans <- data.frame(Map(\(x,y) x[y] , df ,lst))
output
a b
1 -0.01189179 -0.07178175
2 0.03687456 -0.04220260
3 -0.12202753 0.04129326
data
df <- structure(list(a = c(-0.12202753, 0.03687456, -0.01189179), b = c(-0.0422026,
0.04129326, -0.07178175)), row.names = c(NA, -3L), class = "data.frame")
Here is a simple approach (using #Mohamed Desouky's Data)
df <- df[nrow(df):1,]
> df
a b
3 -0.01189179 -0.07178175
2 0.03687456 0.04129326
1 -0.12202753 -0.04220260
I'm trying to aggregate values over a flexible pooling variable, e.g. calculate the mean of my value x for every n rows when the sum of consecutive d is equal a predetermined value. I think it comes down to finding the indices of my summations and using those to create a grouping variable, but I don't know how to do this.
> head(dat)
x d
1 0.10000112 22.24835
2 0.11074217 22.24835
3 0.03002743 22.24835
4 0.05756194 22.24836
5 0.10906047 22.24836
6 0.05954912 25.12431
I want to calculate the mean/sum/length of x every n rows for which the sum of d e.g. is ~100.
sample data:
structure(list(x = c(0.10000112377193, 0.110742170350877, 0.0300274304561404,
0.0575619395964912, 0.109060465438596, 0.0595491225614035, 0.0539270264912281,
0.0812452063859649, 0.0341699389122807, 0.0391744879122807, 0.0411787485614035,
0.0996091644385965, 0.0970479474912281, 0.0595715843684211, 0.0483489989122807,
0.0549631194561404, 0.0705080555964912, 0.080437472631579, 0.105883664631579,
0.0872411613684211, 0.103236660631579, 0.0381296894912281, 0.0465064491578947,
0.0936565184561403, 0.0410095752631579, 0.0311180032105263, 0.0257758157894737,
0.0354721928947368, 0.0584999394736842, 0.0241286060175439, 0.112053376666667,
0.0769823868596491, 0.0558137530526316, 0.0374491000701754, 0.0419279142631579,
0.0260257506842105, 0.0544360374561404, 0.107411071842105, 0.103873468,
0.0419322114035088, 0.0483912961052632, 0.0328373653157895, 0.0866868717719298,
0.063990467245614, 0.0799280314035088, 0.123490407070175, 0.145676836280702,
0.0292878782807018, 0.0432093036666667, 0.0203547443684211),
d = c(22.2483512600033, 22.2483529247042, 22.2483545865809,
22.2483562542823, 22.24835791863, 25.1243105415557, 25.1243148759953,
25.1243192107884, 25.1243235416981, 25.1243278750792, 27.2240858553058,
27.2240943134697, 27.2241027638674, 27.224111222031, 27.2241196741942,
24.5623431981188, 24.5623453409221, 24.5623474809012, 24.562349626705,
24.5623517696847, 28.1458125837154, 28.1458157376341, 28.1458188889053,
28.1458220452951, 28.1458251983314, 27.8293318542146, 27.8293366652115,
27.8293414829159, 27.829346292148, 27.8293511094993, 27.5271773325046,
27.5271834011289, 27.5271894694002, 27.5271955369655, 27.5272016048837,
28.0376097925214, 28.0376146410729, 28.0376194959786, 28.0376243427651,
28.0376291969647, 26.8766095768196, 26.8766122563318, 26.8766149309023,
26.8766176123562, 26.8766202925746, 27.8736950101666, 27.8736960528853,
27.8736971017815, 27.8736981446767, 27.8736991932199)), row.names = c(NA,
50L), class = "data.frame")
Maybe this helps
library(dplyr)
dat %>%
mutate(rn = row_number()) %>%
group_by(grp = (cumsum(d)-1)%/% 100 + 1) %>%
summarise(x = mean(x, na.rm = TRUE), start = first(rn), end = last(rn))
Good day!
I have a dataset in which I have values like "Invalid", "Invalid(N/A)", "Invalid(1.23456)", lots of them in different columns and they are different from file to file.
Goal is to make script file to process different CSVs.
I tried read.csv and read_csv, but faced errors with data types or no errors, but no action either.
All columns are col_character except one - col_double.
Tried this:
is.na(df) <- startsWith(as.character(df, "Inval")
no luck
Tried this:
is.na(df) <- startsWith(df, "Inval")
no luck, some error about non char object
Tried this:
df %>%
mutate(across(everything(), .fns = ~str_replace(., "invalid", NA_character_)))
no luck
And other google stuff - no luck, again, errors with data types or no errors, but no action either.
So R is incapable of simple find and replace in data frame, huh?
data frame exampl
Output of dput(dtype_Result[1:20, 1:4])
structure(list(Location = c("1(1,A1)", "2(1,B1)", "3(1,C1)",
"4(1,D1)", "5(1,E1)", "6(1,F1)", "7(1,G1)", "8(1,H1)", "9(1,A2)",
"10(1,B2)", "11(1,C2)", "12(1,D2)", "13(1,E2)", "14(1,F2)", "15(1,G2)",
"16(1,H2)", "17(1,A3)", "18(1,B3)", "19(1,C3)", "20(1,D3)"),
Sample = c("Background0", "Background0", "Standard1", "Standard1",
"Standard2", "Standard2", "Standard3", "Standard3", "Standard4",
"Standard4", "Standard5", "Standard5", "Standard6", "Standard6",
"Control1", "Control1", "Control2", "Control2", "Unknown1",
"Unknown1"), EGF = c(NA, NA, "6.71743640129069", "2.66183193679533",
"16.1289784536322", "16.1289784536322", "78.2706654825781",
"78.6376213069722", "382.004087907716", "447.193928257862",
"Invalid(N/A)", "1920.90297258996", "7574.57784103579", "29864.0308009592",
"167.830723655146", "109.746615928611", "868.821939675054",
"971.158518683179", "9.59119569511596", "4.95543581398464"
), `FGF-2` = c(NA, NA, "25.5436745776637", NA, "44.3280630362038",
NA, "91.991708192168", "81.9459159768959", "363.563899234418",
"425.754478700876", "Invalid(2002.97340881547)", "2027.71958119836",
"9159.40221389147", "11138.8722428849", "215.58494072476",
"70.9775438699825", "759.798876479002", "830.582605561901",
"58.7007261370257", "70.9775438699825")), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
The error is in the use of startsWith. The following grepl solution is simpler and works.
is.na(df) <- sapply(df, function(x) grepl("^Invalid", x))
The str_replace function will attempt to edit the content of a character string, inserting a partial replacement, rather than replacing it entirely. Also, the across function is targeting all of the columns including the numeric id. The following code works, building on the tidyverse example you provided.
To fix it, use where to identify the columns of interest, then use if_else to overwrite the data with NA values when there is a partial string match, using str_detect to spot the target text.
Example data
library(tiyverse)
df <- tibble(
id = 1:3,
x = c("a", "invalid", "c"),
y = c("d", "e", "Invalid/NA")
)
df
# A tibble: 3 x 3
id x y
<int> <chr> <chr>
1 1 a d
2 2 invalid e
3 3 c Invalid/NA
Solution
df <- df %>%
mutate(
across(where(is.character),
.fns = ~if_else(str_detect(tolower(.x), "invalid"), NA_character_, .x))
)
print(df)
Result
# A tibble: 3 x 3
id x y
<int> <chr> <chr>
1 1 a d
2 2 NA e
3 3 c NA
I'm fairly new to R and I'm running into the following problem.
Let's say I have the following data frames:
sale_df <- data.frame("Cheese" = c("cheese-01", "cheese-02", "cheese-03"), "Number_of_sales" = c(4, 8, 23))
id_df <- data.frame("ID" = c(1, 2, 3), "Name" = c("Leerdammer", "Gouda", "Mozerella")
What I want to do is match the numbers of the first column of id_df to the numbers in the string of the first column of sale_df.
Then I want to replace the value in sale_df by the value in the second column of id_df, i.e. I want cheese-01 to become "Leerdammer".
Does anyone have any idea how I could solve this?
With tidyverse :
sale_df %>% mutate(ID=as.numeric(str_extract(Cheese,"(?<=cheese-).*"))) %>% inner_join(id_df,by="ID")
# Cheese Number_of_sales ID Name
#1 cheese-01 4 1 Leerdammer
#2 cheese-02 8 2 Gouda
#3 cheese-03 23 3 Mozerella
Assuming that all entries for Cheese in sale_df will start with cheese-, here is a simple solution.
sale_df$CheeseID <- as.numeric(substring(sale_df$Cheese, 8))
merge(sale_df, id_df, by.x = "CheeseID", by.y = "ID", all.x = TRUE)
sale_df$Number_of_sales=id_df$Name[match(id_df$ID,as.numeric(gsub("\\D","",sale_df$Cheese)))]
> sale_df
Cheese Number_of_sales
1 cheese-01 Leerdammer
2 cheese-02 Gouda
3 cheese-03 Mozerella