I have a tibble with several columns in which numbers are stored as text:
my_tbl <- tibble(names = letters[1:5],
value1 = as.character(runif(5)),
value2 = as.character(runif(5)))
Now, I'd like to change the type of these columns ("value1" and "value2") from character to numeric. Only option I've found is using a for-loop:
for (i in 2:ncol(my_tbl)) {
my_tbl[[i]] <- as.numeric(my_tbl[[i]])
}
Is there a possibility to do this without a loop?
You can use mutate_if from dplyr:
library(dplyr)
my_tbl %>%
group_by(names) %>%
mutate_if(is.character, as.numeric)
my_tbl
## A tibble: 5 x 3
## Groups: names [5]
# names value1 value2
# <chr> <dbl> <dbl>
#1 a 0.427 0.0191
#2 b 0.817 0.300
#3 c 0.108 0.158
#4 d 0.394 0.643
#5 e 0.775 0.311
With purrr you could do this:
If you already know your target columns :
library(purrr)
modify_at(my_tbl,-1,as.numeric)
If you need to detect them:
modify_if(my_tbl,~is.character(.) && !any(grepl("[:alpha:]",.)),as.numeric)
# # A tibble: 5 x 3
# names value1 value2
# <chr> <dbl> <dbl>
# 1 a 0.715 0.943
# 2 b 0.639 0.128
# 3 c 0.471 0.0395
# 4 d 0.374 0.374
# 5 e 0.500 0.800
using dplyr instead of purrr, these will yield the same results:
library(dplyr)
mutate_at(my_tbl,-1,as.numeric)
mutate_if(my_tbl,~is.character(.) && !any(grepl("[:alpha:]",.)),as.numeric)
The base R translations:
my_tbl[-1] <- lapply(my_tbl[-1],as.numeric)
my_tbl[] <- lapply(my_tbl,function(x)
if (is.character(x) && !any(grepl("[:alpha:]",x))) as.numeric(x)
else x)
Related
I'm trying to go from a tibble of variable names and functions like this:
N <- 100
dat <-
tibble(
variable_name = c("a", "b"),
variable_value = c("rnorm(N)", "rnorm(N)")
)
to a tibble with two variables a and b of length N
dat2 <-
tibble(
a = rnorm(N),
b = rnorm(N)
)
is there a !!! or rlang-y way to accomplish this?
We can evalutate the string
library(dplyr)
library(purrr)
library(tibble)
deframe(dat) %>%
map_dfc(~ eval(rlang::parse_expr(.x)))
-output
# A tibble: 100 x 2
a b
<dbl> <dbl>
1 0.0750 2.55
2 -1.65 -1.48
3 1.77 -0.627
4 0.766 -0.0411
5 0.832 0.200
6 -1.91 -0.533
7 -0.0208 -0.266
8 -0.409 1.08
9 -1.38 -0.181
10 0.727 0.252
# … with 90 more rows
Here is a base way with a pipe and a as_tibble call.
Map(function(x) eval(str2lang(x)), setNames(dat$variable_value, dat$variable_name)) %>%
as_tibble
I have run out of R power on this one. I appreciate any help, it is probably quite simple for someone with more experience.
I have a data frame (tibble) with some numerical columns, a group column, and some other columns with other information. I want to do operations on the numerical columns, by group, but still retain all the columns.
I've put an example below: I am replacing the NAs with the group mean, for each column. The columns to replace the NAs are specified by the df_names variable.
It basically works, except it removes all columns except the numerical ones, AND reorders everything. Which makes it hard to reassemble. I could work around this, but I have a feeling there must be a simpler way to direct group_apply to specified columns, while retaining the other columns, and keeping the order.
Can anyone help? Thanks so much in advance!
Will
library("tidyverse")
# create tibble
df <- tibble(
name=letters[1:10],
csize=c("L","S","S","L","L","S","L","S","L","S"),
v1=rnorm(10),
v2=rnorm(10),
v3=rnorm(10)
)
# introduce some missing data
df$v1[3] <- NA
df$v1[6] <- NA
df$v1[7] <- NA
df$v3[2] <- NA
# these are the cols where I want to replace the NAs
df_names <- c("v1","v2","v3")
# this is the grouping variable (has to be stored as a string, since it is an input to the function)
groupvar <- "csize"
# now I want to replace the NAs with column means, restricted to their group
# the following line works, but the problem is that it removes the name column, and reorders the rows...
df_imp <- df %>% group_by(.dots=groupvar) %>% select(df_names) %>% group_modify( ~{replace_na(.x,as.list(colMeans(.x, na.rm=TRUE)))})
group_modify is overkill in this case; mutate(across()) is your friend here:
df %>% group_by(.dots = groupvar) %>%
mutate(across(all_of(df_names), ~if_else(is.na(.x), mean(.x, na.rm = TRUE), .x)))
Result:
> df
# A tibble: 10 x 5
# Groups: csize [2]
name csize v1 v2 v3
<chr> <chr> <dbl> <dbl> <dbl>
1 a L -1.22 1.48 -0.628
2 b S -1.17 0.0890 -0.130
3 c S -0.422 -0.0956 -0.0271
4 d L -0.265 0.180 -0.786
5 e L -0.491 0.509 -0.359
6 f S -0.422 -0.712 0.232
7 g L -0.400 -1.13 1.13
8 h S -0.538 -0.0785 0.690
9 i L 0.373 0.308 0.252
10 j S 0.445 0.743 -1.41
Does this work:
> library(dplyr)
> df %>% group_by(csize) %>% mutate(across(v1:v3, ~ replace_na(., mean(., na.rm = T))))
# A tibble: 10 x 5
# Groups: csize [2]
name csize v1 v2 v3
<chr> <chr> <dbl> <dbl> <dbl>
1 a L 1.57 0.310 -1.76
2 b S -0.705 0.0655 0.577
3 c S -1.05 1.28 1.82
4 d L 0.958 -2.09 -0.371
5 e L -0.712 0.247 -1.13
6 f S -1.05 -0.516 -0.107
7 g L 0.403 1.79 0.128
8 h S -0.793 1.52 1.07
9 i L -0.206 -0.369 -1.77
10 j S -1.65 -0.992 -0.476
Given some data like the following:
set.seed(1234)
df <- tibble(class = rep(c("a","b"), each=6), value = c(rnorm(n=6, mean=0, sd=1), rnorm(n=6, mean=1, sd=0.1)))
# A tibble: 12 x 2
# class value
# <chr> <dbl>
# 1 a -1.21
# 2 a 0.277
# 3 a 1.08
# 4 a -2.35
# 5 a 0.429
# 6 a 0.506
# 7 b 0.943
# 8 b 0.945
# 9 b 0.944
#10 b 0.911
#11 b 0.952
#12 b 0.900
I'm trying to generate a new column (context) that contains the average of "value" of the X preceding and posterior rows, when possible. It would be desirable to have this by level of a factor in a different column. For example, for X=2, I would expect something like the following:
# A tibble: 12 x 2
# class value context
# <chr> <dbl> <dbl>
# 1 a -1.21 NA
# 2 a 0.277 NA
# 3 a 1.08 -0.7135
# 4 a -2.35 0.573
# 5 a 0.429 NA
# 6 a 0.506 NA
# 7 b 0.943 NA
# 8 b 0.945 NA
# 9 b 0.944 0.9377
#10 b 0.911 0.9278
#11 b 0.952 NA
#12 b 0.900 NA
Note that for the first two rows it is not possible to generate the context value in this case, because they do not have X=2 predecing rows. The value -0.7135 at row 3 is the average of rows 1, 2, 4 and 5.
Similarly, rows 5 and 6 do not have a value of context, because these do not have two values afterwards belonging to the same level of the factor "class" (because row 7 is class="b" while 5 and 6 are class="a").
I do not know if this is even possible in R, I haven't found any similar questions, and I can only reach to solutions like the following one, which I think is not representative of this language.
My solution:
X <- 2
df_list <- df %>% dplyr::group_split(class)
result <- tibble()
for (i in 1:length(df_list)) {
tmp <- df_list[[i]]
context <- vector()
for (j in 1:nrow(tmp)) {
if (j<=X | j>nrow(tmp)-X) context <- c(context, NA)
else {
values <- vector()
for (k in 1:X) {
values <- c(values, tmp$value[j-k], tmp$value[j+k])
}
context <- c(context, mean(values))
}
}
tmp <- tmp %>% dplyr::mutate(context=context)
result <- result %>% dplyr::bind_rows(tmp)
}
This will give and approximate solution to that above (differences due to rounding). But again, this approach lacks of flexibility, e.g. if we want to create various columns at once, for different values of X. Are there R functions developed to solved tasks like this one? (eg. vectorized functions?)
# this is your dataframe
set.seed(1234)
df <- tibble(class = rep(c("a","b"), each=6), value = c(rnorm(n=6, mean=0, sd=1), rnorm(n=6, mean=1, sd=0.1)))
# pipes ('%>%') and grouping from the dplyr package
library(tidyverse)
# rolling mean function from the zoo package
library(zoo)
df %>% # take df
group_by(class) %>% # group it by class
mutate(context = (rollsum(value, 5, fill = NA) - value) / 4) # and calculate the rolling mean
Basically you calculate a rolling mean with a window width of 5, that is center (it's the default) and you fill the remaining values with NAs. Since the value of the exact row is not to be included in the average, it needs to be excluded.
One way using dplyr :
n <- 2
library(dplyr)
df %>%
group_by(class) %>%
mutate(context = map_dbl(row_number(), ~ if(.x <= n | .x > (n() - n))
NA else mean(value[c((.x - n):(.x - 1), (.x + 1) : (.x + n))])))
# class value context
# <chr> <dbl> <dbl>
# 1 a -1.21 NA
# 2 a 0.277 NA
# 3 a 1.08 -0.712
# 4 a -2.35 0.574
# 5 a 0.429 NA
# 6 a 0.506 NA
# 7 b 0.943 NA
# 8 b 0.945 NA
# 9 b 0.944 0.938
#10 b 0.911 0.935
#11 b 0.952 NA
#12 b 0.900 NA
Here is a base R solution using ave(), i.e.,
df <- within(df,
contest <- ave(value,
class,
FUN = function(v,X=2) sapply(seq(v), function(k) ifelse(k-X < 1 | k+X >length(v),NA,mean(v[c(k-(X:1),k + (1:X))])))))
such that
> df
# A tibble: 12 x 3
class value contest
<chr> <dbl> <dbl>
1 a -1.21 NA
2 a 0.277 NA
3 a 1.08 -0.712
4 a -2.35 0.574
5 a 0.429 NA
6 a 0.506 NA
7 b 0.943 NA
8 b 0.945 NA
9 b 0.944 0.938
10 b 0.911 0.935
11 b 0.952 NA
12 b 0.900 NA
I'm trying to generate multiple new columns/variables in a R dataframe with dynamic new names taken from a vector. The new variables are computed from groups/levels of a single column.
The dataframe contains measurements (counts) of different chemical elements (element) along depth (z). The new variables are computed by dividing the counts of each element at a certain depth by the respective counts of proxy elements (proxies) at the same depth.
There is already a solution using mutate that works if I only want to create one new column/name the columns explicitly (see code below). I'm looking for a generalised solution to use in a shiny web app where proxies is not a string but a vector of strings and is dynamically changing according to user input.
# Working code for just one new column at a time (here Ti_ratio)
proxies <- "Ti"
df <- tibble(z = rep(1:10, 4), element = rep(c("Ag", "Fe", "Ca", "Ti"), each = 10), counts = rnorm(40))
df_Ti <- df %>%
group_by(z) %>%
mutate(Ti_ratio = counts/counts[element %in% proxies])
# Not working code for multiple columns at a time
proxies <- c("Ca", "Fe", "Ti")
varname <- paste(proxies, "ratio", sep = "_")
df_ratios <- df %>%
group_by(z) %>%
map(~ mutate(!!varname = .x$counts/.x$counts[element %in% proxies]))
Output of working code:
> head(df_Ti)
# A tibble: 6 x 4
# Groups: z [6]
z element counts Ti_ratio
<int> <chr> <dbl> <dbl>
1 1 Ag 2.41 4.10
2 2 Ag -1.06 -0.970
3 3 Ag -0.312 -0.458
4 4 Ag -0.186 0.570
5 5 Ag 1.12 -1.38
6 6 Ag -1.68 -2.84
Expected output of not working code:
> head(df_ratios)
# A tibble: 6 x 6
# Groups: z [6]
z element counts Ca_ratio Fe_ratio Ti_ratio
<int> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 Ag 2.41 4.78 -10.1 4.10
2 2 Ag -1.06 3.19 0.506 -0.970
3 3 Ag -0.312 -0.479 -0.621 -0.458
4 4 Ag -0.186 -0.296 -0.145 0.570
5 5 Ag 1.12 0.353 3.19 -1.38
6 6 Ag -1.68 -2.81 -0.927 -2.84
Edit:
I found a general solution to my problem with base R using two nested for-loops, similar to the answer posted by #fra (the difference being that here I loop both over the depth and the proxies):
library(tidyverse)
df <- tibble(z = rep(1:3, 4), element = rep(c("Ag", "Ca", "Fe", "Ti"), each = 3), counts = runif(12)) %>% arrange(z, element)
proxies <- c("Ca", "Fe", "Ti")
for (f in seq_along(proxies)) {
proxy <- proxies[f]
tmp2 <- NULL
for (i in unique(df$z)) {
tmp <- df[df$z == i,]
tmp <- as.data.frame(tmp$counts/tmp$counts[tmp$element %in% proxy])
names(tmp) <- paste(proxy, "ratio", sep = "_")
tmp2 <- rbind(tmp2, tmp)
}
df[, 3 + f] <- tmp2
}
And the correct output:
> head(df)
# A tibble: 6 x 6
z element counts Ca_ratio Fe_ratio Ti_ratio
<int> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 Ag 0.690 0.864 9.21 1.13
2 1 Ca 0.798 1 10.7 1.30
3 1 Fe 0.0749 0.0938 1 0.122
4 1 Ti 0.612 0.767 8.17 1
5 2 Ag 0.687 0.807 3.76 0.730
6 2 Ca 0.851 1 4.66 0.904
I made the dataframe contain less data so that it's clearly visible why this solution is correct (Ratios of elements with themselves = 1).
I'm still interested in a more elegant solution that I could use with pipes.
A tidyverse option could be to create a function, similar to your original code and then pass that through using map_dfc to create new columns.
library(tidyverse)
proxies <- c("Ca", "Fe", "Ti")
your_func <- function(x){
df %>%
group_by(z) %>%
mutate(!!paste(x, "ratio", sep = "_") := counts/counts[element %in% !!x]) %>%
ungroup() %>%
select(!!paste(x, "ratio", sep = "_") )
}
df %>%
group_modify(~map_dfc(proxies, your_func)) %>%
bind_cols(df, .) %>%
arrange(z, element)
# z element counts Ca_ratio Fe_ratio Ti_ratio
# <int> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 Ag -0.112 -0.733 -0.197 -1.51
# 2 1 Ca 0.153 1 0.269 2.06
# 3 1 Fe 0.570 3.72 1 7.66
# 4 1 Ti 0.0743 0.485 0.130 1
# 5 2 Ag 0.881 0.406 -6.52 -1.49
# 6 2 Ca 2.17 1 -16.1 -3.69
# 7 2 Fe -0.135 -0.0622 1 0.229
# 8 2 Ti -0.590 -0.271 4.37 1
# 9 3 Ag 0.398 0.837 0.166 -0.700
#10 3 Ca 0.476 1 0.198 -0.836
# ... with 30 more rows
Using base R
proxies <- c("Ca", "Fe", "Ti")
for(f in proxies){
newDF <- as.data.frame(df$counts/df$counts[df$element %in% f])
names(newDF) <- paste(f, "ratio", sep = "_")
df <- cbind(df,newDF)
}
> df
z element counts Ca_ratio Fe_ratio Ti_ratio
1 1 Ag -0.40163072 -0.35820754 1.7375395 0.45692965
2 2 Ag -1.00880171 1.27798430 22.8520332 -2.84599471
3 3 Ag 0.72230855 -1.19506223 6.3893485 -0.73558507
4 4 Ag -1.71524002 -1.38942436 1.7564861 -3.03313134
5 5 Ag -0.30813737 1.08127226 4.1985801 -0.33008370
6 6 Ag 0.20524663 0.08910397 -0.3132916 -0.23778331
...
I tried to calculate the cumsum with a depreciation rate.
I have a grouped dataframe with a column number.
I want to add the number one by one with depreciation.
If the rate is 1, then the cumsum function in base r is good enough.
But if not, let's say the rate of 0.5 (means each number will multiply by 0.5 to add the next number), cumsum is not enough.
I tried to write my own function to work with dplyr, but it fails.
library(tidyverse)
# dataframe
id=sample(1:5,25,replace=TRUE)
num=rnorm(25)
df=data.frame(id,num)
# my custom function
depre=function(data){
rate=0.5
r=nrow(data)
sl=data$num
nl=data$num
for (i in 2:r){
sl[i]=sl[i-1]*rate+nl[i]
}
return(sl)
}
# work with one group
df %>% filter(id==1) %>% depre(.)
# failed to work with dplyr
df %>% group_by(id) %>% mutate(sl=depre(.))
I expect the first element of column s, should be the same as in column num.
But the following ones, should be depreciate by times 0.5 and add next num.
It works in one group, but failed in multi-grouped dataframe.
The error message is: "Error: Column sl must be length 6 (the group size) or one, not 25".
I have no idea. Could anyone have a clue?
Thanks
Your function would work if you pass vector to your function instead of dataframe
depre <- function(num){
rate = 0.5
r= length(num)
sl = num
nl = num
for (i in 2:r){
sl[i]=sl[i-1]*rate+nl[i]
}
return(sl)
}
and then apply it by group.
library(dplyr)
df %>% group_by(id) %>% mutate(sl = depre(num))
We can split by 'id' and use the OP's function without any changes
library(dplyr)
library(purrr)
df %>%
group_split(id, keep = FALSE) %>%
map_df(~ tibble(id = .$id, sl = depre(.)))
# id sl
# <int> <dbl>
# 1 1 1.07
# 2 1 -0.776
# 3 1 -0.518
# 4 1 0.628
# 5 1 0.601
# 6 1 1.10
# 7 2 -0.734
# 8 2 -0.583
# 9 2 -0.437
#10 2 -3.45
# … with 15 more rows
or an option would be accumulate from purrr which would be more compact
out <- df %>%
group_by(id) %>%
mutate(sl = accumulate(num, ~ .y + .x * 0.5))
out
# A tibble: 25 x 3
# Groups: id [5]
# id num sl
# <int> <dbl> <dbl>
# 1 3 -0.784 -0.784
# 2 2 -0.734 -0.734
# 3 2 -0.216 -0.583
# 4 3 -0.335 -0.727
# 5 5 -1.09 -1.09
# 6 4 -0.0854 -0.0854
# 7 1 1.07 1.07
# 8 2 -0.145 -0.437
# 9 3 -1.17 -1.53
#10 5 -0.819 -1.36
# … with 15 more rows
out %>%
filter(id == 1)
# A tibble: 6 x 3
# Groups: id [1]
# id num sl
# <int> <dbl> <dbl>
#1 1 1.07 1.07
#2 1 -1.31 -0.776
#3 1 -0.129 -0.518
#4 1 0.887 0.628
#5 1 0.287 0.601
#6 1 0.800 1.10
Issue in the OP's function is that the input is the whole dataset and during the process of getting the number of rows, it uses nrow(data), which would be the total number of rows. With group_by, the dplyr convention is n() - giving the number of rows. By doing the group_split, the input data.frame is split into subset of data.frames and the nrow of those will work for the created function