looking for some help on this maybe basic issue. Suppose I have the following:
tibble(
x= c("a","a","b","b","b","b"),
y= c(1,2,1,2,1,2)
)
# A tibble: 6 x 2
# x y
# <chr> <dbl>
# 1 a 1
# 2 a 2
# 3 b 1
# 4 b 2
# 5 b 1
# 6 b 2
I would like to transform to the following tibble:
tibble(
x= c("a","b","b"),
y.1= c("1","1","1"),
y.2= c("2","2","2")
)
# A tibble: 3 x 3
# x y.1 y.2
# <chr> <chr> <chr>
# 1 a 1 2
# 2 b 1 2
# 3 b 1 2
What's the best way to achieve this? I tried to use tidyr::pivot_wider but I couldn't figure it out without preserving the x column's two "b" values.
Given the current structure, one possible approach is to use vector recycling
library(tidyverse)
df = tibble(
x= c("a","a","b","b","b","b"),
y= c(1,2,1,2,1,2)
)
df %>%
summarise(
x = x[c(T, F)],
y.1 = y[c(T, F)],
y.2 = y[c(F, T)]
)
#> # A tibble: 3 x 3
#> x y.1 y.2
#> <chr> <dbl> <dbl>
#> 1 a 1 2
#> 2 b 1 2
#> 3 b 1 2
Although this could break if data is not in the proper sequence.
Related
I want to add a new column based on a given character vector.
For example, in the example below, I want to add column d defined in expr:
library(magrittr)
data <- tibble::tibble(
a = c(1, 2),
b = c(3, 4)
)
expr <- "d = a + b"
just as below:
data %>%
dplyr::mutate(d = a + b)
# # A tibble: 2 x 3
# a b d
# <dbl> <dbl> <dbl>
# 1 1 3 4
# 2 2 4 6
However, in the codes below, while the calculations themselves (i.e., adding) work, the names of the new columns are different from what I expected.
data %>%
dplyr::mutate(!!rlang::parse_expr(expr))
# # A tibble: 2 x 3
# a b `d = a + b`
# <dbl> <dbl> <dbl>
# 1 1 3 4
# 2 2 4 6
data %>%
dplyr::mutate(!!rlang::parse_quo(expr, env = rlang::global_env()))
# # A tibble: 2 x 3
# a b `d = a + b`
# <dbl> <dbl> <dbl>
# 1 1 3 4
# 2 2 4 6
data %>%
dplyr::mutate(rlang::eval_tidy(rlang::parse_expr(expr)))
# # A tibble: 2 x 3
# a b `rlang::eval_tidy(rlang::parse_expr(expr))`
# <dbl> <dbl> <dbl>
# 1 1 3 4
# 2 2 4 6
How can I properly use an expression in dplyr::mutate?
My question is similar to this, but in my example, the new variable (d) and its definition (a + b) are given in a single character vector (expr).
Lets first look at what kind of expressions dplyr::mutate takes to create named variables: we need a named list that contains an expression to create variables based on that expression with the given list element name.
library(tidyverse)
data <- tibble::tibble(
a = c(1, 2),
b = c(3, 4)
)
expr <- "d = a + b"
# let's rewrite the string above as named list containing an expression.
expr2 <- list(d = expr(a + b))
# this works as expected:
data %>%
mutate(!!! expr2)
#> # A tibble: 2 x 3
#> a b d
#> <dbl> <dbl> <dbl>
#> 1 1 3 4
#> 2 2 4 6
Now we simply need a function that transforms a string into a named list containing the expression of the right-hand side of the equation. The name needs to be the left-hand side of the equation. We can do this with regular string manipulations. Finally we need to transform the right-hand side of the equation from a string into an expression. We can use base R's str2lang here.
create_expr_ls <- function(str_expr) {
expr_nm <- str_extract(str_expr, "^\\w+")
expr_code <- str_replace_all(str_expr, "(^\\w+\\s?=\\s?)(.*)", "\\2")
set_names(list(str2lang(expr_code)), expr_nm)
}
expr3 <- create_expr_ls(expr)
data %>%
mutate(!!! expr3)
#> # A tibble: 2 x 3
#> a b d
#> <dbl> <dbl> <dbl>
#> 1 1 3 4
#> 2 2 4 6
Created on 2022-01-23 by the reprex package (v0.3.0)
Any of these work. The second is similar to the first but does not require that rlang be on the search path. The third and fourth also work if the d= part is not present in expr in which case default names are used. The last one uses only base R and is also the shortest.
data %>% mutate(within(., !!parse_expr(expr)))
data %>% mutate(within(., !!parse(text = expr)))
data %>% mutate(data, !!parse_expr(sprintf("tibble(%s)", expr)))
data %>% { eval_tidy(parse_expr(sprintf("mutate(., %s)", expr))) }
within(data, eval(parse(text = expr))) # base R
Note
Assume this premable:
library(dplyr)
library(rlang)
# input
data <- tibble(a = c(1, 2), b = c(3, 4))
expr <- "d = a + b"
To get the desired name for the mutated column, you can still use the same syntax and assign the results to a column with the preferred name. To get this name you can use a regular expression to find what is before = and then remove any leading or trailing spaces that might exist.
expr <- "x = a * b"
col_name <- trimws(str_extract(expr,"[^=]+"))
data %>%
dplyr::mutate(!!col_name := !!rlang::parse_expr(expr))
# A tibble: 2 × 3
a b x
<dbl> <dbl> <dbl>
1 1 3 3
2 2 4 8
data %>%
dplyr::mutate(!!col_name := !!rlang::parse_quo(expr, env = rlang::global_env()))
# A tibble: 2 × 3
a b x
<dbl> <dbl> <dbl>
1 1 3 3
2 2 4 8
data %>%
dplyr::mutate(!!col_name := rlang::eval_tidy(rlang::parse_expr(expr)))
# A tibble: 2 × 3
a b x
<dbl> <dbl> <dbl>
1 1 3 3
2 2 4 8
I'm using the sample dataset below:
mytable <- read.table(text=
"group team num ID
1 a x 1 9
2 a x 2 4
3 a y 3 5
4 a y 4 9
5 b x 1 7
6 b y 4 4
7 b x 3 9
8 b y 2 8",
header = TRUE, stringsAsFactors = FALSE)
I want to create separate data frames for each set of variables that I want to group by, I also want to group by two variables as well... I'm not sure how to do that. For example, I want a separate dataframe that groups the data by both team and ID as well... how do I do that?
library(dplyr)
lapply(c("group","team","ID",c("team","ID")), function(x){
group_by(mytable,across(c(x,num)))%>%summarise(Count = n()) %>% mutate(new=x)%>% as.data.frame()
})
See if this is what you want.
library(dplyr)
cols <- list("group","team","ID", c("team","ID"))
lapply(cols, function(x, dat = mytable){
dat2 <- dat %>%
group_by(across({{x}})) %>%
summarise(Count = n()) %>%
mutate(new = toString(x)) %>%
as.data.frame()
return(dat2)
})
# `summarise()` has grouped output by 'team'. You can override using the `.groups` argument.
# [[1]]
# group Count new
# 1 a 4 group
# 2 b 4 group
#
# [[2]]
# team Count new
# 1 x 4 team
# 2 y 4 team
#
# [[3]]
# ID Count new
# 1 4 2 ID
# 2 5 1 ID
# 3 7 1 ID
# 4 8 1 ID
# 5 9 3 ID
#
# [[4]]
# team ID Count new
# 1 x 4 1 team, ID
# 2 x 7 1 team, ID
# 3 x 9 2 team, ID
# 4 y 4 1 team, ID
# 5 y 5 1 team, ID
# 6 y 8 1 team, ID
# 7 y 9 1 team, ID
Does this, based on tidyverse, give you what you want?
library(tidyverse)
ytable %>%
group_by(team, ID) %>%
group_split()
<list_of<
tbl_df<
group: character
team : character
num : integer
ID : integer
>
>[7]>
[[1]]
# A tibble: 1 × 4
group team num ID
<chr> <chr> <int> <int>
1 a x 2 4
[[2]]
# A tibble: 1 × 4
group team num ID
<chr> <chr> <int> <int>
1 b x 1 7
[[3]]
# A tibble: 2 × 4
group team num ID
<chr> <chr> <int> <int>
1 a x 1 9
2 b x 3 9
[[4]]
# A tibble: 1 × 4
group team num ID
<chr> <chr> <int> <int>
1 b y 4 4
[[5]]
# A tibble: 1 × 4
group team num ID
<chr> <chr> <int> <int>
1 a y 3 5
[[6]]
# A tibble: 1 × 4
group team num ID
<chr> <chr> <int> <int>
1 b y 2 8
[[7]]
# A tibble: 1 × 4
group team num ID
<chr> <chr> <int> <int>
1 a y 4 9
How to divide a data.frame into seversal data.frames according to some specail character?
df <- tibble(NumberA = c(5,3,2,0,"\\#",2,0,"\\#",3,1,1,3,1,0,"\\#"),
NumberB = c(5,6,2,5,"\\#",4,3,"\\#",4,3,2,1,3,9,"\\#"))
Option 1
A base one-liner :
split(df, replace(cumsum(df$NumberA == "\\#"), df$NumberA == "\\#", NA))
Option 2
A dplyr solution with group_split().
library(dplyr)
df %>%
group_by(grp = cumsum(NumberA == "\\#")) %>%
filter(NumberA != "\\#") %>%
group_split(.keep = FALSE)
Output
# [[1]]
# # A tibble: 4 x 2
# NumberA NumberB
# <chr> <chr>
# 1 5 5
# 2 3 6
# 3 2 2
# 4 0 5
#
# [[2]]
# # A tibble: 2 x 2
# NumberA NumberB
# <chr> <chr>
# 1 2 4
# 2 0 3
#
# [[3]]
# # A tibble: 6 x 2
# NumberA NumberB
# <chr> <chr>
# 1 3 4
# 2 1 3
# 3 1 2
# 4 3 1
# 5 1 3
# 6 0 9
Update
If you wanna get the mean of each column in each data.frame and combine all the means into one data.frame, you can use map_dfr() in purrr.
library(purrr)
map_dfr(df_split, ~ colMeans(mutate(.x, across(everything(), as.numeric))))
# # A tibble: 3 x 2
# NumberA NumberB
# <dbl> <dbl>
# 1 2.5 4.5
# 2 1 3.5
# 3 1.5 3.67
where df_split is the splitted data.
A mix of base R and tidyverse would be (Altought #DarrenTsai solution is very optimal):
library(dplyr)
library(tidyverse)
#Data
df <- tibble(NumberA=c(5,3,2,0,"\\#",2,0,"\\#",3,1,1,3,1,0,"\\#"),
NumberB=c(5,6,2,5,"\\#",4,3,"\\#",4,3,2,1,3,9,"\\#"))
#Detect characters
index <- which(df$NumberA=='\\#')
#Assign var
df$Var <- NA
df$Var[index]<-1:length(index)
#Fill
df %>% fill(Var,.direction = 'up') -> df1
#Remove rows with character
df1 <- df1[-index,]
#Compute mean
df1 %>% mutate(NumberA=as.numeric(NumberA),NumberB=as.numeric(NumberB)) %>%
group_by(Var) %>% summarise_all(.funs = mean) %>% mutate(Var=paste0('df',Var)) -> dfmean
#Split
L1 <- split(df1,df1$Var)
#Remove var
L1 <- lapply(L1,function(x) {x$Var<-NULL; return(x)})
#Dataframes
names(L1)<-paste0('df',names(L1))
list2env(L1,envir = .GlobalEnv)
It will create:
df1
# A tibble: 4 x 2
NumberA NumberB
<chr> <chr>
1 5 5
2 3 6
3 2 2
4 0 5
df2
# A tibble: 2 x 2
NumberA NumberB
<chr> <chr>
1 2 4
2 0 3
df3
# A tibble: 6 x 2
NumberA NumberB
<chr> <chr>
1 3 4
2 1 3
3 1 2
4 3 1
5 1 3
6 0 9
And for the means the output:
# A tibble: 3 x 3
Var NumberA NumberB
<chr> <dbl> <dbl>
1 df1 2.5 4.5
2 df2 1 3.5
3 df3 1.5 3.67
I created the df again using the data.frame function, the functin tibble did not work for me.
But I created a list with the new df splited by our index "\#".
# Require packages
require(dplyr)
# Create the df
df <- data.frame(NumberA=c(5,3,2,0,"\\#",2,0,"\\#",3,1,1,3,1,0,"\\#"),
NumberB=c(5,6,2,5,"\\#",4,3,"\\#",4,3,2,1,3,9,"\\#"))
# Create a split point based on the special character, and filter to remains just the inter "special character lines".
df <- df %>% mutate(split_point = NumberA == "\\#",
block = with(rle(split_point), rep(seq_along(lengths), lengths))) %>%
filter(split_point == F)
# Create an empty list to store the data frames inside a loop
list_df <- list()
# Unique blcks of df
blokcs <- unique(df$block)
# Loop for create the list of data frames
for (i in 1:length(blokcs)) {
list_df[[i]] <- df[df$block == blokcs[i], ]
}
list_df
I've imported an excel data set and want to set nearly all columns (greater than 90) to numeric when they are initially characters. What is the best way to achieve this because importing and changing each to numeric one by one isn't the most efficient approach?
This should do as you wish:
# Random data frame for illustration (100 columns wide)
df <- data.frame(replicate(100,sample(0:1,1000,rep=TRUE)))
# Check column names / return column number (just encase you wanted to check)
colnames(df)
# Specify columns
cols <- c(1:length(df)) # length(df) is useful as if you ever add more columns at later date
# Or if only want to specify specific column numbers:
# cols <- c(1:100)
#With help of magrittr pipe function change all to numeric
library(magrittr)
df[,cols] %<>% lapply(function(x) as.numeric(as.character(x)))
# Check our columns are numeric
str(df)
Assuming your data is already imported with all character columns, you can convert the relevant columns to numeric using mutate_at by position or name:
suppressPackageStartupMessages(library(tidyverse))
# Assume the imported excel file has 5 columns a to e
df <- tibble(a = as.character(1:3),
b = as.character(5:7),
c = as.character(8:10),
d = as.character(2:4),
e = as.character(2:4))
# select the columns by position (convert all except 'b')
df %>% mutate_at(c(1, 3:5), as.numeric)
#> # A tibble: 3 x 5
#> a b c d e
#> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 1 5 8 2 2
#> 2 2 6 9 3 3
#> 3 3 7 10 4 4
# or drop the columns that shouldn't be used ('b' and 'd' should stay as chr)
df %>% mutate_at(-c(2, 4), as.numeric)
#> # A tibble: 3 x 5
#> a b c d e
#> <dbl> <chr> <dbl> <chr> <dbl>
#> 1 1 5 8 2 2
#> 2 2 6 9 3 3
#> 3 3 7 10 4 4
# select the columns by name
df %>% mutate_at(c("a", "c", "d", "e"), as.numeric)
#> # A tibble: 3 x 5
#> a b c d e
#> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 1 5 8 2 2
#> 2 2 6 9 3 3
#> 3 3 7 10 4 4
I have a tibble with one column being a list column, always having two numeric values named a and b (e.g. as a result of calling purrr:map to a function which returns a list), say:
df <- tibble(x = 1:3, y = list(list(a = 1, b = 2), list(a = 3, b = 4), list(a = 5, b = 6)))
df
# A tibble: 3 × 2
x y
<int> <list>
1 1 <list [2]>
2 2 <list [2]>
3 3 <list [2]>
How do I separate the list column y into two columns a and b, and get:
df_res <- tibble(x = 1:3, a = c(1,3,5), b = c(2,4,6))
df_res
# A tibble: 3 × 3
x a b
<int> <dbl> <dbl>
1 1 1 2
2 2 3 4
3 3 5 6
Looking for something like tidyr::separate to deal with a list instead of a string.
Using dplyr (current release: 0.7.0):
bind_cols(df[1], bind_rows(df$y))
# # A tibble: 3 x 3
# x a b
# <int> <dbl> <dbl>
# 1 1 1 2
# 2 2 3 4
# 3 3 5 6
edit based on OP's comment:
To embed this in a pipe and in case you have many non-list columns, we can try:
df %>% select(-y) %>% bind_cols(bind_rows(df$y))
We could also make use the map_df from purrr
library(tidyverse)
df %>%
summarise(x = list(x), new = list(map_df(.$y, bind_rows))) %>%
unnest
# A tibble: 3 x 3
# x a b
# <int> <dbl> <dbl>
#1 1 1 2
#2 2 3 4
#3 3 5 6