Split multiple columns based on the same delimiter [duplicate] - r

This question already has answers here:
Tidy method to split multiple columns using tidyr::separate
(5 answers)
Closed 2 years ago.
I struggle to find a more programmatic way to split multiple columns based on the same delimiter...
The solution should work for n columns, all to be identified by a similar pattern, in my example "^var[0-9]"
E.g.
library(tidyverse)
foo <- data.frame(var1 = paste0("a_",1:10), var2 = paste("a_",1:10), id = 1:10)
# Desired output
foo %>%
separate(var1, into = c("group1", "index1")) %>%
separate(var2, into = c("group2", "index2"))
#> group1 index1 group2 index2 id
#> 1 a 1 a 1 1
#> 2 a 2 a 2 2
#> 3 a 3 a 3 3
#> 4 a 4 a 4 4
#> 5 a 5 a 5 5
#> 6 a 6 a 6 6
#> 7 a 7 a 7 7
#> 8 a 8 a 8 8
#> 9 a 9 a 9 9
#> 10 a 10 a 10 10

This isn't very elegant, but it works (based on answers here: Tidy method to split multiple columns using tidyr::separate):
grep("var[0-9]", names(foo), value = TRUE) %>%
map_dfc(~ foo %>%
select(.x) %>%
separate(.x,
into = paste0(c("group", "index"), gsub("[^0-9]", "", .x)))
) %>%
bind_cols(id = foo$id)
Returns:
group1 index1 group2 index2 id
1 a 1 a 1 1
2 a 2 a 2 2
3 a 3 a 3 3
4 a 4 a 4 4
5 a 5 a 5 5
6 a 6 a 6 6
7 a 7 a 7 7
8 a 8 a 8 8
9 a 9 a 9 9
10 a 10 a 10 10

Related

Binning a discrete variable (preferably in dplyr)

I would like to "bin" a large discrete variable by combining two consecutive rows into one bin. I would also like to call the bin by the first row value.
As an example:
x<-data.frame(x=c(1,2,3,4,5,6,7,8,9,10,11,12),
y=c(1,1,3,3,5,5,7,7,9,9,11,11))
x
We may use gl to create the grouping bin
library(dplyr)
x %>%
mutate(grp = as.integer(gl(n(), 2, n())))
x y grp
1 1 1 1
2 2 1 1
3 3 3 2
4 4 3 2
5 5 5 3
6 6 5 3
7 7 7 4
8 8 7 4
9 9 9 5
10 10 9 5
11 11 11 6
12 12 11 6
Performing the steps as you exactly outlined them would be this:
library(dplyr)
x %>%
mutate(bins = rep(1:(length(x) / 2), each = 2)) %>%
group_by(bins) %>%
filter(row_number() == 1) %>%
ungroup()
However this would give you the exact same result (without the bins column) in one line of code:
x[seq(1, nrow(x), by = 2), ]
Another way using seq and ceiling.
x$bin <- ceiling(seq(nrow(x))/2)
x
# x y bin
#1 1 1 1
#2 2 1 1
#3 3 3 2
#4 4 3 2
#5 5 5 3
#6 6 5 3
#7 7 7 4
#8 8 7 4
#9 9 9 5
#10 10 9 5
#11 11 11 6
#12 12 11 6

Replicate rows of a data frame using purrr [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 2 years ago.
I have the following data frame:
Id Value Freq
1 A 8 2
2 B 7 3
3 C 2 4
and I want to obtain a new data frame by replicating each Value according to Freq:
Id Value
1 A 8
2 A 8
3 B 7
4 B 7
5 B 7
6 C 2
7 C 2
8 C 2
9 C 2
I understand this can be very easily done with purrr (I have identified map_dfr as the most suitable function), but I cannot understand what is the best and most "compact" way to do it.
You can just use some nice indexing-properties of dataframes.
df <- data.frame(Id=c("A","B","C"),Value=c(8,7,2),Freq=c(2,3,4))
replicatedDataframe <- do.call("rbind",lapply(1:NROW(df), function(k) {
df[rep(k,df$Freq[k]),-3]
}))
This can be done more easier using the times-argument in rep:
replicatedDataframe <- df[rep(1:NROW(df),times=df$Freq),-3]
Convert Freq to a vector and unnest.
df %>%
mutate(Freq = map(Freq, seq_len)) %>%
unnest(Freq) %>%
select(-Freq)
#> # A tibble: 9 x 2
#> Id Value
#> <chr> <dbl>
#> 1 A 8
#> 2 A 8
#> 3 B 7
#> 4 B 7
#> 5 B 7
#> 6 C 2
#> 7 C 2
#> 8 C 2
#> 9 C 2

R how to fill in NA with rules

data=data.frame(person=c(1,1,1,2,2,2,2,3,3,3,3),
t=c(3,NA,9,4,7,NA,13,3,NA,NA,12),
WANT=c(3,6,9,4,7,10,13,3,6,9,12))
So basically I am wanting to create a new variable 'WANT' which takes the PREVIOUS value in t and ADDS 3 to it, and if there are many NA in a row then it keeps doing this. My attempt is:
library(dplyr)
data %>%
group_by(person) %>%
mutate(WANT_TRY = fill(t) + 3)
Here's one way -
data %>%
group_by(person) %>%
mutate(
# cs = cumsum(!is.na(t)), # creates index for reference value; uncomment if interested
w = case_when(
# rle() gives the running length of NA
is.na(t) ~ t[cumsum(!is.na(t))] + 3*sequence(rle(is.na(t))$lengths),
TRUE ~ t
)
) %>%
ungroup()
# A tibble: 11 x 4
person t WANT w
<dbl> <dbl> <dbl> <dbl>
1 1 3 3 3
2 1 NA 6 6
3 1 9 9 9
4 2 4 4 4
5 2 7 7 7
6 2 NA 10 10
7 2 13 13 13
8 3 3 3 3
9 3 NA 6 6
10 3 NA 9 9
11 3 12 12 12
Here is another way. We can do linear interpolation with the imputeTS package.
library(dplyr)
library(imputeTS)
data2 <- data %>%
group_by(person) %>%
mutate(WANT2 = na.interpolation(WANT)) %>%
ungroup()
data2
# # A tibble: 11 x 4
# person t WANT WANT2
# <dbl> <dbl> <dbl> <dbl>
# 1 1 3 3 3
# 2 1 NA 6 6
# 3 1 9 9 9
# 4 2 4 4 4
# 5 2 7 7 7
# 6 2 NA 10 10
# 7 2 13 13 13
# 8 3 3 3 3
# 9 3 NA 6 6
# 10 3 NA 9 9
# 11 3 12 12 12
This is harder than it seems because of the double NA at the end. If it weren't for that, then the following:
ifelse(is.na(data$t), c(0, data$t[-nrow(data)])+3, data$t)
...would give you want you want. The simplest way, that uses the same logic but doesn't look very clever (sorry!) would be:
.impute <- function(x) ifelse(is.na(x), c(0, x[-length(x)])+3, x)
.impute(.impute(data$t))
...which just cheats by doing it twice. Does that help?
You can use functional programming from purrr and "NA-safe" addition from hablar:
library(hablar)
library(dplyr)
library(purrr)
data %>%
group_by(person) %>%
mutate(WANT2 = accumulate(t, ~.x %plus_% 3))
Result
# A tibble: 11 x 4
# Groups: person [3]
person t WANT WANT2
<dbl> <dbl> <dbl> <dbl>
1 1 3 3 3
2 1 NA 6 6
3 1 9 9 9
4 2 4 4 4
5 2 7 7 7
6 2 NA 10 10
7 2 13 13 13
8 3 3 3 3
9 3 NA 6 6
10 3 NA 9 9
11 3 12 12 12

Create new column based on condition from other column per group using tidy evaluation

Similar to this question but I want to use tidy evaluation instead.
df = data.frame(group = c(1,1,1,2,2,2,3,3,3),
date = c(1,2,3,4,5,6,7,8,9),
speed = c(3,4,3,4,5,6,6,4,9))
> df
group date speed
1 1 1 3
2 1 2 4
3 1 3 3
4 2 4 4
5 2 5 5
6 2 6 6
7 3 7 6
8 3 8 4
9 3 9 9
The task is to create a new column (newValue) whose values equals to the values of the date column (per group) with one condition: speed == 4. Example: group 1 has a newValue of 2 because date[speed==4] = 2.
group date speed newValue
1 1 1 3 2
2 1 2 4 2
3 1 3 3 2
4 2 4 4 4
5 2 5 5 4
6 2 6 6 4
7 3 7 6 8
8 3 8 4 8
9 3 9 9 8
It worked without tidy evaluation
df %>%
group_by(group) %>%
mutate(newValue=date[speed==4L])
#> # A tibble: 9 x 4
#> # Groups: group [3]
#> group date speed newValue
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 3 2
#> 2 1 2 4 2
#> 3 1 3 3 2
#> 4 2 4 4 4
#> 5 2 5 5 4
#> 6 2 6 6 4
#> 7 3 7 6 8
#> 8 3 8 4 8
#> 9 3 9 9 8
But had error with tidy evaluation
my_fu <- function(df, filter_var){
filter_var <- sym(filter_var)
df <- df %>%
group_by(group) %>%
mutate(newValue=!!filter_var[speed==4L])
}
my_fu(df, "date")
#> Error in quos(..., .named = TRUE): object 'speed' not found
Thanks in advance.
We can place the evaluation within brackets. Otherwise, it may try to evaluate the whole expression (filter_var[speed = 4L]) instead of filter_var alone
library(rlang)
library(dplyr)
my_fu <- function(df, filter_var){
filter_var <- sym(filter_var)
df %>%
group_by(group) %>%
mutate(newValue=(!!filter_var)[speed==4L])
}
my_fu(df, "date")
# A tibble: 9 x 4
# Groups: group [3]
# group date speed newValue
# <dbl> <dbl> <dbl> <dbl>
#1 1 1 3 2
#2 1 2 4 2
#3 1 3 3 2
#4 2 4 4 4
#5 2 5 5 4
#6 2 6 6 4
#7 3 7 6 8
#8 3 8 4 8
#9 3 9 9 8
Also, you can use from sqldf. Join df with a constraint on that:
library(sqldf)
df = data.frame(group = c(1,1,1,2,2,2,3,3,3),
date = c(1,2,3,4,5,6,7,8,9),
speed = c(3,4,3,4,5,6,6,4,9))
sqldf("SELECT df_origin.*, df4.`date` new_value FROM
df df_origin join (SELECT `group`, `date` FROM df WHERE speed = 4) df4
on (df_origin.`group` = df4.`group`)")

group cases by shared values in r [duplicate]

This question already has answers here:
R: define distinct pattern from values of multiple variables [duplicate]
(3 answers)
Closed 5 years ago.
I have a dataset like this:
case x y
1 4 5
2 4 5
3 8 9
4 7 9
5 6 3
6 6 3
I would like to create a grouping variable.
This variable should have the same values when both x and y are the same.
I do not care what this value is but it is to group them. Because in my dataset if x and y are the same for two cases they are probably part of the same organization. I want to see which organizations there are.
So my preferred dataset would look like this:
case x y org
1 4 5 1
2 4 5 1
3 8 9 2
4 7 9 3
5 6 3 4
6 6 3 4
How would I have to program this in R?
As you said , I do not care what this value is, you can just do following
dt$new=as.numeric(as.factor(paste(dt$x,dt$y)))
dt
case x y new
1 1 4 5 1
2 2 4 5 1
3 3 8 9 4
4 4 7 9 3
5 5 6 3 2
6 6 6 3 2
A solution from dplyr using the group_indices.
library(dplyr)
dt2 <- dt %>%
mutate(org = group_indices(., x, y))
dt2
case x y org
1 1 4 5 1
2 2 4 5 1
3 3 8 9 4
4 4 7 9 3
5 5 6 3 2
6 6 6 3 2
If the group numbers need to be in order, we can use the rleid from the data.table package after we create the org column as follows.
library(dplyr)
library(data.table)
dt2 <- dt %>%
mutate(org = group_indices(., x, y)) %>%
mutate(org = rleid(org))
dt2
case x y org
1 1 4 5 1
2 2 4 5 1
3 3 8 9 2
4 4 7 9 3
5 5 6 3 4
6 6 6 3 4
Update
Here is how to arrange the columns in dplyr.
library(dplyr)
dt %>%
arrange(x)
case x y
1 1 4 5
2 2 4 5
3 5 6 3
4 6 6 3
5 4 7 9
6 3 8 9
We can also do this for more than one column, such as arrange(x, y) or use desc to reverse the oder, like arrange(desc(x)).
DATA
dt <- read.table(text = " case x y
1 4 5
2 4 5
3 8 9
4 7 9
5 6 3
6 6 3",
header = TRUE)

Resources