Related
I have the following dataset:
library(data.table)
x <- data.table(a = c(1:3, 1), b = c('12 13', '14 15', '16 17', '18 19'))
> x
a b
1: 1 12 13
2: 2 14 15
3: 3 16 17
4: 1 18 19
and I would like to get a new dataset which has
> x
a b c
1: 1 12 13 12
2: 2 14 15 14
3: 3 16 17 16
4: 1 18 19 19
so that it takes the first element of column b's elements. I tried to do
x[,c:=unlist(strsplit(b, " "))[[1]][1]]
but it doesn't work. Is there a way to apply such a thing in data.table?
You can use stringr::str_split_i to take the first element of each split string:
library(stringr)
x[, c := str_split_i(b, " ", 1)]
x
a b c
1: 1 12 13 12
2: 2 14 15 14
3: 3 16 17 16
4: 1 18 19 18
Use tstrsplit from data.table:
x[,c := tstrsplit(b," ")[1]]
x
a b c
1: 1 12 13 12
2: 2 14 15 14
3: 3 16 17 16
4: 1 18 19 18
x[, c := readr::parse_number(b)]
x
a b c
1: 1 12 13 12
2: 2 14 15 14
3: 3 16 17 16
4: 1 18 19 18
We can use sapply() along with strsplit() and retain the first element from each vector in the list.
x$c <- sapply(strsplit(x$b, " "), `[[`, 1)
x
a b c
1: 1 12 13 12
2: 2 14 15 14
3: 3 16 17 16
4: 1 18 19 18
I need to apply a function that return a data.frame across a (grouped) tibble
Some data:
df <- data.frame(start=1:10,end=21:30,g=sample(LETTERS[1:2],10,replace=TRUE))
ff <- function(start,end,... ) {
out <- data.frame(T1=c(start,rev(start)),T2=c(end,rev(end)))
return(out)
}
and then I would like to do something like this
library(dplyr)
library(purrr)
df %>%
group_by(g) %>%
pmap_dfr( ff,.keep=TRUE)
to produce a tibble / data.frame like this:
g start end
1 A 1 21
2 A 3 23
3 A 4 24
4 A 5 25
5 A 6 26
6 A 7 27
7 A 8 28
8 A 8 28
9 A 7 27
10 A 6 26
11 A 5 25
12 A 4 24
13 A 3 23
14 A 1 21
15 B 2 22
16 B 9 29
17 B 10 30
18 B 10 30
19 B 9 29
20 B 2 22
So that the outcput is concatenated together row-wise, and the group to which it belong is marked somehow.
The functions I would like to apply need to get arguments from the other columns in the original data.frame (df in the example code) so I thought pmap_dfr would be the correct function to used. But I am just confused by the output, so I must be using that function wrong.
I would appreciate all the help I could get on this.
One option is to use dplyr::group_split() and purrr::map_dfr().
How this works: group_split() will divide your data.frame df into a list of data.frames based on the grouping variables you supply (e.g., g). Next, map_dfr() can be used to apply a function to each element of that list. Because your custom function ff() returns a data.frame without your grouping variable g, you'll want to add that information back to ff() output - this can be accomplished with mutate() as in the example below:
library(dplyr)
library(purrr)
# set seed so that example is reproducible
set.seed(1)
# your example data and function
df <- data.frame(start=1:10,end=21:30,g=sample(LETTERS[1:2],10,replace=TRUE))
ff <- function(start,end,... ) {
out <- data.frame(T1=c(start,rev(start)),T2=c(end,rev(end)))
return(out)
}
# use group_split & map_dfr
df %>%
# divide df into a list of data.frames based on supplied grouping variables
group_split(g) %>%
# for each element in the list, apply this function
map_dfr(function(df.x) {
with(df.x,
# get the data.frame your function returns
ff(start, end) %>%
# add your grouping variables back-in (stripped by ff)
mutate(g = g[1]))
})
# a short-hand version of the above can be written as:
df %>%
group_split(g) %>%
map_dfr(~ff(.x$start, .x$end) %>% mutate(g = .x$g[1]))
Using data.table and lapply expected result can be achieved.
df <- data.frame(start=1:10,end=21:30,g=sample(LETTERS[1:2],10,replace=TRUE))
start end g
1: 1 21 B
2: 2 22 A
3: 3 23 A
4: 4 24 A
5: 5 25 A
6: 6 26 B
7: 7 27 A
8: 8 28 B
9: 9 29 B
10: 10 30 B
library(data.table)
setDT(df)
ff <- function(x) {
x <- c(x, rev(x))
return(x)
}
df[,lapply(.SD, ff), .SDcols = c('start', 'end'), by = .(g)]
g start end
1: B 1 21
2: B 6 26
3: B 8 28
4: B 9 29
5: B 10 30
6: B 10 30
7: B 9 29
8: B 8 28
9: B 6 26
10: B 1 21
11: A 2 22
12: A 3 23
13: A 4 24
14: A 5 25
15: A 7 27
16: A 7 27
17: A 5 25
18: A 4 24
19: A 3 23
20: A 2 22
It is possible to use dplyr::across() like this:
library(tidyverse)
group_by(df, g) %>%
summarise(across(all_of(c("start", "end"))) %>%
{
ff(.[[1]], .[[2]])
})
#> `summarise()` has grouped output by 'g'. You can override using the `.groups` argument.
#> # A tibble: 20 × 3
#> # Groups: g [2]
#> g T1 T2
#> <chr> <int> <int>
#> 1 A 1 21
#> 2 A 3 23
#> 3 A 4 24
#> 4 A 9 29
#> 5 A 10 30
#> 6 A 10 30
#> 7 A 9 29
#> 8 A 4 24
#> 9 A 3 23
#> 10 A 1 21
#> 11 B 2 22
#> 12 B 5 25
#> 13 B 6 26
#> 14 B 7 27
#> 15 B 8 28
#> 16 B 8 28
#> 17 B 7 27
#> 18 B 6 26
#> 19 B 5 25
#> 20 B 2 22
Created on 2021-12-21 by the reprex package (v2.0.1)
In this sample data:
id<-c(2,2,2,2,2,3,3,3,3,3,3,4,4,4,4)
time<-c(3,5,7,8,9,2,8,10,12,14,18,4,6,7,9)
status<-c('mar','mar','div','c','mar','mar','div','mar','mar','c','div','mar','mar','c','mar')
myd<-data.frame(id,time,status)
id time status
1 2 3 mar
2 2 5 mar
3 2 7 div
4 2 8 c
5 2 9 mar
6 3 2 mar
7 3 8 div
8 3 10 mar
9 3 12 mar
10 3 14 c
11 3 18 div
12 4 4 mar
13 4 6 mar
14 4 7 c
15 4 9 mar
I need to know when the person married (if there are two consecutive 'mar' rows without 'div' anywhere in between, the person never divorced, hence it's the same marriage, and we don't need the timing of that repeat information; the same goes with sequence of mar, c, mar where since div is not detected, the marriage before and after child are the same marriage, hence the second one can be deleted). I suspect I need to get min(time[status=='mar']) but this would be incorrect if that person gets a mar,mar,div,mar,div,mar sequence (only 2nd mar needs deletion, not all the ones after the first one).
So the new data should look something like
id time status
2 2 5 mar
3 2 7 div
4 2 8 c
5 2 9 mar
6 3 2 mar
7 3 8 div
8 3 10 mar
10 3 14 c
11 3 18 div
13 4 6 mar
14 4 7 c
This was my approach, which only worked for one row
myd2<-myd %>%
group_by(id) %>%
mutate(dum1=ifelse(status=='mar',min(time[status=='mar']),NA),
dum2=cumsum(status=='div'),
flag=ifelse(time>dum1 & dum2==0,1,0))
If I get rid of dum2==0 then it deleted too many rows.
Using a quick helper function,
func <- function(x, vals = c("mar", "div")) {
out <- rep(TRUE, length(x))
last <- x[1]
for (ind in seq_along(x)[-1]) {
out[ind] <- x[ind] != last || !x[ind] %in% vals
if (out[ind] && x[ind] %in% vals) last <- x[ind]
}
out
}
We can do
library(data.table)
as.data.table(myd)[, .SD[func(status),], by = .(id)]
# id time status
# <num> <num> <char>
# 1: 2 3 mar
# 2: 2 7 div
# 3: 2 8 c
# 4: 2 9 mar
# 5: 3 2 mar
# 6: 3 8 div
# 7: 3 10 mar
# 8: 3 14 c
# 9: 3 18 div
# 10: 4 4 mar
# 11: 4 7 c
If you want this in dplyr, then
library(dplyr)
myd %>%
group_by(id) %>%
filter(func(status))
My approach:
library(dplyr)
myd %>%
group_by(id) %>%
arrange(time) %>%
filter(status != lag(status) | is.na(lag(status))) %>%
ungroup() %>%
arrange(id)
Returns:
# A tibble: 12 x 3
id time status
<dbl> <dbl> <chr>
1 2 3 mar
2 2 7 div
3 2 8 c
4 2 9 mar
5 3 2 mar
6 3 8 div
7 3 10 mar
8 3 14 c
9 3 18 div
10 4 4 mar
11 4 7 c
12 4 9 mar
I would delete rows in which the status is unchanged by creating a lag_status variable in grouped data:
> myd %>%
+ arrange(id, time) %>%
+ group_by(id) %>%
+ mutate(lag_status = lag(status)) %>%
+ ungroup() %>%
+ filter(is.na(lag_status) | status != lag_status) %>%
+ select(-lag_status)
# A tibble: 12 x 3
id time status
<dbl> <dbl> <fct>
1 2 3 mar
2 2 7 div
3 2 8 c
4 2 9 mar
5 3 2 mar
6 3 8 div
7 3 10 mar
8 3 14 c
9 3 18 div
10 4 4 mar
11 4 7 c
12 4 9 mar
I read two different questions in your post.
When the person first married
How to make a list that removes redundant status information
It seems like you have a solution for #1, but you actually want #2.
I read #2 as a desire to filter out rows where the id and status are the same as the previous row. That would look like:
myd %>%
filter(!(id == lag(id) & status == lag(status))
I have to following issue using R. In short I want to create multiple new columns in a data frame based on calculations of different column pairs in the data frame.
The data looks as follows:
df <- data.frame(a1 = c(1:5),
b1 = c(4:8),
c1 = c(10:14),
a2 = c(9:13),
b2 = c(3:7),
c2 = c(15:19))
df
a1 b1 c1 a2 b2 c2
1 4 10 9 3 15
2 5 11 10 4 16
3 6 12 11 5 17
4 7 13 12 6 18
5 8 14 13 7 19
The output is supposed to look like the following:
a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
1 4 10 9 3 15 10 7 25
2 5 11 10 4 16 12 9 27
4 7 13 12 6 18 16 13 31
5 8 14 13 7 19 18 15 33
I can achieve this using dplyr doing some manual work in the following way:
df %>% rowwise %>% mutate(sum_a = sum(a1, a2),
sum_b = sum(b1, b2),
sum_c = sum(c1, c2)) %>%
as.data.frame()
So what is being done is: take columns with the letter "a" in it, calulate the sum rowwise, and create a new column with the sum named sum_[letter]. Repeat for columns with different letters.
This is working, however, if I have a large data set with say 300 different column pairs the manual input would be significant, since I would have to write 300 mutate calls.
I recently stumbled upon the R package "purrr" and my guess is that this would solve my problem of doing what I want in a more automated way.
In particular, I would think to be able to use purrr:map2 to which I pass two lists of column names.
list1 = all columns with the number 1 in it
list2 = all columns with the number 2 in it
Then I could calculate the sum of each matching list entry, in the form of:
map2(list1, list2, ~mutate(sum))
However, I am not able to figure out how to best approach this problem using purrr. I am rather new to using purrr, so I would really appreciate any help on this issue.
Here is one option with purrr. We get the unique prefix of the names of the dataset ('nm1'), use map (from purrr) to loop through the unique names, select the column that matches the prefix value of 'nm1', add the rows using reduce and the bind the columns (bind_cols) with the original dataset
library(tidyverse)
nm1 <- names(df) %>%
substr(1, 1) %>%
unique
nm1 %>%
map(~ df %>%
select(matches(.x)) %>%
reduce(`+`)) %>%
set_names(paste0("sum_", nm1)) %>%
bind_cols(df, .)
# a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
#1 1 4 10 9 3 15 10 7 25
#2 2 5 11 10 4 16 12 9 27
#3 3 6 12 11 5 17 14 11 29
#4 4 7 13 12 6 18 16 13 31
#5 5 8 14 13 7 19 18 15 33
df %>%
mutate(sum_a = pmap_dbl(select(., starts_with("a")), sum),
sum_b = pmap_dbl(select(., starts_with("b")), sum),
sum_c = pmap_dbl(select(., starts_with("c")), sum))
a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
1 1 4 10 9 3 15 10 7 25
2 2 5 11 10 4 16 12 9 27
3 3 6 12 11 5 17 14 11 29
4 4 7 13 12 6 18 16 13 31
5 5 8 14 13 7 19 18 15 33
EDIT:
In the case there are many columns, and you wish to apply it programmatically:
row_sums <- function(x) {
transmute(df, !! paste0("sum_", quo_name(x)) := pmap_dbl(select(df, starts_with(x)), sum))
}
newdf <- map_dfc(letters[1:3], row_sums)
newdf
sum_a sum_b sum_c
1 10 7 25
2 12 9 27
3 14 11 29
4 16 13 31
5 18 15 33
And if needed you can tack on the original variables with:
bind_cols(df, dfnew)
a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
1 1 4 10 9 3 15 10 7 25
2 2 5 11 10 4 16 12 9 27
3 3 6 12 11 5 17 14 11 29
4 4 7 13 12 6 18 16 13 31
5 5 8 14 13 7 19 18 15 33
In case you like to consider a base R approach, here's how you could do it:
cbind(df, lapply(split.default(df, substr(names(df), 0,1)), rowSums))
# a1 b1 c1 a2 b2 c2 a b c
#1 1 4 10 9 3 15 10 7 25
#2 2 5 11 10 4 16 12 9 27
#3 3 6 12 11 5 17 14 11 29
#4 4 7 13 12 6 18 16 13 31
#5 5 8 14 13 7 19 18 15 33
It splits the data column-wise into a list, based on the first letter of each column name (either a, b, or c).
If you have a large number of columns and need to differentiate between all characters except the numbers at the end of each column name, you could modify the approach to:
cbind(df, lapply(split.default(df, sub("\\d+$", "", names(df))), rowSums))
in base R, all vectorized:
nms <- names(df)
df[paste0("sum_",unique(gsub("[1-9]","",nms)))] <-
df[endsWith(nms,"1")] + df[endsWith(nms,"2")]
# a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
# 1 1 4 10 9 3 15 10 7 25
# 2 2 5 11 10 4 16 12 9 27
# 3 3 6 12 11 5 17 14 11 29
# 4 4 7 13 12 6 18 16 13 31
# 5 5 8 14 13 7 19 18 15 33
Here is another tidyverse approach that uses only the pipe and doesn't require to create new objects.
library(tidyverse)
df %>%
bind_cols(
map_dfc(.x = list("a", "b", "c"),
.f = ~ .y %>%
rowwise() %>%
transmute(!!str_c("sum_", .x) := sum(c_across(starts_with(.x)))),
.y = .)
)
#> a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
#> 1 1 4 10 9 3 15 10 7 25
#> 2 2 5 11 10 4 16 12 9 27
#> 3 3 6 12 11 5 17 14 11 29
#> 4 4 7 13 12 6 18 16 13 31
#> 5 5 8 14 13 7 19 18 15 33
Explanation
The dataframe is piped into bind_cols() which binds the original columns with the newly created columns. The new columns are created with purrr::map_dfc() which takes a list of variable prefixes (.x) and the transforming function (.f). Additionally, the piped data (.) is assigned as another argument (.y). Since rowwise operations are required, rowwise() and c_across() are used in each iteration over the prefixes. transmute is used so that the original variables are not duplicated. In order to dynamically create variable names, the bang-bang operator (!!) along with := are used inside transmute.
Note
It would be shorter to use rowSums() instead of rowwise() and c_across() but other functions can easier be implemented using this approach.
For a hackish tidy solution, check this out:
library(tidyr)
library(dplyr)
df %>%
rownames_to_column(var = 'row') %>%
gather(a1:c2, key = 'key', value = 'value') %>%
extract(key, into = c('col.base', 'col.index'), regex = '([a-zA-Z]+)([0-9]+)') %>%
group_by(row, col.base) %>%
summarize(.sum = sum(value)) %>%
spread(col.base, .sum) %>%
bind_cols(df, .) %>%
select(-row)
Basically, I collect all pairs of columns with their values across all rows, separate the column name in two parts, calculate the row sums for columns with the same letter, and cast it back to the wide form.
Another solution that splits df by the numbers than use Reduce to calculate the sum
library(tidyverse)
df %>%
split.default(., substr(names(.), 2, 3)) %>%
Reduce('+', .) %>%
set_names(paste0("sum_", substr(names(.), 1, 1))) %>%
cbind(df, .)
#> a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
#> 1 1 4 10 9 3 15 10 7 25
#> 2 2 5 11 10 4 16 12 9 27
#> 3 3 6 12 11 5 17 14 11 29
#> 4 4 7 13 12 6 18 16 13 31
#> 5 5 8 14 13 7 19 18 15 33
Created on 2018-04-13 by the reprex package (v0.2.0).
1) dplyr/tidyr Convert to long form, summarize and convert back to wide form:
library(dplyr)
library(tidyr)
DF %>%
mutate(Row = 1:n()) %>%
gather(colname, value, -Row) %>%
group_by(g = gsub("\\d", "", colname), Row) %>%
summarize(sum = sum(value)) %>%
ungroup %>%
mutate(g = paste("sum", g, sep = "_")) %>%
spread(g, sum) %>%
arrange(Row) %>%
cbind(DF, .) %>%
select(-Row)
giving:
a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
1 1 4 10 9 3 15 10 7 25
2 2 5 11 10 4 16 12 9 27
3 4 7 13 12 6 18 16 13 31
4 5 8 14 13 7 19 18 15 33
2) base using matrix multiplication
nms is a vector of column names without the digits and prefaced with sum_. u is a vector of the unique elements of it. Form a logical matrix using outer from that which when multiplied by DF gives the sums -- the logicals get converted to 0-1 when that is done. Finally bind it to the input.
nms <- gsub("(\\D+)\\d", "sum_\\1", names(DF))
u <- unique(nms)
sums <- as.matrix(DF) %*% outer(nms, setNames(u, u), "==")
cbind(DF, sums)
giving:
a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
1 1 4 10 9 3 15 10 7 25
2 2 5 11 10 4 16 12 9 27
3 4 7 13 12 6 18 16 13 31
4 5 8 14 13 7 19 18 15 33
3) base with tapply
Using nms from (2) apply tapply to each row:
cbind(DF, t(apply(DF, 1, tapply, nms, sum)))
giving:
a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
1 1 4 10 9 3 15 10 7 25
2 2 5 11 10 4 16 12 9 27
3 4 7 13 12 6 18 16 13 31
4 5 8 14 13 7 19 18 15 33
You may wish to replace nms with factor(nms, levels = unique(nms)) in the above expression if the names are not in ascending order.
A slightly different approach using base R:
cbind(df, lapply(unique(gsub("\\d+","", colnames(df))), function(li) {
set_names(data.frame(V = apply(df[grep(li, colnames(df), val = T)], FUN = sum, MARGIN = 1)), paste0("sum_", li))
}))
# a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
#1 1 4 10 9 3 15 10 7 25
#2 2 5 11 10 4 16 12 9 27
#3 3 6 12 11 5 17 14 11 29
#4 4 7 13 12 6 18 16 13 31
#5 5 8 14 13 7 19 18 15 33
Say I have the following data frame:
ID<-c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3, 4,4,4,4,4,5,5,5,5,5)
Score<- sample(1:20, 25, replace=TRUE)
days<-rep(c("Mon", "Tue", "Wed", "Thu", "Fri"), times=5)
t<-cbind(ID, Score, days)
I would like to reshape it so that the new columns are ID and the actual weekday names, (meaning 6 columns) and the Score values are distributed according to their ID and day name. Something like this:
I found that reshape package might do. Tried (melt and cast) but it did not produce the result I wanted, but something like in this post: Melt data for one column
A base R solution that uses the built-in reshape command.
set.seed(12345)
t <- data.frame(id = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5),
score = sample(x = 1:20,size = 25,replace = TRUE),
days = rep(x = c("Mon","Tue","Wed","Thu","Fri"),times = 5))
t.wide <- reshape(data = t,
v.names = "score",
timevar = "days",
idvar = "id",
direction = "wide")
names(t.wide) <- gsub(pattern = "score.",replacement = "",x = names(t.wide),fixed = TRUE)
t.wide
id Mon Tue Wed Thu Fri
1 1 15 18 16 18 10
6 2 4 7 11 15 20
11 3 1 4 15 1 8
16 4 10 8 9 4 20
21 5 10 7 20 15 13
You can use reshape2 to do this, but you need a data.frame to do that. Using cbind produces a matrix. (And converts all your numerical variables to characters in this case, as matrices can only hold one data type).
I've changed your code to produce a dataframe, which is already in long format (one row per observation).
set.seed(123)
ID<-c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3, 4,4,4,4,4,5,5,5,5,5)
Score<- sample(1:20, 25, replace=TRUE)
days<-rep(c("Mon", "Tue", "Wed", "Thu", "Fri"), times=5)
dat<-data.frame(ID, Score, days)
Changing it to wide using reshape2 is then quite straightforward:
library(reshape2)
res <- dcast(ID~days,value.var="Score",data=dat)
> res
ID Fri Mon Thu Tue Wed
1 1 16 3 2 12 6
2 2 19 13 12 7 19
3 3 19 19 17 8 15
4 4 15 3 8 1 20
5 5 3 11 18 8 15
You could also use unstack if your data are complete (same number of days per id).
Here's an example (using the data from TARehman's answer):
unstack(t, score ~ days)
# Fri Mon Thu Tue Wed
# 1 10 15 18 18 16
# 2 20 4 15 7 11
# 3 8 1 1 4 15
# 4 20 10 4 8 9
# 5 13 10 15 7 20
Here's the clean-up for the column ordering, and for adding in the ID column:
cbind(ID = unique(t$id), unstack(t, score ~ days)[c("Mon", "Tue", "Wed", "Thu", "Fri")])
## ID Mon Tue Wed Thu Fri
## 1 1 15 18 16 18 10
## 2 2 4 7 11 15 20
## 3 3 1 4 15 1 8
## 4 4 10 8 9 4 20
## 5 5 10 7 20 15 13
Rather than reshape I'd move to the newer tidyr package and also make use of dplyr like so:
library(dplyr)
library(tidyr)
tdf<-as.data.frame(t) %>%
mutate(Score=as.numeric(Score)) %>%
spread(days,Score, fill=NA)
glimpse(tdf)
HTH
Just another option using splitstackshape
library(splitstackshape)
data = data.frame(t)
out = setnames(cSplit(setDT(data)[, .(x = toString(Score)), by = ID],
'x', ','), c('ID', unique(days)))
#> out
# ID Mon Tue Wed Thu Fri
#1: 1 8 14 11 5 10
#2: 2 16 1 4 14 8
#3: 3 8 18 19 13 3
#4: 4 16 9 19 16 6
#5: 5 7 2 1 2 13
Within both the dplyr & tidyr package, use spread to achieve the following:
library(dplyr)
library(tidyr)
t <- tbl_df(as.data.frame(t))
t %>% spread(days, Score, ID)
and you get the following output:
ID Fri Mon Thu Tue Wed
(fctr) (fctr) (fctr) (fctr) (fctr) (fctr)
1 1 10 10 18 17 10
2 2 18 11 14 3 16
3 3 11 13 9 15 17
4 4 13 13 16 17 11
5 5 7 14 9 15 20