reshape or table data from long to wide [duplicate] - r

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 4 years ago.
I'm using R and I'm really at a loss right now. I have data like this:
df <- data.frame(
group = c(2, 2, 2, 1, 1, 0, 0, 1, 1, 0, 1, 0),
grade = c(2, 4, 3, 1, 3, 2, 5, 1, 1, 2, 3, 1)
)
I want to have it like this:
group0 group1 group2
1 1 3 0
2 2 0 1
3 0 2 1
4 0 0 1
5 1 0 0
6 0 0 0
I've been trying for hours using subset, tapply, table, for loops and what not but I can't seem to figure it out. I'd be really happy if someone could help me, I can't help but think I'm missing something really easy and obvious.
How can I produce my target output?
/ Solved, see below. Thanks for finding a fitting title btw, you guys are the best!

You can do something like this with dplyr and tidyr:
df %>%
count(group, grade) %>%
mutate(group = paste0('group', group)) %>%
spread(group, n, fill = 0)
# A tibble: 5 x 4
grade group0 group1 group2
* <int> <dbl> <dbl> <dbl>
1 1 1 3 0
2 2 2 0 1
3 3 0 2 1
4 4 0 0 1
5 5 1 0 0
If you don't want the additional 'grade' column, you can do:
df %>%
count(group, grade) %>%
mutate(group = paste0('group', group)) %>%
spread(group, n, fill = 0) %>%
select(-grade)
group0 group1 group2
* <dbl> <dbl> <dbl>
1 1 3 0
2 2 0 1
3 0 2 1
4 0 0 1
5 1 0 0

Alternatively, consider a base R approach using: by for grouping, aggregate for counts, setNames for group## column names, and Reduce for chain merge of dataframes:
# DATAFRAME LIST BY EACH GROUP
grp_list <- by(df, df$group, function(d) setNames(aggregate(.~grade, d, FUN=length),
c("grade", paste0("group",max(d$group)))))
# CHAIN MERGE (OUTER JOIN)
final_df <- Reduce(function(x,y) merge(x,y, by="grade", all=TRUE), grp_list)
# FILL NA WITH ZEROS
final_df[is.na(final_df)] <- 0
final_df
# grade group0 group1 group2
# 1 1 1 3 0
# 2 2 2 0 1
# 3 3 0 2 1
# 4 4 0 0 1
# 5 5 1 0 0
And to remove grade, use transform after chain merge or directly on final_df:
final_df <- transform(Reduce(function(x,y) merge(x,y, by="grade", all=TRUE), grp_list),
grade = NULL)
final_df <- transform(final_df, grade = NULL)

Related

How to run Excel-like formulas using dplyr?

In the below reproducible R code, I'd like to add a column "adjust" that results from a series of calculations that in Excel would use cumulative countifs, max, and match (actually, to make this more complete the adjust column should have used the match formula since there could be more than 1 element in the list starting in row 15, but I think it's clear what I'm doing without actually using match) formulas as shown below in the illustration. The yellow shading shows what the reproducible code generates, and the blue shading shows my series of calculations in Excel that derive the desired values in the "adjust" column. Any suggestions for doing this, in dplyr if possible?
I am a long-time Excel user trying to migrate all of my work to R.
Reproducible code:
library(dplyr)
myData <-
data.frame(
Element = c("A","B","B","B","B","B","B","B"),
Group = c(0,1,1,1,2,2,3,3)
)
myDataGroups <- myData %>%
mutate(origOrder = row_number()) %>%
group_by(Element) %>%
mutate(ElementCnt = row_number()) %>%
ungroup() %>%
mutate(Group = factor(Group, unique(Group))) %>%
arrange(Group) %>%
mutate(groupCt = cumsum(Group != lag(Group, 1, Group[[1]])) - 1L) %>%
as.data.frame()
myDataGroups
We may use rowid to get the sequence to update the 'Group', and then create a logical vector on 'Group' to create the binary and use cumsum on the 'excessOver2' and take the lag
library(dplyr)
library(data.table)
myDataGroups %>%
mutate(Group = rowid(Element, Group),
excessOver2 = +(Group > 2), adjust = lag(cumsum(excessOver2),
default = 0))
-output
Element Group origOrder ElementCnt groupCt excessOver2 adjust
1 A 1 1 1 -1 0 0
2 B 1 2 1 0 0 0
3 B 2 3 2 0 0 0
4 B 3 4 3 0 1 0
5 B 1 5 4 1 0 1
6 B 2 6 5 1 0 1
7 B 1 7 6 2 0 1
8 B 2 8 7 2 0 1
library(dplyr)
myData %>%
group_by(Element, Group) %>%
summarize(ElementCnt = row_number(), over2 = 1 * (ElementCnt > 2),
.groups = "drop_last") %>%
mutate(adjust = cumsum(lag(over2, default = 0))) %>%
ungroup()
Result
# A tibble: 8 × 5
Element Group ElementCnt over2 adjust
<chr> <dbl> <int> <dbl> <dbl>
1 A 0 1 0 0
2 B 1 1 0 0
3 B 1 2 0 0
4 B 1 3 1 0
5 B 2 1 0 1
6 B 2 2 0 1
7 B 3 1 0 1
8 B 3 2 0 1

Create and fill new columns based on range information from two other columns

I have the following data:
df <- data.frame(group = c(1, 1, 1, 2, 2, 2),
start = c(2, 2, 2, 7, 7, 7),
stop = c(4, 7, 8, 7, 8, 9),
unstop = c(5, 7, 10, 7, 9, 10))
I now want to do the following:
Create new columns that have the names "week_1", "week_2" ... "week_10", "week_n".
Within each group for the FIRST ROW, I check which weeks the row was "active" in, i.e. it started in week 2, and stopped in week 4, so the row was active in week 2, 3, 4. I now want to populate the respective week columns with a 1.
Within each group for ALL OTHER EXCEPT THE LAST ROW, I do the same check, but now populate based on the unstop value of that row and the stop value of the next row.
Within each group for the LAST ROW, I do the same check, but now populate based on the range from unstop to 10 (which is the last week in my case).
I have a theoretical way. The problem is that my real-life data has 80k rows (consisting of 60k groups) and I'd need to create ~200 of such week-columns. Even filtering on 10 rows only takes ~30s for the code below.
So I'm looking for a more elegent/smarter/FASTER solution.
Expected outcome:
# A tibble: 6 × 14
# Groups: group [2]
group start stop unstop week_1 week_2 week_3 week_4 week_5 week_6 week_7 week_8 week_9 week_10
<dbl> <dbl> <dbl> <dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1 2 4 5 0 1 1 1 0 0 0 0 0 0
2 1 2 7 7 0 0 0 0 0 0 1 1 0 0
3 1 2 8 10 0 0 0 0 0 0 0 0 0 1
4 2 7 7 7 0 0 0 0 0 0 1 0 0 0
5 2 7 8 9 0 0 0 0 0 0 0 1 1 0
6 2 7 9 10 0 0 0 0 0 0 0 0 0 1
Below is how I would have approached it generally (of course not with manually defining each row_number. Apart from that, the code is also wrong and does not give the expected 0/1 values. It alsow throws many warnings. And finally, this code already runs a few seconds just for this small test data. It would run one month for my 80k/200col data set.
add_weeks <- as_tibble(as.list(setNames(rep(0L, 10),
paste0("week_", 1:10))))
df |>
bind_cols(add_weeks) |>
group_by(group) |>
mutate(across(num_range("week_", 1:10),
~ if_else(row_number() == 1 & str_extract(cur_column(), "\\d+$") %in% start:stop,
1L,
.)),
across(num_range("week_", 1:10),
~ if_else(row_number() == 2 & str_extract(cur_column(), "\\d+$") %in% unstop:lead(stop),
1L,
.)),
across(num_range("week_", 1:10),
~ if_else(row_number() == 3 & str_extract(cur_column(), "\\d+$") %in% unstop:10,
1L,
.)))
Now tested code. Implementation of strategy described in comment:
I’d make a matrix with names columns and assign with row and col indices. You can then either attach it as a matrix or convert to data frame.
Mat <- matrix(0, nrow(df), 10) # 200 for real case
maxwk <- 10
colnames(Mat) <- paste0("week", 1:maxwk)
# Add extra column that marks condition
# If there are always exactly 3 row per group just rep(1:3, ngrps)
# Need to define a value for cond that identifies the three possibilities:
df$cond <- rep(1:3, length=nrow(df)) # assume all groups have exactly 3:
for ( r in 1:nrow(df) ) {
# for first row in group
if( df$cond[r] == 1){
Idx <- paste0("week", df$start[r]:df$stop[r] ) #start:stop
Mat[r, Idx] <- 1; next}
# second
if( df$cond[r] == 2){
Idx <- paste0("week" , df$stop[r]:df$unstop[r] )# stop:unstop
Mat[r, Idx] <- 1; next}
# third
if( df$cond[r] == 3){
Idx <- paste0("week", df$unstop[r]:maxwk ) # unstop:max
Mat[r, Idx] <- 1; next}
}
df
group start stop unstop cond
1 1 2 4 5 1
2 1 2 7 7 2
3 1 2 8 10 3
4 2 7 6 7 1
5 2 7 8 9 2
6 2 7 9 10 3
> Mat
week1 week2 week3 week4 week5 week6 week7 week8 week9 week10
[1,] 0 1 1 1 0 0 0 0 0 0
[2,] 0 0 0 0 0 0 1 0 0 0
[3,] 0 0 0 0 0 0 0 0 0 1
[4,] 0 0 0 0 0 1 1 0 0 0
[5,] 0 0 0 0 0 0 0 1 1 0
[6,] 0 0 0 0 0 0 0 0 0 1
You could cbind these.
There might be performance improvements possible. Could use switch(cond, ...) to dispatch to the correct logic rather than the if( cond == .){ ., next} method. This should be much faster than code that uses ifelse or if_else. If you want to see how that's implemented, then endorse the general strategy with a checkmark and I'll spend the time to add the alternate code.
ran a benchmark after setting up both methods for 100 week maximum. * The warnings are from the code in the question:
> perf_results <- microbenchmark(
+ first.method = do_first(df), sec.method=do_second(df), times=10)
There were 50 or more warnings (use warnings() to see the first 50)
> perf_results
Unit: microseconds
expr min lq mean median uq max neval
first.method 4385001.123 4416568.8 4581549.9624 4450691.5455 4615753.753 5350416.80 10
sec.method 146.432 149.6 181.6137 188.2125 193.307 243.47 10
I wanted to see if a switch method of selecting the proper algorithm for a row would improve performance. It did and to a degree that surprised me. The switch function is analogous to the case function in Pascal and many other languages. It has two forms whose behavior is different depending on whether the first argument, EXPR is numeric or character. Here, the "dispatch" version is chosen because the "cond" column is numeric.
do_third= function(df){ Mat <- matrix(0, nrow(df), 100) # 200 for real case
maxwk <- 100
colnames(Mat) <- paste0("week", 1:maxwk)
df$cond <- rep(1:3, length=nrow(df)) # assume all groups have exactly 3:
for( r in 1:nrow(df)) { switch( df[r,"cond"],
{ # for first row in each group of 3
Idx <- paste0("week", df$start[r]:df$stop[r] ) #start:stop
Mat[r, Idx] <- 1 },
{ # second row in group
Idx <- paste0("week" , df$stop[r]:df$unstop[r] )# stop:unstop
Mat[r, Idx] <- 1 },
{# third
Idx <- paste0("week", df$unstop[r]:maxwk ) # unstop:max
Mat[r, Idx] <- 1 } ) }
}
New microbenchmark:
perf_results
Unit: nanoseconds
expr min lq mean median uq max neval cld
first.method 4304901359 4351893534 4387626725.8 4372151785 4416247096 4543314742 10 b
sec.method 162803 173855 2588492.1 215309 216878 24081195 10 a
third.meth 34 53 610.6 877 940 963 10 a
FWIW, I'm posting my own solution of it. Apparently, adding 200 cols to a 60k data frame based on some conditions is extremely slow. So what I did instead is:
Add one chr column with the info about the weeks via str_c.
Create a smaller data set that just has the grouping var and this new info.
Then use separate_rows on this week_info to get a long format data set.
Then use pivot_wider and combine this info with the orginal data set.
Note that this approach works because I didn't mention in my initial post that I actually want to summarize the week info per group. So in the end I want to have one row per group. In the interest of keeping my question simple, I didn't add this to my question.
Having said that, the solution of #IRTFM is still considerably faster by a factor of 3.
df2 <- df |>
group_by(group) |>
mutate(lead_stop = lead(stop, default = 0),
n_rows = n(),
row_number = row_number()) |>
ungroup() |>
rowwise() |>
mutate(split_weeks = case_when(n_rows == 1 & row_number == 1 ~ str_c(start:stop, collapse = ","),
n_rows > 1 & row_number == 1 ~ str_c(c(start:stop, unstop:lead_stop), collapse = ","),
row_number == n_rows ~ str_c(unstop:10, collapse = ","),
TRUE ~ str_c(unstop:lead_stop, collapse = ",")))
df3 <- df2 |>
group_by(group) |>
summarize(split_weeks = unique(str_c(split_weeks, collapse = ","))) |>
separate_rows(split_weeks, sep = ",", convert = TRUE) |>
distinct() |>
mutate(value = 1L) |>
full_join(y = data.frame(split_weeks = 1:10)) |>
pivot_wider(names_from = split_weeks,
names_prefix = "week_",
values_from = value,
values_fill = 0L,
names_expand = TRUE) |>
filter(!is.na(group))
df4 <- df2 |>
ungroup() |>
select(-split_weeks, -n_rows) |>
pivot_wider(names_from = row_number, values_from = -group) |>
bind_cols(x = df3 |> select(-group), y = _)
library(tidyverse)
periods <- tibble(
group = c(1, 1, 1, 2, 2, 2),
start = c(2, 2, 2, 7, 7, 7),
stop = c(4, 7, 8, 7, 8, 9),
unstop = c(5, 7, 10, 7, 9, 10)
)
LAST <- 10
I think it makes sense to recode the intragroup start/stop/unstop logic to a
single start/stop for each row. Let’s call them rstart/rstop. Using your
rules, they can be created like this:
(periods <- periods %>%
group_by(group) %>%
transmute(
period = row_number(),
rstart = if_else(period == 1L, start, unstop),
rstop = if_else(period == 1L, stop, lead(stop, default = LAST))
) %>%
ungroup()
)
#> # A tibble: 6 x 4
#> group period rstart rstop
#> <dbl> <int> <dbl> <dbl>
#> 1 1 1 2 4
#> 2 1 2 7 8
#> 3 1 3 10 10
#> 4 2 1 7 7
#> 5 2 2 9 9
#> 6 2 3 10 10
Now, we can generate the active stretches by group_by -> summarise. Here we
also add a indicator column active to show that the given weeks are active
(periods <- periods %>%
group_by(group, period) %>%
summarise(
weeks = rstart:rstop,
active = 1L,
.groups = "drop"
)
)
#> # A tibble: 9 x 4
#> group period weeks active
#> <dbl> <int> <int> <int>
#> 1 1 1 2 1
#> 2 1 1 3 1
#> 3 1 1 4 1
#> 4 1 2 7 1
#> 5 1 2 8 1
#> 6 1 3 10 1
#> 7 2 1 7 1
#> 8 2 2 9 1
#> 9 2 3 10 1
To have the non-observed weeks be present in the output after pivot_wider,
we can convert the week column to a factor and add the missing levels with
fct_expand. I’ve also added fct_inseq to make sure that the columns are
ordered as expected in the output. Once that is done, we can use pivot_wider
to get the wide format. Note the names_expand = TRUE argument which gives us
the levels we added to the week column.
periods %>%
mutate(
weeks = as_factor(weeks) %>%
fct_expand(as.character(1:LAST)) %>%
fct_inseq()
) %>%
pivot_wider(
names_from = weeks,
names_expand = TRUE,
values_from = active,
values_fill = 0L,
names_prefix = "week"
)
#> # A tibble: 6 x 12
#> group period week1 week2 week3 week4 week5 week6 week7 week8 week9 week10
#> <dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 1 1 0 1 1 1 0 0 0 0 0 0
#> 2 1 2 0 0 0 0 0 0 1 1 0 0
#> 3 1 3 0 0 0 0 0 0 0 0 0 1
#> 4 2 1 0 0 0 0 0 0 1 0 0 0
#> 5 2 2 0 0 0 0 0 0 0 0 1 0
#> 6 2 3 0 0 0 0 0 0 0 0 0 1
Created on 2022-05-09 by the reprex package (v2.0.1)

how to find duplicated columns in row in R?

I have such a data frame below and I want to find duplicated columns in each row of this data frame. Please see the input and output example below. 0 is repeated 2 times in the first row, that is why column rep should be 0 (data_input[1,"rep"]=0); 2 is repeated 2 times in the second row, that is why column rep should be 0; there are no replicated values in the 3rd row that is why column rep can be 4 (or you can add any value instead of 0,1,2) and 1 is repeated 3 times in the 4th row, that is why column rep should be 1.
data_input=data.frame(X1=c(0,1,2,1), X2=c(0,2,1,1),
X3=c(1,2,0,1))
data_output=data.frame(X1=c(0,1,2,1),
X2=c(0,2,1,1), X3=c(1,2,0,1), rep=c(0,2,4,1))
Here is an option with rowwise - create the rowwise attribute, then find the duplicated element from the row, if there are none, replace the NA with 4
library(dplyr)
library(tidyr)
data_input %>%
rowwise %>%
mutate(rep = {tmp <- c_across(everything())
replace_na(tmp[duplicated(tmp)][1], 4)
}) %>%
ungroup
-output
# A tibble: 4 × 4
X1 X2 X3 rep
<dbl> <dbl> <dbl> <dbl>
1 0 0 1 0
2 1 2 2 2
3 2 1 0 4
4 1 1 1 1
Above solution didn't consider the case where there are multiple duplicates. If there are, then either consider to create a list column or paste the unique elements together to a single string
data_input %>%
rowwise %>%
mutate(rep = {tmp <- c_across(everything())
tmp <- toString(sort(unique(tmp[duplicated(tmp)])))
replace(tmp, tmp == "", "4")
}) %>%
ungroup
-output
# A tibble: 4 × 4
X1 X2 X3 rep
<dbl> <dbl> <dbl> <chr>
1 0 0 1 0
2 1 2 2 2
3 2 1 0 4
4 1 1 1 1
Or using base R
data_input$rep <- apply(data_input, 1, FUN = \(x) x[anyDuplicated(x)][1])
data_input$rep[is.na(data_input$rep)] <- 4
Another solution, based on base R:
nCols <- ncol(data_input)
data_output <- cbind(
data_input, rep = apply(data_input, 1,
function(x) if (length(table(x)) != nCols) x[which.max(table(x))] else nCols+1))
data_output
#> X1 X2 X3 rep
#> 1 0 0 1 0
#> 2 1 2 2 2
#> 3 2 1 0 4
#> 4 1 1 1 1

Calculate sum of n previous rows

I have a quite big dataframe and I'm trying to add a new variable which is the sum of the three previous rows on a running basis, also it should be grouped by ID. The first three rows per ID should be 0. Here's what it should look like.
ID Var1 VarNew
1 2 0
1 2 0
1 3 0
1 0 7
1 4 5
1 1 7
Here's an example dataframe
ID <- c(1, 1, 1, 1, 1, 1)
Var1 <- c(2, 2, 3, 0, 4, 1)
df <- data.frame(ID, Var1)
You can use any of the package that has rolling calculation function with a window size of 3 and lag the result. For example with zoo::rollsumr.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(VarNew = lag(zoo::rollsumr(Var1, 3, fill = 0), default = 0)) %>%
ungroup
# ID Var1 VarNew
# <dbl> <dbl> <dbl>
#1 1 2 0
#2 1 2 0
33 1 3 0
#4 1 0 7
#5 1 4 5
#6 1 1 7
You can use filter in ave.
df$VarNew <- ave(df$Var1, df$ID, FUN=function(x) c(0, 0, 0,
filter(head(df$Var1, -1), c(1,1,1), side=1)[-1:-2]))
df
# ID Var1 VarNew
#1 1 2 0
#2 1 2 0
#3 1 3 0
#4 1 0 7
#5 1 4 5
#6 1 1 7
or using cumsum in combination with head and tail.
df$VarNew <- ave(df$Var1, df$ID, FUN=function(x) {y <- cumsum(x)
c(0, 0, 0, tail(y, -3) - head(y, -3))})
Library runner also helps
library(runner)
df %>% mutate(var_new = sum_run(Var1, k =3, na_pad = T, lag = 1))
ID Var1 var_new
1 1 2 NA
2 1 2 NA
3 1 3 NA
4 1 0 7
5 1 4 5
6 1 1 7
NAs can be mutated to 0 if desired so, easily.

from column with factors to two different column with 0, 1

I have a column with group1 group 2 in data frame.
group <- c( "group1", "group1", "group2", "group1", "group2" )
value<- c(1:5)
dat <- data.frame(value, group)
I want to make it like this-
group1 <- c(1, 1, 0, 1, 0)
group2 <- c(0, 0, 1, 0, 1)
dat<- data.frame(value, group1, group2)
I tried this but have to remove the group column later
dat<- dat %>%
mutate( group1 = ifelse(data1$group =="group1", 1, 0 ),
group2 = ifelse(data1$group =="group2", 1, 0 ) )
Is there any other nice way to do this job.
Thanks in advance for your help.
You could create a dummy column and get data in wide format.
library(dplyr)
library(tidyr)
dat %>%
mutate(n = 1) %>%
pivot_wider(names_from = group, values_from = n, values_fill = 0) -> result
# value group1 group2
# <int> <dbl> <dbl>
#1 1 1 0
#2 2 1 0
#3 3 0 1
#4 4 1 0
#5 5 0 1
Or in base R use table :
table(dat)
# group
#value group1 group2
# 1 1 0
# 2 1 0
# 3 0 1
# 4 1 0
# 5 0 1
A base R option using reshape
replace(
out <- reshape(
cbind(dat, q = 1),
direction = "wide",
idvar = "value",
timevar = "group"
),
is.na(out),
0
)
giving
value q.group1 q.group2
1 1 1 0
2 2 1 0
3 3 0 1
4 4 1 0
5 5 0 1
We can use data.table
library(data.table)
dcast(setDT(dat), value ~ group, length)
# value group1 group2
#1: 1 1 0
#2: 2 1 0
#3: 3 0 1
#4: 4 1 0
#5: 5 0 1
Or this can be done with pivot_wider in a single step by specifying values_fn
library(dplyr)
library(tidyr)
dat %>%
pivot_wider(names_from = group, values_from = group,
values_fn = length, values_fill = 0)
# A tibble: 5 x 3
# value group1 group2
# <int> <int> <int>
#1 1 1 0
#2 2 1 0
#3 3 0 1
#4 4 1 0
#5 5 0 1
Insert %>% select(!"group") at the end of the dplyr pipe. Also remove data1$ from it - you probably meant dat, even that's not needed.
dat %>%
mutate(group1 = ifelse(group =="group1", 1, 0 ),
group2 = ifelse(group =="group2", 1, 0 )) %>%
select(!"group")
value group1 group2
1 1 1 0
2 2 1 0
3 3 0 1
4 4 1 0
5 5 0 1

Resources