Error: incompatible size when mutating in dplyr - r

I have a trouble with the mutate function in dplyr and the error says;
Error: incompatible size (0), expecting 5 (the group size) or 1
There are some previous posts and I tried some of the solutions but no luck for my case.
group-factorial-data-with-multiple-factors-error-incompatible-size-0-expe
r-dplyr-using-mutate-with-na-omit-causes-error-incompatible-size-d
grouped-operations-that-result-in-length-not-equal-to-1-or-length-of-group-in-dp
Here is what I tried,
ff <- c(seq(0,0.2,0.1),seq(0,-0.2,-0.1))
flip <- c(c(0,0,1,1,1,1),c(1,1,0,0,0,0))
df <- data.frame(ff,flip,group=gl(2,6))
> df
ff flip group
1 0.0 0 1
2 0.1 0 1
3 0.2 1 1
4 0.0 1 1
5 -0.1 1 1
6 -0.2 1 1
7 0.0 1 2
8 0.1 1 2
9 0.2 0 2
10 0.0 0 2
11 -0.1 0 2
12 -0.2 0 2
I want to add new group called c1 and c2 based on some conditions as follows
dff <- df%>%
group_by(group)%>%
mutate(flip=as.numeric(flip),direc=ifelse(c(0,diff(ff))<0,"backward","forward"))%>%
spread(direc,flip)%>%
arrange(group,group)%>%
mutate(c1=ff[head(which(forward>0),1)],c2=ff[tail(which(backward>0),1)])
Error: incompatible size (0), expecting 5 (the group size) or 1
I also add do and tried
do(data.frame(., c1=ff[head(which(.$forward>0),1)],c2=ff[tail(which(.$backward>0),1)]))
Error in data.frame(., c1 = ff[head(which(.$forward > 0), 1)], c2 = ff[tail(which(.$backward > :
arguments imply differing number of rows: 5, 1, 0
but when I only mutate c1 column everything seems to be working. Why?

Just expanding on #allistaire's comment.
Your specified conditions are the cause of the error. specifically, tail(which(backward>0),1)
Given code can be optimised to get rid of the spread()
you can try
dff <- df%>%
group_by(group)%>%
mutate(flip=as.numeric(flip),direc=ifelse(c(0,diff(ff))<0,"backward","forward"))%>%
arrange(group)%>%
mutate(c1=ff[head(which(direc=="forward" & flip > 0),1)])
It seems like you are looking to identify influx points where direction changes, for each group. In this scenario, please clarify exactly how flip is related, or maybe if you change flip <- c(c(0,0,1,1,1,1),c(1,1,0,0,0,0)) to flip <- c(c(0,0,1,1,1,1),c(1,1,0,1,1,1)) so that flip marks change in direction of ff , you can use
dff <- df%>%
group_by(group)%>%
mutate(flip=as.numeric(flip),direc=ifelse(c(0,diff(ff))<0,"backward","forward"))%>%
arrange(group)%>%
mutate(c1=ff[head(which(direc=="forward" & flip > 0),1)]) %>%
mutate(c2=ff[tail(which(direc=="backward"& flip >0),1)])
which gives:
Source: local data frame [12 x 6]
Groups: group [2]
ff flip group direc c1 c2
<dbl> <dbl> <fctr> <chr> <dbl> <dbl>
1 0.0 0 1 forward 0.2 -0.2
2 0.1 0 1 forward 0.2 -0.2
3 0.2 1 1 forward 0.2 -0.2
4 0.0 1 1 backward 0.2 -0.2
5 -0.1 1 1 backward 0.2 -0.2
6 -0.2 1 1 backward 0.2 -0.2
7 0.0 1 2 forward 0.0 -0.2
8 0.1 1 2 forward 0.0 -0.2
9 0.2 0 2 forward 0.0 -0.2
10 0.0 1 2 backward 0.0 -0.2
11 -0.1 1 2 backward 0.0 -0.2
12 -0.2 1 2 backward 0.0 -0.2

It might be informative to step through the pipe to see what is going on.
df %>%
group_by(group)%>%
mutate(flip=as.numeric(flip),direc=ifelse(c(0,diff(ff))<0,"backward","forward"))%>%
spread(direc,flip)%>%
arrange(group,group)
# Source: local data frame [10 x 4]
# Groups: group [2]
# ff group backward forward
# <dbl> <fctr> <dbl> <dbl>
# 1 -0.2 1 1 NA
# 2 -0.1 1 1 NA
# 3 0.0 1 1 0
# 4 0.1 1 NA 0
# 5 0.2 1 NA 1
# 6 -0.2 2 0 NA
# 7 -0.1 2 0 NA
# 8 0.0 2 0 1
# 9 0.1 2 NA 1
# 10 0.2 2 NA 0
BTW: Why arrange(group,group)? Doubling the order variable is pointless.
Looking here, you'll see that you have (1) backward values that are not greater than 0. When you run something like which(FALSE) you get integer(0). This might be a good time to realize that dplyr needs the vector length of the rhs to be the same length as the number of rows in the group.
Instead of your mutate, I'll show it with a slight modification: return the number of unique values returned in the which call for c2:
df %>%
group_by(group)%>%
mutate(flip=as.numeric(flip),direc=ifelse(c(0,diff(ff))<0,"backward","forward"))%>%
spread(direc,flip)%>%
arrange(group,group)%>%
mutate(
c1 = ff[head(which(forward>0),1)],
c2len = length(which(backward > 0))
)
# Source: local data frame [10 x 6]
# Groups: group [2]
# ff group backward forward c1 c2len
# <dbl> <fctr> <dbl> <dbl> <dbl> <int>
# 1 -0.2 1 1 NA 0.2 3
# 2 -0.1 1 1 NA 0.2 3
# 3 0.0 1 1 0 0.2 3
# 4 0.1 1 NA 0 0.2 3
# 5 0.2 1 NA 1 0.2 3
# 6 -0.2 2 0 NA 0.0 0
# 7 -0.1 2 0 NA 0.0 0
# 8 0.0 2 0 1 0.0 0
# 9 0.1 2 NA 1 0.0 0
# 10 0.2 2 NA 0 0.0 0
In order to meaningfully index on ff, you need something other than integer(0) in your returns.

Related

How to find the first column with a certain value for each row with dplyr

I have a dataset like this:
df <- data.frame(id=c(1:4), time_1=c(1, 0.9, 0.2, 0), time_2=c(0.1, 0.4, 0, 0.9), time_3=c(0,0.5,0.3,1.0))
id time_1 time_2 time_3
1 1.0 0.1 0
2 0.9 0.4 0.5
3 0.2 0 0.3
4 0 0.9 1.0
And I want to identify for each row, the first column containing a 0, and extract the corresponding number (as the last element of colname), obtaining this:
id time_1 time_2 time_3 count
1 1.0 0.1 0 3
2 0.9 0.4 0.5 NA
3 0.2 0 0.3 2
4 0 0.9 1.0 1
Do you have a tidyverse solution?
We may use max.col
v1 <- max.col(df[-1] ==0, "first")
v1[rowSums(df[-1] == 0) == 0] <- NA
df$count <- v1
-output
> df
id time_1 time_2 time_3 count
1 1 1.0 0.1 0.0 3
2 2 0.9 0.4 0.5 NA
3 3 0.2 0.0 0.3 2
4 4 0.0 0.9 1.0 1
Or using dplyr - use if_any to check if there are any 0 in the 'time' columns for each row, if there are any, then return the index of the 'first' 0 value with max.col (pick is from devel version, can replace with across) within the case_when
library(dplyr)
df %>%
mutate(count = case_when(if_any(starts_with("time"), ~ .x== 0) ~
max.col(pick(starts_with("time")) ==0, "first")))
-output
id time_1 time_2 time_3 count
1 1 1.0 0.1 0.0 3
2 2 0.9 0.4 0.5 NA
3 3 0.2 0.0 0.3 2
4 4 0.0 0.9 1.0 1
You can do this:
df <- df %>%
rowwise() %>%
mutate (count = which(c_across(starts_with("time")) == 0)[1])
df
id time_1 time_2 time_3 count
<int> <dbl> <dbl> <dbl> <dbl>
1 1 1 0.1 0 3
2 2 0.9 0.4 0.5 NA
3 3 0.2 0 0.3 2
4 4 0 0.9 1 1

Getting a value and subtract by value in a static column (Mean) over multiple columns

I have a sample dataframe from which I want to get a value and then subtract by value in a static column (Mean) over multiple columns.
For example:
I have a dataframe df:
LK Loc1 Loc2 Loc3 Mean
1 1 2 0 3
2 2 8 4 4.6
3 3 1 2 2
4 2 0 1 1.5
5 1 2 0 1.5
I want to get in a new dataframe:
LK Loc1 Loc2 Loc3
1 -2 -1 -3
2 -2.6 3.4 -0.6
3 1 -1 0
4 0.5 -1.5 -0.5
5 -0.5 0.5 -1.5
I tried something with:
df2 <- df %>%
mutate(across(-LK, ~ accumulate(., `-`)))
But I don't know how to continue..
Any help is appreciated.
Thank you in advance
I think you can use the following solution:
library(dply)
df %>%
mutate(across(starts_with("Loc"), ~ .x - Mean))
LK Loc1 Loc2 Loc3 Mean
1 1 -2.0 -1.0 -3.0 3.0
2 2 -2.6 3.4 -0.6 4.6
3 3 1.0 -1.0 0.0 2.0
4 4 0.5 -1.5 -0.5 1.5
5 5 -0.5 0.5 -1.5 1.5
We can also use pmap from purrr package function. This is a bit complicated but it would nice to know. We use pmap function to iterate over every row of a data frame:
Here we use c(...) to capture all values in each row but I selected only those whose names start with Loc as a vector of 3 elements
Then we subtract each element of the resulting vector from the corresponding value of Mean variable which is represented by ..5 in this case as the fifth variable in this data set.
The rest is just renaming and resetting the configuration of variables.
df %>%
pmap_df(~ {x <- c(...)[startsWith(names(df), "Loc")];
x - ..5}) %>%
bind_cols(df$LK) %>%
rename(LK = ...4) %>%
relocate(LK)
# A tibble: 5 x 4
LK Loc1 Loc2 Loc3
<int> <dbl> <dbl> <dbl>
1 1 -2 -1 -3
2 2 -2.6 3.4 -0.600
3 3 1 -1 0
4 4 0.5 -1.5 -0.5
5 5 -0.5 0.5 -1.5
Another way to do it:
library(tidyverse)
df <-
read_table('LK Loc1 Loc2 Loc3 Mean
1 1 2 0 3
2 2 8 4 4.6
3 3 1 2 2
4 2 0 1 1.5
5 1 2 0 1.5')
cbind( df[1],
map_dfc(select(df,starts_with('Loc')), ~ .x - df$Mean) )
#> LK Loc1 Loc2 Loc3
#> 1 1 -2.0 -1.0 -3.0
#> 2 2 -2.6 3.4 -0.6
#> 3 3 1.0 -1.0 0.0
#> 4 4 0.5 -1.5 -0.5
#> 5 5 -0.5 0.5 -1.5
Created on 2021-06-21 by the reprex package (v2.0.0)
I was able to get what you needed using mutate_at:
df %>%
mutate_at(vars(starts_with("Loc")), ~ .-Mean) %>%
select(-c(Mean))
Here, I leverage vars(starts_with("Loc")) to tell R that any column starting with "Loc" should be included in the aggregation, which is referenced as . after the tilde. Then I specifically refer to the column Mean. I noticed that the first value in the Mean column is not a mean across the rows, but the rest look like they are row-wise means. I wasn't sure if that was on purpose or not, but here is one code option that will get you row-wise means in dplyr: mutate(Mean = mean(c(Loc1, Loc2, Loc3)))

pivot_wider() producing NA for previously full data

I wonder why when I try to turn my data.frame into wide format, the two columns Y1 & Y2 contain NA?
The dataset originally had no NA on its Y1 and Y2. Is there a fix?
library(tidyverse)
dat <- read.csv("https://raw.githubusercontent.com/rnorouzian/v/main/mvmm.csv")
pivot_wider(dat, names_from= DV, values_from = Response)
# School Student Treat Gender Pretest MeanPretest TXG Index1 D1 D2 TreatCAT Gendercat Y1 Y2
# <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int> <int> <int> <dbl> <dbl> <dbl>
# 1 1 1 -0.5 -0.5 48.3 45.6 0.25 1 1 0 0 -0.5 29.4 NA
# 2 1 1 -0.5 -0.5 48.3 45.6 0.25 2 0 1 0 -0.5 NA 47.4
# 3 1 2 -0.5 0.5 52.1 45.6 -0.25 1 1 0 0 0.5 52.2 NA
I think you have duplicated columns, If you change your code like this, it should work, The columns DV, D2, D1, Index1 contains either similar pattern or exact contrast pattern, they should be reshaped together, otherwise, rows are getting duplicated while it is being translated to wider form.We can check that by taking dim of your original table: 1600 rows, if widen properly it should have lower number of records, with below code, its converted to 800. With the OP code it was still at 1600.
library(tidyverse)
dat %>%
pivot_wider(names_from= c(DV,D2,D1,Index1), values_from = Response)
Output:
School Student Treat Gender Pretest MeanPretest TXG TreatCAT
1 1 1 -0.5 -0.5 48.34437943 45.62666702 0.25 0
2 1 2 -0.5 0.5 52.14841080 45.62666702 -0.25 0
3 1 3 -0.5 -0.5 40.56079483 45.62666702 0.25 0
4 1 4 -0.5 0.5 63.11892700 45.62666702 -0.25 0
5 1 5 -0.5 -0.5 66.79794312 45.62666702 0.25 0
6 1 6 -0.5 0.5 19.42481995 45.62666702 -0.25 0
Gendercat Y1_0_1_1 Y2_1_0_2
1 -0.5 29.36377525 47.35104752
2 0.5 52.20915985 49.77211761
3 -0.5 42.21330261 36.21236038
4 0.5 46.69318008 63.72433472
5 -0.5 48.70760345 48.04736328
6 0.5 23.40506554 11.07947922
Try this:
dat %>%
select(-c(Index1, D1, D2)) %>%
pivot_wider(names_from = DV, values_from = Response)
This is happening because Index1, D1, and D2 are all the same, and also correspond to the column you want to pivot by. If you get rid of them it works fine

Create grouping based on cumulative sum and another group

This question is nearly identical to:
Create new group based on cumulative sum and group
However, when I apply the accepted solution to my data, it doesn't have the expected result.
In a nutshell, I have a data with two variables: domain and value. Domain is a group variable with multiple observations and value is some continuous value that I would like to accumulate by domain and great a new group variable, newgroup. There are three main rules:
I accumulate only within each domain. If I reach the end of the domain, then the accumulation is reset.
If the accumulated sum is at least 1.0 then the observations whose values added up to at least 1.0 are assigned to a different value for group1. Please note that this rule can be satisfied by a single observation.
If the last group in a domain has an accumulated sum less than 1.0, then merge that with the second to last group within the same domain. This is reflected in the variable group2
The data below has been simplified. The data will usually consist of 10^5 - 10^6 rows, so a vectorized solution would be ideal.
Example Data
domain <- c(rep(1,5),rep(2,8))
value <- c(1,0,2,2.5,0.1,0.1,0.5,0,0.2,0.6,0,0,0.1)
df_raw <- data.frame(domain,value)
domain value
1 1.0
1 0.0
1 2.0
1 2.5
1 0.1
2 0.1
2 0.5
2 0.0
2 0.2
2 0.6
2 0.0
2 0.0
2 0.1
Desired Output
cumsum_val <- c(1,0,2,2.5,0.1,0.1,0.6,0.6,0.8,1.4,0,0,0.1)
group1 <- c(1,2,2,3,4,5,5,5,5,5,6,6,6)
group2 <- c(1,2,2,3,3,4,4,4,4,4,4,4,4) #Satisfies Rule #3
df_want <- data.frame(domain,value,cumsum_val,group1,group2)
domain value cumsum_val group1 group2
1 1.0 1.0 1 1
1 0.0 0.0 2 2
1 2.0 2.0 2 2
1 2.5 2.5 3 3
1 0.1 0.1 4 3
2 0.1 0.1 5 4
2 0.5 0.6 5 4
2 0.0 0.6 5 4
2 0.2 0.8 5 4
2 0.6 1.4 5 4
2 0.0 0.0 6 4
2 0.0 0.0 6 4
2 0.1 0.1 6 4
I used the following code:
sum0 <- function(x, y) { if (x + y >= 1.0) 0 else x + y }
is_start <- function(x) head(c(TRUE, Reduce(sum0, init=0, x, acc = TRUE)[-1] == 0), -1)
cumsum(ave(df_raw$value, df_raw$domain, FUN = is_start))
## 1 2 3 4 5 6 6 6 6 6 7 8 9
but the last line does not produce the same values as group1 above. Generating group1 output is what is mainly causing me issues. Can someone help me understand the function is_start and how that is supposed to produce the groupings?
EDIT
akrun provided some working code in the comments for the simplified example above. However, there are still some situations where it doesn't work. For example,
domain <- c(rep(1,7),rep(2,8))
value <- c(1,0,1,0,2,2.5,0.1,0.1,0.5,0,0.2,0.6,0,0,0.1)
df_raw <- data.frame(domain,value)
The output is show below with new coming from akrun's code and group1 and group2 are the desired groupings based on rules #2 and #3. The discrepancy between new and group2 occurs mainly in the first 3 rows.
domain value new group1 group2
1 1.0 1 1 1
1 0.0 2 2 2
1 1.0 3 2 2
1 0.0 4 3 3
1 2.0 4 3 3
1 2.5 5 4 4
1 0.1 5 5 4
2 0.1 6 6 5
2 0.5 6 6 5
2 0.0 6 6 5
2 0.2 6 6 5
2 0.6 6 6 5
2 0.0 6 7 5
2 0.0 6 7 5
2 0.1 6 7 5
EDIT 2
I have updated with a working answer.
This works! It uses a combination of purrr's accumulate (similar to cumsum but more versatile) and cumsum with appropriate use of group_by to get what you're looking for. I've added comments to indicate what each part is doing. I'll note that next_group2 is a bit of a misnomer--it's more of a not_next_group2, but hopefully the rest is clear.
library(tidyverse)
domain <- c(rep(1,5),rep(2,8))
value <- c(1,0,2,2.5,0.1,0.1,0.5,0,0.2,0.6,0,0,0.1)
df_raw <- data.frame(domain,value)
## Modified from: https://stackoverflow.com/questions/49076769/dplyr-r-cumulative-sum-with-reset
sum_reset_at = function(val_col, threshold, include.equals = TRUE) {
if (include.equals) {
purrr::accumulate({{val_col}}, ~if_else(.x>=threshold , .y, .x+.y))
} else {
purrr::accumulate({{val_col}}, ~if_else(.x>threshold , .y, .x+.y))
}
}
df_raw %>%
group_by(domain) %>%
mutate(cumsum_val = sum_reset_at(value, 1)) %>%
mutate(next_group1 = ifelse(lag(cumsum_val) >= 1 | row_number() == 1, 1, 0)) %>% ## binary interpretation of whether there should be a new group
ungroup %>%
mutate(group1 = cumsum(next_group1)) %>% ## generate new groups
group_by(domain, group1) %>%
mutate(next_group2 = ifelse(max(cumsum_val) < 1 & row_number() == 1, 1, 0)) %>% ## similar to above, but grouped by your new group1; we ask it only to transition at the first value of the group that doesn't reach 1
ungroup %>%
mutate(group2 = cumsum(next_group1 - next_group2)) %>% ## cancel out the next_group1 binary if it meets the conditions of next_group2
select(-starts_with("next_"))
And as specified, this produces:
# A tibble: 13 x 5
domain value cumsum_val group1 group2
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1
2 1 0 0 2 2
3 1 2 2 2 2
4 1 2.5 2.5 3 3
5 1 0.1 0.1 4 3
6 2 0.1 0.1 5 4
7 2 0.5 0.6 5 4
8 2 0 0.6 5 4
9 2 0.2 0.8 5 4
10 2 0.6 1.4 5 4
11 2 0 0 6 4
12 2 0 0 6 4
13 2 0.1 0.1 6 4
The solution below is adapted from Group vector on conditional sum.
Helper Rcpp Function
library(Rcpp)
cppFunction('
IntegerVector CreateGroup(NumericVector x, int cutoff) {
IntegerVector groupVec (x.size());
int group = 1;
int threshid = 0;
double runSum = 0;
for (int i = 0; i < x.size(); i++) {
runSum += x[i];
groupVec[i] = group;
if (runSum >= cutoff) {
group++;
runSum = 0;
}
}
return groupVec;
}
')
Main Function
domain <- c(rep(1,7),rep(2,8))
value <- c(1,0,1,0,2,2.5,0.1,0.1,0.5,0,0.2,0.6,0,0,0.1)
df_raw <- data.frame(domain,value)
df_raw %>%
group_by(domain) %>%
mutate(group1 = CreateGroup(value,1),
group1 = ifelse(group1==max(group1) & last(value) < 1,
max(group1)-1,group1)) %>%
ungroup() %>%
mutate(group2 = rleid(group1))
domain value group1 group2
1 1.0 1 1
1 0.0 2 2
1 1.0 2 2
1 0.0 3 3
1 2.0 3 3
1 2.5 4 4
1 0.1 4 4
2 0.1 1 5
2 0.5 1 5
2 0.0 1 5
2 0.2 1 5
2 0.6 1 5
2 0.0 1 5
2 0.0 1 5
2 0.1 1 5

Find missing month after grouping with dplyr

I have a data frame with two columns that I am grouping by with dplyr, a column of months (as numerics, e.g. 1 through 12), and several columns with statistical data following that (values unimportant). An example:
ID_1 ID_2 month st1 st2
1 1 1 0.5 0.2
1 1 2 0.7 0.9
1 1 3 1.1 1.7
1 1 4 2.6 0.8
1 1 5 1.8 1.3
1 1 6 2.1 2.2
1 1 7 0.5 0.2
1 1 8 0.7 0.9
1 1 9 1.1 1.7
1 1 10 2.6 0.8
1 1 11 1.8 1.3
1 1 12 2.1 2.2
1 2 1 0.5 0.2
1 2 2 0.7 0.9
1 2 3 1.1 1.7
1 2 4 2.6 0.8
1 2 5 1.8 1.3
1 2 6 2.1 2.2
1 2 7 0.5 0.2
1 2 9 1.1 1.7
1 2 10 2.6 0.8
1 2 11 1.8 1.3
1 2 12 2.1 2.2
For the second grouping (ID_1 = 1 and ID_2 = 2), there is a month missing from the data (month = 8). Is there a way I can find this month and insert a row with the correct ID_1 and ID_2 values, the missing month value, and NA values for the rest of the columns? I've been playing around with this using dplyr functions and can't seem to figure it out, perhaps there is even a non-dplyr solution out there as well.
PS: If it helps, each unique grouping of ID_1 and ID_2 will have no more than 1 month missing.
Expand grid to make all combos of groups, then merge:
# make reference with all needed rows
ref <- data.frame(expand.grid(unique(df1$ID_1),
unique(df1$ID_2),
1:12))
colnames(ref) <- colnames(df1)[1:3]
# them merge with all TRUE
res <- merge(df1, ref, all = TRUE)
# to check output, show only month = 8
res[ res$month == 8, ]
# ID_1 ID_2 month st1 st2
# 8 1 1 8 0.7 0.9
# 20 1 2 8 NA NA
This can be done via tidyr::complete:
library(dplyr)
library(tidyr)
dat %>%
group_by(ID_1, ID_2) %>%
complete(month = 1:12)
Tail of dataset:
Source: local data frame [6 x 5]
Groups: ID_1, ID_2 [1]
ID_1 ID_2 month st1 st2
<int> <int> <int> <dbl> <dbl>
1 1 2 7 0.5 0.2
2 1 2 8 NA NA
3 1 2 9 1.1 1.7
4 1 2 10 2.6 0.8
5 1 2 11 1.8 1.3
6 1 2 12 2.1 2.2
If you go with tidyr, there is the complete function for this, you can nest ID_1 and ID_2 if you want both of the two variables as your grouping variable:
library(tidyr)
df1 = df %>% complete(nesting(ID_1, ID_2), month)
tail(df1)
# Source: local data frame [6 x 5]
# ID_1 ID_2 month st1 st2
# <int> <int> <int> <dbl> <dbl>
# 1 1 2 7 0.5 0.2
# 2 1 2 8 NA NA
# 3 1 2 9 1.1 1.7
# 4 1 2 10 2.6 0.8
# 5 1 2 11 1.8 1.3
# 6 1 2 12 2.1 2.2

Resources