pivot_wider() producing NA for previously full data - r

I wonder why when I try to turn my data.frame into wide format, the two columns Y1 & Y2 contain NA?
The dataset originally had no NA on its Y1 and Y2. Is there a fix?
library(tidyverse)
dat <- read.csv("https://raw.githubusercontent.com/rnorouzian/v/main/mvmm.csv")
pivot_wider(dat, names_from= DV, values_from = Response)
# School Student Treat Gender Pretest MeanPretest TXG Index1 D1 D2 TreatCAT Gendercat Y1 Y2
# <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int> <int> <int> <dbl> <dbl> <dbl>
# 1 1 1 -0.5 -0.5 48.3 45.6 0.25 1 1 0 0 -0.5 29.4 NA
# 2 1 1 -0.5 -0.5 48.3 45.6 0.25 2 0 1 0 -0.5 NA 47.4
# 3 1 2 -0.5 0.5 52.1 45.6 -0.25 1 1 0 0 0.5 52.2 NA

I think you have duplicated columns, If you change your code like this, it should work, The columns DV, D2, D1, Index1 contains either similar pattern or exact contrast pattern, they should be reshaped together, otherwise, rows are getting duplicated while it is being translated to wider form.We can check that by taking dim of your original table: 1600 rows, if widen properly it should have lower number of records, with below code, its converted to 800. With the OP code it was still at 1600.
library(tidyverse)
dat %>%
pivot_wider(names_from= c(DV,D2,D1,Index1), values_from = Response)
Output:
School Student Treat Gender Pretest MeanPretest TXG TreatCAT
1 1 1 -0.5 -0.5 48.34437943 45.62666702 0.25 0
2 1 2 -0.5 0.5 52.14841080 45.62666702 -0.25 0
3 1 3 -0.5 -0.5 40.56079483 45.62666702 0.25 0
4 1 4 -0.5 0.5 63.11892700 45.62666702 -0.25 0
5 1 5 -0.5 -0.5 66.79794312 45.62666702 0.25 0
6 1 6 -0.5 0.5 19.42481995 45.62666702 -0.25 0
Gendercat Y1_0_1_1 Y2_1_0_2
1 -0.5 29.36377525 47.35104752
2 0.5 52.20915985 49.77211761
3 -0.5 42.21330261 36.21236038
4 0.5 46.69318008 63.72433472
5 -0.5 48.70760345 48.04736328
6 0.5 23.40506554 11.07947922

Try this:
dat %>%
select(-c(Index1, D1, D2)) %>%
pivot_wider(names_from = DV, values_from = Response)
This is happening because Index1, D1, and D2 are all the same, and also correspond to the column you want to pivot by. If you get rid of them it works fine

Related

How to find the first column with a certain value for each row with dplyr

I have a dataset like this:
df <- data.frame(id=c(1:4), time_1=c(1, 0.9, 0.2, 0), time_2=c(0.1, 0.4, 0, 0.9), time_3=c(0,0.5,0.3,1.0))
id time_1 time_2 time_3
1 1.0 0.1 0
2 0.9 0.4 0.5
3 0.2 0 0.3
4 0 0.9 1.0
And I want to identify for each row, the first column containing a 0, and extract the corresponding number (as the last element of colname), obtaining this:
id time_1 time_2 time_3 count
1 1.0 0.1 0 3
2 0.9 0.4 0.5 NA
3 0.2 0 0.3 2
4 0 0.9 1.0 1
Do you have a tidyverse solution?
We may use max.col
v1 <- max.col(df[-1] ==0, "first")
v1[rowSums(df[-1] == 0) == 0] <- NA
df$count <- v1
-output
> df
id time_1 time_2 time_3 count
1 1 1.0 0.1 0.0 3
2 2 0.9 0.4 0.5 NA
3 3 0.2 0.0 0.3 2
4 4 0.0 0.9 1.0 1
Or using dplyr - use if_any to check if there are any 0 in the 'time' columns for each row, if there are any, then return the index of the 'first' 0 value with max.col (pick is from devel version, can replace with across) within the case_when
library(dplyr)
df %>%
mutate(count = case_when(if_any(starts_with("time"), ~ .x== 0) ~
max.col(pick(starts_with("time")) ==0, "first")))
-output
id time_1 time_2 time_3 count
1 1 1.0 0.1 0.0 3
2 2 0.9 0.4 0.5 NA
3 3 0.2 0.0 0.3 2
4 4 0.0 0.9 1.0 1
You can do this:
df <- df %>%
rowwise() %>%
mutate (count = which(c_across(starts_with("time")) == 0)[1])
df
id time_1 time_2 time_3 count
<int> <dbl> <dbl> <dbl> <dbl>
1 1 1 0.1 0 3
2 2 0.9 0.4 0.5 NA
3 3 0.2 0 0.3 2
4 4 0 0.9 1 1

r nested sum by groups

I am trying to implement this logic:
i j Eij
1 1 -1.5
1 2 -1.5
1 3 0.5
1 4 0.5
2 1 -0.5
2 2 -0.5
2 3 1.5
2 4 1.5
Each value in column Eij , by i, should be multiplied by the sum of values after that value.
For example, for i=1, the first value -1.5 (i=1, j=1). This value -1.5 should be multiplied by the sum of -1.5, 0.5, 0.5 all these values occur after the initial value -1.5 (i=1, j=1)
Similarly next value in i=1, i.e -1.5(i=1, j=2). This value -1.5 should be multiplied by the sum of 0.5, 0.5 all these values occur after the value -1.5 (i=1, j=2)
So on.
ro = -1.5*(-1.5 + 0.5 + 0.5) +
-1.5*(0.5 + 0.5) +
0.5*(0.5) +
-0.5*(-0.5 + 1.5 + 1.5) +
-0.5*(1.5 + 1.5) +
1.5*(1.5)
Here's a solution with dplyr:
library(dplyr)
df %>%
group_by(i) %>%
arrange(desc(j), .by_group = TRUE) %>%
mutate(
multiplier = lag(cumsum(Eij), default = 0),
result = Eij * multiplier
) %>%
arrange(j, .by_group = TRUE) %>%
ungroup
# # A tibble: 8 × 5
# i j Eij multiplier result
# <int> <int> <dbl> <dbl> <dbl>
# 1 1 1 -1.5 -0.5 0.75
# 2 1 2 -1.5 1 -1.5
# 3 1 3 0.5 0.5 0.25
# 4 1 4 0.5 0 0
# 5 2 1 -0.5 2.5 -1.25
# 6 2 2 -0.5 3 -1.5
# 7 2 3 1.5 1.5 2.25
# 8 2 4 1.5 0 0
f <- function(v) v %*% c(rev(cumsum(rev(v)))[-1], 0)
sum(aggregate(df$Eij, list(df$i), FUN = f)$x)
#[1] -1
one value per row:
f <- function(v) v * c(rev(cumsum(rev(v)))[-1], 0)
ave(df$Eij, df$i, FUN = f)
#[1] 0.75 -1.50 0.25 0.00 -1.25 -1.50 2.25 0.00

create dataframe of difference of medians in column based on values of another column

I have a dataframe which looks like this:
data <- data.frame(id=c(1,2,6,3,7,1,5,7),
class=c('apple','boy','boy','apple','boy','apple','apple','boy'),
type=c('type1','type1','type2','type2','type3','type4','type4','type4'),
col1=c(-0.9,0.8,0.7,-0.6,-0.5,0.4,0.3,0.9), col2=c(-6.9,2.8,0.4,-1.6,-0.8,0.6,0.2,-0.1),
col3=c(6.7,0.9,0.2,-0.7,-0.8,1.6,3.2,0.1))
id class type col1 col2 col3
1 apple type1 -0.9 -6.9 6.7
2 boy type1 0.8 2.8 0.9
6 boy type2 0.7 0.4 0.2
3 apple type2 -0.6 -1.6 -0.7
7 boy type3 -0.5 -0.8 -0.8
1 apple type4 0.4 0.6 1.6
5 apple type4 0.3 0.2 3.2
7 boy type4 0.9 -0.1 0.1
I am trying to create a dataframe which has the same columns (i.e., col1, col2, col3, ...) but the values in it should be median((data %>% filter(class=="apple"))$col1) - median((data %>% filter(class=="boy"))$col1) and so on for each type for each column.
So, the final dataframe will look like
type col1 col2 col3
type1 -0.1 -4.1 3.7
type2 0.7 0.4 0.2
type3 -0.5 -0.8 -0.8
type4 0.4 0.6 1.6
I can do this by creating individual dataframes of each type and calculating the difference of medians of both the classes and append the vector to an empty dataframe with bind_rows().
But is there any better and easier method to do this?
The method you want is something like this:
data %>%
group_by(type) %>%
summarize(across(col1:col3, ~ median(.[class=="boy"] - median(.[class=="boy"]))))
# # A tibble: 4 x 4
# type col1 col2 col3
# <chr> <dbl> <dbl> <dbl>
# 1 type1 0 0 0
# 2 type2 0 0 0
# 3 type3 0 0 0
# 4 type4 0 0 0
though in this instance it will return all 0s because you only have one "boy" within each group.
Post question-edit, here's the updated code and results:
data %>%
group_by(type) %>%
summarize(across(col1:col3, ~ median(.[class=="apple"]) - median(.[class=="boy"])))
# # A tibble: 4 x 4
# type col1 col2 col3
# <chr> <dbl> <dbl> <dbl>
# 1 type1 -1.7 -9.700 5.8
# 2 type2 -1.3000 -2 -0.9000
# 3 type3 NA NA NA
# 4 type4 -0.55 0.5 2.3
The NAs are because type3 only has "boy", no "apple".
(At least we aren't comparing "apple" to "orange", that would have been rather cliché ;-)
Here is a way how you can get your solution: That was a tough!
library(dplyr)
data %>%
arrange(id) %>%
filter(class == "boy" | type=="type1") %>%
group_by(type) %>%
summarise(across(starts_with("col"), sum))
type col1 col2 col3
<chr> <dbl> <dbl> <dbl>
1 type1 -0.1 -4.1 7.6
2 type2 0.7 0.4 0.2
3 type3 -0.5 -0.8 -0.8
4 type4 0.9 -0.1 0.1

Getting a value and subtract by value in a static column (Mean) over multiple columns

I have a sample dataframe from which I want to get a value and then subtract by value in a static column (Mean) over multiple columns.
For example:
I have a dataframe df:
LK Loc1 Loc2 Loc3 Mean
1 1 2 0 3
2 2 8 4 4.6
3 3 1 2 2
4 2 0 1 1.5
5 1 2 0 1.5
I want to get in a new dataframe:
LK Loc1 Loc2 Loc3
1 -2 -1 -3
2 -2.6 3.4 -0.6
3 1 -1 0
4 0.5 -1.5 -0.5
5 -0.5 0.5 -1.5
I tried something with:
df2 <- df %>%
mutate(across(-LK, ~ accumulate(., `-`)))
But I don't know how to continue..
Any help is appreciated.
Thank you in advance
I think you can use the following solution:
library(dply)
df %>%
mutate(across(starts_with("Loc"), ~ .x - Mean))
LK Loc1 Loc2 Loc3 Mean
1 1 -2.0 -1.0 -3.0 3.0
2 2 -2.6 3.4 -0.6 4.6
3 3 1.0 -1.0 0.0 2.0
4 4 0.5 -1.5 -0.5 1.5
5 5 -0.5 0.5 -1.5 1.5
We can also use pmap from purrr package function. This is a bit complicated but it would nice to know. We use pmap function to iterate over every row of a data frame:
Here we use c(...) to capture all values in each row but I selected only those whose names start with Loc as a vector of 3 elements
Then we subtract each element of the resulting vector from the corresponding value of Mean variable which is represented by ..5 in this case as the fifth variable in this data set.
The rest is just renaming and resetting the configuration of variables.
df %>%
pmap_df(~ {x <- c(...)[startsWith(names(df), "Loc")];
x - ..5}) %>%
bind_cols(df$LK) %>%
rename(LK = ...4) %>%
relocate(LK)
# A tibble: 5 x 4
LK Loc1 Loc2 Loc3
<int> <dbl> <dbl> <dbl>
1 1 -2 -1 -3
2 2 -2.6 3.4 -0.600
3 3 1 -1 0
4 4 0.5 -1.5 -0.5
5 5 -0.5 0.5 -1.5
Another way to do it:
library(tidyverse)
df <-
read_table('LK Loc1 Loc2 Loc3 Mean
1 1 2 0 3
2 2 8 4 4.6
3 3 1 2 2
4 2 0 1 1.5
5 1 2 0 1.5')
cbind( df[1],
map_dfc(select(df,starts_with('Loc')), ~ .x - df$Mean) )
#> LK Loc1 Loc2 Loc3
#> 1 1 -2.0 -1.0 -3.0
#> 2 2 -2.6 3.4 -0.6
#> 3 3 1.0 -1.0 0.0
#> 4 4 0.5 -1.5 -0.5
#> 5 5 -0.5 0.5 -1.5
Created on 2021-06-21 by the reprex package (v2.0.0)
I was able to get what you needed using mutate_at:
df %>%
mutate_at(vars(starts_with("Loc")), ~ .-Mean) %>%
select(-c(Mean))
Here, I leverage vars(starts_with("Loc")) to tell R that any column starting with "Loc" should be included in the aggregation, which is referenced as . after the tilde. Then I specifically refer to the column Mean. I noticed that the first value in the Mean column is not a mean across the rows, but the rest look like they are row-wise means. I wasn't sure if that was on purpose or not, but here is one code option that will get you row-wise means in dplyr: mutate(Mean = mean(c(Loc1, Loc2, Loc3)))

Error: incompatible size when mutating in dplyr

I have a trouble with the mutate function in dplyr and the error says;
Error: incompatible size (0), expecting 5 (the group size) or 1
There are some previous posts and I tried some of the solutions but no luck for my case.
group-factorial-data-with-multiple-factors-error-incompatible-size-0-expe
r-dplyr-using-mutate-with-na-omit-causes-error-incompatible-size-d
grouped-operations-that-result-in-length-not-equal-to-1-or-length-of-group-in-dp
Here is what I tried,
ff <- c(seq(0,0.2,0.1),seq(0,-0.2,-0.1))
flip <- c(c(0,0,1,1,1,1),c(1,1,0,0,0,0))
df <- data.frame(ff,flip,group=gl(2,6))
> df
ff flip group
1 0.0 0 1
2 0.1 0 1
3 0.2 1 1
4 0.0 1 1
5 -0.1 1 1
6 -0.2 1 1
7 0.0 1 2
8 0.1 1 2
9 0.2 0 2
10 0.0 0 2
11 -0.1 0 2
12 -0.2 0 2
I want to add new group called c1 and c2 based on some conditions as follows
dff <- df%>%
group_by(group)%>%
mutate(flip=as.numeric(flip),direc=ifelse(c(0,diff(ff))<0,"backward","forward"))%>%
spread(direc,flip)%>%
arrange(group,group)%>%
mutate(c1=ff[head(which(forward>0),1)],c2=ff[tail(which(backward>0),1)])
Error: incompatible size (0), expecting 5 (the group size) or 1
I also add do and tried
do(data.frame(., c1=ff[head(which(.$forward>0),1)],c2=ff[tail(which(.$backward>0),1)]))
Error in data.frame(., c1 = ff[head(which(.$forward > 0), 1)], c2 = ff[tail(which(.$backward > :
arguments imply differing number of rows: 5, 1, 0
but when I only mutate c1 column everything seems to be working. Why?
Just expanding on #allistaire's comment.
Your specified conditions are the cause of the error. specifically, tail(which(backward>0),1)
Given code can be optimised to get rid of the spread()
you can try
dff <- df%>%
group_by(group)%>%
mutate(flip=as.numeric(flip),direc=ifelse(c(0,diff(ff))<0,"backward","forward"))%>%
arrange(group)%>%
mutate(c1=ff[head(which(direc=="forward" & flip > 0),1)])
It seems like you are looking to identify influx points where direction changes, for each group. In this scenario, please clarify exactly how flip is related, or maybe if you change flip <- c(c(0,0,1,1,1,1),c(1,1,0,0,0,0)) to flip <- c(c(0,0,1,1,1,1),c(1,1,0,1,1,1)) so that flip marks change in direction of ff , you can use
dff <- df%>%
group_by(group)%>%
mutate(flip=as.numeric(flip),direc=ifelse(c(0,diff(ff))<0,"backward","forward"))%>%
arrange(group)%>%
mutate(c1=ff[head(which(direc=="forward" & flip > 0),1)]) %>%
mutate(c2=ff[tail(which(direc=="backward"& flip >0),1)])
which gives:
Source: local data frame [12 x 6]
Groups: group [2]
ff flip group direc c1 c2
<dbl> <dbl> <fctr> <chr> <dbl> <dbl>
1 0.0 0 1 forward 0.2 -0.2
2 0.1 0 1 forward 0.2 -0.2
3 0.2 1 1 forward 0.2 -0.2
4 0.0 1 1 backward 0.2 -0.2
5 -0.1 1 1 backward 0.2 -0.2
6 -0.2 1 1 backward 0.2 -0.2
7 0.0 1 2 forward 0.0 -0.2
8 0.1 1 2 forward 0.0 -0.2
9 0.2 0 2 forward 0.0 -0.2
10 0.0 1 2 backward 0.0 -0.2
11 -0.1 1 2 backward 0.0 -0.2
12 -0.2 1 2 backward 0.0 -0.2
It might be informative to step through the pipe to see what is going on.
df %>%
group_by(group)%>%
mutate(flip=as.numeric(flip),direc=ifelse(c(0,diff(ff))<0,"backward","forward"))%>%
spread(direc,flip)%>%
arrange(group,group)
# Source: local data frame [10 x 4]
# Groups: group [2]
# ff group backward forward
# <dbl> <fctr> <dbl> <dbl>
# 1 -0.2 1 1 NA
# 2 -0.1 1 1 NA
# 3 0.0 1 1 0
# 4 0.1 1 NA 0
# 5 0.2 1 NA 1
# 6 -0.2 2 0 NA
# 7 -0.1 2 0 NA
# 8 0.0 2 0 1
# 9 0.1 2 NA 1
# 10 0.2 2 NA 0
BTW: Why arrange(group,group)? Doubling the order variable is pointless.
Looking here, you'll see that you have (1) backward values that are not greater than 0. When you run something like which(FALSE) you get integer(0). This might be a good time to realize that dplyr needs the vector length of the rhs to be the same length as the number of rows in the group.
Instead of your mutate, I'll show it with a slight modification: return the number of unique values returned in the which call for c2:
df %>%
group_by(group)%>%
mutate(flip=as.numeric(flip),direc=ifelse(c(0,diff(ff))<0,"backward","forward"))%>%
spread(direc,flip)%>%
arrange(group,group)%>%
mutate(
c1 = ff[head(which(forward>0),1)],
c2len = length(which(backward > 0))
)
# Source: local data frame [10 x 6]
# Groups: group [2]
# ff group backward forward c1 c2len
# <dbl> <fctr> <dbl> <dbl> <dbl> <int>
# 1 -0.2 1 1 NA 0.2 3
# 2 -0.1 1 1 NA 0.2 3
# 3 0.0 1 1 0 0.2 3
# 4 0.1 1 NA 0 0.2 3
# 5 0.2 1 NA 1 0.2 3
# 6 -0.2 2 0 NA 0.0 0
# 7 -0.1 2 0 NA 0.0 0
# 8 0.0 2 0 1 0.0 0
# 9 0.1 2 NA 1 0.0 0
# 10 0.2 2 NA 0 0.0 0
In order to meaningfully index on ff, you need something other than integer(0) in your returns.

Resources