I have a dataset with answers to a likert scale and reaction times that are both results of a experimental manipulation. Ideally I would like to copy the Likert_Answer values and align them to the experimental manipulation associated with that value.
The dataset looks like this:
x <- rep(c(NA, round(runif(5, min=0, max=100), 2)), times=3)
myDF <- data.frame(ID = rep(c(1,2,3), each=6),
Condition = rep(c("A","B"), each=3, times=3),
Type_of_Task = rep(c("Test", rep(c("Experiment"), times=2)), times=6),
Likert_Answer = c(5, NA, NA, 6, NA, NA, 1, NA, NA, 5, NA, NA, 5, NA, NA, 1, NA, NA),
Reaction_Times = x)
I find it very hard to formulate the problem I have, so this is how my expected output should look like:
myDF_Output <- data.frame(ID = rep(c(1,2,3), each=6),
Condition = rep(c("A","B"), each=3, times=3),
Type_of_Task = rep(c("Test", rep(c("Experiment"), times=2)), times=6),
Likert_Answer = rep(c(5, 6, 1, 5, 5, 1), each = 3),
Reaction_Times = x)
I have seen in this post a feasible solution that is the following:
library(dplyr)
library(tidyr)
myDF2 <- myDF %>%
group_by(ID) %>%
fill(Likert_Answer) %>%
fill(Likert_Answer, .direction = "up")
The problem is that this solution is valid as far as a person replies to the likert scale. If that was not the case, I am afraid this solution would "drag" the result of the likert scale of the previous one experimental condition. For example:
myDF_missing <- myDF
myDF_missing[4,4] = NA
myDF3 <- myDF_missing %>%
group_by(ID) %>%
fill(Likert_Answer) %>%
fill(Likert_Answer, .direction = "up")
In this case, what should have been a NA in Likert_Scales for all values in condition B for ID 1 has become a 5. Any idea of how could avoid this?
(Excuse me if the code is dirty: I am quite new to R and I am learning the hard way... But I got pretty stuck with this problem at this stage.)
if I understood your problem correctly you are very close to a solution. I manipulated the demo df to show how the grouping works:
library(dplyr)
library(tidyr)
myDF <- data.frame(ID = rep(c(1,2,3), each=6),
Condition = rep(c("A","B"), each=3, times=3),
Type_of_Task = rep(c("Test", rep(c("Experiment"), times=5)), times=3),
Likert_Answer = c(5, NA, NA, 6, NA, NA, 1, NA, NA, 5, NA, NA, NA, NA, NA, 1, NA, NA),
Reaction_Times = x)
myDF %>%
dplyr::group_by(ID) %>%
tidyr::fill(Likert_Answer)
ID Condition Type_of_Task Likert_Answer Reaction_Times
<dbl> <chr> <chr> <dbl> <dbl>
1 1 A Test 5 NA
2 1 A Experiment 5 18.4
3 1 A Experiment 5 41.1
4 1 B Experiment 6 59.8
5 1 B Experiment 6 93.4
6 1 B Experiment 6 38.5
7 2 A Test 1 NA
8 2 A Experiment 1 18.4
9 2 A Experiment 1 41.1
10 2 B Experiment 5 59.8
11 2 B Experiment 5 93.4
12 2 B Experiment 5 38.5
13 3 A Test NA NA
14 3 A Experiment NA 18.4
15 3 A Experiment NA 41.1
16 3 B Experiment 1 59.8
17 3 B Experiment 1 93.4
18 3 B Experiment 1 38.5
Related
I’ve got this data:
tribble(
~ranges, ~last,
0, NA,
1, NA,
1, NA,
1, NA,
1, NA,
2, NA,
2, NA,
2, NA,
3, NA,
3, NA
)
and I want to fill the last column only at the row index at the last entry of the number by the ranges column. That means, it should look like this:
tribble(
~ranges, ~last,
0, 0,
1, NA,
1, NA,
1, NA,
1, 1,
2, NA,
2, NA,
2, 2,
3, NA,
3, 3
)
So far I came up with a row-wise approach:
for (r in seq.int(max(tmp$ranges))) {
print(r)
range <- which(tmp$ranges == r) |> max()
tmp$last[range] <- r
}
The main issue is that it is terribly slow. I am looking for a vectorized approach to this issue. Any creative solution out there?
Here's a dplyr solution:
library(dplyr)
tmp %>%
group_by(ranges) %>%
mutate(
last = case_when(row_number() == n() ~ ranges, TRUE ~ NA_real_)
) %>%
ungroup()
# # A tibble: 10 × 2
# ranges last
# <dbl> <dbl>
# 1 0 0
# 2 1 NA
# 3 1 NA
# 4 1 NA
# 5 1 1
# 6 2 NA
# 7 2 NA
# 8 2 2
# 9 3 NA
# 10 3 3
Or we could do something clever with base R for the same result. Here we calculate the difference of ranges to identify when the next row is different (i.e., the last of a group). We then stick a TRUE on the end so the last row is included. This assumes your data is already sorted by ranges.
tmp$last = ifelse(c(diff(tmp$ranges) != 0, TRUE), tmp$ranges, NA)
Using replace:
library(dplyr)
df %>%
group_by(ranges) %>%
mutate(last = replace(last, n(), ranges[n()]))
Using ifelse:
library(dplyr)
df %>%
group_by(ranges) %>%
mutate(last = ifelse(row_number() == n(), ranges, NA))
Using tail:
library(dplyr)
df %>%
group_by(ranges) %>%
mutate(last = c(last[-n()], tail(ranges, 1)))
output
ranges last
<dbl> <dbl>
1 0 0
2 1 NA
3 1 NA
4 1 NA
5 1 1
6 2 NA
7 2 NA
8 2 2
9 3 NA
10 3 3
I want to set the next row i+1 in the same column to NA if there is already an NA in row i and then do this by groups. Here is my attempt:
dfeg <- tibble(id = c(rep("A", 5), rep("B", 5)),
x = c(1, 2, NA, NA, 3, 5, 6, NA, NA, 7))
setNextrowtoNA <- function(x){
for (j in 1:length(x)){
if(is.na(x[j])){x[j+1] <- NA}
}
}
dfeg <- dfeg %>% group_by(id) %>% mutate(y = setNextrowtoNA(x))
However my attempt doesn't create the column y that am looking for. Can anyone help with this? Thanks!
EDIT: In my actual data I have multiple values in a row that need to be set to NA, for example my data is more like this:
dfeg <- tibble(id = c(rep("A", 6), rep("B", 6)),
x = c(1, 2, NA, NA, 3, 4, 15, 16, NA, NA, 17, 18))
And need to create a column like this:
y = c(1, 2, NA, NA, NA, NA, 15, 16, NA, NA, NA, NA)
Any ideas? Thanks!
EDIT 2:
I figured it out on my own, this seems to work:
dfeg <- tibble(id = c(rep("A", 6), rep("B", 6)),
x = c(1, 2, NA, NA, 3, 4, 15, 16, NA, NA, 17, 18))
setNextrowtoNA <- function(x){
for (j in 1:(length(x))){
if(is.na(x[j]))
{
x[j+1] <- NA
}
lengthofx <- length(x)
x <- x[-lengthofx]
print(x[j])
}
return(x)
}
dfeg <- dfeg %>% group_by(id) %>% mutate(y = NA,
y = setNextrowtoNA(x))
Use cumany:
library(dplyr)
dfeg %>%
group_by(id) %>%
mutate(y = ifelse(cumany(is.na(x)), NA, x))
id x y
<chr> <dbl> <dbl>
1 A 1 1
2 A 2 2
3 A NA NA
4 A NA NA
5 A 3 NA
6 A 4 NA
7 B 15 15
8 B 16 16
9 B NA NA
10 B NA NA
11 B 17 NA
12 B 18 NA
Previous answer:
Use an ifelse statement with lag:
library(dplyr)
dfeg %>%
group_by(id) %>%
mutate(y = ifelse(is.na(lag(x, default = 0)), NA, x))
I have a dataframe with the following column headers:
df <- data.frame(
ABC1_1_1DEF = c(1, 2, 3),
ABC1_2_1DEF = c(NA, 1, 2),
ABC1_3_1DEF = c(1, 1, NA),
ABC1_1_2DEF = c(3, NA, NA),
ABC1_2_2DEF = c(2, NA, NA),
ABC1_3_2DEF = c(NA, 1, 1)
)
I want to pivot the dataframe longer such that the middle number of each column is the group that contains the new columns:
df2 <- data.frame(
ABC1_1 = c(1, 2, 3, 3, NA, NA),
ABC1_2 = c(3, NA, NA, 2, NA, NA),
ABC1_3 = c(2, NA, NA, NA, 1, 1)
)
What's the best way to achieve this using R, ideally with dplyr?
To combine all the ABC1_1, ABC1_2 and ABC1_3 columns you can use -
tidyr::pivot_longer(df, cols = everything(),
names_to = '.value',
names_pattern = '([A-Z]+\\d+_\\d+)')
# ABC1_1 ABC1_2 ABC1_3
# <dbl> <dbl> <dbl>
#1 1 NA 1
#2 3 2 NA
#3 2 1 1
#4 NA NA 1
#5 3 2 NA
#6 NA NA 1
Some background information for better understanding
I have a very specific dataframe. It is basically tha sample of answers on my survey. Variables v1 represent a multiple choice question; the respondent were to choose variants from 1 to 4, he/she could choose the only or several options. Every chosen variant readressed the respondent to the block: if 1 was chosen, the respondent was readressed to o_v1_1, c_v1_1, and f_v1_1 and so on.
The problem
I want to write a function that will change my data structure from wide to long. But I am struggling with pivot_longer, because it does not produce a desirable output.
Here's the sample dataframe with some initial data processing:
structure(list(seance_id = c(1, 2, 3, 4), respondent = c("A",
"B", "C", "D"), v1...3 = c(1, 1, NA, 1), v1...4 = c(2, NA, 2,
NA), v1...5 = c(3, 4, 4, NA), v1...6 = c(4, NA, NA, NA), o_v1_1 = c(6,
1, NA, 4), c_v1_1 = c(7, 1, NA, 1), f_v1_1 = c(8, 1, NA, 1),
o_v1_2 = c(10, NA, 4, NA), c_v1_2 = c(8, NA, 1, NA), f_v1_2 = c(3,
NA, 3, NA), o_v1_3 = c(4, NA, NA, NA), c_v1_3 = c(1, NA,
NA, NA), f_v1_3 = c(2, NA, NA, NA), o_v1_4 = c(10, 5, 4,
NA), c_v1_4 = c(9, 6, 5, NA), f_v1_4 = c(9, 6, 6, NA)), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
data <- data %>% mutate_if(is.numeric, as.character)
colnames(data) <- c("seance_id", "respondent", "v1", "v1", "v1", "v1", "o_v1_1",
"c_v1_1", "f_v1_1", "o_v1_2", "c_v1_2", "f_v1_2", "o_v1_3", "c_v1_3",
"f_v1_3", "o_v1_4", "c_v1_4", "f_v1_4")
And here's how I tried to make my table long:
long <- data %>%
pivot_longer(cols = -`seance_id`, names_to = "v1", values_to = "answer")
And this is what I want to get:
`séance_id` respondent direction answer_dir criteria criteria_answer
<dbl> <chr> <chr> <dbl> <chr> <dbl>
1 1 A v1 1 o_v1_1 6
2 1 A v1 1 c_v1_1 7
3 1 A v1 1 f_v1_1 8
4 1 A v1 2 o_v1_2 10
5 1 A v1 2 c_v1_2 8
6 1 A v1 2 f_v1_2 3
I have been researching SO for 2 days already and have not resolved my problem yet. How can I use pivot_longer effectively to get the desirable output? And is there any way to automate the process of longing my dfs? I have more than 30 of dfs, nested in different lists within one Excel file.
You can drop you initial v1 columns. I instead propose the following:
data %>% select(-starts_with('v1')) %>%
pivot_longer(cols = contains('v1'), names_to = "v1", values_to = "criteria_answer") %>%
separate(v1, sep='_', into=c('w','direction','answer_dir'), remove=FALSE) %>%
rename(creteria=v1) %>% select(-w)
# A tibble: 48 x 6
seance_id respondent creteria direction answer_dir criteria_answer
<chr> <chr> <chr> <chr> <chr> <chr>
1 1 A o_v1_1 v1 1 6
2 1 A c_v1_1 v1 1 7
3 1 A f_v1_1 v1 1 8
4 1 A o_v1_2 v1 2 10
5 1 A c_v1_2 v1 2 8
6 1 A f_v1_2 v1 2 3
7 1 A o_v1_3 v1 3 4
8 1 A c_v1_3 v1 3 1
9 1 A f_v1_3 v1 3 2
10 1 A o_v1_4 v1 4 10
# ... with 38 more rows
The final select(-w) is to remove the w-column, an artefact from separate splitting o_v1_1,c_v1_1,etc. into 3 columns. Here w was for the first character.
I have a dataset where I have to fill NA values using the previous value and a sum of current value in another column. Basically, my data looks like
library(lubridate)
library(tidyverse)
library(zoo)
df <- tibble(
Id = c(1, 1, 1, 1, 2, 2, 2, 2),
Time = ymd(c("2012-09-01", "2012-09-02", "2012-09-03", "2012-09-04", "2012-09-01", "2012-09-02", "2012-09-03", "2012-09-04")),
av = c(18, NA, NA, NA, 21, NA, NA, NA),
Value = c(121, NA,NA, NA, 146, NA, NA, NA)
)
# A tibble: 8 x 4
Id Time av Value
<dbl> <date> <dbl> <dbl>
1 2012-09-01 18 121
1 2012-09-02 NA NA
1 2012-09-03 NA NA
1 2012-09-04 NA NA
2 2012-09-01 21 146
2 2012-09-02 NA NA
2 2012-09-03 NA NA
2 2012-09-04 NA NA
What I want to do is: where the Value is NA, I want to replace it by sum of previous Value and current value of av. If av is NA, it can be replaced with previous value. I use na.locf function from zoo package as
df1 <- df %>% arrange(Id, Time) %>% group_by(Id) %>%
mutate(av = zoo::na.locf(av))
However, filling in for Value seems to be difficult. I can do it using for loop as
# Back up the Value column for testing
df1$Value_backup <- df1$Value
for(i in 2:nrow(df1))
{
df1$Value[i] <- ifelse(is.na(df1$Value[i]), df1$av[i] + df1$Value[i-1], df1$Value[i])
}
This produces the result I want but for a large dataset, I believe there are better ways to do it in R. I tried complete function from dplyr but it adds two additional rows as:
df1 <- df %>% arrange(Id, Time) %>% group_by(Id) %>% mutate(av = zoo::na.locf(av)) %>%
mutate(num_rows = n()) %>%
complete(nesting(Id), Value = seq(min(Value, na.rm = TRUE),
(min(Value, na.rm = TRUE) + max(num_rows) * min(na.omit(av))), min(na.omit(av))))
The output has two extra rows; 10 instead of 8
# A tibble: 10 x 5
# Groups: Id [2]
Id Value Time av num_rows
<dbl> <dbl> <date> < dbl> <int>
1 121 2012-09-01 18 4
1 139 NA NA NA
1 157 NA NA NA
1 175 NA NA NA
1 193 NA NA NA
2 146 2012-09-01 21 4
2 167 NA NA NA
2 188 NA NA NA
2 209 NA NA NA
2 230 NA NA NA
Any help to do it faster without loops would be greatly appreciated.
In the question av starts with a non-NA in each group and is followed by NAs so if this is the general pattern then this will work. Note that it is good form to close any group_by with ungroup; however, we did not do that below so that we could compare df2 with df1.
df2 <- df %>%
group_by(Id) %>%
mutate(Value_backup = Value,
av = first(av),
Value = first(Value) + cumsum(av) - av)
identical(df1, df2)
## [1] TRUE
Note
For reproducibility first run this (taken from question except we only load needed packages):
library(dplyr)
library(tibble)
library(lubridate)
df <- tibble(
Id = c(1, 1, 1, 1, 2, 2, 2, 2),
Time = ymd(c("2012-09-01", "2012-09-02", "2012-09-03", "2012-09-04", "
2012-09-01", "2012-09-02", "2012-09-03", "2012-09-04")),
av = c(18, NA, NA, NA, 21, NA, NA, NA),
Value = c(121, NA,NA, NA, 146, NA, NA, NA)
)
df1 <- df %>% arrange(Id, Time) %>% group_by(Id) %>%
mutate(av = zoo::na.locf(av))
df1$Value_backup <- df1$Value
for(i in 2:nrow(df1))
{
df1$Value[i] <- ifelse(is.na(df1$Value[i]), df1$av[i] + df1$Value[i-1], df1$Value[i])
}