How to write a function that manipulates the data structure in R? - r

Some background information for better understanding
I have a very specific dataframe. It is basically tha sample of answers on my survey. Variables v1 represent a multiple choice question; the respondent were to choose variants from 1 to 4, he/she could choose the only or several options. Every chosen variant readressed the respondent to the block: if 1 was chosen, the respondent was readressed to o_v1_1, c_v1_1, and f_v1_1 and so on.
The problem
I want to write a function that will change my data structure from wide to long. But I am struggling with pivot_longer, because it does not produce a desirable output.
Here's the sample dataframe with some initial data processing:
structure(list(seance_id = c(1, 2, 3, 4), respondent = c("A",
"B", "C", "D"), v1...3 = c(1, 1, NA, 1), v1...4 = c(2, NA, 2,
NA), v1...5 = c(3, 4, 4, NA), v1...6 = c(4, NA, NA, NA), o_v1_1 = c(6,
1, NA, 4), c_v1_1 = c(7, 1, NA, 1), f_v1_1 = c(8, 1, NA, 1),
o_v1_2 = c(10, NA, 4, NA), c_v1_2 = c(8, NA, 1, NA), f_v1_2 = c(3,
NA, 3, NA), o_v1_3 = c(4, NA, NA, NA), c_v1_3 = c(1, NA,
NA, NA), f_v1_3 = c(2, NA, NA, NA), o_v1_4 = c(10, 5, 4,
NA), c_v1_4 = c(9, 6, 5, NA), f_v1_4 = c(9, 6, 6, NA)), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
data <- data %>% mutate_if(is.numeric, as.character)
colnames(data) <- c("seance_id", "respondent", "v1", "v1", "v1", "v1", "o_v1_1",
"c_v1_1", "f_v1_1", "o_v1_2", "c_v1_2", "f_v1_2", "o_v1_3", "c_v1_3",
"f_v1_3", "o_v1_4", "c_v1_4", "f_v1_4")
And here's how I tried to make my table long:
long <- data %>%
pivot_longer(cols = -`seance_id`, names_to = "v1", values_to = "answer")
And this is what I want to get:
`séance_id` respondent direction answer_dir criteria criteria_answer
<dbl> <chr> <chr> <dbl> <chr> <dbl>
1 1 A v1 1 o_v1_1 6
2 1 A v1 1 c_v1_1 7
3 1 A v1 1 f_v1_1 8
4 1 A v1 2 o_v1_2 10
5 1 A v1 2 c_v1_2 8
6 1 A v1 2 f_v1_2 3
I have been researching SO for 2 days already and have not resolved my problem yet. How can I use pivot_longer effectively to get the desirable output? And is there any way to automate the process of longing my dfs? I have more than 30 of dfs, nested in different lists within one Excel file.

You can drop you initial v1 columns. I instead propose the following:
data %>% select(-starts_with('v1')) %>%
pivot_longer(cols = contains('v1'), names_to = "v1", values_to = "criteria_answer") %>%
separate(v1, sep='_', into=c('w','direction','answer_dir'), remove=FALSE) %>%
rename(creteria=v1) %>% select(-w)
# A tibble: 48 x 6
seance_id respondent creteria direction answer_dir criteria_answer
<chr> <chr> <chr> <chr> <chr> <chr>
1 1 A o_v1_1 v1 1 6
2 1 A c_v1_1 v1 1 7
3 1 A f_v1_1 v1 1 8
4 1 A o_v1_2 v1 2 10
5 1 A c_v1_2 v1 2 8
6 1 A f_v1_2 v1 2 3
7 1 A o_v1_3 v1 3 4
8 1 A c_v1_3 v1 3 1
9 1 A f_v1_3 v1 3 2
10 1 A o_v1_4 v1 4 10
# ... with 38 more rows
The final select(-w) is to remove the w-column, an artefact from separate splitting o_v1_1,c_v1_1,etc. into 3 columns. Here w was for the first character.

Related

Wide to long with pivot_longer and mix of numeric and character data

help <- data.frame(
id = c(100, 100, 101, 102, 102),
q1 = c(NA, 1, NA, NA, 3),
q2 = c(1, NA, 2, NA, NA),
q3 = c(NA, 1, NA, 4, NA),
q4 = c(NA, NA, 4, NA, 5),
group = c("a", "b", "c", "a", "c"))
help$group <- as.character(help$group)
I am trying to pivot longer so dataset looks like this:
id score group
100 NA a
100 1 b
100 NA c
...
But I get an error with the numeric values of q1-q4 and the character string group.
pivot_longer(help, !id, names_to = "score",
values_to = "group", values_ptypes = list(group = 'character'))
Error: Can't convert <double> to <character>.
How can I pivot longer but also preserve the group variable (where there is several missing data for the q1-4 there is a match for every id and group)?
library(tidyr)
output <- pivot_longer(help, -c(id, group), names_to = "question",
values_to = "score") %>%
dplyr::select(-question) %>%
dplyr::arrange(id, group)
Output
head(output)
# A tibble: 6 × 3
id group score
<dbl> <chr> <dbl>
1 100 a NA
2 100 a 1
3 100 a NA
4 100 a NA
5 100 b 1
6 100 b NA

How to pivot longer with the grouping notation at the center of a dataframe?

I have a dataframe with the following column headers:
df <- data.frame(
ABC1_1_1DEF = c(1, 2, 3),
ABC1_2_1DEF = c(NA, 1, 2),
ABC1_3_1DEF = c(1, 1, NA),
ABC1_1_2DEF = c(3, NA, NA),
ABC1_2_2DEF = c(2, NA, NA),
ABC1_3_2DEF = c(NA, 1, 1)
)
I want to pivot the dataframe longer such that the middle number of each column is the group that contains the new columns:
df2 <- data.frame(
ABC1_1 = c(1, 2, 3, 3, NA, NA),
ABC1_2 = c(3, NA, NA, 2, NA, NA),
ABC1_3 = c(2, NA, NA, NA, 1, 1)
)
What's the best way to achieve this using R, ideally with dplyr?
To combine all the ABC1_1, ABC1_2 and ABC1_3 columns you can use -
tidyr::pivot_longer(df, cols = everything(),
names_to = '.value',
names_pattern = '([A-Z]+\\d+_\\d+)')
# ABC1_1 ABC1_2 ABC1_3
# <dbl> <dbl> <dbl>
#1 1 NA 1
#2 3 2 NA
#3 2 1 1
#4 NA NA 1
#5 3 2 NA
#6 NA NA 1

R - copy values from cells to empty cells within same columns

I have a dataset with answers to a likert scale and reaction times that are both results of a experimental manipulation. Ideally I would like to copy the Likert_Answer values and align them to the experimental manipulation associated with that value.
The dataset looks like this:
x <- rep(c(NA, round(runif(5, min=0, max=100), 2)), times=3)
myDF <- data.frame(ID = rep(c(1,2,3), each=6),
Condition = rep(c("A","B"), each=3, times=3),
Type_of_Task = rep(c("Test", rep(c("Experiment"), times=2)), times=6),
Likert_Answer = c(5, NA, NA, 6, NA, NA, 1, NA, NA, 5, NA, NA, 5, NA, NA, 1, NA, NA),
Reaction_Times = x)
I find it very hard to formulate the problem I have, so this is how my expected output should look like:
myDF_Output <- data.frame(ID = rep(c(1,2,3), each=6),
Condition = rep(c("A","B"), each=3, times=3),
Type_of_Task = rep(c("Test", rep(c("Experiment"), times=2)), times=6),
Likert_Answer = rep(c(5, 6, 1, 5, 5, 1), each = 3),
Reaction_Times = x)
I have seen in this post a feasible solution that is the following:
library(dplyr)
library(tidyr)
myDF2 <- myDF %>%
group_by(ID) %>%
fill(Likert_Answer) %>%
fill(Likert_Answer, .direction = "up")
The problem is that this solution is valid as far as a person replies to the likert scale. If that was not the case, I am afraid this solution would "drag" the result of the likert scale of the previous one experimental condition. For example:
myDF_missing <- myDF
myDF_missing[4,4] = NA
myDF3 <- myDF_missing %>%
group_by(ID) %>%
fill(Likert_Answer) %>%
fill(Likert_Answer, .direction = "up")
In this case, what should have been a NA in Likert_Scales for all values in condition B for ID 1 has become a 5. Any idea of how could avoid this?
(Excuse me if the code is dirty: I am quite new to R and I am learning the hard way... But I got pretty stuck with this problem at this stage.)
if I understood your problem correctly you are very close to a solution. I manipulated the demo df to show how the grouping works:
library(dplyr)
library(tidyr)
myDF <- data.frame(ID = rep(c(1,2,3), each=6),
Condition = rep(c("A","B"), each=3, times=3),
Type_of_Task = rep(c("Test", rep(c("Experiment"), times=5)), times=3),
Likert_Answer = c(5, NA, NA, 6, NA, NA, 1, NA, NA, 5, NA, NA, NA, NA, NA, 1, NA, NA),
Reaction_Times = x)
myDF %>%
dplyr::group_by(ID) %>%
tidyr::fill(Likert_Answer)
ID Condition Type_of_Task Likert_Answer Reaction_Times
<dbl> <chr> <chr> <dbl> <dbl>
1 1 A Test 5 NA
2 1 A Experiment 5 18.4
3 1 A Experiment 5 41.1
4 1 B Experiment 6 59.8
5 1 B Experiment 6 93.4
6 1 B Experiment 6 38.5
7 2 A Test 1 NA
8 2 A Experiment 1 18.4
9 2 A Experiment 1 41.1
10 2 B Experiment 5 59.8
11 2 B Experiment 5 93.4
12 2 B Experiment 5 38.5
13 3 A Test NA NA
14 3 A Experiment NA 18.4
15 3 A Experiment NA 41.1
16 3 B Experiment 1 59.8
17 3 B Experiment 1 93.4
18 3 B Experiment 1 38.5

How to replace factor NA's with the level of the cell above

I'm trying to replace NA values in factor column with the values of the cell above. It would be great to have this in a tidy verse approach, but it doesn't matter too much if its not.
I have data that looks like:
data <- tibble(site = as.factor(c("A", "A", NA, "B","B", NA,"C", NA, "C")),
value = c(1, 2, NA, 1, 2, NA, 1, NA, 2))
And I need it to look like:
output <- data <- tibble(site = as.factor(c("A", "A", "A", "B","B", "B","C", "C", "C")),
value = c(1, 1, NA, 1,2, NA, 1, NA, 2))
I've tried a few different approaches using lag and replace_na although they have basically amounted to trying the same thing which is:
mutate(site = as.character(site),
site = ifelse(is.na(site), "zero", site),
site = ifelse(site == "zero", lag(site), site),
site = as.factor(site))
Thanks!
Try fill() from tidyr:
library(tidyverse)
#Code
data <- data %>% fill(site)
Output:
# A tibble: 9 x 2
site value
<fct> <dbl>
1 A 1
2 A 2
3 A NA
4 B 1
5 B 2
6 B NA
7 C 1
8 C NA
9 C 2
An option with na.locf
library(zoo)
data$state <- na.locf0(data$site)

How can I compare multiple rows in R?

I would like to compare multiple values by USER.
Based on USER "A", If the values (A,B,C,D,and E) are same with USER "B", it should be written as B at the newly created variable EQUAL
Here is my data
Desired value
I am very new to R, I tried to look at the compare function but got a little overwhelmed. Would very much appreciate any help.
Here's an abridged version of the data you provided:
library(tidyverse)
df <- data.frame(
id = c(1001, 1002, 1003, 1001, 1002, 1003),
user = c('a', 'a', 'a', 'b', 'b', 'b'),
point_a = c(1, 1, NA, 1, 1, NA),
point_b = c(NA, NA, 2, NA, NA, NA),
point_c = c(3, 2, 3, 3, 2, 3),
point_d = c(2, 1, NA, 2, 1, NA),
point_e = c(4, NA, 1, 4, NA, NA)
)
df
id user point_a point_b point_c point_d point_e
1 1001 a 1 NA 3 2 4
2 1002 a 1 NA 2 1 NA
3 1003 a NA 2 3 NA 1
4 1001 b 1 NA 3 2 4
5 1002 b 1 NA 2 1 NA
6 1003 b NA NA 3 NA NA
If you inner_join on the columns you want to match, and then filter for rows where user.x is greater than user.y (i.e. first in alphabetical order, to get rid of duplicates and rows matching to themselves), you should be left with the matches you're looking for:
df %>%
inner_join(df, by = c('point_a', 'point_b', 'point_c', 'point_d', 'point_e')) %>%
filter(user.x < user.y) %>%
rename(user = user.x,
equal = user.y)
id.x user point_a point_b point_c point_d point_e id.y equal
1 1001 a 1 NA 3 2 4 1001 b
2 1002 a 1 NA 2 1 NA 1002 b
We may split the data along users, and put the result in mapply and calculate the rowSums of TRUEs after comparison with `==`. From the resulting matrix we want to know which.max which allows us to subset the users (without "A"). The result just needs to be subsetted by user "A".
transform(dat, EQUAL=
split(dat, dat$user) |>
(\(.) mapply(\(x, y) rowSums(x == y, na.rm=TRUE),
unname(.['A']),
.[c('B', 'C')]))() |>
(\(.) sort(unique(dat$user))[-1][apply(., 1, which.max)])()
) |>
(\(.) .[.$user == 'A', ])()
# id user point_a point_b point_c point_d point_e EQUAL
# 1 1001 A 1 NA 3 2 4 B
# 2 1002 A 1 NA 2 1 NA B
# 3 1003 A NA 2 3 NA 1 C
Note: R version 4.1.2 (2021-11-01)
Data:
dat <- structure(list(id = c(1001L, 1002L, 1003L, 1001L, 1002L, 1003L,
1001L, 1002L, 1003L), user = c("A", "A", "A", "B", "B", "B",
"C", "C", "C"), point_a = c(1, 1, NA, 1, 1, NA, 4, 1, NA), point_b = c(NA,
NA, 2, NA, NA, NA, 3, NA, 2), point_c = c(3, 2, 3, 3, 2, 3, 3,
2, 3), point_d = c(2, 1, NA, 2, 1, NA, 2, 1, NA), point_e = c(4,
NA, 1, 4, NA, NA, 4, NA, 1)), class = "data.frame", row.names = c(NA,
-9L))

Resources