How can I compare multiple rows in R? - r

I would like to compare multiple values by USER.
Based on USER "A", If the values (A,B,C,D,and E) are same with USER "B", it should be written as B at the newly created variable EQUAL
Here is my data
Desired value
I am very new to R, I tried to look at the compare function but got a little overwhelmed. Would very much appreciate any help.

Here's an abridged version of the data you provided:
library(tidyverse)
df <- data.frame(
id = c(1001, 1002, 1003, 1001, 1002, 1003),
user = c('a', 'a', 'a', 'b', 'b', 'b'),
point_a = c(1, 1, NA, 1, 1, NA),
point_b = c(NA, NA, 2, NA, NA, NA),
point_c = c(3, 2, 3, 3, 2, 3),
point_d = c(2, 1, NA, 2, 1, NA),
point_e = c(4, NA, 1, 4, NA, NA)
)
df
id user point_a point_b point_c point_d point_e
1 1001 a 1 NA 3 2 4
2 1002 a 1 NA 2 1 NA
3 1003 a NA 2 3 NA 1
4 1001 b 1 NA 3 2 4
5 1002 b 1 NA 2 1 NA
6 1003 b NA NA 3 NA NA
If you inner_join on the columns you want to match, and then filter for rows where user.x is greater than user.y (i.e. first in alphabetical order, to get rid of duplicates and rows matching to themselves), you should be left with the matches you're looking for:
df %>%
inner_join(df, by = c('point_a', 'point_b', 'point_c', 'point_d', 'point_e')) %>%
filter(user.x < user.y) %>%
rename(user = user.x,
equal = user.y)
id.x user point_a point_b point_c point_d point_e id.y equal
1 1001 a 1 NA 3 2 4 1001 b
2 1002 a 1 NA 2 1 NA 1002 b

We may split the data along users, and put the result in mapply and calculate the rowSums of TRUEs after comparison with `==`. From the resulting matrix we want to know which.max which allows us to subset the users (without "A"). The result just needs to be subsetted by user "A".
transform(dat, EQUAL=
split(dat, dat$user) |>
(\(.) mapply(\(x, y) rowSums(x == y, na.rm=TRUE),
unname(.['A']),
.[c('B', 'C')]))() |>
(\(.) sort(unique(dat$user))[-1][apply(., 1, which.max)])()
) |>
(\(.) .[.$user == 'A', ])()
# id user point_a point_b point_c point_d point_e EQUAL
# 1 1001 A 1 NA 3 2 4 B
# 2 1002 A 1 NA 2 1 NA B
# 3 1003 A NA 2 3 NA 1 C
Note: R version 4.1.2 (2021-11-01)
Data:
dat <- structure(list(id = c(1001L, 1002L, 1003L, 1001L, 1002L, 1003L,
1001L, 1002L, 1003L), user = c("A", "A", "A", "B", "B", "B",
"C", "C", "C"), point_a = c(1, 1, NA, 1, 1, NA, 4, 1, NA), point_b = c(NA,
NA, 2, NA, NA, NA, 3, NA, 2), point_c = c(3, 2, 3, 3, 2, 3, 3,
2, 3), point_d = c(2, 1, NA, 2, 1, NA, 2, 1, NA), point_e = c(4,
NA, 1, 4, NA, NA, 4, NA, 1)), class = "data.frame", row.names = c(NA,
-9L))

Related

R: Set next row to NA in group_by

I want to set the next row i+1 in the same column to NA if there is already an NA in row i and then do this by groups. Here is my attempt:
dfeg <- tibble(id = c(rep("A", 5), rep("B", 5)),
x = c(1, 2, NA, NA, 3, 5, 6, NA, NA, 7))
setNextrowtoNA <- function(x){
for (j in 1:length(x)){
if(is.na(x[j])){x[j+1] <- NA}
}
}
dfeg <- dfeg %>% group_by(id) %>% mutate(y = setNextrowtoNA(x))
However my attempt doesn't create the column y that am looking for. Can anyone help with this? Thanks!
EDIT: In my actual data I have multiple values in a row that need to be set to NA, for example my data is more like this:
dfeg <- tibble(id = c(rep("A", 6), rep("B", 6)),
x = c(1, 2, NA, NA, 3, 4, 15, 16, NA, NA, 17, 18))
And need to create a column like this:
y = c(1, 2, NA, NA, NA, NA, 15, 16, NA, NA, NA, NA)
Any ideas? Thanks!
EDIT 2:
I figured it out on my own, this seems to work:
dfeg <- tibble(id = c(rep("A", 6), rep("B", 6)),
x = c(1, 2, NA, NA, 3, 4, 15, 16, NA, NA, 17, 18))
setNextrowtoNA <- function(x){
for (j in 1:(length(x))){
if(is.na(x[j]))
{
x[j+1] <- NA
}
lengthofx <- length(x)
x <- x[-lengthofx]
print(x[j])
}
return(x)
}
dfeg <- dfeg %>% group_by(id) %>% mutate(y = NA,
y = setNextrowtoNA(x))
Use cumany:
library(dplyr)
dfeg %>%
group_by(id) %>%
mutate(y = ifelse(cumany(is.na(x)), NA, x))
id x y
<chr> <dbl> <dbl>
1 A 1 1
2 A 2 2
3 A NA NA
4 A NA NA
5 A 3 NA
6 A 4 NA
7 B 15 15
8 B 16 16
9 B NA NA
10 B NA NA
11 B 17 NA
12 B 18 NA
Previous answer:
Use an ifelse statement with lag:
library(dplyr)
dfeg %>%
group_by(id) %>%
mutate(y = ifelse(is.na(lag(x, default = 0)), NA, x))

How to pivot longer with the grouping notation at the center of a dataframe?

I have a dataframe with the following column headers:
df <- data.frame(
ABC1_1_1DEF = c(1, 2, 3),
ABC1_2_1DEF = c(NA, 1, 2),
ABC1_3_1DEF = c(1, 1, NA),
ABC1_1_2DEF = c(3, NA, NA),
ABC1_2_2DEF = c(2, NA, NA),
ABC1_3_2DEF = c(NA, 1, 1)
)
I want to pivot the dataframe longer such that the middle number of each column is the group that contains the new columns:
df2 <- data.frame(
ABC1_1 = c(1, 2, 3, 3, NA, NA),
ABC1_2 = c(3, NA, NA, 2, NA, NA),
ABC1_3 = c(2, NA, NA, NA, 1, 1)
)
What's the best way to achieve this using R, ideally with dplyr?
To combine all the ABC1_1, ABC1_2 and ABC1_3 columns you can use -
tidyr::pivot_longer(df, cols = everything(),
names_to = '.value',
names_pattern = '([A-Z]+\\d+_\\d+)')
# ABC1_1 ABC1_2 ABC1_3
# <dbl> <dbl> <dbl>
#1 1 NA 1
#2 3 2 NA
#3 2 1 1
#4 NA NA 1
#5 3 2 NA
#6 NA NA 1

How to write a function that manipulates the data structure in R?

Some background information for better understanding
I have a very specific dataframe. It is basically tha sample of answers on my survey. Variables v1 represent a multiple choice question; the respondent were to choose variants from 1 to 4, he/she could choose the only or several options. Every chosen variant readressed the respondent to the block: if 1 was chosen, the respondent was readressed to o_v1_1, c_v1_1, and f_v1_1 and so on.
The problem
I want to write a function that will change my data structure from wide to long. But I am struggling with pivot_longer, because it does not produce a desirable output.
Here's the sample dataframe with some initial data processing:
structure(list(seance_id = c(1, 2, 3, 4), respondent = c("A",
"B", "C", "D"), v1...3 = c(1, 1, NA, 1), v1...4 = c(2, NA, 2,
NA), v1...5 = c(3, 4, 4, NA), v1...6 = c(4, NA, NA, NA), o_v1_1 = c(6,
1, NA, 4), c_v1_1 = c(7, 1, NA, 1), f_v1_1 = c(8, 1, NA, 1),
o_v1_2 = c(10, NA, 4, NA), c_v1_2 = c(8, NA, 1, NA), f_v1_2 = c(3,
NA, 3, NA), o_v1_3 = c(4, NA, NA, NA), c_v1_3 = c(1, NA,
NA, NA), f_v1_3 = c(2, NA, NA, NA), o_v1_4 = c(10, 5, 4,
NA), c_v1_4 = c(9, 6, 5, NA), f_v1_4 = c(9, 6, 6, NA)), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
data <- data %>% mutate_if(is.numeric, as.character)
colnames(data) <- c("seance_id", "respondent", "v1", "v1", "v1", "v1", "o_v1_1",
"c_v1_1", "f_v1_1", "o_v1_2", "c_v1_2", "f_v1_2", "o_v1_3", "c_v1_3",
"f_v1_3", "o_v1_4", "c_v1_4", "f_v1_4")
And here's how I tried to make my table long:
long <- data %>%
pivot_longer(cols = -`seance_id`, names_to = "v1", values_to = "answer")
And this is what I want to get:
`séance_id` respondent direction answer_dir criteria criteria_answer
<dbl> <chr> <chr> <dbl> <chr> <dbl>
1 1 A v1 1 o_v1_1 6
2 1 A v1 1 c_v1_1 7
3 1 A v1 1 f_v1_1 8
4 1 A v1 2 o_v1_2 10
5 1 A v1 2 c_v1_2 8
6 1 A v1 2 f_v1_2 3
I have been researching SO for 2 days already and have not resolved my problem yet. How can I use pivot_longer effectively to get the desirable output? And is there any way to automate the process of longing my dfs? I have more than 30 of dfs, nested in different lists within one Excel file.
You can drop you initial v1 columns. I instead propose the following:
data %>% select(-starts_with('v1')) %>%
pivot_longer(cols = contains('v1'), names_to = "v1", values_to = "criteria_answer") %>%
separate(v1, sep='_', into=c('w','direction','answer_dir'), remove=FALSE) %>%
rename(creteria=v1) %>% select(-w)
# A tibble: 48 x 6
seance_id respondent creteria direction answer_dir criteria_answer
<chr> <chr> <chr> <chr> <chr> <chr>
1 1 A o_v1_1 v1 1 6
2 1 A c_v1_1 v1 1 7
3 1 A f_v1_1 v1 1 8
4 1 A o_v1_2 v1 2 10
5 1 A c_v1_2 v1 2 8
6 1 A f_v1_2 v1 2 3
7 1 A o_v1_3 v1 3 4
8 1 A c_v1_3 v1 3 1
9 1 A f_v1_3 v1 3 2
10 1 A o_v1_4 v1 4 10
# ... with 38 more rows
The final select(-w) is to remove the w-column, an artefact from separate splitting o_v1_1,c_v1_1,etc. into 3 columns. Here w was for the first character.

How to replace factor NA's with the level of the cell above

I'm trying to replace NA values in factor column with the values of the cell above. It would be great to have this in a tidy verse approach, but it doesn't matter too much if its not.
I have data that looks like:
data <- tibble(site = as.factor(c("A", "A", NA, "B","B", NA,"C", NA, "C")),
value = c(1, 2, NA, 1, 2, NA, 1, NA, 2))
And I need it to look like:
output <- data <- tibble(site = as.factor(c("A", "A", "A", "B","B", "B","C", "C", "C")),
value = c(1, 1, NA, 1,2, NA, 1, NA, 2))
I've tried a few different approaches using lag and replace_na although they have basically amounted to trying the same thing which is:
mutate(site = as.character(site),
site = ifelse(is.na(site), "zero", site),
site = ifelse(site == "zero", lag(site), site),
site = as.factor(site))
Thanks!
Try fill() from tidyr:
library(tidyverse)
#Code
data <- data %>% fill(site)
Output:
# A tibble: 9 x 2
site value
<fct> <dbl>
1 A 1
2 A 2
3 A NA
4 B 1
5 B 2
6 B NA
7 C 1
8 C NA
9 C 2
An option with na.locf
library(zoo)
data$state <- na.locf0(data$site)

R: Shift multiple columns by different number of rows

I have a list of tibbles like the following:
list(A = structure(list(
ID = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
g1 = c(0, 1, 2, NA, NA, NA, NA, NA, NA),
g2 = c(NA, NA, NA, 3, 4, 5, NA, NA, NA),
g3 = c(NA, NA, NA, NA, NA, NA, 6, 7, 8)),
row.names = c(NA, -9L),
class = c("tbl_df", "tbl", "data.frame")),
B = structure(list(ID = c(1, 2, 1, 2, 1, 2),
g1 = c(10, 11, NA, NA, NA, NA),
g2 = c(NA, NA, 12,13, NA, NA),
g3 = c(NA, NA, NA, NA, 14, 15)),
row.names = c(NA, -6L),
class = c("tbl_df", "tbl", "data.frame"))
)
Each element looks like this:
ID g1 g2 g3
<dbl> <dbl> <dbl> <dbl>
1 0 NA NA
2 1 NA NA
3 2 NA NA
1 NA 3 NA
2 NA 4 NA
3 NA 5 NA
1 NA NA 6
2 NA NA 7
3 NA NA 8
The g* columns are created dynamically, during previous mutates, and their number can vary, but it will be the same across all list elements.
Every g* column has only certain non-NA elements (as many as the unique IDs).
I would like to shift the g* columns so that they contain the non-NA element to the top rows.
I can do it for a single column by
num.shifts<- rle(is.na(myList[[1]]$g1))$lengths[1]
shift(myList[[1]]$g2,-num.shifts)
but how can I do it for all the g* columns, for all list elements, when I don't know in advance the number of g* columns?
Ideally, I would like a tidyverse solution, but not a requirement...
Thanks!
We can loop over the list with map, and use mutate_at to go over the columns that matches the 'g' followed by digits and order based on the non-NA elements
library(dplyr)
library(tidyr)
map(lst1, ~
.x %>%
mutate_at(vars(matches('^g\\d+')), ~ .[order(is.na(.))]))
In base R, we can do
lapply(lst1, function(x) {i1 <- grepl("^g\\d+$", names(x))
x[i1] <- lapply(x[i1], function(y) y[order(is.na(y))])
x})

Resources