Excel: Merging Repeating Columns Within A Dataset - r

forgive the very basic question. I have some output from an experiment that had 3 different versions of the same question, depending on the condition. The output file treated each question as a separate column so my output looks like this, where the headers for the columns repeat:
Q1,Q2,Q3,Q1,Q2,Q3,Q1,Q2,Q3
1, 0, 1
-----------0, 1, 0
--------------------1, 1, 1
How would I be able to merge the output (preferably in Excel - my output is currently stored in an excel file, or alternatively in R), so that the desired output looks like this:
Q1,Q2,Q3
1, 0, 1
0, 1, 0
1, 1, 1
Thanks in advance!

An option in R after reading the dataset with a function that reads thee excel file (read_excel etc.) would be to loop over the unique names of dataset, extract the columns, unlist, remove the NA elements (if any - assuming the blanks are NA)
nm1 <- unique(sub("\\.\\d+", "", names(df1)))
out <- sapply(nm1, function(x) na.omit(unlist(df1[grep(x, names(df1))])))
row.names(out) <- NULL
out
# Q1 Q2 Q3
#[1,] 1 0 1
#[2,] 0 1 0
#[3,] 1 1 1
Or with tidyverse with gather/spread
library(tidyverse)
gather(df1, na.rm = TRUE) %>%
mutate(key = str_remove(key, "\\.\\d+$"), ind = rowid(key)) %>%
spread(key, value) %>%
select(-ind)
# Q1 Q2 Q3
#1 1 0 1
#2 0 1 0
#3 1 1 1
Or another option is to split into a list of data.frames having similar columns, use coalesce to reduce it to a single vector which would remove the NA elements in the row and get the first non-NA element in that row
split.default(df1, nm1) %>%
map_df(reduce, coalesce)
# A tibble: 3 x 3
# Q1 Q2 Q3
# <dbl> <dbl> <dbl>
#1 1 0 1
#2 0 1 0
#3 1 1 1
data
df1 <- structure(list(Q1 = c(1, NA, NA), Q2 = c(0, NA, NA), Q3 = c(1,
NA, NA), Q1.1 = c(NA, 0, NA), Q2.1 = c(NA, 1, NA), Q3.1 = c(NA,
0, NA), Q1.2 = c(NA, NA, 1), Q2.2 = c(NA, NA, 1), Q3.2 = c(NA,
NA, 1)), class = "data.frame", row.names = c(NA, -3L))

Related

Iteratively dplyr::coalesce()

I have a dataset that I am needing to use dplyr::coalesce() on. But I want to do this multiple times and am not sure about what is a more efficient way of doing this (e.g. loop, apply, etc).
To give you a toy example, say my dataset is:
df = data.frame(
a = c(1, NA, NA),
a.1 = c(NA, 1, NA),
a.2 = c(NA, NA, 1),
b = c(2, NA, NA),
b.1 = c(NA, 2, NA),
b.2 = c(NA, NA, 2),
c = c(3, NA, NA),
c.1 = c(NA, 3, NA),
c.2 = c(NA, NA, 3)
)
And I could do this:
new_df = df |>
dplyr::mutate(
a = dplyr::coalesce(a, a.1, a.2),
b = dplyr::coalesce(b, b.1, b.2),
c = dplyr::coalesce(c, c.1, c.2)
) |>
dplyr::select(a, b, c)
Which would give me:
new_df
a b c
1 1 2 3
2 1 2 3
3 1 2 3
First, how could I efficiently do this without having to write coalesce n times? This example here is just an example and I'd really need to do this forty times with the dataset.
Also, is there a way to do it as I have here where I basically just keep a, b, and c rather than naming it as a.1 or whatever?
If columns are like something and somthing.etc shape,
you may try
library(dplyr)
library(stringr)
df %>%
split.default(str_remove(names(.), "\\..*")) %>%
map_df(~ coalesce(!!! .x))
a b c
<dbl> <dbl> <dbl>
1 1 2 3
2 1 2 3
3 1 2 3
Here is an alternative with pivoting:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(everything()) %>%
mutate(name = sub("\\..*", "", name)) %>%
drop_na %>%
pivot_wider(names_from = name, values_from = value, values_fn = list) %>%
unnest(cols = c(a, b, c))
a b c
<dbl> <dbl> <dbl>
1 1 2 3
2 1 2 3
3 1 2 3

Get last entry of a range with identical numbers in R, vectorized

I’ve got this data:
tribble(
~ranges, ~last,
0, NA,
1, NA,
1, NA,
1, NA,
1, NA,
2, NA,
2, NA,
2, NA,
3, NA,
3, NA
)
and I want to fill the last column only at the row index at the last entry of the number by the ranges column. That means, it should look like this:
tribble(
~ranges, ~last,
0, 0,
1, NA,
1, NA,
1, NA,
1, 1,
2, NA,
2, NA,
2, 2,
3, NA,
3, 3
)
So far I came up with a row-wise approach:
for (r in seq.int(max(tmp$ranges))) {
print(r)
range <- which(tmp$ranges == r) |> max()
tmp$last[range] <- r
}
The main issue is that it is terribly slow. I am looking for a vectorized approach to this issue. Any creative solution out there?
Here's a dplyr solution:
library(dplyr)
tmp %>%
group_by(ranges) %>%
mutate(
last = case_when(row_number() == n() ~ ranges, TRUE ~ NA_real_)
) %>%
ungroup()
# # A tibble: 10 × 2
# ranges last
# <dbl> <dbl>
# 1 0 0
# 2 1 NA
# 3 1 NA
# 4 1 NA
# 5 1 1
# 6 2 NA
# 7 2 NA
# 8 2 2
# 9 3 NA
# 10 3 3
Or we could do something clever with base R for the same result. Here we calculate the difference of ranges to identify when the next row is different (i.e., the last of a group). We then stick a TRUE on the end so the last row is included. This assumes your data is already sorted by ranges.
tmp$last = ifelse(c(diff(tmp$ranges) != 0, TRUE), tmp$ranges, NA)
Using replace:
library(dplyr)
df %>%
group_by(ranges) %>%
mutate(last = replace(last, n(), ranges[n()]))
Using ifelse:
library(dplyr)
df %>%
group_by(ranges) %>%
mutate(last = ifelse(row_number() == n(), ranges, NA))
Using tail:
library(dplyr)
df %>%
group_by(ranges) %>%
mutate(last = c(last[-n()], tail(ranges, 1)))
output
ranges last
<dbl> <dbl>
1 0 0
2 1 NA
3 1 NA
4 1 NA
5 1 1
6 2 NA
7 2 NA
8 2 2
9 3 NA
10 3 3

How to write a function that manipulates the data structure in R?

Some background information for better understanding
I have a very specific dataframe. It is basically tha sample of answers on my survey. Variables v1 represent a multiple choice question; the respondent were to choose variants from 1 to 4, he/she could choose the only or several options. Every chosen variant readressed the respondent to the block: if 1 was chosen, the respondent was readressed to o_v1_1, c_v1_1, and f_v1_1 and so on.
The problem
I want to write a function that will change my data structure from wide to long. But I am struggling with pivot_longer, because it does not produce a desirable output.
Here's the sample dataframe with some initial data processing:
structure(list(seance_id = c(1, 2, 3, 4), respondent = c("A",
"B", "C", "D"), v1...3 = c(1, 1, NA, 1), v1...4 = c(2, NA, 2,
NA), v1...5 = c(3, 4, 4, NA), v1...6 = c(4, NA, NA, NA), o_v1_1 = c(6,
1, NA, 4), c_v1_1 = c(7, 1, NA, 1), f_v1_1 = c(8, 1, NA, 1),
o_v1_2 = c(10, NA, 4, NA), c_v1_2 = c(8, NA, 1, NA), f_v1_2 = c(3,
NA, 3, NA), o_v1_3 = c(4, NA, NA, NA), c_v1_3 = c(1, NA,
NA, NA), f_v1_3 = c(2, NA, NA, NA), o_v1_4 = c(10, 5, 4,
NA), c_v1_4 = c(9, 6, 5, NA), f_v1_4 = c(9, 6, 6, NA)), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
data <- data %>% mutate_if(is.numeric, as.character)
colnames(data) <- c("seance_id", "respondent", "v1", "v1", "v1", "v1", "o_v1_1",
"c_v1_1", "f_v1_1", "o_v1_2", "c_v1_2", "f_v1_2", "o_v1_3", "c_v1_3",
"f_v1_3", "o_v1_4", "c_v1_4", "f_v1_4")
And here's how I tried to make my table long:
long <- data %>%
pivot_longer(cols = -`seance_id`, names_to = "v1", values_to = "answer")
And this is what I want to get:
`séance_id` respondent direction answer_dir criteria criteria_answer
<dbl> <chr> <chr> <dbl> <chr> <dbl>
1 1 A v1 1 o_v1_1 6
2 1 A v1 1 c_v1_1 7
3 1 A v1 1 f_v1_1 8
4 1 A v1 2 o_v1_2 10
5 1 A v1 2 c_v1_2 8
6 1 A v1 2 f_v1_2 3
I have been researching SO for 2 days already and have not resolved my problem yet. How can I use pivot_longer effectively to get the desirable output? And is there any way to automate the process of longing my dfs? I have more than 30 of dfs, nested in different lists within one Excel file.
You can drop you initial v1 columns. I instead propose the following:
data %>% select(-starts_with('v1')) %>%
pivot_longer(cols = contains('v1'), names_to = "v1", values_to = "criteria_answer") %>%
separate(v1, sep='_', into=c('w','direction','answer_dir'), remove=FALSE) %>%
rename(creteria=v1) %>% select(-w)
# A tibble: 48 x 6
seance_id respondent creteria direction answer_dir criteria_answer
<chr> <chr> <chr> <chr> <chr> <chr>
1 1 A o_v1_1 v1 1 6
2 1 A c_v1_1 v1 1 7
3 1 A f_v1_1 v1 1 8
4 1 A o_v1_2 v1 2 10
5 1 A c_v1_2 v1 2 8
6 1 A f_v1_2 v1 2 3
7 1 A o_v1_3 v1 3 4
8 1 A c_v1_3 v1 3 1
9 1 A f_v1_3 v1 3 2
10 1 A o_v1_4 v1 4 10
# ... with 38 more rows
The final select(-w) is to remove the w-column, an artefact from separate splitting o_v1_1,c_v1_1,etc. into 3 columns. Here w was for the first character.

adding multiple columns include na in dataframe in r

I have dataframe like this:
I want to create a new column which is the sum of other columns by ignoring NA if there is any numeric value in a row. But if all value (like the second row) in a row are na, the sum column gets NA.
As this is your first activity here on SO you should have a look to this which describes how a minimal and reproducible examples is made. This is certainly needed in the future, if you have more questions. An image is mostly not accepted as a starting point.
Fortunately your table was a small one. I turned it into a tribble and then used rowSums to calculate the numbers you seem to want.
df <- tibble::tribble(
~x, ~y, ~z,
6000, NA, NA,
NA, NA, NA,
100, 7000, 1000,
0, 0, NA
)
df$sum <- rowSums(df, na.rm = T)
df
#> # A tibble: 4 x 4
#> x y z sum
#> <dbl> <dbl> <dbl> <dbl>
#> 1 6000 NA NA 6000
#> 2 NA NA NA 0
#> 3 100 7000 1000 8100
#> 4 0 0 NA 0
Created on 2020-06-15 by the reprex package (v0.3.0)
Let's say that your data frame is called df
cbind(df, apply(df, 1, function(x){if (all(is.na(x))) {NA} else {sum(x, na.rm = T)}))
Note that if your data frame has other columns, you will need to restrict the df call within apply to only be the columns you're after.
You can count the NA values in df. If in a row there is no non-NA value you can assign output as NA or calculate row-wise sum otherwise using rowSums.
ifelse(rowSums(!is.na(df)) == 0, NA, rowSums(df, na.rm = TRUE))
#[1] 6000 NA 10000 8100 0
data
df <- structure(list(x = c(6000, NA, 10000, 100, 0), y = c(NA, NA,
NA, 7000, 0), z = c(NA, NA, NA, 1000, NA)), class = "data.frame",
row.names = c(NA, -5L))

R: What is an efficient way to recode variables? How do I prorate means?

I was wondering if anyone could point me in the direction of how I would go about recoding multiple variables with the same rules. I have the following df bhs1:
structure(list(bhs1_1 = c(NA, 1, NA, 2, 1, 2), bhs1_2 = c(NA,
2, NA, 2, 1, 1), bhs1_3 = c(NA, 1, NA, 2, 2, 2), bhs1_4 = c(NA,
2, NA, 1, 1, 1), bhs1_5 = c(NA, 1, NA, 1, 2, 2), bhs1_6 = c(NA,
1, NA, 2, 1, 2), bhs1_7 = c(NA, 1, NA, 1, 2, 1), bhs1_8 = c(NA,
2, NA, 2, 2, 2), bhs1_9 = c(NA, 1, NA, 2, 1, 1), bhs1_10 = c(NA,
2, NA, 1, 2, 2), bhs1_11 = c(NA, 2, NA, 2, 2, 1), bhs1_12 = c(NA,
2, NA, 2, 1, 1), bhs1_13 = c(NA, 1, NA, 1, 2, 2), bhs1_14 = c(NA,
2, NA, 2, 1, 1), bhs1_15 = c(NA, 1, NA, 2, 2, 2), bhs1_16 = c(NA,
2, NA, 2, 2, 2), bhs1_17 = c(NA, 2, NA, 2, 2, 1), bhs1_18 = c(NA,
1, NA, 1, 2, 1), bhs1_19 = c(NA, 1, NA, 2, 1, 2), bhs1_20 = c(NA,
2, NA, 2, 1, 1)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
There are two transformation rules, for half of the data set, e.g.,:
(bhs1_2, bhs1_4, bhs1_7, bhs1_9, bhs1_11, bhs1_12, bhs1_14, bhs1_16, bhs1_17,
bhs1_18, bhs1_20)
(if_else(1, 1, 0))
and
(bhs1_1, bhs1_3, bhs1_5, bhs1_6, bhs1_8, bhs1_10, bhs1_13,
bhs1_15, bhs1_19)
(if_else(2, 1, 0))
Is there an elegant way to write code to meet this use case? If so, can someone please point me in the right direction and/or provide me with a sample?
Here's a solution using dplyr
library(dplyr)
case1 <- vars(bhs1_2, bhs1_4, bhs1_7, bhs1_9, bhs1_11, bhs1_12, bhs1_14, bhs1_16, bhs1_17,
bhs1_18, bhs1_20)
case2 <- vars(bhs1_1, bhs1_3, bhs1_5, bhs1_6, bhs1_8, bhs1_10, bhs1_13,
bhs1_15, bhs1_19)
result <- df %>%
mutate_at(case1, ~ (. == 1) * 1L) %>%
mutate_at(case2, ~ (. == 2) * 1L)
Note - I skipped the ifelse statement - I'm just testing for your condition, then converted the TRUE/FALSE responses to numbers by multiplying by 1. I'm also not sure how you want NAs to be handled, but this is ignoring them.
If you aren't familiar with the pipe operator (%>%), it takes the result of the previous function, and sets it as the first argument of the next function. It's designed to improve code legibility by avoiding lots of function nesting.
We can create the column names of interest, then convert to binary (as.integer) from the logical expression
case1 <- c("bhs1_2", "bhs1_4", "bhs1_7", "bhs1_9", "bhs1_11", "bhs1_12",
"bhs1_14", "bhs1_16", "bhs1_17", "bhs1_18", "bhs1_20")
case2 <- c("bhs1_1", "bhs1_3", "bhs1_5", "bhs1_6", "bhs1_8",
"bhs1_10", "bhs1_13", "bhs1_15", "bhs1_19")
library(magrittr)
df1 %<>%
mutate_at(vars(case1), funs(as.integer(.==1 ))) %<>%
mutate_at(vars(case2), funs(as.integer(.==2)))
df1
# A tibble: 6 x 20
# bhs1_1 bhs1_2 bhs1_3 bhs1_4 bhs1_5 bhs1_6 bhs1_7 bhs1_8 bhs1_9 bhs1_10
# <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#1 NA NA NA NA NA NA NA NA NA NA
#2 0 0 0 0 0 0 1 1 1 1
#3 NA NA NA NA NA NA NA NA NA NA
#4 1 0 1 1 0 1 1 1 0 0
#5 0 1 1 1 1 0 0 1 1 1
#6 1 1 1 1 1 1 1 1 1 1
# ... with 10 more variables: bhs1_11 <int>, bhs1_12 <int>, bhs1_13 <int>,
# bhs1_14 <int>, bhs1_15 <int>, bhs1_16 <int>, bhs1_17 <int>, bhs1_18 <int>,
# bhs1_19 <int>, bhs1_20 <int>
Or an efficient option would be to use data.table
library(data.table)
setDT(df1)[, (case1) := lapply(.SD, function(x) as.integer(x == 1 )),
.SDcols = case1
][, (case2) := lapply(.SD, function(x) as.integer(x == 2)),
.SDcols = case2][]
NOTE This doesn't assume that all the values are of the same
You can use a very fast base R way of doing this as below:
case1=c("bhs1_10", "bhs1_11", "bhs1_12", "bhs1_13", "bhs1_14", "bhs1_15","bhs1_16", "bhs1_17", "bhs1_18", "bhs1_19", "bhs1_20")
case2=c("bhs1_1", "bhs1_3", "bhs1_5", "bhs1_6", "bhs1_8", "bhs1_10", "bhs1_13", "bhs1_15", "bhs1_19")
dat[case1]=abs(dat[case1]-2)
dat[case2]=dat[case2]-1
An simple ifelse can be helpful considering OP wants NA to be converted based on specified rules:
case1 = c("bhs1_2", "bhs1_4", "bhs1_7", "bhs1_9", "bhs1_11", "bhs1_12",
"bhs1_14", "bhs1_16", "bhs1_17", "bhs1_18", "bhs1_20")
case2 = c("bhs1_1", "bhs1_3", "bhs1_5", "bhs1_6", "bhs1_8", "bhs1_10",
"bhs1_13", "bhs1_15", "bhs1_19")
df[case1] = ifelse(!is.na(df[case1]) & df[case1]==1,1,0)
df[case2] = ifelse(!is.na(df[case2]) & df[case2]==2,1,0)
#Test solution
df[1:7]
# bhs1_1 bhs1_2 bhs1_3 bhs1_4 bhs1_5 bhs1_6 bhs1_7
# 1 0 0 0 0 0 0 0
# 2 0 0 0 0 0 0 1
# 3 0 0 0 0 0 0 0
# 4 1 0 1 1 0 1 1
# 5 0 1 1 1 1 0 0
# 6 1 1 1 1 1 1 1
**Updated:**If NA to be left as is then solution can be:
df[case1] = ifelse(df[case1]==1,1,0)
df[case2] = ifelse(df[case2]==2,1,0)
df[1:7]
# bhs1_1 bhs1_2 bhs1_3 bhs1_4 bhs1_5 bhs1_6 bhs1_7
# 1 NA NA NA NA NA NA NA
# 2 0 0 0 0 0 0 1
# 3 NA NA NA NA NA NA NA
# 4 1 0 1 1 0 1 1
# 5 0 1 1 1 1 0 0
# 6 1 1 1 1 1 1 1

Resources