R Merge duplicate columns that have different values in one dataframe - r

I currently have survey data where a set of Likert-type questions appears twice in the dataset and the set of questions a participant answered depends on an initial response to a binary "check" question. My goal is to merge the sets of duplicate questions. The data looks something like this:
Check
Q1
Q2
Q3
Q1.1
Q2.1
Q3.1
1
5
5
4
1
2
5
3
2
4
6
3
2
4
2
1
...where Q1.1 is a duplicate of Q1, and so on for Q2 and Q3
And I'd like my final output to look like this:
Check
Q1
Q2
Q3
1
5
5
4
1
2
5
3
2
4
6
3
2
4
2
1
I've been testing out a variety of ideas using things like for-loops, sapply, paste, and cbind. I've run into walls with each of them, particularly because I need to somehow match questions (ex. Q1 gets Q1.1's value when check==2) and run this over a set of multiple columns in one dataset.
Any help on this would be greatly appreciated!

If the missing elements are NA, pivot_longer can be used
library(tidyr)
pivot_longer(df1, cols = -Check, names_pattern = "^(Q\\d+).*",
names_to = ".value", values_drop_na = TRUE)
-output
# A tibble: 4 × 4
Check Q1 Q2 Q3
<int> <int> <int> <int>
1 1 5 5 4
2 1 2 5 3
3 2 4 6 3
4 2 4 2 1
data
df1 <- structure(list(Check = c(1L, 1L, 2L, 2L), Q1 = c(5L, 2L, NA,
NA), Q2 = c(5L, 5L, NA, NA), Q3 = c(4L, 3L, NA, NA), Q1.1 = c(NA,
NA, 4L, 4L), Q2.1 = c(NA, NA, 6L, 2L), Q3.1 = c(NA, NA, 3L, 1L
)), class = "data.frame", row.names = c(NA, -4L))

In the check column do the numbers represent individuals? Does each individual have 2 rows of data? Or is this example table all for a single individual that will have 4 rows of data?
If all 4 rows are for 1 person I would structure the table like this if its not already.
Subject Check Q1 Q2 Q3 Q1_1 Q2_1 Q3_1
1 1
1 1
1 2
1 2
There are endless ways of doing this. Based on my knowledge I would subset the dataset into 2 tables for each subject, one for check - 1 and one for check = 2, and then just use rbind to stack them on top of each other. That is what poppiytt did when they created an example dataset but they didn't add a column for subject.
data1 <- (data, check == 1, select = c(subject, check, Q1, Q2, Q3))
data2 <- (data, check == 2, select = c(subject, check, Q1_1, Q2_1, Q3_1))
data3 <- rbind(data1, data2)
I'm sure there is a more efficient way to do this but this will work.

Related

Aggregating rows across multiple values

I have a large dataframe with approximately this pattern:
Person
Rate
Street
a
b
c
d
e
f
A
2
XYZ
1
NULL
3
4
5
NULL
A
2
XYZ
NULL
2
NULL
NULL
NULL
NULL
A
3
XYZ
NULL
NULL
NULL
NULL
NULL
6
B
2
DEF
NULL
NULL
NULL
NULL
5
NULL
B
2
DEF
NULL
2
3
NULL
NULL
6
C
1
DEF
1
2
3
4
5
6
A, b, c, d, e, f represents about 600 columns.
I am trying to combine the columns so that each person becomes one line, rows a-f combine into a single line using sum, and any conflicting rate or street information becomes a new row. So the data should look something like this:
Person
Rate
Rate 2
Street
a
b
c
d
e
f
A
2
3
XYZ
1
2
3
4
5
6
B
2
DEF
NULL
2
3
NULL
5
6
C
1
DEF
1
2
3
4
5
6
I keep trying to make this work with aggregate and summarize but I'm not sure that's the right approach.
Thank you very much for your help!
First we pivot all the unique rates per person and street.
library(reshape2)
tmp1=dcast(unique(df[,c("Person","Rate","Street")]),Person+Street~Rate,value.var="Rate")
colnames(tmp1)[-c(1:2)]=paste("Rate",colnames(tmp1)[-c(1:2)])
Then we aggregate and sum by person and rate, columns 4 to 9, from "a" to "f", change accordingly.
tmp2=aggregate(df[,4:9],list(Person=df$Person,Street=df$Street),function(x){
ifelse(all(is.na(x)),NA,sum(x,na.rm=T))
})
And finally merge the two.
merge(tmp1,tmp2,by=c("Person","Street"))
Person Street Rate 1 Rate 2 Rate 3 a b c d e f
1 A XYZ NA 2 3 1 2 3 4 5 6
2 B DEF NA 2 NA NA 2 3 NA 5 6
3 C DEF 1 NA NA 1 2 3 4 5 6
Perhaps, you can do this in two-step process -
library(dplyr)
library(tidyr)
#sum columns a-f
table1 <- df %>%
group_by(Person) %>%
summarise(across(a:f, sum, na.rm = TRUE))
#Remove duplicated values and get the data in separate columns
#for Rate and Street columns.
table2 <- df %>%
group_by(Person) %>%
mutate(across(c(Rate, Street), ~replace(., duplicated(.), NA))) %>%
select(Person, Rate, Street) %>%
filter(if_any(c(Rate, Street), ~!is.na(.))) %>%
mutate(col = row_number()) %>%
ungroup %>%
pivot_wider(names_from = col, values_from = c(Rate, Street)) %>%
select(where(~any(!is.na(.))))
#Join the two data to get final result
inner_join(table1, table2, by = 'Person')
# Person a b c d e f Rate_1 Rate_2 Street_1
# <chr> <int> <int> <int> <int> <int> <int> <int> <int> <chr>
#1 A 1 2 3 4 5 6 2 3 XYZ
#2 B 0 2 3 0 5 6 2 NA DEF
#3 C 1 2 3 4 5 6 1 NA DEF
data
It is helpful and easier to help when you share data in a reproducible format which can be copied directly. I have used the below data for the answer.
df <- structure(list(Person = c("A", "A", "A", "B", "B", "C"), Rate = c(2L,
2L, 3L, 2L, 2L, 1L), Street = c("XYZ", "XYZ", "XYZ", "DEF", "DEF",
"DEF"), a = c(1L, NA, NA, NA, NA, 1L), b = c(NA, 2L, NA, NA,
2L, 2L), c = c(3L, NA, NA, NA, 3L, 3L), d = c(4L, NA, NA, NA,
NA, 4L), e = c(5L, NA, NA, 5L, NA, 5L), f = c(NA, NA, 6L, NA,
6L, 6L)), row.names = c(NA, -6L), class = "data.frame")

How can I use data in df.x and use functions to select and add to df.y

I am new to R and I am trying to write a piece of code that will enable me to pick some data in df.x and put it in df.y:
Category 2019 2020 2021 2022 2023
Apple 3 4 5 6 7
Pear 3 4 5 6 7
Banana 3 4 5 6 7
Oranges 3 4 5 6 7
I want to select the value for oranges in 2019 and put in df.y and differences in years for Apple into a new df.y, like this:
Resource 2019 2020 2021 2022 2023
Orange 3 4 5 6 7
Apple 1 1 1 1
Any helps are appreciated!!!
Thanks
This is a tidyverse approach involving wide to long transform since it's easier to calculate the year differences.
library(tidyverse)
df.x <- tibble(
Category = c("Apple", "Pear", "Banana", "Orange"),
"2019" = 3,
"2020" = 4,
"2021" = 5,
"2022" = 6,
"2023" = 7
)
df.y <- df %>%
filter(Category %in% c("Apple", "Orange")) %>%
pivot_longer(-Category) %>%
mutate(value = ifelse(Category == "Apple", value - lag(value, 1), value)) %>%
pivot_wider()
# A tibble: 2 x 6
# Category `2019` `2020` `2021` `2022` `2023`
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Apple NA 1 1 1 1
#2 Orange 3 4 5 6 7
First you need to provide your data using dput(df.x):
df.x <- structure(list(Category = c("Apple", "Pear", "Banana", "Oranges"
), X2019 = c(3L, 3L, 3L, 3L), X2020 = c(4L, 4L, 4L, 4L), X2021 = c(5L,
5L, 5L, 5L), X2022 = c(6L, 6L, 6L, 6L), X2023 = c(7L, 7L, 7L,
7L)), class = "data.frame", row.names = c(NA, -4L))
Note that your column names have changed because R does not allow column/variable names to begin with a number. The process for extracting information from a data frame is covered in detail on the manual page: ?Extract. It is a bit dense so it may be easier to begin with some introductory tutorials on R.
To extract the row for Oranges:
row1 <- df.x[df.x$Category=="Oranges", ]
row1
# Category X2019 X2020 X2021 X2022 X2023
# 4 Oranges 3 4 5 6 7
The row number indicates that this is the 4th row in df.x. Now the second row is slightly more involved. First extract the row:
row2 <- df.x[df.x$Category=="Apple", ]
row2
# Category X2019 X2020 X2021 X2022 X2023
# 1 Apple NA 1 1 1 1
Now compute the differences across the row by converting the data frame row to a simple vector with unlist since diff is picky about what kind of data structure it will work with and insert a missing value, NA, for the first year:
row2[ , -1] <- c(X2019=NA, diff(unlist(row2[, -1])))
df.y <- rbind(row1, row2)
df.y
# Category X2019 X2020 X2021 X2022 X2023
# 4 Oranges 3 4 5 6 7
# 1 Apple NA 1 1 1 1
rownames(df.y) <- NULL
The last line just resets the row names which have carried over from df.x.

Fill in missing rows in data in R

Suppose I have a data frame like this:
1 8
2 12
3 2
5 -6
6 1
8 5
I want to add a row in the places where the 4 and 7 would have gone in the first column and have the second column for these new rows be 0, so adding these rows:
4 0
7 0
I have no idea how to do this in R.
In excel, I could use a vlookup inside an iferror. Is there a similar combo of functions in R to make this happen?
Edit: also, suppose that row 1 was missing and needed to be filled in similarly. Would this require another solution? What if I wanted to add rows until I reached ten rows?
Use tidyr::complete to fill in the missing sequence between min and max values.
library(tidyr)
library(rlang)
complete(df, V1 = min(V1):max(V1), fill = list(V2 = 0))
#Or using `seq`
#complete(df, V1 = seq(min(V1), max(V1)), fill = list(V2 = 0))
# V1 V2
# <int> <dbl>
#1 1 8
#2 2 12
#3 3 2
#4 4 0
#5 5 -6
#6 6 1
#7 7 0
#8 8 5
If we already know min and max of the dataframe we can use them directly. Let's say we want data from V1 = 1 to 10, we can do.
complete(df, V1 = 1:10, fill = list(V2 = 0))
If we don't know the column names beforehand, we can do something like :
col1 <- names(df)[1]
col2 <- names(df)[2]
complete(df, !!sym(col1) := 1:10, fill = as.list(setNames(0, col2)))
data
df <- structure(list(V1 = c(1L, 2L, 3L, 5L, 6L, 8L), V2 = c(8L, 12L,
2L, -6L, 1L, 5L)), class = "data.frame", row.names = c(NA, -6L))

How to filter rows only when all conditions are met in R [duplicate]

This question already has answers here:
Remove rows which have all NAs in certain columns
(5 answers)
Closed 3 years ago.
I'm trying to remove rows from a table of survey response data. I want to remove rows only where all my specified conditions are met. For example, if three columns contain NA then I want to remove the whole row. But if only one or two of those same columns contains NA that is acceptable.
I haven't managed to use filter to achieve this. If I use the code below then it removes the row if any NA's exist as opposed to all the
df <- filter(df,
is.na(Q1) == FALSE &
is.na(Q2) == FALSE &
is.na(Q3) == FALSE)
So if we have a df like below I'd only want to remove row #2:
rowid Q1 Q2 Q3
1 1 3 2
2 NA NA NA
3 NA 1 0
4 1 NA 2
5 1 1 NA
An option would be to use filter_at and specify any_vars with the condition check for any non-NA elements in a row
library(dplyr)
df %>%
filter_at(vars(starts_with("Q")), any_vars(!is.na(.)))
# rowid Q1 Q2 Q3
#1 1 1 3 2
#2 3 NA 1 0
#3 4 1 NA 2
#4 5 1 1 NA
As the OP requested specifically (in the comments) for all_vars
df %>%
filter_at(vars(starts_with('Q')), all_vars(is.na(.))) %>%
anti_join(df, ., by = 'rowid')
Or with rowSums from base R
df[ rowSums(!is.na(df[-1])) != 0,]
data
df <- structure(list(rowid = 1:5, Q1 = c(1L, NA, NA, 1L, 1L), Q2 = c(3L,
NA, 1L, NA, 1L), Q3 = c(2L, NA, 0L, 2L, NA)), class = "data.frame",
row.names = c(NA,
-5L))

Restructuring data using apply family of functions

I have inherited a data set that is 23 attributes measured for each of 13 names (between-subjects--each participant only rated one name on all of these attributes). Right now it's structured such that the attributes are the fastest-moving factor, followed by the name. So the the data look like this:
Sub# N1-item1 N1-item2 N1-item3 […] N2-item1 N2-item2 N2-item3
1 3 5 3 NA NA NA
2 NA NA NA 1 5 3
3 3 5 3 NA NA NA
4 NA NA NA 2 2 1
It needs to be restructured it such that it's collapsed over name, and all of the item1 entries are the same column (subjects don't matter for this purpose), as below (bearing in mind that there are 23 items not 3 and 13 names not 2):
Name item1 item2 item3
N1 3 5 3
N2 1 5 3
I can do this with loops and, but I'd rather do it in a manner more natural to R, which I'm guessing would be one of the apply family of functions, but I can't quite wrap my head around it--what is the smart way to do this?
Here's an answer using dplyr and tidyr:
library(dplyr)#loads libraries
library(tidyr)
dat %>% #name of your dataframe
gather(key, val, -Sub) %>% #gathers to long data, with id as Sub
filter(!is.na(val)) %>% #removes rows with NA for the value
separate(key, c("Name", "item")) %>% #split the column key into Name and item
spread(item, val) #spreads the data into wide format, with item as the columns
Sub Name item1 item2 item3
1 1 N1 3 5 3
2 2 N2 1 5 3
3 3 N1 3 5 3
4 4 N2 2 2 1
Spin the column names around to be itemX-NY and then let reshape sort it out:
names(dat)[-1] <- gsub("(^.+?)-(.+?$)", "\\2-\\1", names(dat)[-1])
na.omit(reshape(dat, direction="long", idvar="Sub", varying=-1, sep="-"))
# Sub time item1 item2 item3
#1.N1 1 N1 3 5 3
#3.N1 3 N1 3 5 3
#2.N2 2 N2 1 5 3
#4.N2 4 N2 2 2 1
Where the data was:
dat <- structure(list(Sub = 1:4, `item1-N1` = c(3L, NA, 3L, NA), `item2-N1` = c(5L,
NA, 5L, NA), `item3-N1` = c(3L, NA, 3L, NA), `item1-N2` = c(NA,
1L, NA, 2L), `item2-N2` = c(NA, 5L, NA, 2L), `item3-N2` = c(NA,
3L, NA, 1L)), .Names = c("Sub", "item1-N1", "item2-N1", "item3-N1",
"item1-N2", "item2-N2", "item3-N2"), row.names = c(NA, -4L), class = "data.frame

Resources