Tidying in R: how to collapse my binary columns into characters, based on vectors? - r

I am tidying my data in R, and want to turn multiple columns into 1, using a function iterating over the items of a vector. I was wondering whether you could help me out to:
work away a semantic error,
and make my code more efficient?
My data is based on a survey with 32 questions. Each question has multiple answers. Each answer is a column, with options 1 and NA.
For one question, a section of the dataset can be reproduced as follows:
XV2_1 <- c(1,NA,NA,NA)
XV2_2 <- c(NA,1,NA,NA)
XV2_3 <- c(NA,NA,NA,1)
XV2_4 <- c(NA,NA,1,NA)
id <- c(12,13,14,15)
dat <- data.frame(id,XV2_1, XV2_2, XV2_3,XV2_4)
> dat
id XV2_1 XV2_2 XV2_3 XV2_4
1 12 1 NA NA NA
2 13 NA 1 NA NA
3 14 NA NA NA 1
4 15 NA NA 1 NA
This is the data I would like to have (
question_2_answers <- c("Yellow","Blue","Green","Orange") #this is a vector based on the answers of the questionnaire
collapsed <- c("Yellow","Blue","Orange","Green")
collapsed_dataframe <- data.frame(id,collapsed)
>collapsed_dataframe
id X2
1 12 Yellow
2 13 Blue
3 14 Green
4 15 Orange
So far, I tried a sequence of "ifelse's" combined with mutate:
library(tidyverse)
question_2_answers <- c("Yellow","Blue","Green","Orange") #this is a vector based on the answers of the questionnaire
dat %>%
mutate(
Colour = tidy_Q2(question_2_answers,XV2_1,XV2_2,XV2_3,XV2_4)
)
tidy_Q2 <- function(a,b,c,d,e) {
ifelse(b == 1, a[1],ifelse(
c==1,a[2],ifelse(
d==1,a[3],a[4])))
}
However, my output is not as expected:
id XV2_1 XV2_2 XV2_3 XV2_4 Colour
1 12 1 NA NA NA Yellow
2 13 NA 1 NA NA <NA>
3 14 NA NA NA 1 <NA>
4 15 NA NA 1 NA <NA>
I would have liked it to be as follows:
id XV2_1 XV2_2 XV2_3 XV2_4 Colour
1 12 1 NA NA NA Yellow
2 13 NA 1 NA NA Blue
3 14 NA NA NA 1 Green
4 15 NA NA 1 NA Orange
Does anyone know a way to remove the error?
Another question that I'd like to ask, is whether my code can be more efficient? I have 32 survey_questions in store after this, I'd like to automate the process as much as possible. Notable things to take in mind:
not all survey questions have the same amount of options (i.e. question 2 has 2 options and therefore 2 columns, whilst question 10 has 8 options and 8 columns)
some values are strings, instead of 1 or NA
Always happy to learn,
Best,
Maria

This is a kind of wide-to-long conversion which we can do with tidyr::gather:
First, we make the colors the column names of the appropriate rows:
# Replace column names (except for the `id` column) with color values
colnames(dat)[-1] <- c("Yellow","Blue","Orange","Green")
dat
id Yellow Blue Orange Green
1 12 1 NA NA NA
2 13 NA 1 NA NA
3 14 NA NA NA 1
4 15 NA NA 1 NA
Then, we gather the non-id columns and drop the NA values:
library(tidyverse)
dat %>%
gather(X2, val, -id) %>% # Gather color cols from wide to long format
filter(!is.na(val)) %>% # Drop rows with NA values
select(-val) # Remove the unnecessary `val` column
id X2
1 12 Yellow
2 13 Blue
3 15 Orange
4 14 Green
This will work with any number of columns (you just need to specify all columns you don't want to gather) and keeps rows with non-NA values. If you want other conditions to exclude a row (for example, if 0 or 'unknown' should count as a non-answer, or only 'correct' counts as an answer) then you should add those conditions to the filter statement.

One option in base R would be max.col is to find the column index of values that are not NA in each row, use that to get the column names corresponding to the index, create a 2 column data.frame by cbinding with the first column
i1 <- max.col(!is.na(dat[-1]), 'first')
cbind(dat['id'], Colour = names(dat)[-1][i1])
# id Colour
#1 12 Yellow
#2 13 Blue
#3 14 Green
#4 15 Orange
data
dat <- structure(list(id = c(12, 13, 14, 15), Yellow = c(1, NA, NA,
NA), Blue = c(NA, 1, NA, NA), Orange = c(NA, NA, NA, 1), Green = c(NA,
NA, 1, NA)), class = "data.frame", row.names = c(NA, -4L))

Related

R - Merging rows with numerous NA values to another column

I would like to ask the R community for help with finding a solution for my data, where any consecutive row with numerous NA values is combined and put into a new column.
For example:
df <- data.frame(A= c(1,2,3,4,5,6), B=c(2, "NA", "NA", 5, "NA","NA"), C=c(1,2,"NA",4,5,"NA"), D=c(3,"NA",5,"NA","NA","NA"))
A B C D
1 1 2 1 3
2 2 NA 2 NA
3 3 NA NA 5
4 4 5 4 NA
5 5 NA 5 NA
6 6 NA NA NA
Must be transformed to this:
A B C D E
1 1 2 1 3 2 NA 2 NA 3 NA NA 5
2 4 5 4 NA 5 NA 5 NA 6 NA NA NA
I would like to do the following:
Identify consecutive rows that have more than 1 NA value -> combine entries from those consecutive rows into a single combined entiry
Place the above combined entry in new column "E" on the prior row
This is quite complex (for me!) and I am wondering if anyone can offer any help with this. I have searched for some similar problems, but have been unable to find one that produces a similar desired output.
Thank you very much for your thoughts--
Using tidyr and dplyr:
Concatenate values for each row.
Keep the concatenated values only for rows with more than one NA.
Group each “good” row with all following “bad” rows.
Use a grouped summarize() to concatenate “bad” row values to a single string.
df %>%
unite("E", everything(), remove = FALSE, sep = " ") %>%
mutate(
E = if_else(
rowSums(across(!E, is.na)) > 1,
E,
""
),
new_row = cumsum(E == "")
) %>%
group_by(new_row) %>%
summarize(
across(A:D, first),
E = trimws(paste(E, collapse = " "))
) %>%
select(!new_row)
# A tibble: 2 × 5
A B C D E
<dbl> <dbl> <dbl> <dbl> <chr>
1 1 2 1 3 2 NA 2 NA 3 NA NA 5
2 4 5 4 NA 5 NA 5 NA 6 NA NA NA

Complex join of longitudinal tables in R

I have ~16 .txt files that I need to turn into one, wide flat file. For each new file, time has passed, and some new variables are added. What I would like to do is append those new columns to the right side of the first table, joining by an identification variable. This gets complicated quickly, so here is an MRE:
library(dplyr)
id <- as.character(1:6)
first <- c("jeff", "jimmy", "andrew", "taj", "karl-anthony", "jamal")
last <- c("teague", "butler", "wiggins", "gibson", "towns", "crawford")
set.seed(1839)
a <- c(1:4, NA, NA)
b <- c(1:4, NA, NA)
c <- c(11:13, NA, 14, NA)
d <- c(11:13, NA, 14, NA)
e <- c(21, 22, NA, 24, NA, 26)
f <- c(21, 22, NA, 24, NA, 26)
Simulating the three different files:
df_1 <- data.frame(
id = id[c(1:3,5)],
first = first[c(1:3,5)],
last = last[c(1:3,5)],
a = a[c(1:3,5)],
b = b[c(1:3,5)]
)
df_2 <- data.frame(
id = id[c(1:3,5)],
first = first[c(1:3,5)],
last = last[c(1:3,5)],
c = c[c(1:3,5)],
d = d[c(1:3,5)]
)
df_3 <- data.frame(
id = id[c(1,2,4,6)],
first = first[c(1,2,4,6)],
last = last[c(1,2,4,6)],
e = e[c(1,2,4,6)],
f = f[c(1,2,4,6)]
)
df_goal <- data.frame(id, first, last, a, b, c, d, e, f)
df_goal is what I want, and here is what it looks like:
> df_goal
id first last a b c d e f
1 1 jeff teague 1 1 11 11 21 21
2 2 jimmy butler 2 2 12 12 22 22
3 3 andrew wiggins 3 3 13 13 NA NA
4 4 taj gibson 4 4 NA NA 24 24
5 5 karl-anthony towns NA NA 14 14 NA NA
6 6 jamal crawford NA NA NA NA 26 26
Note that these are very big files, and the columns are not always in the right order, so I cannot just say to join by keeping the first three columns.
If I do a full_join on all, I get the names repeated every time:
df_all <- df_1 %>%
full_join(df_2, by = "id") %>%
full_join(df_3, by = "id")
> df_all
id first.x last.x a b first.y last.y c d first last e f
1 1 jeff teague 1 1 jeff teague 11 11 jeff teague 21 21
2 2 jimmy butler 2 2 jimmy butler 12 12 jimmy butler 22 22
3 3 andrew wiggins 3 3 andrew wiggins 13 13 <NA> <NA> NA NA
4 5 karl-anthony towns NA NA karl-anthony towns 14 14 <NA> <NA> NA NA
5 4 <NA> <NA> NA NA <NA> <NA> NA NA taj gibson 24 24
6 6 <NA> <NA> NA NA <NA> <NA> NA NA jamal crawford 26 26
What I tried to do next. I wrote a for loop, and I got each data frame, selected just (a) the id column, and (b) columns whose names have not appeared in the df_all data frame yet, and (c) did a full_join:
dfs <- c("df_2", "df_3")
df_all1 <- df_1
for (i in dfs) {
df_all1 <- get(i)[!names(get(i)) %in% names(df_all1)[-1]] %>%
full_join(df_all1, .)
}
> df_all1
id first last a b c d e f
1 1 jeff teague 1 1 11 11 21 21
2 2 jimmy butler 2 2 12 12 22 22
3 3 andrew wiggins 3 3 13 13 NA NA
4 5 karl-anthony towns NA NA 14 14 NA NA
5 4 <NA> <NA> NA NA NA NA 24 24
6 6 <NA> <NA> NA NA NA NA 26 26
Note that this means the cases that did not appear in the first file are missing the names (these represent key demographic variables in my data). I also tried going through row-by-row and doing a column join if the id was already present, and then doing a bind_row if it was not. This code threw an error:
df_all2 <- df_1
for (i in dfs) {
for (k in 1:nrow(get(i))) {
if (get(i)[k, "id"] %in% df_all2$id) {
df_all2 <- get(i)[k, !names(get(i)) %in% names(df_all2)[-1]] %>%
left_join(df_all2, ., by = "id")
} else {
df_all2 <- bind_rows(
df_all2,
get(i)[k, !names(get(i)) %in% names(df_all2)[-1]]
)
}
}
}
There has got to be a way to do a join with only select columns, but fill in missing information if necessary. Again, I am working with lots of files with lots of columns, so I cannot assume that I know the position of any columns; it has to be done by the column names.
I have also thought about just including a new variable that is the date of the file, stacking them all on top of one another ("long" format), and then using tidyr::spread and tidyr::gather, but I haven't found a solution yet.
I am not wedded to the tidyverse (base or data.table would be great, even some way to do a SQL join in R) or even R; I am open to a Python solution using pandas, as well.
Short version: How do I join new columns to an existing data set—by identification number—and fill in information from not-new columns, but since the case is new, need to be filled in?
Possible solution, per Psidom:
df_all1 <- df_1
for (i in dfs) {
df_all1 <- get(i) %>%
full_join(
df_all1, .,
by = names(get(i))[names(get(i)) %in% names(df_all1)]
)
}
df_all1
Maybe a more efficient way to do this, though?
Using melt once you have a full_join df_all.
library(data.table)
df <- melt(setDT(df_all),
measure.vars = patterns("^first", "^last"))
df <- unique(df[,-c("id", "variable")])
df[!is.na(df$value1),]
a b c d e f value1 value2
1: 1 1 11 11 21 21 jeff teague
2: 2 2 12 12 22 22 jimmy butler
3: 3 3 13 13 NA NA andrew wiggins
4: NA NA 14 14 NA NA karl-anthony towns
5: NA NA NA NA 24 24 taj gibson
6: NA NA NA NA 26 26 jamal crawford
The most simple solution using dplyr is to omit the by parameter in the calls to full_join().
library(dplyr)
df_1 %>%
full_join(df_2) %>%
full_join(df_3)
Joining, by = c("id", "first", "last")
Joining, by = c("id", "first",
"last")
id first last a b c d e f
1 1 jeff teague 1 1 11 11 21 21
2 2 jimmy butler 2 2 12 12 22 22
3 3 andrew wiggins 3 3 13 13 NA NA
4 5 karl-anthony towns NA NA 14 14 NA NA
5 4 taj gibson NA NA NA NA 24 24
6 6 jamal crawford NA NA NA NA 26 26
Warning messages:
1: Column id joining factors with different levels, coercing to character vector
2: Column first joining factors with different levels, coercing to character vector
3: Column last joining factors with different levels, coercing to character vector
The documentation of the by parameter in ?full_join says: If NULL, the default, *_join() will do a natural join, using all variables with common names across the two tables.
So this is equivivalent to explicetely passing by = c("id", "first", "last") as proposed by Psidom.
If there are many data frames to join, the code below may save a lot of typing:
Reduce(full_join, list(df_1, df_2, df_3))
The result (inluding messages) is the same as above.

How to add characters to strings of differing sizes in preparation for joining data frames via left_join?

I have a base df titled help.a and I am attempting to join help.b, however, when I read in help.b the id variable is numeric and not the same length/format as the id variable in help.a. I am attempting to stick with character variables due to left_join changing them to character when the levels of the factor are different.
help.a <- data.frame(id = as.character(c("00005", "00010", "00010", "00010", "00025", "00025", "00324", "00324")),
var_a = c(NA, 2, 2, 2, NA, NA, NA, NA),
var_b = c(4, NA, NA, 4, 4, 4, NA, NA))
help.b <- data.frame(id = c(5, 10, 324),
var_c = c(2, 2, 2),
var_d = c(4, NA, 6))
My approach thus far has been to change help.b to a character, however, it fails to join due to the ids not matching:
help.b$id <- as.character(help.b$id)
left_join(help.a, help.b)
id var_a var_b var_c var_d
1 00005 NA 4 NA NA
2 00010 2 NA NA NA
3 00010 2 NA NA NA
4 00010 2 4 NA NA
5 00025 NA 4 NA NA
6 00025 NA 4 NA NA
7 00324 NA NA NA NA
8 00324 NA NA NA NA
This is my desired end result:
id var_a var_b var_c var_d
1 00005 NA 4 2 4
2 00010 2 NA 2 NA
3 00010 2 NA 2 NA
4 00010 2 4 2 NA
5 00025 NA 4 NA NA
6 00025 NA 4 NA NA
7 00324 NA NA 2 6
8 00324 NA NA 2 6
And what I think I need to do is read in help.b and change id to a character and then add "0's" to each id, but all need to equal 5 characters in length... e.g., row 1 would need four "0's" and row 2 would need three "0's". That way the left_join will notice matching strings and join appropriately.
Any assistance is greatly appreciated.
It looks like you are looking for sprintf:
help.b$id <- sprintf("%05d", help.b$id)
With the d you indicate that you want to format integer numbers, with the 05 that you want the resulting number to be 5 characters wide padded with zeros.
From the comments it appears that help.b$id is a character column. In that case, depending on the platform (on linux this doesn't work; the help filt of sprintf doesn't tell on which platforms this works), you can use
help.b$id <- sprintf("%05s", help.b$id)
Or,
# When help.b$id is a character use
id <- as.numeric(help.b$id)
# When help.b$id is a factor use
id <- as.numeric(as.character(help.b$id))
# Just to make sure check the conversion went ok; should return empty vector and
# if not the values for which the conversion went wrong.
help.b$id[as.character(id) != help.b$id]
help.b$id <- sprintf("%05d", id)
One option here is to simply convert the help.a$id column to numeric, and then use the baseR merge() function in LEFT JOIN mode (all.x=TRUE):
> help.a$id <- as.numeric(as.character(help.a$id))
> merge(help.a, help.b, by="id", all.x=TRUE)
id var_a var_b var_c var_d
1 5 NA 4 2 4
2 10 2 NA 2 NA
3 10 2 NA 2 NA
4 10 2 4 2 NA
5 25 NA 4 NA NA
6 25 NA 4 NA NA
7 324 NA NA 2 6
8 324 NA NA 2 6
Update:
If, for some reason, you want to retain the original column, then just create a copy of it in the help.a data frame, e.g.
help.a$id_orig <- help.a$id
Do this before converting help.a$id to numeric.

Conditionals calculations across rows R

First, I'm brand new to R and am making the switch from SAS. I have a dataset that is 1000 rows by 24 columns, where the columns are different treatments. I want to count the number of times an observation meets a criteria across rows of my dataset listed below.
Gene A B C D
1 AARS_3 NA NA 4.168365 NA
2 AASDHPPT_21936 NA NA NA -3.221287
3 AATF_26432 NA NA NA NA
4 ABCC2_22 4.501518 3.17992 NA NA
5 ABCC2_26620 NA NA NA NA
I was trying to create column vectors that counted
1) Number of NAs
2) Number of columns <0
3) Number of columns >0
I would then use cbind to add these to my large dataset
I solved the first one with :
NA.Count <- (apply(b01,MARGIN=1,FUN=function(x) length(x[is.na(x)])))
I tried to modify this to count evaluate the !is.na and then count the number of times the value was less than zero with this:
lt0 <- (apply(b01,MARGIN=1,FUN=function(x) ifelse(x[!is.na(x)],count(x[x<0]))))
which didn't work at all.
I tried a dozen ways to get dplyr mutate to work with this and did not succeed.
What I want are the last two columns below; and if you had a cleaner version of the NA.Count I did, that would also be greatly appreciated.
Gene A B C D NA.Count lt0 gt0
1 AARS_3 NA NA 4.168365 NA 3 0 1
2 AASDHPPT_21936 NA NA NA -3.221287 3 1 0
3 AATF_26432 NA NA NA NA 4 0 0
4 ABCC2_22 4.501518 3.17992 NA NA 2 0 2
5 ABCC2_26620 NA NA NA NA 4 0 0
Here is one way to do it taking advantage of the fact that TRUE equals 1 in R.
# test data frame
lil_df <- data.frame(Gene = c("AAR3", "ABCDE"),
A = c(NA, 3),
B = c(2, NA),
C = c(-1, -2),
D = c(NA, NA))
# is.na
NA.count <- rowSums(is.na(lil_df[,-1]))
# less than zero
lt0 <- rowSums(lil_df[,-1]<0, na.rm = TRUE)
# more that zero
mt0 <- rowSums(lil_df[,-1]>0, na.rm = TRUE)
# cbind to data frame
larger_df <- cbind(lil_df, NA.count, lt0, mt0 )
larger_df
Gene A B C D NA.count lt0 mt0
1 AAR3 NA 2 -1 NA 2 1 1
2 ABCDE 3 NA -2 NA 2 1 1

Shifting a column down by one

Say I have a data.frame that looks like this
df <- data.frame(AAA = rep(c(NA,sample(1:10, 1)),5),
BBB = rep(c(NA,sample(1:10, 1)),5),
CCC = rep(c(sample(1:10, 1),NA),5))
> df
AAA BBB CCC
1 NA NA 10
2 3 7 NA
3 NA NA 10
4 3 7 NA
5 NA NA 10
6 3 7 NA
7 NA NA 10
8 3 7 NA
9 NA NA 10
10 3 7 NA
I want to shift column CCC down by one so that all the numbers align in a single row, and then delete the rows that contain no data (often every other row - BUT NOT ALWAYS - the pattern might vary through the data.frame.
Using dplyr
library(dplyr)
df %>%
mutate(CCC=lag(CCC)) %>%
na.omit()
Or using data.table
library(data.table)
na.omit(setDT(df)[, CCC:=c(NA, CCC[-.N])])
Use a combination of the very efficient transform and na.omit functions
df <- na.omit(transform(df, CCC = c(NA, CCC[-nrow(df)])))
You can shift everything down by one with:
df['CCC'] <- c(NA, head(df['CCC'], dim(df)[1] - 1)[[1]])
To delete rows with only NA values, do:
df <- df[apply(df, 1, function(x) !all(is.na(x))), ]

Resources