I'm running a linear regression, but many of my observations can be used because some of the values have an NA in the row. I know that if one of a set of variables is entered, then and NA is actually 0. However, if all the values are NA, then the columns do not change. I will include and example because I know this might be confusing.
What I have is something that looks likes this:
df <- data.frame(outcome = c(1, 0, 1, 1, 0),
Var1 = c(1, 0, 1, NA, NA),
Var2 = c(NA, 1, 0, 0, NA),
Var3 = c(0, 1, NA, 1, NA))
For Vars 1-3, the first 4 rows have an NA, but have other entries in other vars. In the last row, however, all values are NA. I know that everything in the last row is NA, but I want the NAs in those first 4 rows to be filled with 0. The desired outcome would look like this:
desired - data.frame(outcome = c(1, 0, 1, 1, 0),
Var1 = c(1, 0, 1, 0, NA),
Var2 = c(0, 1, 0, 0, NA),
Var3 = c(0, 1, 0, 1, NA))
I know there are messy ways I could go about this, but I was wondering what would be the most streamlined process for this?
I hope this makes sense, I know the question is confusing. I can clarify anything if needed.
We can create a logical vector with rowSums, use that to subset the rows before changing the NA to 0
i1 <- rowSums(!is.na(df[-1])) > 0
df[i1, -1][is.na(df[i1, -1])] <- 0
-checking with desired
identical(df, desired)
#[1] TRUE
You can use apply to conditionally replace NA in certain rows:
data.frame(t(apply(df, 1, function(x) if (all(is.na(x[-1]))) x else replace(x, is.na(x), 0))))
Output
outcome Var1 Var2 Var3
1 1 1 0 0
2 0 0 1 1
3 1 1 0 0
4 1 0 0 1
5 0 NA NA NA
Related
I am working with a data set where I have to recode variables so that Never and Rarely =0, Sometimes and Always as 1, and Not Applicable as NA. For reference, the numbering scheme for the code is that 1=Never, 2=Rarely, 3=Sometimes, 4=Always, and 5= Not Applicable. Should I change the numeric variables before renaming them or change the character variables into numeric ones? I'm at an impasse and could use help on what code to use.
The problem
You have a vector (or a data frame column) x with values 1 through 5, eg:
x <- c(1,2,3,4,5,4,3,2,1)
You want to recode 1 and 2 to 0, 3 and 4 to 1, and 5 to NA.
Solution in base R
values <- list(`1` = 0, `2` = 0, `3` = 1, `4` = 1, `5` = NA)
x <- unname(unlist(values[x]))
[1] 0 0 1 1 NA 1 1 0 0
Solution with dplyr::recode()
values <- list(`1` = 0, `2` = 0, `3` = 1, `4` = 1, `5` = NA_real_)
x <- dplyr::recode(x, !!!values)
[1] 0 0 1 1 NA 1 1 0 0
forgive the very basic question. I have some output from an experiment that had 3 different versions of the same question, depending on the condition. The output file treated each question as a separate column so my output looks like this, where the headers for the columns repeat:
Q1,Q2,Q3,Q1,Q2,Q3,Q1,Q2,Q3
1, 0, 1
-----------0, 1, 0
--------------------1, 1, 1
How would I be able to merge the output (preferably in Excel - my output is currently stored in an excel file, or alternatively in R), so that the desired output looks like this:
Q1,Q2,Q3
1, 0, 1
0, 1, 0
1, 1, 1
Thanks in advance!
An option in R after reading the dataset with a function that reads thee excel file (read_excel etc.) would be to loop over the unique names of dataset, extract the columns, unlist, remove the NA elements (if any - assuming the blanks are NA)
nm1 <- unique(sub("\\.\\d+", "", names(df1)))
out <- sapply(nm1, function(x) na.omit(unlist(df1[grep(x, names(df1))])))
row.names(out) <- NULL
out
# Q1 Q2 Q3
#[1,] 1 0 1
#[2,] 0 1 0
#[3,] 1 1 1
Or with tidyverse with gather/spread
library(tidyverse)
gather(df1, na.rm = TRUE) %>%
mutate(key = str_remove(key, "\\.\\d+$"), ind = rowid(key)) %>%
spread(key, value) %>%
select(-ind)
# Q1 Q2 Q3
#1 1 0 1
#2 0 1 0
#3 1 1 1
Or another option is to split into a list of data.frames having similar columns, use coalesce to reduce it to a single vector which would remove the NA elements in the row and get the first non-NA element in that row
split.default(df1, nm1) %>%
map_df(reduce, coalesce)
# A tibble: 3 x 3
# Q1 Q2 Q3
# <dbl> <dbl> <dbl>
#1 1 0 1
#2 0 1 0
#3 1 1 1
data
df1 <- structure(list(Q1 = c(1, NA, NA), Q2 = c(0, NA, NA), Q3 = c(1,
NA, NA), Q1.1 = c(NA, 0, NA), Q2.1 = c(NA, 1, NA), Q3.1 = c(NA,
0, NA), Q1.2 = c(NA, NA, 1), Q2.2 = c(NA, NA, 1), Q3.2 = c(NA,
NA, 1)), class = "data.frame", row.names = c(NA, -3L))
What would be a good visualization to use in R to show the association of 2 binary variables?
I understand that phi coefficient would be the best statistic to use, but how can I show it graphically? Considering that if I use a scatterplot, it would be very condensed since there are only 4 possible values.
One idea would be to create a mosaicplot from the contigency table of the two binary variables.
Let's assume our data looks like this:
var1 var2
1 1 1
2 0 0
3 1 1
4 0 0
5 1 1
6 1 1
7 0 0
8 0 1
9 0 1
10 1 0
We could visualize it in the following way:
mosaicplot(table(df))
Data
df <- structure(list(var1 = c(1, 0, 1, 0, 1, 1, 0, 0, 0, 1), var2 = c(1,
0, 1, 0, 1, 1, 0, 1, 1, 0)), .Names = c("var1", "var2"), row.names = c(NA,
-10L), class = "data.frame")
Once again data transformation is alluding me. I've tried aggregate, xtab, the apply functions, gmodels::CrossTable all sorts but nothing seems to work.
I have a table with four columns eg A:D each a numeric binomial variable (0, 1).
eg:
x <- data.frame(A = c(0, 1, 1, 0, 1),
B = c(1, 1, 0, 1, 0),
C = c(0, 1, 1, 0, 1),
D = c(1, 0, 1, 0, 1))
I would like an output where the rows and columns are both the variables (A:D) and the values are the sum of intersections.
eg:
output <- data.frame(A = c(3, 1, 3, 2),
B = c(1, 3, 1, 1),
C = c(3, 1, 3, 2),
D = c(2, 1, 2, 3))
rownames(output) <- c("A", "B", "C", "D")
For example if there were 3 observations in column A then the intersection of A-A in the output would be 3. If there was 1 of the A observations also in variable B then the intersection of A-B in the output table would show 1 as would the intersection B-A.
Hope that makes sense. Its really bugging me how to do it.
You can get this from matrix algebra.
M = as.matrix(x)
t(M) %*% M
A B C D
A 3 1 3 2
B 1 3 1 1
C 3 1 3 2
D 2 1 2 3
I have searched around but could not find a particular answer to my question.
Suppose I have a data frame df:
df = data.frame(id = c(10, 11, 12, 13, 14),
V1 = c('blue', 'blue', 'blue', NA, NA),
V2 = c('blue', 'yellow', NA, 'yellow', 'green'),
V3 = c('yellow', NA, NA, NA, 'blue'))
I want to use the values of V1-V3 as unique column headers and I want the occurrence frequency of each of those per row to populate the rows.
Desired output:
desired = data.frame(id = c(10, 11, 12, 13, 14),
blue = c(2, 1, 1, 0, 1),
yellow = c(1, 1, 0, 1, 0),
green = c(0, 0, 0, 0, 1))
There is probably a really cool way to do this with tidyr::spread and dplyr::summarise. However, I don't know how to spread the V* columns when the keys I want to spread by are all over the place in different columns and include NAs.
Thanks for any help!
Using meltand dcast from package reshape2:
dcast(melt(df, id="id", na.rm = TRUE), id~value)
id blue green yellow
1 10 2 0 1
2 11 1 0 1
3 12 1 0 0
4 13 0 0 1
5 14 1 1 0
As suggested by David Arenburg, it is just simpler to use recast, a wrapper for melt and dcast:
recast(df, id ~ value, id.var = "id")[,1:4] # na.rm is not possible then
id blue green yellow
1 10 2 0 1
2 11 1 0 1
3 12 1 0 0
4 13 0 0 1
5 14 1 1 0