R: Cleaning up a wide and untidy dataframe - r

I have a data frame that looks like:
d<-data.frame(id=(1:9),
grp_id=(c(rep(1,3), rep(2,3), rep(3,3))),
a=rep(NA, 9),
b=c("No", rep(NA, 3), "Yes", rep(NA, 4)),
c=c(rep(NA,2), "No", rep(NA,6)),
d=c(rep(NA,3), "Yes", rep(NA,2), "No", rep(NA,2)),
e=c(rep(NA, 7), "No", NA),
f=c(NA, "No", rep(NA,3), "No", rep(NA,2), "No"))
>d
id grp_id a b c d e f
1 1 1 NA No <NA> <NA> <NA> <NA>
2 2 1 NA <NA> <NA> <NA> <NA> No
3 3 1 NA <NA> No <NA> <NA> <NA>
4 4 2 NA <NA> <NA> Yes <NA> <NA>
5 5 2 NA Yes <NA> <NA> <NA> <NA>
6 6 2 NA <NA> <NA> <NA> <NA> No
7 7 3 NA <NA> <NA> No <NA> <NA>
8 8 3 NA <NA> <NA> <NA> No <NA>
9 9 3 NA <NA> <NA> <NA> <NA> No
Within each group (grp_id) there is only 1 "Yes" or "No" value associated with each of the columns a:f.
I'd like to create a single row for each grp_id to get a data frame that looks like the following:
grp_id a b c d e f
1 NA No No <NA> <NA> No
2 NA Yes <NA> Yes <NA> No
3 NA <NA> <NA> No No No
I recognize that the tidyr package is probably the best tool and the 1st steps are likely to be
d %>%
group_by(grp_id) %>%
summarise()
I would appreciate help with the commands within summarise, or any solution really. Thanks.

We can use summarise_at and subset the first non-NA element
library(dplyr)
d %>%
group_by(grp_id) %>%
summarise_at(2:7, funs(.[!is.na(.)][1]))
# A tibble: 3 x 7
# grp_id a b c d e f
# <dbl> <lgl> <fctr> <fctr> <fctr> <fctr> <fctr>
#1 1 NA No No <NA> <NA> No
#2 2 NA Yes <NA> Yes <NA> No
#3 3 NA <NA> <NA> No No No
In the example dataset, columns 'a' to 'f' are all factors with some having only 'No' levels. If it needs to be standardized with all the columns having the same levels, then we may need to call the factor with levels specified as c('Yes', 'No') in the summarise_at i.e. summarise_at(2:7, funs(factor(.[!is.na(.)][1], levels = c('Yes', 'No'))))

We can use aggregate. No packages are used.
YN <- function(x) c(na.omit(as.character(x)), NA)[1]
aggregate(d[3:8], d["grp_id"], YN)
giving:
## grp_id a b c d e f
## 1 1 <NA> No No <NA> <NA> No
## 2 2 <NA> Yes <NA> Yes <NA> No
## 3 3 <NA> <NA> <NA> No No No
The above gives character columns. If you prefer factor columns then use this:
YNfac <- function(x) factor(YN(x), c("No", "Yes"))
aggregate(d[3:8], d["grp_id"], YNfac)
Note: Other alternate implementations of YN are:
YN <- function(x) sort(as.character(x), na.last = TRUE)[1]
YN <- function(x) if (all(is.na(x))) NA_character_ else na.omit(as.character(x))[1]
library(zoo)
YN <- function(x) na.locf0(as.character(x), fromLast = TRUE)[1]

You've received some good answers but neither of them actually uses the tidyr package. (The summarize() and summarize_at() family of functions is from dplyr.)
In fact, a tidyr-only solution for your problem is very doable.
d %>%
gather(col, value, -id, -grp_id, factor_key=TRUE) %>%
na.omit() %>%
select(-id) %>%
spread(col, value, fill=NA, drop=FALSE)
The only hard part is ensuring that you get the a column in your output. For your example data, it is entirely NA. The trick is the factor_key=TRUE argument to gather() and the drop=FALSE argument to spread(). Without those two arguments being set, the output would not have an a column, and would only have columns with at least one non-NA entry.
Here's a description of how it works:
gather(col, value, -id, -grp_id, factor_key=TRUE) %>%
This tidies your data -- it effectively replaces columns a - f with new columns col and value, forming a long-formated "tidy" data frame. The entries in the col column are letters a - f. And because we've used factor_key=TRUE, this column is a factor with levels, not just a character vector.
na.omit() %>%
This removes all the NA values from the long data.
select(-id) %>%
This eliminates the id column.
spread(col, value, fill=NA, drop=FALSE)
This re-widens the data, using the values in the col column to define new column names, and the values in the value column to fill in the entries of the new columns. When data is missing, a value of fill (here NA) is used instead. And the drop=FALSE means that when col is a factor, there will be one column per level of the factor, no matter whether that level appears in the data or not. This, along with setting col to be a factor, is what gets a as an output column.
I personally find this approach more readable than the approaches requiring subsetting or lapply stuff. Additionally, this approach will fail if your data is not actually one-hot, whereas other approaches may "work" and give you unexpected output. The downside of this approach is that the output columns a - f are not factors, but character vectors. If you need factor output you should be able to do (untested)
mutate(value = factor(value, levels=c('Yes', 'No', NA))) %>%
anywhere between the gather() and spread() functions to ensure factor output.

Related

separte columns from a data frame

I have a data frame results from extracting data from text files which have some columns which contains more than a value
I want to split columns with more than a value into 2 columns like this
I tried this code but it generates an error
db<-separate_rows(db,TYPE,CHRO,EX ,sep=",\\s+")
Error: All nested columns must have the same number of elements.
Note that sample data and expected output don't match; for example, there is no CHRO=c700 entry in your sample data. You also seem to be missing rows. Please check your input/expected output data.
You could use tidyr::separate_rows, e.g.
df %>%
separate_rows(TYPE, sep = ",") %>%
separate_rows(CHRO, sep = ",") %>%
separate_rows(EX, sep = ",")
# TYPE CHRO EX
#1 multiple c.211dup <NA>
#2 multiple c.3751dup <NA>
#3 multiple <NA> exon.2
#4 multiple <NA> exon.3
#5 multiple <NA> exon.7
#6 mitocondrial <NA> exon.3
#7 mitocondrial <NA> exon.7
#8 multifactorial <NA> <NA>
Or perhaps use splitstackshape
library(splitstackshape)
df %>%
cSplit(names(df), direction = "long") %>%
fill(TYPE) %>%
group_by_at(names(df)) %>%
slice(1)
# TYPE CHRO EX
# <fct> <fct> <fct>
#1 mitocondrial NA exon.7
#2 multifactorial NA NA
#3 multiple c.211dup NA
#4 multiple c.3751dup NA
#5 multiple NA exon.2
#6 multiple NA exon.3
#7 multiple NA NA
Note that results are different because the order of separating columns matters.
Sample data
df <- read.table(text =
"TYPE CHRO EX
multiple 'c.211dup, c.3751dup' NA
multiple NA exon.2
multiple,mitocondrial NA exon.3,exon.7
multifactorial NA NA", header = T)

Combine values in two columns together based specific conditions in R

I have data that looks like the following:
moo <- data.frame(Farm = c("A","B",NA,NA,"A","B"),
Barn_Yard = c("A","A",NA,"A",NA,"B"),
stringsAsFactors=FALSE)
print(moo)
Farm Barn_Yard
A A
B A
<NA> <NA>
<NA> A
A <NA>
B B
I am attempting to combine the columns into one variable where if they are the same the results yields what is found in both columns, if both have data the result is what is in the Farm column, if both are <NA> the result is <NA>, and if one has a value and the other doesn't the result is the value present in the column that has the value. Thus, in this instance the result would be:
oink <- data.frame(Animal_House = c("A","B",NA,"A","A","B"),
stringsAsFactors = FALSE)
print(oink)
Animal_House
A
B
<NA>
A
A
B
I have tried the unite function from tidyr but it doesn't give me exactly what I want. Any thoughts? Thanks!
dplyr::coalesce does exactly that, substituting any NA values in the first vector with the value from the second:
library(dplyr)
moo <- data.frame(Farm = c("A","B",NA,NA,"A","B"),
Barn_Yard = c("A","A",NA,"A",NA,"B"),
stringsAsFactors = FALSE)
oink <- moo %>% mutate(Animal_House = coalesce(Farm, Barn_Yard))
oink
#> Farm Barn_Yard Animal_House
#> 1 A A A
#> 2 B A B
#> 3 <NA> <NA> <NA>
#> 4 <NA> A A
#> 5 A <NA> A
#> 6 B B B
If you want to discard the original columns, use transmute instead of mutate.
A less succinct option is to use a couple ifelse() statements, but this could be useful if you wish to introduce another condition or column into the mix.
moo <- data.frame(Farm = c("A","B",NA,NA,"A","B"),
Barn_Yard = c("A","A",NA,"A",NA,"B"),
stringsAsFactors = FALSE)
moo$Animal_House = with(moo,ifelse(is.na(Farm) & is.na(Barn_Yard),NA,
ifelse(!is.na(Barn_Yard) & is.na(Farm),Barn_Yard,
Farm)))

Fuse multiple data.frame date fields by removing NA using piping

I want to fuse multiple date fields that contains NAs using piping in R. The data looks like:
dd <- data.frame(id=c("a","b","c","d"),
f1=as.Date(c(NA, "2012-03-24", NA,NA)),
f2=as.Date(c("2010-01-24", NA, NA,NA)),
f3=as.Date(c(NA, NA, "2014-11-22", NA)))
dd
id f1 f2 f3
1 a <NA> 2010-01-24 <NA>
2 b 2012-03-24 <NA> <NA>
3 c <NA> <NA> 2014-11-22
4 d <NA> <NA> <NA>
I know how to do it the R base way:
unlist(apply(dd[,c("f1","f2","f3")],1,na.omit))
f2 f1 f3
"2010-01-24" "2012-03-24" "2014-11-22"
So that is not the point of my question. I'm in the process of learning piping and dplyr so I want to pipe this function. I've tried:
library(dplyr)
dd %>% mutate(f=na.omit(c(f1,f2,f3)))
Error in mutate_impl(.data, dots) :
Column `f` must be length 4 (the number of rows) or one, not 3
It doesn't work because of the line with all NAs. Without this line, it would work:
dd[-4,] %>% mutate(f=na.omit(c(f1,f2,f3)))
id f1 f2 f3 f
1 a <NA> 2010-01-24 <NA> 2012-03-24
2 b 2012-03-24 <NA> <NA> 2010-01-24
3 c <NA> <NA> 2014-11-22 2014-11-22
Any idea how to do it properly?
BTW, my question is different from this and this as I want to use piping and because my field is a date field, I cannot use sum with na.rm=T.
Thanks
We can use coalesce to create the new column,
library(dplyr)
dd %>%
transmute(newcol = coalesce(f1, f2, f3)) #%>%
#then `filter` the rows to remove the NA elements
#and `pull` as a `vector` (if needed)
#filter(!is.na(newcol)) %>%
#pull(newcol)
# newcol
#1 2010-01-24
#2 2012-03-24
#3 2014-11-22
#4 <NA>

r matrix columns filling incorrectly

Trying to re-organise a data set (sdevDFC) into a matrix with my latitude (Lat) as row names and longitude (Lon) as column names, then filling in the matrix with values respective to the coordinates.
stand_dev_m <- matrix(data=sdevDFC$SDev, nrow=length(sdevDFC$Lat), ncol=length(sdevDFC$Lon), byrow=TRUE, dimnames = list(sdevDFC$Lat, sdevDFC$Lon))
The column and row names appear as they should, but my data fills in so that all values in their respective columns are identical as shown in the
image (which should not be the case as none of my values ever repeat).
I've filled it with byrow = FALSE to see if it also occurred then (it does), and I've also used colnames and rownames instead of dimnames (changes nothing).
Would appreciate any insight into what I may be doing wrong here--also new to this platform so I apologise if I've missed a guideline or another question that's similar
Example data:
df <- data.frame(LON=1:5,
LAT=11:15,
VAL=letters[1:5],
stringsAsFactors=F)
You could try the following:
rn <- df$LON # Save what-will-be-rownames
df1 <- df %>%
spread(LAT,VAL,fill=NA) %>%
select(-LON) %>%
setNames(., df$LAT)
rownames(df1) <- rn
Output
11 12 13 14 15
1 a <NA> <NA> <NA> <NA>
2 <NA> b <NA> <NA> <NA>
3 <NA> <NA> c <NA> <NA>
4 <NA> <NA> <NA> d <NA>
5 <NA> <NA> <NA> <NA> e

Convert various dummy/logical variables into a single categorical variable/factor from their name in R

My question has strong similarities with this one and this other one, but my dataset is a little bit different and I can't seem to make those solutions work. Please excuse me if I misunderstood something and this question is redundant.
I have a dataset such as this one:
df <- data.frame(
id = c(1:5),
conditionA = c(1, NA, NA, NA, 1),
conditionB = c(NA, 1, NA, NA, NA),
conditionC = c(NA, NA, 1, NA, NA),
conditionD = c(NA, NA, NA, 1, NA)
)
# id conditionA conditionB conditionC conditionD
# 1 1 1 NA NA NA
# 2 2 NA 1 NA NA
# 3 3 NA NA 1 NA
# 4 4 NA NA NA 1
# 5 5 1 NA NA NA
(Note that apart from these columns, I have a lot of other columns that shouldn't be affected by the current manipulation.)
So, I observe that conditionA, conditionB, conditionC and conditionD are mutually exclusives and should be better presented as a single categorical variable, i.e. factor, that should look like this :
# id type
# 1 1 conditionA
# 2 2 conditionB
# 3 3 conditionC
# 4 4 conditionD
# 5 5 conditionA
I have investigated using gather or unite from tidyr, but it doesn't correspond to this case (with unite, we lose the information from the variable name).
I tried using kimisc::coalescence.na, as suggested in the first referred answer, but 1. I need first to set a factor value based on the name for each column, 2. it doesn't work as expected, only including the first column :
library(kimisc)
# first, factor each condition with a specific label
df$conditionA <- df$conditionA %>%
factor(levels = 1, labels = "conditionA")
df$conditionB <- df$conditionB %>%
factor(levels = 1, labels = "conditionB")
df$conditionC <- df$conditionC %>%
factor(levels = 1, labels = "conditionC")
df$conditionD <- df$conditionD %>%
factor(levels = 1, labels = "conditionD")
# now coalesce.na to merge into a single variable
df$type <- coalesce.na(df$conditionA, df$conditionB, df$conditionC, df$conditionD)
df
# id conditionA conditionB conditionC conditionD type
# 1 1 conditionA <NA> <NA> <NA> conditionA
# 2 2 <NA> conditionB <NA> <NA> <NA>
# 3 3 <NA> <NA> conditionC <NA> <NA>
# 4 4 <NA> <NA> <NA> conditionD <NA>
# 5 5 conditionA <NA> <NA> <NA> conditionA
I tried the other suggestions from the second question, but haven't found one that would bring me the expected result...
Try:
library(dplyr)
library(tidyr)
df %>% gather(type, value, -id) %>% na.omit() %>% select(-value) %>% arrange(id)
Which gives:
# id type
#1 1 conditionA
#2 2 conditionB
#3 3 conditionC
#4 4 conditionD
#5 5 conditionA
Update
To handle the case you detailed in the comments, you could do the operation on the desired portion of the data frame and then left_join() the other columns:
df %>%
select(starts_with("condition"), id) %>%
gather(type, value, -id) %>%
na.omit() %>%
select(-value) %>%
left_join(., df %>% select(-starts_with("condition"))) %>%
arrange(id)
You can also try:
colnames(df)[2:5][max.col(!is.na(df[,2:5]))]
#[1] "conditionA" "conditionB" "conditionC" "conditionD" "conditionA"
The above works if one and only one column has a value other than NA for each row. If the values of a row can be all NAs, then you can try:
mat<-!is.na(df[,2:5])
colnames(df)[2:5][max.col(mat)*(NA^!rowSums(mat))]
library(tidyr)
library(dplyr)
df <- df %>%
gather(type, count, -id)
df <- df[complete.cases(df),][,-3]
df[order(df$id),]
id type
1 1 conditionA
7 2 conditionB
13 3 conditionC
19 4 conditionD
5 5 conditionA

Resources