I have a data frame results from extracting data from text files which have some columns which contains more than a value
I want to split columns with more than a value into 2 columns like this
I tried this code but it generates an error
db<-separate_rows(db,TYPE,CHRO,EX ,sep=",\\s+")
Error: All nested columns must have the same number of elements.
Note that sample data and expected output don't match; for example, there is no CHRO=c700 entry in your sample data. You also seem to be missing rows. Please check your input/expected output data.
You could use tidyr::separate_rows, e.g.
df %>%
separate_rows(TYPE, sep = ",") %>%
separate_rows(CHRO, sep = ",") %>%
separate_rows(EX, sep = ",")
# TYPE CHRO EX
#1 multiple c.211dup <NA>
#2 multiple c.3751dup <NA>
#3 multiple <NA> exon.2
#4 multiple <NA> exon.3
#5 multiple <NA> exon.7
#6 mitocondrial <NA> exon.3
#7 mitocondrial <NA> exon.7
#8 multifactorial <NA> <NA>
Or perhaps use splitstackshape
library(splitstackshape)
df %>%
cSplit(names(df), direction = "long") %>%
fill(TYPE) %>%
group_by_at(names(df)) %>%
slice(1)
# TYPE CHRO EX
# <fct> <fct> <fct>
#1 mitocondrial NA exon.7
#2 multifactorial NA NA
#3 multiple c.211dup NA
#4 multiple c.3751dup NA
#5 multiple NA exon.2
#6 multiple NA exon.3
#7 multiple NA NA
Note that results are different because the order of separating columns matters.
Sample data
df <- read.table(text =
"TYPE CHRO EX
multiple 'c.211dup, c.3751dup' NA
multiple NA exon.2
multiple,mitocondrial NA exon.3,exon.7
multifactorial NA NA", header = T)
Related
I'm trying to remove rows with duplicate values in one column of a data frame. I want to make sure that all the existing values in that column are represented, appearing more than once if its values in one other column are not duplicated and non-missing, and only once if the values in that other column are all missing. Take for example the following data frame:
toy <- data.frame(Group = c(1,1,2,2,2,3,3,4,5,5,6,7,7), Class = c("a",NA,"a","b",NA,NA,NA,NA,"a","b","a","a","a"))
I would like to end up with this:
ideal <- data.frame(Group = c(1,2,2,3,4,5,5,6,7), Class = c("a","a","b",NA,NA,"a","b","a","a"))
I tried transforming the data frame into a data table and follow the advice here, like this:
library(data.table)
toy.dt <- as.data.table(toy)
toy.dt[, .(Class = if(all(is.na(Class))) NA_character_ else na.omit(Class)), by = Group]
but duplicates weren't handled as needed: value 7 in the column 'Group' should appear only once in the resulting data.
It would be a bonus if the solution doesn't require transforming the data into a data table.
Here is one way using base R. We first drop NA rows in toy and select only unique rows. We can then left join it with unique Group values to get the rows which are NA for the group.
df1 <- unique(na.omit(toy))
merge(unique(subset(toy, select = Group)), df1, all.x = TRUE)
# Group Class
#1 1 a
#2 2 a
#3 2 b
#4 3 <NA>
#5 4 <NA>
#6 5 a
#7 5 b
#8 6 a
#9 7 a
Same logic using dplyr functions :
library(dplyr)
toy %>%
na.omit() %>%
distinct() %>%
right_join(toy %>% distinct(Group))
If you would like to try a tidyverse approach:
library(tidyverse)
toy %>%
group_by(Group) %>%
filter(!(is.na(Class) & sum(!is.na(Class)) > 0)) %>%
distinct()
Output
# A tibble: 9 x 2
# Groups: Group [7]
Group Class
<dbl> <chr>
1 1 a
2 2 a
3 2 b
4 3 NA
5 4 NA
6 5 a
7 5 b
8 6 a
9 7 a
I have a data frame which consists of a single column of some very messy JSON data. I would like to convert the JSON entries in that column to additional columns in the same data frame, I have a messy solution, but it will be tedious and long to apply it to my actual dataset.
Here is my sample data frame:
sample.df <- data.frame(id = c(101, 102, 103, 104),
json_col = c('[{"foo_a":"bar"}]',
'[{"foo_a":"bar","foo_b":"bar"}]',
'[{"foo_a":"bar","foo_c":2}]',
'[{"foo_a":"bar","foo_b":"bar","foo_c":2,"nested_col":{"foo_d":"bar","foo_e":3}}]'),
startdate = as.Date(c('2010-11-1','2008-3-25','2007-3-14','2006-2-21')))
in reality my data frame has over 100000 entries and consists of multiple JSON columns where I need to apply the solution to this question, there are also several orders of nested lists (i.e. nested lists within nested lists).
Here is my solution:
j.col <- sample.df[2]
library(jsonlite)
j.l <- apply(j.col, 1, jsonlite::fromJSON, flatten = T)
library(dplyr)
l.as.df <- bind_rows(lapply(j.l,data.frame))
new.df <- cbind(sample.df$id, l.as.df, sample.df$startdate)
My solution is a roundabout method were I extract the column from the data frame with the JSON stuff, and then convert the JSON into a second dataframe, and then I combine the two dataframes into a third datadframe. This will be long and tedious to do with my actual data, not to mention that it is inelegant. How can I do this without having to create the additional data frames?
Thanks in advance for any help!
Here's another approach that will spare you the intermediate dataframes:
library(dplyr)
library(jsonlite)
new.df <- sample.df %>%
rowwise() %>%
do(data.frame(fromJSON(.$json_col, flatten = T))) %>%
ungroup() %>%
bind_cols(sample.df %>% select(-json_col))
print(new.df)
# # A tibble: 4 x 7
# foo_a foo_b foo_c nested_col.foo_d nested_col.foo_e id startdate
# <chr> <chr> <int> <chr> <int> <dbl> <date>
# 1 _ <NA> NA <NA> NA 101 2010-11-01
# 2 _ _ NA <NA> NA 102 2008-03-25
# 3 _ <NA> 2 <NA> NA 103 2007-03-14
# 4 _ _ 2 _ 3 104 2006-02-21
library(dplyr)
library(tidyr)
library(purrr)
library(jsonlite)
sample.df %>%
mutate(
json_parsed = map(json_col, ~ fromJSON(., flatten=TRUE))
) %>%
unnest(json_parsed)
# id
# 1 101
# 2 102
# 3 103
# 4 104
# json_col
# 1 [{"foo_a":"bar"}]
# 2 [{"foo_a":"bar","foo_b":"bar"}]
# 3 [{"foo_a":"bar","foo_c":2}]
# 4 [{"foo_a":"bar","foo_b":"bar","foo_c":2,"nested_col":{"foo_d":"bar","foo_e":3}}]
# startdate foo_a foo_b foo_c nested_col.foo_d nested_col.foo_e
# 1 2010-11-01 bar <NA> NA <NA> NA
# 2 2008-03-25 bar bar NA <NA> NA
# 3 2007-03-14 bar <NA> 2 <NA> NA
# 4 2006-02-21 bar bar 2 bar 3
If you are reducing libraries, you can remove purrr and instead use:
...
json_parsed = lapply(.$json_col, fromJSON, flatten=TRUE)
...
I think this will work. The main idea is that we take json_col and turn it into a character string, that we can then pass into the fromJSON function that takes care of the rest.
library(stringi)
library(jsonlite)
sample.df$json_col<- as.character(sample.df$json_col)
json_obj<- paste(sample.df$json_col, collapse = "")
json_obj<- stri_replace_all_fixed(json_obj, "][", ",")
new.df<- cbind(sample.df$id, fromJSON(json_obj), sample.df$startdate)
> new.df
# sample.df$id foo_a foo_b foo_c nested_col.foo_d nested_col.foo_e
#1 101 _ <NA> NA <NA> NA
#2 102 _ _ NA <NA> NA
#3 103 _ <NA> 2 <NA> NA
#4 104 _ _ 2 _ 3
# sample.df$startdate
#1 2010-11-01
#2 2008-03-25
#3 2007-03-14
#4 2006-02-21
Make sure that the cbind part work correctly! In this case it did, but make sure that in your overall manipulations, you don't change the order of things.
I have data that looks like the following:
moo <- data.frame(Farm = c("A","B",NA,NA,"A","B"),
Barn_Yard = c("A","A",NA,"A",NA,"B"),
stringsAsFactors=FALSE)
print(moo)
Farm Barn_Yard
A A
B A
<NA> <NA>
<NA> A
A <NA>
B B
I am attempting to combine the columns into one variable where if they are the same the results yields what is found in both columns, if both have data the result is what is in the Farm column, if both are <NA> the result is <NA>, and if one has a value and the other doesn't the result is the value present in the column that has the value. Thus, in this instance the result would be:
oink <- data.frame(Animal_House = c("A","B",NA,"A","A","B"),
stringsAsFactors = FALSE)
print(oink)
Animal_House
A
B
<NA>
A
A
B
I have tried the unite function from tidyr but it doesn't give me exactly what I want. Any thoughts? Thanks!
dplyr::coalesce does exactly that, substituting any NA values in the first vector with the value from the second:
library(dplyr)
moo <- data.frame(Farm = c("A","B",NA,NA,"A","B"),
Barn_Yard = c("A","A",NA,"A",NA,"B"),
stringsAsFactors = FALSE)
oink <- moo %>% mutate(Animal_House = coalesce(Farm, Barn_Yard))
oink
#> Farm Barn_Yard Animal_House
#> 1 A A A
#> 2 B A B
#> 3 <NA> <NA> <NA>
#> 4 <NA> A A
#> 5 A <NA> A
#> 6 B B B
If you want to discard the original columns, use transmute instead of mutate.
A less succinct option is to use a couple ifelse() statements, but this could be useful if you wish to introduce another condition or column into the mix.
moo <- data.frame(Farm = c("A","B",NA,NA,"A","B"),
Barn_Yard = c("A","A",NA,"A",NA,"B"),
stringsAsFactors = FALSE)
moo$Animal_House = with(moo,ifelse(is.na(Farm) & is.na(Barn_Yard),NA,
ifelse(!is.na(Barn_Yard) & is.na(Farm),Barn_Yard,
Farm)))
I have a data frame that looks like:
d<-data.frame(id=(1:9),
grp_id=(c(rep(1,3), rep(2,3), rep(3,3))),
a=rep(NA, 9),
b=c("No", rep(NA, 3), "Yes", rep(NA, 4)),
c=c(rep(NA,2), "No", rep(NA,6)),
d=c(rep(NA,3), "Yes", rep(NA,2), "No", rep(NA,2)),
e=c(rep(NA, 7), "No", NA),
f=c(NA, "No", rep(NA,3), "No", rep(NA,2), "No"))
>d
id grp_id a b c d e f
1 1 1 NA No <NA> <NA> <NA> <NA>
2 2 1 NA <NA> <NA> <NA> <NA> No
3 3 1 NA <NA> No <NA> <NA> <NA>
4 4 2 NA <NA> <NA> Yes <NA> <NA>
5 5 2 NA Yes <NA> <NA> <NA> <NA>
6 6 2 NA <NA> <NA> <NA> <NA> No
7 7 3 NA <NA> <NA> No <NA> <NA>
8 8 3 NA <NA> <NA> <NA> No <NA>
9 9 3 NA <NA> <NA> <NA> <NA> No
Within each group (grp_id) there is only 1 "Yes" or "No" value associated with each of the columns a:f.
I'd like to create a single row for each grp_id to get a data frame that looks like the following:
grp_id a b c d e f
1 NA No No <NA> <NA> No
2 NA Yes <NA> Yes <NA> No
3 NA <NA> <NA> No No No
I recognize that the tidyr package is probably the best tool and the 1st steps are likely to be
d %>%
group_by(grp_id) %>%
summarise()
I would appreciate help with the commands within summarise, or any solution really. Thanks.
We can use summarise_at and subset the first non-NA element
library(dplyr)
d %>%
group_by(grp_id) %>%
summarise_at(2:7, funs(.[!is.na(.)][1]))
# A tibble: 3 x 7
# grp_id a b c d e f
# <dbl> <lgl> <fctr> <fctr> <fctr> <fctr> <fctr>
#1 1 NA No No <NA> <NA> No
#2 2 NA Yes <NA> Yes <NA> No
#3 3 NA <NA> <NA> No No No
In the example dataset, columns 'a' to 'f' are all factors with some having only 'No' levels. If it needs to be standardized with all the columns having the same levels, then we may need to call the factor with levels specified as c('Yes', 'No') in the summarise_at i.e. summarise_at(2:7, funs(factor(.[!is.na(.)][1], levels = c('Yes', 'No'))))
We can use aggregate. No packages are used.
YN <- function(x) c(na.omit(as.character(x)), NA)[1]
aggregate(d[3:8], d["grp_id"], YN)
giving:
## grp_id a b c d e f
## 1 1 <NA> No No <NA> <NA> No
## 2 2 <NA> Yes <NA> Yes <NA> No
## 3 3 <NA> <NA> <NA> No No No
The above gives character columns. If you prefer factor columns then use this:
YNfac <- function(x) factor(YN(x), c("No", "Yes"))
aggregate(d[3:8], d["grp_id"], YNfac)
Note: Other alternate implementations of YN are:
YN <- function(x) sort(as.character(x), na.last = TRUE)[1]
YN <- function(x) if (all(is.na(x))) NA_character_ else na.omit(as.character(x))[1]
library(zoo)
YN <- function(x) na.locf0(as.character(x), fromLast = TRUE)[1]
You've received some good answers but neither of them actually uses the tidyr package. (The summarize() and summarize_at() family of functions is from dplyr.)
In fact, a tidyr-only solution for your problem is very doable.
d %>%
gather(col, value, -id, -grp_id, factor_key=TRUE) %>%
na.omit() %>%
select(-id) %>%
spread(col, value, fill=NA, drop=FALSE)
The only hard part is ensuring that you get the a column in your output. For your example data, it is entirely NA. The trick is the factor_key=TRUE argument to gather() and the drop=FALSE argument to spread(). Without those two arguments being set, the output would not have an a column, and would only have columns with at least one non-NA entry.
Here's a description of how it works:
gather(col, value, -id, -grp_id, factor_key=TRUE) %>%
This tidies your data -- it effectively replaces columns a - f with new columns col and value, forming a long-formated "tidy" data frame. The entries in the col column are letters a - f. And because we've used factor_key=TRUE, this column is a factor with levels, not just a character vector.
na.omit() %>%
This removes all the NA values from the long data.
select(-id) %>%
This eliminates the id column.
spread(col, value, fill=NA, drop=FALSE)
This re-widens the data, using the values in the col column to define new column names, and the values in the value column to fill in the entries of the new columns. When data is missing, a value of fill (here NA) is used instead. And the drop=FALSE means that when col is a factor, there will be one column per level of the factor, no matter whether that level appears in the data or not. This, along with setting col to be a factor, is what gets a as an output column.
I personally find this approach more readable than the approaches requiring subsetting or lapply stuff. Additionally, this approach will fail if your data is not actually one-hot, whereas other approaches may "work" and give you unexpected output. The downside of this approach is that the output columns a - f are not factors, but character vectors. If you need factor output you should be able to do (untested)
mutate(value = factor(value, levels=c('Yes', 'No', NA))) %>%
anywhere between the gather() and spread() functions to ensure factor output.
I have the following sample data:
df
val_str
fruit=apple,machine=crane
machine=crane
machine=roboter
fruit=apple
machine=roboter,food=samosa
df2
fruit machine food
apple crane NA
NA crane NA
NA roboter NA
apple NA NA
NA roboter samosa
How do I get from df to df2? Each unique value before the "=" should create a column and then the respective values belonging to this should be spread across the rows.
Code:
df <- data.frame(val_str = c("fruit=apple,machine=crane","machine=crane","machine=roboter", "fruit=apple", "machine=roboter,food=samosa"))
df2 <- data.frame(fruit = c("apple",NA,NA,"apple","NA"),
machine = c("crane","crane","roboter",NA,"roboter"),
food = c(NA,NA,NA,NA,"samosa"))
We can do an strsplit on the 'val_str' column, create data.frame from the alternate elements (using logical index for subseting via recycling) by looping through the list elements via map
library(dplyr)
library(purrr)
strsplit(as.character(df$val_str), "[=,]") %>%
map_df(~ setNames(as.data.frame.list(.[c(FALSE, TRUE)]), .[c(TRUE, FALSE)]))
# fruit machine food
#1 apple crane <NA>
#2 <NA> crane <NA>
#3 <NA> roboter <NA>
#4 apple <NA> <NA>
#5 <NA> roboter samosa