Spread Strings across Columns - r

I have the following sample data:
df
val_str
fruit=apple,machine=crane
machine=crane
machine=roboter
fruit=apple
machine=roboter,food=samosa
df2
fruit machine food
apple crane NA
NA crane NA
NA roboter NA
apple NA NA
NA roboter samosa
How do I get from df to df2? Each unique value before the "=" should create a column and then the respective values belonging to this should be spread across the rows.
Code:
df <- data.frame(val_str = c("fruit=apple,machine=crane","machine=crane","machine=roboter", "fruit=apple", "machine=roboter,food=samosa"))
df2 <- data.frame(fruit = c("apple",NA,NA,"apple","NA"),
machine = c("crane","crane","roboter",NA,"roboter"),
food = c(NA,NA,NA,NA,"samosa"))

We can do an strsplit on the 'val_str' column, create data.frame from the alternate elements (using logical index for subseting via recycling) by looping through the list elements via map
library(dplyr)
library(purrr)
strsplit(as.character(df$val_str), "[=,]") %>%
map_df(~ setNames(as.data.frame.list(.[c(FALSE, TRUE)]), .[c(TRUE, FALSE)]))
# fruit machine food
#1 apple crane <NA>
#2 <NA> crane <NA>
#3 <NA> roboter <NA>
#4 apple <NA> <NA>
#5 <NA> roboter samosa

Related

Subset dataframe in R based on a list specified in a vector (using a 'starts with' expression or equivalent)

I am trying to identify any participant taking statins in a dataset of over 1 million rows and subset based on this. I have a vector that includes all the codes for these medications (I've just made a few up for demonstration purposes), and I would next like to create a function that searches through the dataframe and identifies any case that has a medication code that "starts with" any of the characters listed in the df.
The df looks like this:
ID readcode_1 readcode_2 generic_name
1 1001 bxd1 1146785342 Simvastatin
2 1002 <NA> <NA> <NA>
3 1003 <NA> <NA> Pravastatin
4 1004 <NA> <NA> <NA>
5 1005 bxd4 45432344 <NA>
6 1006 <NA> <NA> <NA>
7 1007 <NA> <NA> <NA>
8 1008 <NA> <NA> <NA>
9 1009 <NA> <NA> <NA>
10 1010 bxde <NA> <NA>
11 1011 <NA> <NA> <NA>
Ideally, I'd like the end product to look like this:
ID readcode_1 readcode_2 generic_name
1 1001 bxd1 1146785342 Simvastatin
3 1003 <NA> <NA> Pravastatin
5 1005 bxd4 45432344 <NA>
10 1010 bxde <NA> <NA>
Here is my code so far (doesn't currently work)
#create vector with list of medication codes of interest
medications <- c("bxd", "Simvastatin", "1146785342", "45432344", "Pravastatin")
# look through all columns (apart from IDs in first column) and if any of them start with the codes listed in the medications vector, return a 1
df$statin_prescribed <- apply(df[, -1], 1, function(x) {
if(any(x %in% startsWith(x, medications))) {
return(1)
} else {
return(0)
}
})
# subset to include only individuals prescribed statins
df <- subset(df, statin_prescribed == 1)
The part that doesn't seem to work is startsWith(x, statin).
Please let me know if you have any suggestions and additional, whether there is alternative code that may be more time efficient!
This is a solution using the dplyr package
library(dplyr)
df %>%
filter_at(vars(-ID), any_vars(grepl(paste(medications, collapse = "|"), .)))
Small explanation: we are asking to filter all those rows where at least one variable (excluding ID) starts with one of the values inside medications
Output
# ID readcode_1 readcode_2 generic_name
# 1 1001 bxd1 1146785342 Simvastatin
# 2 1003 <NA> <NA> Pravastatin
# 3 1005 bxd4 45432344 <NA>
# 4 1010 bxde <NA> <NA>
Another solution in base R with a similar rationale is the following
df[apply(df[,-1], 1, function(x) {any(grepl(paste(medications, collapse = "|"), x))}),]
Output is the same (except row index which I believe is not relevant)
# ID readcode_1 readcode_2 generic_name
# 1 1001 bxd1 1146785342 Simvastatin
# 3 1003 <NA> <NA> Pravastatin
# 5 1005 bxd4 45432344 <NA>
# 10 1010 bxde <NA> <NA>
After some benchmarking tests, the base R solution seems to be around 5x faster than the dplyr one. So I suggest you to use the base R solution if time efficiency is your main concern.
microbenchmark::microbenchmark(
df %>% filter_at(vars(-ID), any_vars(grepl(paste(medications, collapse = "|"), .))),
df[apply(df[,-1], 1, function(x) {any(grepl(paste(medications, collapse = "|"), x))}),],
times = 100
)
# Unit: microseconds
# # expr min
# df %>% filter_at(vars(-ID), any_vars(grepl(paste(medications, collapse = "|"), .))) 1958.4
# df[apply(df[, -1], 1, function(x) { any(grepl(paste(medications, collapse = "|"), x)) }), ] 341.7
# lq mean median uq max neval
# 1989.55 2146.993 2041.30 2149.05 7851.1 100
# 352.50 405.972 380.25 401.55 2154.0 100

How to spread two column dataframe with creating a unique identifier?

Trying to spread two column data to a format where there will be some NA values.
dataframe:
df <- data.frame(Names = c("TXT","LSL","TXT","TXT","TXT","USL","LSL"), Values = c("apple",-2,"orange","banana","pear",10,-1),stringsAsFactors = F)
If a row includes TXT following rows that has LSL or USL will belong to that row.
For ex:
in the first row; Name is TXT Value is apple next row is LSL value will be for apple's LSL and since no USL that will be NA until the next TXT name.
If there is a TXT followed by another TXT, then LSL and USL values for that row will be NA
trying to create this:
I tried using spread with row numbers as unique identifier but that's not what I want:
df %>% group_by(Names) %>% mutate(row = row_number()) %>% spread(key = Names,value = Values)
I guess I need to create following full table with NAs then spread but couldn't figure out how.
We can expand the dataset with complete after creating a grouping index based on the occurence of 'TXT'
library(dplyr)
library(tidyr)
df %>%
group_by(grp = cumsum(Names == 'TXT')) %>%
complete(Names = unique(.$Names)) %>%
ungroup %>%
spread(Names, Values) %>%
select(TXT, LSL, USL)
# A tibble: 4 x 3
# TXT LSL USL
# <chr> <chr> <chr>
#1 apple -2 <NA>
#2 orange <NA> <NA>
#3 banana <NA> <NA>
#4 pear -1 10
In data.table, we can use dcast :
library(data.table)
dcast(setDT(df), cumsum(Names == 'TXT')~Names, value.var = 'Values')[, -1]
# LSL TXT USL
#1: -2 apple <NA>
#2: <NA> orange <NA>
#3: <NA> banana <NA>
#4: -1 pear 10

separte columns from a data frame

I have a data frame results from extracting data from text files which have some columns which contains more than a value
I want to split columns with more than a value into 2 columns like this
I tried this code but it generates an error
db<-separate_rows(db,TYPE,CHRO,EX ,sep=",\\s+")
Error: All nested columns must have the same number of elements.
Note that sample data and expected output don't match; for example, there is no CHRO=c700 entry in your sample data. You also seem to be missing rows. Please check your input/expected output data.
You could use tidyr::separate_rows, e.g.
df %>%
separate_rows(TYPE, sep = ",") %>%
separate_rows(CHRO, sep = ",") %>%
separate_rows(EX, sep = ",")
# TYPE CHRO EX
#1 multiple c.211dup <NA>
#2 multiple c.3751dup <NA>
#3 multiple <NA> exon.2
#4 multiple <NA> exon.3
#5 multiple <NA> exon.7
#6 mitocondrial <NA> exon.3
#7 mitocondrial <NA> exon.7
#8 multifactorial <NA> <NA>
Or perhaps use splitstackshape
library(splitstackshape)
df %>%
cSplit(names(df), direction = "long") %>%
fill(TYPE) %>%
group_by_at(names(df)) %>%
slice(1)
# TYPE CHRO EX
# <fct> <fct> <fct>
#1 mitocondrial NA exon.7
#2 multifactorial NA NA
#3 multiple c.211dup NA
#4 multiple c.3751dup NA
#5 multiple NA exon.2
#6 multiple NA exon.3
#7 multiple NA NA
Note that results are different because the order of separating columns matters.
Sample data
df <- read.table(text =
"TYPE CHRO EX
multiple 'c.211dup, c.3751dup' NA
multiple NA exon.2
multiple,mitocondrial NA exon.3,exon.7
multifactorial NA NA", header = T)

R: Cleaning up a wide and untidy dataframe

I have a data frame that looks like:
d<-data.frame(id=(1:9),
grp_id=(c(rep(1,3), rep(2,3), rep(3,3))),
a=rep(NA, 9),
b=c("No", rep(NA, 3), "Yes", rep(NA, 4)),
c=c(rep(NA,2), "No", rep(NA,6)),
d=c(rep(NA,3), "Yes", rep(NA,2), "No", rep(NA,2)),
e=c(rep(NA, 7), "No", NA),
f=c(NA, "No", rep(NA,3), "No", rep(NA,2), "No"))
>d
id grp_id a b c d e f
1 1 1 NA No <NA> <NA> <NA> <NA>
2 2 1 NA <NA> <NA> <NA> <NA> No
3 3 1 NA <NA> No <NA> <NA> <NA>
4 4 2 NA <NA> <NA> Yes <NA> <NA>
5 5 2 NA Yes <NA> <NA> <NA> <NA>
6 6 2 NA <NA> <NA> <NA> <NA> No
7 7 3 NA <NA> <NA> No <NA> <NA>
8 8 3 NA <NA> <NA> <NA> No <NA>
9 9 3 NA <NA> <NA> <NA> <NA> No
Within each group (grp_id) there is only 1 "Yes" or "No" value associated with each of the columns a:f.
I'd like to create a single row for each grp_id to get a data frame that looks like the following:
grp_id a b c d e f
1 NA No No <NA> <NA> No
2 NA Yes <NA> Yes <NA> No
3 NA <NA> <NA> No No No
I recognize that the tidyr package is probably the best tool and the 1st steps are likely to be
d %>%
group_by(grp_id) %>%
summarise()
I would appreciate help with the commands within summarise, or any solution really. Thanks.
We can use summarise_at and subset the first non-NA element
library(dplyr)
d %>%
group_by(grp_id) %>%
summarise_at(2:7, funs(.[!is.na(.)][1]))
# A tibble: 3 x 7
# grp_id a b c d e f
# <dbl> <lgl> <fctr> <fctr> <fctr> <fctr> <fctr>
#1 1 NA No No <NA> <NA> No
#2 2 NA Yes <NA> Yes <NA> No
#3 3 NA <NA> <NA> No No No
In the example dataset, columns 'a' to 'f' are all factors with some having only 'No' levels. If it needs to be standardized with all the columns having the same levels, then we may need to call the factor with levels specified as c('Yes', 'No') in the summarise_at i.e. summarise_at(2:7, funs(factor(.[!is.na(.)][1], levels = c('Yes', 'No'))))
We can use aggregate. No packages are used.
YN <- function(x) c(na.omit(as.character(x)), NA)[1]
aggregate(d[3:8], d["grp_id"], YN)
giving:
## grp_id a b c d e f
## 1 1 <NA> No No <NA> <NA> No
## 2 2 <NA> Yes <NA> Yes <NA> No
## 3 3 <NA> <NA> <NA> No No No
The above gives character columns. If you prefer factor columns then use this:
YNfac <- function(x) factor(YN(x), c("No", "Yes"))
aggregate(d[3:8], d["grp_id"], YNfac)
Note: Other alternate implementations of YN are:
YN <- function(x) sort(as.character(x), na.last = TRUE)[1]
YN <- function(x) if (all(is.na(x))) NA_character_ else na.omit(as.character(x))[1]
library(zoo)
YN <- function(x) na.locf0(as.character(x), fromLast = TRUE)[1]
You've received some good answers but neither of them actually uses the tidyr package. (The summarize() and summarize_at() family of functions is from dplyr.)
In fact, a tidyr-only solution for your problem is very doable.
d %>%
gather(col, value, -id, -grp_id, factor_key=TRUE) %>%
na.omit() %>%
select(-id) %>%
spread(col, value, fill=NA, drop=FALSE)
The only hard part is ensuring that you get the a column in your output. For your example data, it is entirely NA. The trick is the factor_key=TRUE argument to gather() and the drop=FALSE argument to spread(). Without those two arguments being set, the output would not have an a column, and would only have columns with at least one non-NA entry.
Here's a description of how it works:
gather(col, value, -id, -grp_id, factor_key=TRUE) %>%
This tidies your data -- it effectively replaces columns a - f with new columns col and value, forming a long-formated "tidy" data frame. The entries in the col column are letters a - f. And because we've used factor_key=TRUE, this column is a factor with levels, not just a character vector.
na.omit() %>%
This removes all the NA values from the long data.
select(-id) %>%
This eliminates the id column.
spread(col, value, fill=NA, drop=FALSE)
This re-widens the data, using the values in the col column to define new column names, and the values in the value column to fill in the entries of the new columns. When data is missing, a value of fill (here NA) is used instead. And the drop=FALSE means that when col is a factor, there will be one column per level of the factor, no matter whether that level appears in the data or not. This, along with setting col to be a factor, is what gets a as an output column.
I personally find this approach more readable than the approaches requiring subsetting or lapply stuff. Additionally, this approach will fail if your data is not actually one-hot, whereas other approaches may "work" and give you unexpected output. The downside of this approach is that the output columns a - f are not factors, but character vectors. If you need factor output you should be able to do (untested)
mutate(value = factor(value, levels=c('Yes', 'No', NA))) %>%
anywhere between the gather() and spread() functions to ensure factor output.

r matrix columns filling incorrectly

Trying to re-organise a data set (sdevDFC) into a matrix with my latitude (Lat) as row names and longitude (Lon) as column names, then filling in the matrix with values respective to the coordinates.
stand_dev_m <- matrix(data=sdevDFC$SDev, nrow=length(sdevDFC$Lat), ncol=length(sdevDFC$Lon), byrow=TRUE, dimnames = list(sdevDFC$Lat, sdevDFC$Lon))
The column and row names appear as they should, but my data fills in so that all values in their respective columns are identical as shown in the
image (which should not be the case as none of my values ever repeat).
I've filled it with byrow = FALSE to see if it also occurred then (it does), and I've also used colnames and rownames instead of dimnames (changes nothing).
Would appreciate any insight into what I may be doing wrong here--also new to this platform so I apologise if I've missed a guideline or another question that's similar
Example data:
df <- data.frame(LON=1:5,
LAT=11:15,
VAL=letters[1:5],
stringsAsFactors=F)
You could try the following:
rn <- df$LON # Save what-will-be-rownames
df1 <- df %>%
spread(LAT,VAL,fill=NA) %>%
select(-LON) %>%
setNames(., df$LAT)
rownames(df1) <- rn
Output
11 12 13 14 15
1 a <NA> <NA> <NA> <NA>
2 <NA> b <NA> <NA> <NA>
3 <NA> <NA> c <NA> <NA>
4 <NA> <NA> <NA> d <NA>
5 <NA> <NA> <NA> <NA> e

Resources