How to collapse rows of data - r

I'm a newbie to R and data.table and but I'm trying to collapse a customer data set that takes the following format - although it extends across 90 columns:
frame <- data.frame(
customer_id = c(123, 123, 123),
time = c(1, 2, 3),
rec_type = c('contact', 'appointment', 'sale'),
variable_1 = c('Yes', NA, "Yes"),
variable_2 = c(NA, 'No', NA),
variable_3 = c(NA, NA, 'complete'),
variable_4 = NA, stringsAsFactors = FALSE)
customer_id time rec_type variable_1 variable_2 variable_3 variable_4
123 1 contact Yes NA NA NA
123 2 appointment NA No NA NA
123 3 sale Yes NA complete NA
I asked before - What's the best way to collapse sparse data into single rows in R? - how to collapse the data for each customer into a single row and got two useful answers in data.table and dplyr.
However, those answers couldn't handle multiple values such as the 'rec_type' field or where are multiple instances of the same value variable_1.
I'd like to lapply a function which works across columns and returns a row vector in which each field is either the single value for each field, NA if all column values are blank or 'multiple'
In this case: my output would be
customer_id time rec_type variable_1 variable_2 variable_3 variable_4
123 multiple multiple Yes No complete NA
I worked out how to count the unique values across columns:
unique_values <- function(x){
uniques <- dt[contact_no == x,][,lapply(.SD, uniqueN)]
uniques
}
lapply(dt$contact_no, unique_values)
But couldn't work how to use the results from uniques to return the results I'd like.
Can anyone suggest an approach I can use?
Is there a simpler way of tackling the problem?

Here is one data.table method.
setDT(frame)[, lapply(.SD, function(x)
{x <- unique(x[!is.na(x)])
if(length(x) == 1) as.character(x)
else if(length(x) == 0) NA_character_
else "multiple"}),
by=customer_id]
The idea is to use lapply to apply an anonymous function to all variables and construct the function in a manner that returns the desired results. This function strips out NA values and duplicates and then checks the length of the resulting vector. The output of each is cast as a character in order to comply with the possibility of "multiple" occurring for another customer_id.
this returns
customer_id time rec_type variable_1 variable_2 variable_3 variable_4
1: 123 multiple multiple Yes No complete NA

Related

How to replace values in specific rows and columns with NA using a reference table?

I need to set values to NA for specific columns and rows using a separate reference table and am not sure how.
I need to set the value to NA if the column name (field name) is not associated with the value in the column called event, using a reference table of event-fields mappings.
This is a simplified example data of what I've got. For the real data, I've ~900 rows and >300 columns to replace NAs in, and the columns are of different types.
df <- tibble::tribble(
~event, ~drug, ~status,
"referral", "drugA", 0,
"therapy", "drugA", 1
)
I have a reference table like below that says what fields are associated with each event.
event_fields <- tibble::tribble(
~unique_event_name, ~field_name,
"referral", "record_id",
"referral", "casetype",
"therapy", "drug",
"therapy", "status"
)
The output I’m trying to get is below e.g. drug and status are not fields associated with the referral event in the event_fields table above so they should get set to NA.
desired_result <- tibble::tribble(
~event_name, ~drug, ~status,
"referral", NA, NA,
"therapy", "drugA", 1
)
One thing I've tried is below (based on Replace multiple values in a dataframe with NA based on conditions given in another dataframe in R, the closest question I could find) but it doesn't work. I'm not sure how to use the event value for each row e.g. "referral" and the name of the field column e.g. "drug" in the filter() or if there's a better way to do this.
library(tidyverse)
df %>% mutate(across(drug:status,
~ replace(., !cur_column() %in%
event_fields %>% filter(unique_event_name == event) %>% pull(field_name),
NA) ))
which gives the error
Error: Problem with `mutate()` input `..1`.
ℹ `..1 = across(...)`.
x no applicable method for 'filter' applied to an object of class "logical"
Any help will highly appreciated!
As we are creating the logical based on 'event' column and the corresponding 'unique_event_name' on the same row of the 'field_name' that matches the column names (cur_column()), subset the 'unique_event_name' based on the logical on 'field_name' and then do the second logical on the 'event' to replace
library(dplyr)
df %>%
mutate(across(drug:status, ~ replace(.,
event != event_fields$unique_event_name[
event_fields$field_name == cur_column()], NA)))
-output
# A tibble: 2 × 3
event drug status
<chr> <chr> <dbl>
1 referral <NA> NA
2 therapy drugA 1
You may try this solution. Although it works on the toy example this might still fail depending on your real data.
The idea is to replace the non-matching fields and replace them with NA.
First find the non-matching rows, then select the corresponding columns.
desired_result <- df
desired_result[ df$event != unique(
event_fields$unique_event_name[ event_fields$field_name %in% colnames( df )]
), na.omit( match( event_fields$field_name, colnames( df ) ) ) ] <- NA
desired_result
# A tibble: 2 x 3
event drug status
<chr> <chr> <dbl>
1 referral NA NA
2 therapy drugA 1

How to reshape dataframe and transpose recurring columns to dataframe rows?

I have a dataframe that has recurring columns (the interval is 5).
my dataframe at the moment
So this is how it looks: I have 5 type of columns and they repeat time over time. The recurring columns have a suffix in their name, this can be removed/renamed as well, so that they would all match.
What I would like to do is to transpose these recurring columns to rows, so that I would have only 5 columns in the end (Dates, PX_LAST, PX_HIGH, PX_VOLUME, Name). Then I would be able to group the dataframe by Dates, Name etc and do many other things.
I tried some manipulations with pipe operator %>%, but it didn't really work at the moment. Since I don't have any ideas left, I thought, that maybe you could help me out.
Thanks in advance!
One option would be to split the data into a list of data.frame based on the column names and then rbind them together
nm1 <- sub("\\.\\d+", "", names(dft))
i1 <- ave(seq_along(dft), nm1, FUN = seq_along)
out <- do.call(rbind, lapply(split.default(dft, i1),
function(x) setNames(x, sub("\\.\\d+", "", names(x)))))
row.names(out) <- NULL
out
# Date Age
#1 1 21
#2 2 15
#3 1 32
#4 2 12
Or another option is to loop through the unique names, subset the data, unlist, and convert to data.frame
un1 <- unique(nm1)
setNames(data.frame(lapply(un1,
function(x) unlist(dft[grep(x, names(dft))]))), un1)
data
dft <- data.frame("Date" = 1:2, "Age" = c(21,15), "Date" = 1:2, "Age" = c(32,12))

Custom data-dependent recoding to logicals in R

I have two data frames, data and meta. Some, but not all, columns in data are logical values, but they are coded in many different ways. The rows in meta describe the columns in data, indicate whether they are to be interpreted as logicals, and if so, what single value codes TRUE and what single value codes FALSE.
I need a procedure that replaces all data values in conceptually logical columns with the appropriate logical values from the codes in the corresponding meta row. Any data values in a conceptually logical column that do not match a value in the corresponding meta row should become NA.
Small toy example for meta:
name type false true
-----------------------------------------
a.char.var char NA NA
a.logical.var logical NA 7
another.logical.var logical 1 0
another.char.var char NA NA
Small toy example for data:
a.char.var a.logical.var another.logical.var another.char.var
----------------------------------------------------------------
aa 7 0 ba
ab NA 1 bb
ac 7 NA bc
ad 4 3 bd
Small toy example output:
a.char.var a.logical.var another.logical.var another.char.var
----------------------------------------------------------------
aa TRUE TRUE ba
ab FALSE FALSE bb
ac TRUE NA bc
ad NA NA bd
I cannot, for the life of me, find a way to do this in idiomatic R that handles all the corner cases. The data sets are large, so an idiomatic solution would be ideal if possible. I inherited this absolutely insane data management mess and will be grateful to anybody who can help fix it. I am by no means an R guru, but this seems like a deceptively difficult problem.
First we set up the data
meta <- data.frame(name=c('a.char.var', 'a.logical.var', 'another.logical.var', 'another.char.var'),
type=c('char', 'logical', 'logical', 'char'),
false=c(NA, NA, 1, NA),
true=c(NA, 7, 0, NA), stringsAsFactors = F)
data <- data.frame(a.char.var=c('aa', 'ab', 'ac', 'ad'),
a.logical.var=c(7, NA, 7, 4),
another.logical.var=c(0,1,NA,3),
another.char.var=c('ba', 'bb', 'bc', 'bd'), stringsAsFactors = F)
Then we subset out just the logical columns. We will iterate through these, using the name column to select the relevant column in data, and change values in data_out from an initialized NA to either T or F according to matching values in data.
Note that data[,logical_meta$name[1]] is equivalent to data[,'a.logical.var'] or data$a.logical.var, if logical_meta$name is a character. If it's a factor (eg if we didn't specify stringsAsFactors=F) we need to convert to character at which point we might as well give it a name - colname below.
Having NAs to contend with means using which is advantageous: c(0, 1,NA,3)==0 returns T,F,NA,F but which then ignores the NA and returns just the position 1. Subsetting by a logical vector that includes NAs yields NA rows or columns, using which eliminates this.
logical_meta <- meta[meta$type=='logical',]
data_out <- data #initialize
for(i in 1:nrow(logical_meta)) {
colname <- as.character(logical_meta$name[i]) #only need as.character if factor
data_out[,colname] <- NA
#false column first
if(is.na(logical_meta$false[i])) {
data_out[is.na(data[,colname]),colname] <- FALSE
} else {
data_out[which(data[,colname]==logical_meta$false[i]),
colname] <- FALSE
}
#true column next
if(is.na(logical_meta$true[i])) {
data_out[is.na(data[,colname]),colname] <- TRUE
} else {
data_out[which(data[,colname]==logical_meta$true[i]),
colname] <- TRUE
}
}
data_out
I've written a function that takes in the column index of data and tries to perform the operation you described.
The function first selects x as the column we are interested in. We then match the name of the column in data to the entries in the first column of meta, this gives our row of interest.
We then check if the column type is logical, if it isn't we just return x, nothing needed to be changed. If the column type is logical we then check whether its values match the true or false columns in meta.
convert_data <- function(colindex, dat, meta = meta){
x <- dat[,colindex] #select our data vector
#match the column name to the first column in meta
find_in_meta <- match(names(dat)[colindex],
meta[,1])
#what type of column is it
type_col <- meta[find_in_meta,2]
if(type_col != 'logical'){
return(x)
}else{
#fix if logical is NA
true_val <- ifelse(is.na(meta[find_in_meta,4]),'NA_val',
meta[find_in_meta,4])
#fix if logical is NA
false_val <- ifelse(is.na(meta[find_in_meta,3]), 'NA_val',
meta[find_in_meta, 3])
#fix if logical is NA
x <- ifelse(is.na(x), 'NA_val', x)
x <- ifelse(x == true_val, TRUE,
ifelse(x == false_val, FALSE, NA))
return(x)
}
}
We can then use lapply and a little data manipulation to get it into an acceptable form:
res <- lapply(1:ncol(df1), function(ind)
convert_data(colindex = ind, dat = df1, meta = meta))
setNames(do.call('cbind.data.frame', res), names(df1))
a.char.var a.logical.var another.logical.var another.char.var
1 aa TRUE TRUE ba
2 ab FALSE FALSE bb
3 ac TRUE NA bc
4 ad NA NA bd
data
meta <- structure(list(name = c("a.char.var", "a.logical.var", "another.logical.var",
"another.char.var"), type = c("char", "logical", "logical", "char"
), false = c(NA, NA, 1L, NA), true = c(NA, 7L, 0L, NA)), .Names = c("name",
"type", "false", "true"), class = "data.frame", row.names = c(NA,
-4L))
df1 <- structure(list(a.char.var = c("aa", "ab", "ac", "ad"), a.logical.var = c(7L,
NA, 7L, 4L), another.logical.var = c(0L, 1L, NA, 3L), another.char.var = c("ba",
"bb", "bc", "bd")), .Names = c("a.char.var", "a.logical.var",
"another.logical.var", "another.char.var"), class = "data.frame", row.names = c(NA,
-4L))

How to add value to column when data frame is empty in r

I have data frame that I have to initialized as empty data frame.
Now I have only column available, I want to add it to empty data frame. How I can do it? I will not sure what will be length of column in advance.
Example
df = data.frame(a= NA, b = NA, col1= NA)
....
nrow(col1) # Here I will know length of column, and I only have one column available here.
df$col1 <- col1
error is as follows:
Error in `$<-.data.frame`(`*tmp*`, "a", value = c("1", :
replacement has 5 rows, data has 1
Any help will be greatful
use cbind
df = data.frame(a= NA, b = NA)
col1 <- c(1,2,3,4,5)
df <- cbind(df, col1)
# a b col1
# 1 NA NA 1
# 2 NA NA 2
# 3 NA NA 3
# 4 NA NA 4
# 5 NA NA 5
After your edits, you can still use cbind, but you'll need to drop the existing column first (or handle the duplicate columns after the cbind)
cbind(df[, 1:2], col1)
## or if you don't know the column indeces
## cbind(df[, !names(df) %in% c("col1")], col1)
A little workaround with lists:
l <- list(a=NA, b=NA, col1=NA)
col1 <- c(1,2,3)
l$col1 <- col1
df <- as.data.frame(l)
I like both answers provided by Symbolix and maRtin, I have done my own hack. My hack is as follow.
df[1:length(a),"a"] = a
However, I am not sure, which one this method is efficient in term of time. What will be big O notion for time

Missing values in nested ifelse statements in R

At some point in time, I encountered this problem...and solved it. However, as it is a recurring problem and I've now forgotten the solution, hopefully this question will offer clarification to others as well as me :)
I am creating a variable that is based answers to several questions. Each question can have three values: 1, 2, or NA. 1's and 2's are mutually exclusive for each observation.
I simply want to create a variable that is a composite of the choice coded with "1" for each person, and give it a value based on that code.
So let's say I have this df:
ID var1 var2 var3 var4
1 1 2 NA NA
2 NA NA 2 1
3 2 1 NA NA
4 2 NA 1 NA
I then try to recode based on the following statement:
df$var <-
ifelse(
as.numeric(df$var1) == 1,
"Gut instinct",
ifelse(
as.numeric(df$var2) == 1,
"Data",
ifelse(
as.numeric(df$var3) == 1,
"Science",
ifelse(
as.numeric(df$var4) == 1,
"Philosophy",
NA
)
)
)
)
However, this code only PARTIALLY codes based on the "ifelse". For example, df$var might have observation of 'Gut instinct' and 'Philosophy', but the codings for when var2 and var3==1 are still NA.
Any thoughts on why this might be happening?
An alternative that will be quicker than apply (using #MrFlick's data):
vals <- c("Gut", "Data", "Science", "Phil")
intm <- dd[-1]==1 & !is.na(dd[-1])
dd$resp <- NA
dd$resp[row(intm)[intm]] <- vals[col(intm)[intm]]
How much quicker? On 1 million rows:
#row/col assignment
user system elapsed
0.99 0.02 1.02
#apply
user system elapsed
11.98 0.04 12.30
And giving the same results when tried on identical datasets:
identical(flick$resp,latemail$resp)
#[1] TRUE
This is because ifelse (and ==) has special behavior for NA. Specifically, R doesn't want to tell you that NA is different from 1 (or anything else), because often NA is used to represent a value that could be anything, maybe even 1.
> 1 == NA
[1] NA
> ifelse(NA == 1, "yes", "no")
[1] NA
With your code, if an NA occurs before a 1 (like for ID 2), then that ifelse statement will just return NA, and the nested FALSE ifelse will never be called.
Here's a way to do with without the nested ifelse statements
#your data
dd<-data.frame(ID = 1:4,
var1 = c(1, NA, 2, 2),
var2 = c(2, NA, 1, NA),
var3 = c(NA, 2, NA, 2),
var4 = c(NA, 1, NA, NA)
)
resp <- c("Gut","Data","Sci","Phil")[apply(dd[,-1]==1,1,function(x) which(x)[1])]
cbind(dd, resp)
I use apply to scan across the rows to find the first 1 and use that index to subset the response values. Using which helps to deal with the NA values.
To answer your question it is due to the NAs in your data. This should sort your problem out
df <- data.frame( ID=1:4, var1= c(1, NA, 2, 2), var2= c(2, NA, 1, NA),
var3=c(NA,2,NA,2), var4=c(NA, 1, NA, NA))
df$var<-ifelse(as.numeric(df$var1)==1&!is.na(df$var1),"Gut instinct",
ifelse(as.numeric(df$var2)==1&!is.na(df$var2),"Data",
ifelse(as.numeric(df$var3)==1&!is.na(df$var3),"Science",
ifelse(as.numeric(df$var4)==1&!is.na(df$var4),"Philosophy",NA))))
However, I would find it easier to reshape the data into a 'matrix' rather than a table and do it using a vector.
data <- df
library(reshape2)
long <- melt(data, id.vars="ID")
long
This would give you a matrix. Convert the var titles to something more meaningful.
library(stringr)
long$variable <- str_replace(long$variable, "var1", "Gut Instinct")
long$variable <- str_replace(long$variable, "var2", "Data")
long$variable <- str_replace(long$variable, "var3", "Science")
long$variable <- str_replace(long$variable, "var4", "Philosophy")
And now you can decide what to do based on each result
long$var <- ifelse(long$value==1, long$variable, NA)
and convert it back to something like the original if you want it that way
reshape(data=long, timevar="ID",idvar=c("var", "variable"), v.names = "value", direction="wide")
HTH

Resources