data.table::dcast long to wide data while ignoring NA-Category? - r

I want to transform my data from long to wide after some joins, resulting in a few NAs in the data provided.
Unfortunately, these NAs are also present in the richt-hand side (RHS), which defines the newly added columns via the transformation.
Consider this example:
library(data.table)
dt <- data.table(id=c(1,2,1,2,3,4),
group = c("A","A","B","B",NA,NA),
values = c(7,8,9,10,NA,NA))
dt_wide <- dcast(dt,
id ~ group,
value.var = c("values"))
In the data, rows 5 and 6 do not have any group or associated value:
id group values
1: 1 A 7
2: 2 A 8
3: 1 B 9
4: 2 B 10
5: 3 <NA> NA
6: 4 <NA> NA
if there is an associated value, a group does exist, therefore: (group == NA) => (value == NA)
the transformed dataframe wrongly considers NA as its own group in the group- column, which results in the following wide data table:
id NA A B
1: 1 NA 7 9
2: 2 NA 8 10
3: 3 NA NA NA
4: 4 NA NA NA
I would not prefer to build a possible buggy workaround where i retroactively delete the NA column by name or values (as it might handle different colnames and columns later in production).
Is there a way to tell dcast to ignore the NAs in group and not make an extra column out of it, while preserving all rows in the transformed table?
Like this:
id A B
1: 1 7 9
2: 2 8 10
3: 3 NA NA
4: 4 NA NA

This is tricky, but seems to work:
dcast(dt,
id ~ ifelse(is.na(group),unique(na.omit(dt$group)),group),
value.var = c("values"))
Key: <id>
id A B
<num> <num> <num>
1: 1 7 9
2: 2 8 10
3: 3 NA NA
4: 4 NA NA

I don't think it's possible to prevent dcast from doing that. I'd just filter them out afterwards:
dt_wide[, names(dt_wide) != "NA", with = FALSE]
Output:
id A B
1: 1 7 9
2: 2 8 10
3: 3 NA NA
4: 4 NA NA

Related

Appending csvs with different column quantities and spellings

Nothing too complicated, it would just be useful to use rbindlist on a large number of csvs where the column names change a little over time (minor spelling changes), the column orders remain the same, and at some point, two additional columns are added to the csvs (which I don't really need).
library(data.table)
csv1 <- data.table("apple" = 1:3, "orange" = 2:4, "dragonfruit" = 13:15)
csv2 <- data.table("appole" = 7:9, "orangina" = 6:8, "dragonificfruit" = 2:4, "pear" = 1:3)
l <- list(csv1, csv2)
When I run
csv_append <- rbindlist(l, fill=TRUE) #which also forces use.names=TRUE
it gives me a data.table with 7 columns
apple orange dragonfruit appole orangina dragonificfruit pear
1: 1 2 13 NA NA NA NA
2: 2 3 14 NA NA NA NA
3: 3 4 15 NA NA NA NA
4: NA NA NA 7 6 2 1
5: NA NA NA 8 7 3 2
6: NA NA NA 9 8 4 3
as opposed to what I want, which is:
V1 V2 V3 V4
1: 1 2 13 NA
2: 2 3 14 NA
3: 3 4 15 NA
4: 7 6 2 1
5: 8 7 3 2
6: 9 8 4 3
which I can use, even though I have to go through the extra step later of renaming the columns back to standard variable names.
If I instead try the default fill=FALSE and use.names=FALSE, it throws an error:
Error in rbindlist(l) :
Item 2 has 4 columns, inconsistent with item 1 which has 3 columns. To fill missing columns use fill=TRUE.
Is there a simple way to manage this, either by forcing fill=TRUE and use.names=FALSE somehow or by omitting the additional columns in the csvs that have them by specifying a vector of columns to append?
If we only need first 3 columns, then drop the rest and bind as usual:
rbindlist(lapply(l, function(i) i[, 1:3]))
# apple orange dragonfruit
# 1: 1 2 13
# 2: 2 3 14
# 3: 3 4 15
# 4: 7 6 2
# 5: 8 7 3
# 6: 9 8 4
Another option, from the comments: we could directly read the files, and set to keep only first 3 columns using fread, then bind:
rbindlist(lapply(filenames, fread, select = c(1:3)))
Here is an option with name matching using phonetic from stringdist. Extract the column names from the list of data.table ('nmlist'), unlist, group using phonetic, get the first element, relist it to the same list structure as 'nmlist', use Map to change the column names of the list of data.table, and then apply rbindlist
library(stringdist)
library(data.table)
nmlist <- lapply(l, names)
nm1 <- unlist(nmlist)
rbindlist(Map(setnames, l, relist(ave(nm1, phonetic(nm1),
FUN = function(x) x[1]), skeleton = nmlist)), fill = TRUE)
-output
# apple orange dragonfruit pear
#1: 1 2 13 NA
#2: 2 3 14 NA
#3: 3 4 15 NA
#4: 7 6 2 1
#5: 8 7 3 2
#6: 9 8 4 3

How to do a complex wide-to-long operation for network analysis

I have survey data that includes who the respondent is (iAmX), who they work with (withX), how frequently they work with each partner (freqX), and how satisfied they are with each partner (likeX). Participants can select multiple options for who they are and who they work with.
I would like to go from something like this, with one row per respondent:
df <- read.table(header=T, text='
id iAmA iAmB iAmC withA withB withC freqA freqB freqC likeA likeB likeC
1 X X NA X X NA 3 2 NA 3 2 NA
2 NA NA X X NA NA 5 NA NA 5 NA NA
')
To something like this, with one row per combination, where "from" is who the actor is and "to" is who they work with:
goal <- read.table(header=T, text='
id from to freq like
1 A A 3 3
1 B A 3 3
1 A B 2 2
1 B B 2 2
2 C A 5 5
')
I have tried some melt, gather, and reshape functions but frankly I think I'm just not up to the logic puzzle today. I would really appreciate some help!
Although I must admit I have not fully understood OP's logic, the code below reproduces the expected goal.
The key points here are data.table's incarnation of the melt() function which is able to reshape multiple measure columns simultaneously and the cross join function CJ().
library(data.table)
# reshape multiple measure columns simultaneously
cols <- c("iAm", "with", "freq", "like")
long <- melt(setDT(df), measure.vars = patterns(cols),
value.name = cols, variable.name = "to")[
# rename factor levels
, to := forcats::fct_relabel(to, function(x) LETTERS[as.integer(x)])]
# create combinations for each id
combi <- long[, CJ(from = na.omit(to[iAm == "X"]), to = na.omit(to[with == "X"])), by = id]
# join to append freq and like
result <- combi[long, on = .(id, to), nomatch = 0L][, -c("iAm", "with")]
# reorder result
setorder(result, id)
result
id from to freq like
1: 1 A A 3 3
2: 1 B A 3 3
3: 1 A B 2 2
4: 1 B B 2 2
5: 2 C A 5 5
The intermediate results are
long
id to iAm with freq like
1: 1 A X X 3 3
2: 2 A <NA> X 5 5
3: 1 B X X 2 2
4: 2 B <NA> <NA> NA NA
5: 1 C <NA> <NA> NA NA
6: 2 C X <NA> NA NA
and
combi
id from to
1: 1 A A
2: 1 A B
3: 1 B A
4: 1 B B
5: 2 C A

address cell in data.frame according to ID

I have a data.frame in R. It looks like this (but is much larger):
df <- data.frame(A = rep(NULL, 5),
B = rep(NULL, 5),
ID = c(3,6,8,9,27))
> df
A B ID
1 NULL NULL 3
2 NULL NULL 6
3 NULL NULL 8
4 NULL NULL 9
5 NULL NULL 27
In order to write their corresponding value into each cell, I need to exactly adress each cell by their column name and the ID-value (column 3). So that X would be adressed by column = A and ID = 8 in the following example. (Instead of column = A and Row = 3)
> df
A B ID
1 NULL NULL 3
2 NULL NULL 6
3 X NULL 8
4 NULL NULL 9
5 NULL NULL 27
Is there a way to do so?
This question appears to be rather simple but there are different aspects to be considered in an answer.
The OP has asked how to replace values in a certain column in rows which match a given ID value.
Data types
First you have to consider the types of data which are used for replacement. So, if A is to contain data of type numeric and B of type character, a sample data set can be created by
df <- data.frame(A = NA_real_,
B = NA_character_,
ID = c(3,6,8,9,27),
stringsAsFactors = FALSE)
df
A B ID
1 NA <NA> 3
2 NA <NA> 6
3 NA <NA> 8
4 NA <NA> 9
5 NA <NA> 27
str(df)
'data.frame': 5 obs. of 3 variables:
$ A : num NA NA NA NA NA
$ B : chr NA NA NA NA ...
$ ID: num 3 6 8 9 27
This will avoid costly type conversions. The parameter stringsAsFactors = FALSE will avoid errors due to wrong factor levels.
Base R
As mentioned by dshkol, which(df$ID==8) returns the row number(s). So, a complete answer would be:
df[which(df$ID == 8), "A"] <- 0.7
df
A B ID
1 NA <NA> 3
2 NA <NA> 6
3 0.7 <NA> 8
4 NA <NA> 9
5 NA <NA> 27
Using data.table
The OP has mentioned that the production data frame is much larger. If there are many replacement operations on large data objects, it might be worthwhile to consider using data.table because these are done in place only in the affected rows, i.e., without copying the whole column. This will save time and memory.
In addition, the syntax is more concise.
library(data.table)
dt <- as.data.table(df) # creating a copy of df for illustration
# replacement in place in rows given by row number
dt[which(ID == 8), A := 0.7][]
# replacement in place in rows given by condition
dt[ID == 8, A := 0.7][]
A B ID
1: NA NA 3
2: NA NA 6
3: 0.7 NA 8
4: NA NA 9
5: NA NA 27
Using a keyed data.table
Setting a key on a data.table will further speed up the searching for a particular ID and the code will become even more concise:
setkey(dt, ID)
dt[.(8), A := 0.7][]
The output is the same as shown above.
Replacing multiple values at once using a lookup table
If there are many values to be replaced in one column, it might be more efficient to store them in a lookup table and use this in an update on join operation:
lookupA <- data.table(ID = c(8, 3),
new = c(0.7, 1.2))
lookupA
ID new
1: 8 0.7
2: 3 1.2
dt[lookupA, on = "ID", A := new][]
A B ID
1: 1.2 NA 3
2: NA NA 6
3: 0.7 NA 8
4: NA NA 9
5: NA NA 27

if condition is true find max in 3 consecutive rows and report it in a new column - r

Reproducible example:
Label<-c(0,0,1,1,1,2,2,3,3,3,4,5,5,5,6,6)
Value<-c(NA,NA,1,2,3,1,2,3,2,1,"NC",1,3,2,1,NA)
dat1<-as.data.frame(cbind(Label, Value))
The output I am after is a new column "test" that gets the maximum of the column "Value" for each value of the column "Label" when there are 3 consecutives values that are the same and otherwise just report the values of the column "Value".
I do not mind about the missing values at the beggining and at the end, they can stay.
Expected result of the column test: NA, NA, 3,3,3,1,2,3,3,3,NC,3,3,3,NA,NA
in excel it was very easy and I coded successfully as follow:
=IF(AND(BN6=BN5,BN6=BN4),X4,Y6)
but in R I cannot.
I tried several methods, the closest to a result is the following:
test <-c(NA,NA)
test_tot <-NULL
for(i in 3:length(dat1$Label)){
test_tot<-c(test_tot, test)
if( dat1$Label[i]==dat1$Label[i+1]&& dat1$Label[i]==dat1$Label[i+2] ){
test<-max(as.numeric(c(dat1$Value[i],dat1$Value[i+1],dat1$Value[i+2])))
}
if(dat1$Label[i]==dat1$Label[i-1]&& dat1$Label[i]==dat1$Label[i+1]){
test<-max(as.numeric(c(dat1$Value[i],dat1$Value[i-1],dat1$Value[i+1])))
}
if(dat1$Label[i]==dat1$Label[i-1]&& dat1$Label[i]==dat1$Label[i-2]){
test<-max(as.numeric(c(dat1$Value[i],dat1$Value[i-1],dat1$Value[i-2])))
}
else {test<-dat1$Value[i]}
}
test_tot<-c(test_tot,NA,NA)
dat1$test<-test_tot
EDIT:
The difficulty apparently is that the column "Value" has character based values. Any solution able to deal with it is greatly appreciated.
Edit: The OP has pointed out that column Value may contain character-based values which are important to identify a specific behaviour happened at a specific time.
Consequently, the whole vector or column is of type character in R (or factor). The code below has been amended to handle this by extracting numeric values to a separate column, computing the maximum values per group, coercing the result back to character and to copy the character-based values into the result.
The data.table solution below
Label<-c(0,0,1,1,1,2,2,3,3,3,4,5,5,5,6,6)
Value<-c(NA,NA,1,2,3,1,2,3,2,1,"NC",1,3,2,1,NA)
Expected <- c(NA, NA, 3,3,3,1,2,3,3,3,"NC",3,3,3,NA,NA)
dat1<-data.frame(Label, Value, Expected)
library(data.table) # CRAN version 1.10.4 used
# coerce to data.table
setDT(dat1)[
# create temporary column with only numeric values
, Value_num := as.numeric(as.character(Value))][
# create temp cols for group id and group size
, `:=`(grp = .GRP, N = .N), by = rleid(Label)][
# for sufficiently large groups compute max values and coerce to char
N >= 3, new := as.character(max(Value_num)), by = grp][
# copy missing values
is.na(new), new := as.character(Value)][
# clean up
, c("grp", "N", "Value_num") := NULL][]
returns the expected result
Label Value Expected new
1: 0 NA NA NA
2: 0 NA NA NA
3: 1 1 3 3
4: 1 2 3 3
5: 1 3 3 3
6: 2 1 1 1
7: 2 2 2 2
8: 3 3 3 3
9: 3 2 3 3
10: 3 1 3 3
11: 4 NC NC NC
12: 5 1 3 3
13: 5 3 3 3
14: 5 2 3 3
15: 6 1 NA 1
16: 6 NA NA NA
except for row 15 where I believe the expected result should be 1 if we follow the words of the OP otherwise just report the values of the column "Value"
The warning message:
In eval(jsub, SDenv, parent.frame()) : NAs introduced by coercion
can be ignored as it's intended to convert non-numbers to NA, here.
Here is a dplyr solution. . NOTE: NC was changed to NA
Label<-c(0,0,1,1,1,2,2,3,3,3,4,5,5,5,6,6)
Value<-c(NA,NA,1,2,3,1,2,3,2,1,NA,1,3,2,1,NA)
dat1<-as.data.frame(cbind(Label, Value))
library(dplyr)
dat1 %>%
filter(!is.na(Value)) %>%
group_by(Label) %>%
summarize(n = n(), max_Value = max(Value)) %>%
mutate(test = if_else(n>=3, max_Value, as.numeric(NA))) %>%
right_join(dat1, by = "Label") %>%
mutate(test = if_else(is.na(test), Value, test)) %>%
select(Label, Value, test)
# # A tibble: 16 × 3
# Label Value test
# <dbl> <dbl> <dbl>
# 1 0 NA NA
# 2 0 NA NA
# 3 1 1 3
# 4 1 2 3
# 5 1 3 3
# 6 2 1 1
# 7 2 2 2
# 8 3 3 3
# 9 3 2 3
# 10 3 1 3
# 11 4 NA NA
# 12 5 1 3
# 13 5 3 3
# 14 5 2 3
# 15 6 1 1
# 16 6 NA NA

Reshaping data.table with cumulative sum

I want to reshape a data.table, and include the historic (cumulative summed) information for each variable. The No variable indicates the chronological order of measurements for object ID. At each measurement additional information is found. I want to aggregate the known information at each timestamp No for object ID.
Let me demonstrate with an example:
For the following data.table:
df <- data.table(ID=c(1,1,1,2,2,2,2),
No=c(1,2,3,1,2,3,4),
Variable=c('a','b', 'a', 'c', 'a', 'a', 'b'),
Value=c(2,1,3,3,2,1,5))
df
ID No Variable Value
1: 1 1 a 2
2: 1 2 b 1
3: 1 3 a 3
4: 2 1 c 3
5: 2 2 a 2
6: 2 3 a 1
7: 2 4 b 5
I want to reshape it to this:
ID No a b c
1: 1 1 2 NA NA
2: 1 2 2 1 NA
3: 1 3 5 1 NA
4: 2 1 NA NA 3
5: 2 2 2 NA 3
6: 2 3 3 NA 3
7: 2 4 3 5 3
So the summed values of Value, per Variable by (ID, No), cumulative over No.
I can get the result without the cumulative part by doing
dcast(df, ID+No~Variable, value.var="Value")
which results in the non-cumulative variant:
ID No a b c
1: 1 1 2 NA NA
2: 1 2 NA 1 NA
3: 1 3 3 NA NA
4: 2 1 NA NA 3
5: 2 2 2 NA NA
6: 2 3 1 NA NA
7: 2 4 NA 5 NA
Any ideas how to make this cumulative? The original data.table has over 250,000 rows, so efficiency matters.
EDIT: I just used a,b,c as an example, the original file has about 40 different levels. Furthermore, the NAs are important; there are also Value-values of 0, which means something else than NA
POSSIBLE SOLUTION
Okay, so I've found a working solution. It is far from efficient, since it enlarges the original table.
The idea is to duplicate each row TotalNo - No times, where TotalNo is the maximum No per ID. Then the original dcast function can be used to extract the dataframe. So in code:
df[,TotalNo := .N, by=ID]
df2 <- df[rep(seq(nrow(df)), (df$TotalNo - df$No + 1))] #create duplicates
df3 <- df2[order(ID, No)]#, No:= seq_len(.N), by=.(ID, No)]
df3[,No:= seq(from=No[1], to=TotalNo[1], by=1), by=.(ID, No)]
df4<- dcast(df3,
formula = ID + No ~ Variable,
value.var = "Value", fill=NA, fun.aggregate = sum)
It is not really nice, because the creation of duplicates uses more memory. I think it can be further optimized, but so far it works for my purposes. In the sample code it goes from 7 rows to 16 rows, in the original file from 241,670 rows to a whopping 978,331. That's over a factor 4 larger.
SOLUTION
Eddi has improved my solution in computing time in the full dataset (2.08 seconds of Eddi versus 4.36 seconds of mine). Those are numbers I can work with! Thanks everybody!
Your solution is good, but you're adding too many rows, that are unnecessary if you compute the cumsum beforehand:
# add useful columns
df[, TotalNo := .N, by = ID][, CumValue := cumsum(Value), by = .(ID, Variable)]
# do a rolling join to extend the missing values, and then dcast
dcast(df[df[, .(No = seq(No[1], TotalNo[1])), by = .(ID, Variable)],
on = c('ID', 'Variable', 'No'), roll = TRUE],
ID + No ~ Variable, value.var = 'CumValue')
# ID No a b c
#1: 1 1 2 NA NA
#2: 1 2 2 1 NA
#3: 1 3 5 1 NA
#4: 2 1 NA NA 3
#5: 2 2 2 NA 3
#6: 2 3 3 NA 3
#7: 2 4 3 5 3
Here's a standard way:
library(zoo)
df[, cv := cumsum(Value), by = .(ID, Variable)]
DT = dcast(df, ID + No ~ Variable, value.var="cv")
lvls = sort(unique(df$Variable))
DT[, (lvls) := lapply(.SD, na.locf, na.rm = FALSE), by=ID, .SDcols=lvls]
ID No a b c
1: 1 1 2 NA NA
2: 1 2 2 1 NA
3: 1 3 5 1 NA
4: 2 1 NA NA 3
5: 2 2 2 NA 3
6: 2 3 3 NA 3
7: 2 4 3 5 3
One alternative way to do it is using a custom built cumulative sum function. This is exactly the method in #David Arenburg's comment, but substitutes in a custom cumulative summary function.
EDIT: Using #eddi's much more efficient custom cumulative sum function.
cumsum.na <- function(z){
Reduce(function(x, y) if (is.na(x) && is.na(y)) NA else sum(x, y, na.rm = T), z, accumulate = T)
}
cols <- sort(unique(df$Variable))
res <- dcast(df, ID + No ~ Variable, value.var = "Value")[, (cols) := lapply(.SD, cumsum.na), .SDcols = cols, by = ID]
res
ID No a b c
1: 1 1 2 NA NA
2: 1 2 2 1 NA
3: 1 3 5 1 NA
4: 2 1 NA NA 3
5: 2 2 2 NA 3
6: 2 3 3 NA 3
7: 2 4 3 5 3
This definitely isn't the most efficient, but it gets the job done and gives you an admittedly very slow very slow cumulative summary function that handles NAs the way you want to.

Resources