Generate a table of values for several columns - r

Say I have a dataframe or datatable.
For example:
try <- data.frame(AA=c(1,2,3,1,2,3,4,5,NA),BB=c(1,2,2,NA,
2,1,2,2,NA), CC=c("A","B", NA, NA, "A","B", "A","C","B"))
setDT(try)
AA BB CC
1 1 A
2 2 B
3 2 NA
1 NA NA
2 2 A
3 1 B
4 2 A
5 2 C
NA NA B
I want to summarize the values in order to export them to an Excel file for further manipulation later.
I could create a table for each column but, in real life, some variable could have too many different values (such as the weight or DOB of people).
I can get the first six value for a single column with:
table(try$BB, useNA ="ifany")
1 2 <NA>
2 5 2
But when I try to do it automatically for all the columns at once it doesn't work as expected:
try[,lapply(.SD,function(x) table(x,useNA="ifany")[1:6] )]
because the table() command generates a 2 rows result and only one is used to create the final summary table.
What procedure do you suggest to keep that information?
For example I could try to convert that single-variable tables to something like
"1":2 "2":5 "NA":2
But I don't know how to do it. Maybe converting it to factors, maybe pasting the values.
I'm not even able to extract the rows of the table for further manipulation.
Any solution with base data.frame or date.table is welcome.
Or I could even order that table to get the most common values first.
PD: I want somethin like this:
AA "1":2 "2":2 "3":2 "4":1 "5":1 "NA": 1
BB "1":2 "2":5 "NA": 2
CC "A":3 "B":3 "C":1 "NA": 2
PD2:
I've tried this
try[,lapply(.SD, function(x) { tmp <- table(x,
useNA ="ifany") ; mapply(paste0, names( tmp ),
rep(":", length(tmp)), tmp )} )
]
But it's too long and it doesn't work well
AA BB CC
1:2 1:2 A:3
2:2 2:5 B:3
3:2 NA:2 C:1
4:1 1:2 NA:2
5:1 1:2 A:3
NA:1 2:5 B:3
It fills the last values with fake values.
Another option would be to interleave the names and the values.
In this example I should get:
AA BB CC
"1:2" "1:2" "A:3"
"2:2" "2:5" "B:3"
"3:2" "NA:2" "C:1"
"4:1" NA "NA:2"
"5:1" NA NA
"NA:1" NA NA
The problem is that the list is converted internally to a datatable by the command as.data.table.list() and the different size vectors are recycled instead of filled with NAs.

You can get your desired output with
library(magrittr)
tab = try %>% lapply(table, useNA = "ifany")
len = max(lengths(tab))
tab %>% lapply(
. %>%
{ paste0(names(.), ":", .) } %>%
`length<-`(len)
) %>% setDF %>% print
AA BB CC
1 1:2 1:2 A:3
2 2:2 2:5 B:3
3 3:2 NA:2 C:1
4 4:1 <NA> NA:2
5 5:1 <NA> <NA>
6 NA:1 <NA> <NA>
I haven't learned purrr, but if you like using pipes, that might offer somewhat cleaner code.

This is my data.table solution with some ideas from Frank.
siz <- 6
try[,lapply(.SD, function(x) { tmp <- table(x,
useNA ="ifany") ; tmp2 <- c(tmp[is.na(names(tmp))],
rev(sort(tmp[!is.na(names(tmp))])));
tmp3 <- mapply(paste0, names( tmp2 ),rep(":",
length(tmp2)),tmp2); length(tmp3)<-siz; tmp3})
]
It places the NAs always at the beginning and order the other elements from the most common to the least common.
Maybe there are some simpler ways to summarize the information.

Related

Equivalent of row_number for columns dplyr

I am trying to apply a function to columns of a tibble, or data.frame, depending on the index of columns. It appears to me several time, and I give just one MWE
library(tidyverse)
test <- data.frame(a = c(1,2,3), b = c(7,8,9), c = c(3,5,6))
test <- test %>% as_tibble() %>% mutate_all( ~lead(., 2))
This will lead by 2 every columns (just an example). But what I want is to lead the first column by 1, the second by 2, and so on. Doing something like mutate_all(~lead(., col_number()).
For this little example, I know one way to do it, like:
test <- as.matrix(test)
for (i in 1:ncol(test)){ test[,i] <- lead(test[,i], i) }
There might be other way to do it too, haven't thought about it much (one needs to convert as a matrix first, otherwise it doesn't produce the right result, I don't really know why).
But I'd like to do it with a mutate or apply, being able to get the index of column in general. With a more complex example.
Any idea?
One option is using purrr::map2_df to sequentially lead every column based on column number.
purrr::map2_df(test, seq_along(test), dplyr::lead)
# A tibble: 3 x 3
# a b c
# <dbl> <dbl> <dbl>
#1 2 9 NA
#2 3 NA NA
#3 NA NA NA
We can also use base R Map
test[] <- Map(function(x, y) c(tail(x, -y), rep(NA, y)), test, seq_along(test))
We can use data.table shift
library(data.table)
setDT(test)[, Map(shift, .SD, n = 1:3, type = 'lead')]
# a b c
#1: 2 9 NA
#2: 3 NA NA
#3: NA NA NA
Or using purrr
library(purrr)
map2_dfr(test, 1:3, ~shift(.x, type = 'lead'))

Transform longitudinal table to wide format efficiently in data.table

I am working in R with a long table stored as a data.table containing values obtained in value changes for variables of numeric and character type. When I want to perform some functions like correlations, regressions, etc. I have to convert the table into wide format and homogenise the timestamp frequency.
I found a way to convert the long table to wide, but I think is not really efficient and I would like to know if there is a better more data.table native approach.
In the reproducible example below, I include the two options I found to perform the wide low transformation and in the comments I indicate what parts I believe are not optimal.
library(zoo)
library(data.table)
dt<-data.table(time=1:6,variable=factor(letters[1:6]),numeric=c(1:3,rep(NA,3)),
character=c(rep(NA,3),letters[1:3]),key="time")
print(dt)
print(dt[,lapply(.SD,typeof)])
#option 1
casted<-dcast(dt,time~variable,value.var=c("numeric","character"))
# types are correct, but I got NA filled columns,
# is there an option like drop
# available for columns instead of rows?
print(casted)
print(casted[,lapply(.SD,typeof)])
# This drop looks ugly but I did not figure out a better way to perform it
casted[,names(casted)[unlist(casted[,lapply(lapply(.SD,is.na),all)])]:=NULL]
# I perform a LOCF, I do not know if I could benefit of
# data.table's roll option somehow and avoid
# the temporal memory copy of my dataset (this would be the second
# and minor issue)
casted<-na.locf(casted)
#option2
# taken from http://stackoverflow.com/questions/19253820/how-to-implement-coalesce-efficiently-in-r
coalesce2 <- function(...) {
Reduce(function(x, y) {
i <- which(is.na(x))
x[i] <- y[i]
x},
list(...))
}
casted2<-dcast(dt[,coalesce2(numeric,character),by=c("time","variable")],
time~variable,value.var="V1")
# There are not NA columns but types are incorrect
# it takes more space in a real table (more observations, less variables)
print(casted2)
print(casted2[,lapply(.SD,typeof)])
# Again, I am pretty sure there is a prettier way to do this
numericvars<-names(casted2)[!unlist(casted2[,lapply(
lapply(lapply(.SD,as.numeric),is.na),all)])]
casted2[,eval(numericvars):=lapply(.SD,as.numeric),.SDcols=numericvars]
# same as option 1, is there a data.table native way to do it?
casted2<-na.locf(casted2)
Any advice/improvement in the process is welcome.
I'd maybe do the char and num tables separately and then rbind:
k = "time"
typecols = c("numeric", "character")
res = rbindlist(fill = TRUE,
lapply(typecols, function(tc){
cols = c(k, tc, "variable")
dt[!is.na(get(tc)), ..cols][, dcast(.SD, ... ~ variable, value.var=tc)]
})
)
setorderv(res, k)
res[, setdiff(names(res), k) := lapply(.SD, zoo::na.locf, na.rm = FALSE), .SDcols=!k]
which gives
time a b c d e f
1: 1 1 NA NA NA NA NA
2: 2 1 2 NA NA NA NA
3: 3 1 2 3 NA NA NA
4: 4 1 2 3 a NA NA
5: 5 1 2 3 a b NA
6: 6 1 2 3 a b c
Note that OP's final result casted2, differs in that it has all cols as char.

How to change values in a column of a data frame based on conditions in another column?

I would like to have an equivalent of the Excel function "if". It seems basic enough, but I could not find relevant help.
I would like to assess "NA" to specific cells if two following cells in a different columns are not identical. In Excel, the command would be the following (say in C1): if(A1 = A2, B1, "NA"). I then just need to expand it to the rest of the column.
But in R, I am stuck!
Here is an equivalent of my R code so far.
df = data.frame(Type = c("1","2","3","4","4","5"),
File = c("A","A","B","B","B","C"))
df
To get the following Type of each Type in another column, I found a useful function on StackOverflow that does the job.
# determines the following Type of each Type
shift <- function(x, n){
c(x[-(seq(n))], rep(6, n))
}
df$TypeFoll <- shift(df$Type, 1)
df
Now, I would like to keep TypeFoll in a specific row when the File for this row is identical to the File on the next row.
Here is what I tried. It failed!
for(i in 1:length(df$File)){
df$TypeFoll2 <- ifelse(df$File[i] == df$File[i+1], df$TypeFoll, "NA")
}
df
In the end, my data frame should look like:
aim = data.frame(Type = c("1","2","3","4","4","5"),
File = c("A","A","B","B","B","C"),
TypeFoll = c("2","3","4","4","5","6"),
TypeFoll2 = c("2","NA","4","4","NA","6"))
aim
Oh, and by the way, if someone would know how to easily put the columns TypeFoll and TypeFoll2 just after the column Type, it would be great!
Thanks in advance
I would do it as follows (not keeping the result from the shift function)
df = data.frame(Type = c("1","2","3","4","4","5"),
File = c("A","A","B","B","B","C"), stringsAsFactors = FALSE)
# This is your shift function
len=nrow(df)
A1 <- df$File[1:(len-1)]
A2 <- df$File[2:len]
# Why do you save the result of the shift function in the df?
Then assign if(A1 = A2, B1, "NA"). As akrun mentioned ifelse is vectorised: Btw. this is how you append a column to a data.frame
df$TypeFoll2 <- c(ifelse(A1 == A2, df$Type, NA), 6) #Why 6?
As 6 is hardcoded here something like:
df$TypeFoll2 <- c(ifelse(A1 == A2, df$Type, NA), max(df$Type)+1)
Is more generic.
First off, 'for' loops are pretty slow in R, so try to think of this as vector manipulation instead.
df = data.frame(Type = c("1","2","3","4","4","5"),
File = c("A","A","B","B","B","C"));
Create shifted types and files and put it in new columns:
df$TypeFoll = c(as.character(df$Type[2:nrow(df)]), "NA");
df$FileFoll = c(as.character(df$File[2:nrow(df)]), "NA");
Now, df looks like this:
> df
Type File TypeFoll FileFoll
1 1 A 2 A
2 2 A 3 B
3 3 B 4 B
4 4 B 4 B
5 4 B 5 C
6 5 C NA NA
Then, create TypeFoll2 by combining these:
df$TypeFoll2 = ifelse(df$File == df$FileFoll, df$TypeFoll, "NA");
And you should have something that looks a lot like what you want:
> df;
Type File TypeFoll FileFoll TypeFoll2
1 1 A 2 A 2
2 2 A 3 B NA
3 3 B 4 B 4
4 4 B 4 B 4
5 4 B 5 C NA
6 5 C NA NA NA
If you want to remove the FileFoll column:
df$FileFoll = NULL;

Improving performance of updating contents of large data frame using contents of similar data frame

I'm looking for a general solution for updating one large data frame with the contents of a second similar data frame. I have dozens of datasets, each with thousands of rows and upwards of 10,000 columns. An "update" dataset will overlap its corresponding "base" dataset by anywhere from a few percent to perhaps 50 percent, rowwise. The datasets have a "key" column and there will be only one row per each unique key value in any given dataset.
The basic rule is: if a non-NA value exists in the update dataset for a given cell, replace the same cell in the base dataset with that value. (The "same cell" means same value of the "key" column and colname.)
Note the update dataset will likely contain new rows ("inserts") which I can handle with an rbind.
So given the base data frame "df1", where column "K" is the unique key column, and "P1" .. "P3" represent the 10,000 columns, whose names will vary from one pair of datasets to the next:
K P1 P2 P3
1 A 1 1 1
2 B 1 1 1
3 C 1 1 1
...and the update data frame "df2":
K P1 P2 P3
1 B 2 NA 2
2 C NA 2 2
3 D 2 2 2
The result I need is as follows, where the 1's for "B" and "C" were overwritten by the 2's but not overwritten by the NA's:
K P1 P2 P3
1 A 1 1 1
2 B 2 1 2
3 C 1 2 2
4 D 2 2 2
This doesn't seem to be a merge candidate as merge gives me either duplicate rows (with respect to the "key" column) or duplicate columns (e.g. P1.x, P1.y), which I have to iterate over to collapse somehow.
I have tried pre-allocating a matrix with the dimensions of the final rows/columns, and populating it with the contents of df1, then iterating over the overlapping rows of df2, but I cannot get better than 20 cells per second performance, requiring hours to complete (compared to minutes for the equivalent DATA step UPDATE functionality in SAS).
I'm sure I'm missing something, but can't find a comparable example.
I see ddply usage that looks close, but not a general solution. The data.table package didn't seem to help as it's not obvious to me that this is a join problem, at least not generally over so many columns.
Also a solution that focuses only on the intersecting rows is adequate as I can identify the others and rbind them in.
Here is some code to fabricate the data frames above:
cat("K,P1,P2,P3", "A,1,1,1", "B,1,1,1", "C,1,1,1", file="f1.dat", sep="\n");
cat("K,P1,P2,P3", "B,2,,2", "C,,2,2", "D,2,2,2", file="f2.dat", sep="\n");
df1 <- read.table("f1.dat", sep=",", header=TRUE, stringsAsFactors=FALSE);
df2 <- read.table("f2.dat", sep=",", header=TRUE, stringsAsFactors=FALSE);
Thanks
This loops by column, setting dt1 by reference and (hopefully) should be quick.
dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
if (!identical(names(dt1),names(dt2)))
stop("Assumed for now. Can relax later if needed.")
w = chmatch(dt2$K, dt1$K)
for (i in 2:ncol(dt2)) {
nna = !is.na(dt2[[i]])
set(dt1,w[nna],i,dt2[[i]][nna])
}
dt1 = rbind(dt1,dt2[is.na(w)])
dt1
K P1 P2 P3
[1,] A 1 1 1
[2,] B 2 1 2
[3,] C 1 2 2
[4,] D 2 2 2
This is likely not the fastest solution but is done entirely in base.
(updated answer per Tommy's comments)
#READING IN YOUR DATA FRAMES
df1 <- read.table(text=" K P1 P2 P3
1 A 1 1 1
2 B 1 1 1
3 C 1 1 1", header=TRUE)
df2 <- read.table(text=" K P1 P2 P3
1 B 2 NA 2
2 C NA 2 2
3 D 2 2 2", header=TRUE)
all <- c(levels(df1$K), levels(df2$K)) #all cells of key column
dups <- all[duplicated(all)] #the overlapping key cells
ndups <- all[!all %in% dups] #unique key cells
df3 <- rbind(df1[df1$K%in%ndups, ], df2[df2$K%in%ndups, ]) #bind the unique rows
decider <- function(x, y) ifelse(is.na(x), y, x) #function replaces NAs if existing
df4 <- data.frame(mapply(df2[df2$K%in%dups, ], df1[df1$K%in%dups, ],
FUN = decider)) #repalce all NAs of df2 with df1 values if they exist
df5 <- rbind(df3, df4) #bind unique rows of df1 and df2 with NA replaced df4
df5 <- df5[order(df5$K), ] #reorder based on key column
rownames(df5) <- 1:nrow(df5) #give proper non duplicated rownames
df5
This yields:
K P1 P2 P3
1 A 1 1 1
2 B 2 1 2
3 C 1 2 2
4 D 2 2 2
Upon closer reading not all columns have the same name but I am assuming the same order. this may be a more helpful approach:
all <- c(levels(df1$K), levels(df2$K))
dups <- all[duplicated(all)]
ndups <- all[!all %in% dups]
LS <- list(df1, df2)
LS2 <- lapply(seq_along(LS), function(i) {
colnames(LS[[i]]) <- colnames(LS[[2]])
return(LS[[i]])
}
)
LS3 <- lapply(seq_along(LS2), function(i) LS2[[i]][LS2[[i]]$K%in%ndups, ])
LS4 <- lapply(seq_along(LS2), function(i) LS2[[i]][LS2[[i]]$K%in%dups, ])
decider <- function(x, y) ifelse(is.na(x), y, x)
DF <- data.frame(mapply(LS4[[2]], LS4[[1]], FUN = decider))
DF$K <- LS4[[1]]$K
LS3[[3]] <- DF
df5 <- do.call("rbind", LS3)
df5 <- df5[order(df5$K), ]
rownames(df5) <- 1:nrow(df5)
df5
EDIT : Please ignore this answer. Bad idea to loop by row. It works but is very slow. Left for posterity! See my 2nd attempt as separate answer.
require(data.table)
dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
K = dt2[[1]]
for (i in 1:nrow(dt2)) {
k = K[i]
p = unlist(dt2[i,-1,with=FALSE])
p = p[!is.na(p)]
dt1[J(k),names(p):=as.list(p),with=FALSE]
}
or, can you use matrix instead of data.frame? If so it could be a single line using A[B] syntax where B is a 2-column matrix containing the row and column numbers to update.
The following gives the correct answer for the small example data, tries to minimize the number of "copies" of tables, and uses the new fread and (new?) rbindlist. Does it work with your larger actual data set? I didn't quite follow all the comments in the original post about the memory issues you had when trying to flatten/normalize/stack, so apologies if you've already tried this route.
library(data.table)
library(reshape2)
cat("K,P1,P2,P3", "A,1,1,1", "B,1,1,1", "C,1,1,1", file="f1.dat", sep="\n")
cat("K,P1,P2,P3", "B,2,,2", "C,,2,2", "D,2,2,2", file="f2.dat", sep="\n")
dt1s<-data.table(melt(fread("f1.dat"), id.vars="K"), key=c("K","variable")) # read f1.dat, melt to long/stacked format, and convert to data.table
dt2s<-data.table(melt(fread("f2.dat"), id.vars="K", na.rm=T), key=c("K","variable")) # read f2.dat, melt to long/stacked format (removing NAs), and convert to data.table
setnames(dt2s,"value","value.new")
dt1s[dt2s,value:=value.new] # Update new values
dtout<-reshape(rbindlist(list(dt1s,dt1s[dt2s][is.na(value),list(K,variable,value=value.new)])), direction="wide", idvar="K", timevar="variable") # Use rbindlist to insert new records, and then reshape
setkey(dtout,K)
setnames(dtout,colnames(dtout),sub("value.", "", colnames(dtout))) # Clean up the column names

How does ddply handle factors as "split" variables?

I have a data.frame with 20 columns. The first two are factors, and the rest are numeric. I'd like to use the first two columns as split variables and then apply the mean() to the remaining columns.
This seems like a quick and easy job for ddply(), however, the results for the output data.frame are not what I am looking for. Here is a minimal example with just one column of data:
Aa <- c(rep(c("A", "a"), each = 20))
Bb <- c(rep(c("B", "b", "B", "b"), each = 10))
x <- runif(40)
df1 <- data.frame(Aa, Bb, x)
ddply(df1, .(Aa, Bb), mean)
The output is:
Aa Bb x
1 NA NA 0.5193275
2 NA NA 0.4491907
3 NA NA 0.4848128
4 NA NA 0.4717899
Warning messages:
1: In mean.default(X[[1L]], ...) :
argument is not numeric or logical: returning NA
The warning is repeated 8 times, presumably once for each call to mean(). I'm guessing this comes from trying to take the mean of a factor. I could write this as:
ddply(df1, .(Aa, Bb), function(df1) mean(df1$x))
or
ddply(df1, .(Aa, Bb), summarize, x = mean(x))
both of which do work (not giving NAs), but I would rather avoid writing out 18 such x = mean(x) statements, one for each of my numeric columns.
Is there a general solution? I'm not wedded to ddply if there is a better answer elsewhere.
Since you are reducing hte number of rows, you need to use summarise:
> ddply(df1, .(Aa, Bb), summarise, mean_x =mean(x) )
Aa Bb mean_x
1 a b 0.3790675
2 a B 0.4242922
3 A b 0.5622329
4 A B 0.4574471
It's just as easy to use aggregate in this instance. Let's say you had two variables:
> aggregate(df1[-(1:2)], df1[1:2], mean)
Aa Bb x y
1 a b 0.4249121 0.4639192
2 A b 0.6127175 0.4639192
3 a B 0.4522292 0.4826715
4 A B 0.5201965 0.4826715
ddply supports negative indexing as well:
ddply(df1, .(Aa, Bb), function(x) mean(x[-(1:2)]))

Resources