iterating table() results into matrix/data frame - r

This must be simple but I'm banging my head against it for a while. Please help. I have a large data set from which I get all kinds of information via table(). I then want to store these counts, with the rownames that were counted. For a reproducible example consider
a <- c("a", "b", "c", "d", "a", "b") # one count, occurring twice for a and
# b and once for c and d
b <- c("a", "c") # a completly different property from the dataset
# occurring once for a and c
x <- table(a)
y <- table(b) # so now x and y hold the information I seek
How can I merge/bind/whatever to get from x and y to this form:
x. y.
a 2. 1
b 2. 0
c 1. 1
d. 1 0
HOWEVER, I need to use the solution to work iteratively, in a loop that takes x and y and gets the requested form above, and then gets more tables added, each hopefully adding a column. One of my many failed attempts, just to show my (probably flawed) logic, is:
member <- function (data = dfm, groupvar = 'group', analysis = kc15) {
res<-matrix(NA,ncol=length(analysis$size)+1) #preparing an object for the results
res[,1]<-table(docvars(data,groupvar)) #getting names and totals of groups
for (i in 1:length(analysis$size)) { #getting a bunch of counts that I care about
r<-table(docvars(data,groupvar)[analysis$cluster==i])
res<-cbind(res,r) #here's the problem, trying to add each new count as a column.
}
res
}
So, to sum, the reproducible example above means to replicate the first column in res and an r, and I'm seeking (I think) a correct solution instead of the cbind, which would allow adding columns of different length but similar names, as in the example above.
Please help its embarrassing how much time I'm wasting on this

The following may be an option, which merges on the "row names" of the data frames, converted from the frequency tables:
df <- merge(as.data.frame(x, row.names=1, responseName ="x"),
as.data.frame(y, row.names=1, responseName ="y"),
by="row.names", all=TRUE)
df[is.na(df)] <- 0; df
Row.names x y
1 a 2 1
2 b 2 0
3 c 1 1
4 d 1 0
Then, this method can be incorporated into your real data with some modification. I've made up the data since I didn't have any to work with.
set.seed(1234)
groupvar <- sample(letters[1:4], 16, TRUE)
clusters <- 1:4
cluster <- rep(clusters, each=4)
Merge the first two tables:
res <- merge(as.data.frame(table(groupvar[cluster==1]),
row.names=1, responseName=clusters[1]),
as.data.frame(table(groupvar[cluster==2]),
row.names=1, responseName=clusters[2]),
by="row.names", all=TRUE)
Then merge the others using your for loop.
for (i in 3:length(clusters)) {
r <- table(groupvar[cluster==i])
res <- merge(res, as.data.frame(r, row.names=1, responseName = clusters[i]),
by.x="Row.names", by.y="row.names", all=TRUE)
}
res[is.na(res)] <- 0
res
Row.names X1 X2 X3 X4
1 a 1 2 0 0
2 b 1 1 2 2
3 c 0 1 1 2
4 d 2 0 1 0

merge the transposed and re-transpose.
res <- t(merge(t(unclass(x)), t(unclass(y)), all=TRUE))
res <- `colnames<-`(res[order(rownames(res)), 2:1], c("x", "y"))
res[is.na(res)] <- 0
res
# x y
# a 2 1
# b 2 0
# c 1 1
# d 1 0

Related

For loop doesn't properly filter

I want to print two dataframes, where the first one is all the rows where column a is not NA and the second is all the rows where column b is not NA.
This is my code. It prints the whole dataframe both times, without triggering the filter.
a <- cbind(rep(NA, 100), seq(0,99))
b <- cbind(seq(0,99), rep(NA, 100))
df <- as.data.frame(rbind(a,b))
names(df) <- c("a", "b")
columns <- c("a", "b")
for (j in columns){
df %>% filter(!is.na(j)) %>% print()
}
I also tried with filter(j != "") and received the same result.
As to why the downvote, I cannot know, but I can guess. You used functions that are not base R without issuing library calls for the packages that contain them, and you constructed your example dataframe in a wasteful and possibly dangerous fashion using cbind unnecessarily and as.data.frame where a single data.frame call would have been more efficient, safer and more expressive.
cbind(as.Date("1970-01-01")) # causes loss of attributes including class
# [,1]
#[1,] 0
c(factor("a"))
#[1] 1
Here's how to properly construct an example like yours:
df <- data.frame( a = c(rep(NA, 100), seq(0,99)) ,
b = c(seq(0,99), rep(NA, 100)))
And you can get a column or object whose name you have in a character vector with get (assuming that there is an appropriate
columns <- c("a", "b")
library(dplyr)
for (j in columns){
df %>% filter(!is.na( get(j) )) %>% print()
}
Do you mean something like:
not_na_a <- data.frame(which(!is.na(df$a)))
#> head(not_na_a)
which..is.na.df.a..
1 101
2 102
3 103
4 104
5 105
6 106
not_na_b <- data.frame(which(!is.na(df$b)))
#> head(not_na_b)
which..is.na.df.b..
1 1
2 2
3 3
4 4
5 5
6 6

Create dataframe from smallest vector available

I want to create a dataframe from a list of dataframes, specifically from a certain column of those dataframes. However each dataframe contains a different number of observations, so the following code gives me an error.
diffs <- data.frame(sensor1 = sensores[[1]]$Diff,
sensor2 = sensores[[2]]$Diff,
sensor3 = sensores[[3]]$Diff,
sensor4 = sensores[[4]]$Diff,
sensor5 = sensores[[5]]$Diff)
The error:
Error in data.frame(sensor1 = sensores[[1]]$Diff, sensor2 = sensores[[2]]$Diff, :
arguments imply differing number of rows: 29, 19, 36, 26
Is there some way to force data.frame() to take the minimal number or rows available from each one of the columns, in this case 19?
Maybe there is a built-in function in R that can do this, any solution is appreciated but I'd love to get something as general and as clear as possible.
Thank you in advance.
I can think of two approaches:
Example data:
df1 <- data.frame(A = 1:3)
df2 <- data.frame(B = 1:4)
df3 <- data.frame(C = 1:5)
Compute the number of rows of the smallest dataframe:
min_rows <- min(sapply(list(df1, df2, df3), nrow))
Use subsetting when combining:
diffs <- data.frame(a = df1[1:min_rows,], b = df2[1:min_rows,], c = df3[1:min_rows,] )
diffs
a b c
1 1 1 1
2 2 2 2
3 3 3 3
Alternatively, use merge:
rowmerge <- function(x,y){
# create row indicators for the merge:
x$ind <- 1:nrow(x)
y$ind <- 1:nrow(y)
out <- merge(x,y, all = T, by = "ind")
out["ind"] <- NULL
return(out)
}
Reduce(rowmerge, list(df1, df2, df3))
A B C
1 1 1 1
2 2 2 2
3 3 3 3
4 NA 4 4
5 NA NA 5
To get rid of the rows with NAs, remove the all = T.
For your particular case, you would probably call Reduce(rowmerge, sensores), assuming that sensores is a list of dataframes.
Note: if you already have an index somewhere (e.g. a timestamp of some sort), then it would be advisable to simply merge on that index instead of creating ind.

R converting a long list of questionnaire choices to a dataframe with one row for each questionnaire

A questionnaire was passed to teachers to check their curriculum preferences. They had to choose 20 items from about 50 options.
The resulting data is a long list of choices of the following type:
Teacher ID, Question ID
i want to format it to a list with one row for each teacher and a colomn per each question with the possible values: 0 (not chosen), 1 (chosen).
In pseudo code (of a programming language)
it would probably be something like this:
iterate list {
data [teacher_id] [question_id] = 0
}
Here is a sample data and the intended result:
a <- data.frame(
Case_ID = c(1,1,2,2,4,4),
Q_ID = c(3,5,5,8,2,6)
)
intended result is
res <- data.frame(
Case_ID = c(1,2,4),
Q_1 = c(0,0,0),
Q_2 = c(0,0,1),
Q_3 = c(1,0,0),
Q_4 = c(0,0,0),
Q_5 = c(1,1,0),
Q_6 = c(0,0,1),
Q_7 = c(0,0,0),
Q_8 = c(0,1,0)
)
Any help would be greatly appreciated.
Tnx
Hed
Returning a matrix and using matrix indexing to do the work:
m <- matrix(0, nrow=3, ncol=8)
rownames(m) <- c(1,2,4)
colnames(m) <- 1:8
idx <-apply(a, 2, as.character)
m[idx] <- 1
m
## 1 2 3 4 5 6 7 8
## 1 0 0 1 0 1 0 0 0
## 2 0 0 0 0 1 0 0 1
## 4 0 1 0 0 0 1 0 0
Note that you can think of a as a list of indecies, which themselves reference which cells in a "master array" are TRUE.
Then if you have a master matrix, say res of all 0's, you can then tell R: "all of the elements that are referenced in a should be 1"
This is done below
First we create the "master matrix"
# identify the unique teacher ID's
teacherIDs <- unique(a$Case_ID)
# count how many teachers there are
numbTeachers <- length(teacherIDs)
# create the column names for the questions
colNames <- c(paste0("Q_", 1:50))
# dim names for matrix. Using T_id for the row names
dnames <- list(paste0("T_", teacherIDs),
colNames)
# create the matrix
res2 <- matrix(0, ncol=50, nrow=numbTeachers, dimnames=dnames)
Next we convert a to a set of indices.
*Note that the first two lines below are only needed if there are Teacher ID's that are not present. ie in your example, T_3 is not present*
# create index out of a
indx <- a
indx$Case_ID <- as.numeric(as.factor(indx$Case_ID))
indx <- as.matrix(indx)
# populate those in a with 1
res2[indx] <- 1
res2

Improving performance of updating contents of large data frame using contents of similar data frame

I'm looking for a general solution for updating one large data frame with the contents of a second similar data frame. I have dozens of datasets, each with thousands of rows and upwards of 10,000 columns. An "update" dataset will overlap its corresponding "base" dataset by anywhere from a few percent to perhaps 50 percent, rowwise. The datasets have a "key" column and there will be only one row per each unique key value in any given dataset.
The basic rule is: if a non-NA value exists in the update dataset for a given cell, replace the same cell in the base dataset with that value. (The "same cell" means same value of the "key" column and colname.)
Note the update dataset will likely contain new rows ("inserts") which I can handle with an rbind.
So given the base data frame "df1", where column "K" is the unique key column, and "P1" .. "P3" represent the 10,000 columns, whose names will vary from one pair of datasets to the next:
K P1 P2 P3
1 A 1 1 1
2 B 1 1 1
3 C 1 1 1
...and the update data frame "df2":
K P1 P2 P3
1 B 2 NA 2
2 C NA 2 2
3 D 2 2 2
The result I need is as follows, where the 1's for "B" and "C" were overwritten by the 2's but not overwritten by the NA's:
K P1 P2 P3
1 A 1 1 1
2 B 2 1 2
3 C 1 2 2
4 D 2 2 2
This doesn't seem to be a merge candidate as merge gives me either duplicate rows (with respect to the "key" column) or duplicate columns (e.g. P1.x, P1.y), which I have to iterate over to collapse somehow.
I have tried pre-allocating a matrix with the dimensions of the final rows/columns, and populating it with the contents of df1, then iterating over the overlapping rows of df2, but I cannot get better than 20 cells per second performance, requiring hours to complete (compared to minutes for the equivalent DATA step UPDATE functionality in SAS).
I'm sure I'm missing something, but can't find a comparable example.
I see ddply usage that looks close, but not a general solution. The data.table package didn't seem to help as it's not obvious to me that this is a join problem, at least not generally over so many columns.
Also a solution that focuses only on the intersecting rows is adequate as I can identify the others and rbind them in.
Here is some code to fabricate the data frames above:
cat("K,P1,P2,P3", "A,1,1,1", "B,1,1,1", "C,1,1,1", file="f1.dat", sep="\n");
cat("K,P1,P2,P3", "B,2,,2", "C,,2,2", "D,2,2,2", file="f2.dat", sep="\n");
df1 <- read.table("f1.dat", sep=",", header=TRUE, stringsAsFactors=FALSE);
df2 <- read.table("f2.dat", sep=",", header=TRUE, stringsAsFactors=FALSE);
Thanks
This loops by column, setting dt1 by reference and (hopefully) should be quick.
dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
if (!identical(names(dt1),names(dt2)))
stop("Assumed for now. Can relax later if needed.")
w = chmatch(dt2$K, dt1$K)
for (i in 2:ncol(dt2)) {
nna = !is.na(dt2[[i]])
set(dt1,w[nna],i,dt2[[i]][nna])
}
dt1 = rbind(dt1,dt2[is.na(w)])
dt1
K P1 P2 P3
[1,] A 1 1 1
[2,] B 2 1 2
[3,] C 1 2 2
[4,] D 2 2 2
This is likely not the fastest solution but is done entirely in base.
(updated answer per Tommy's comments)
#READING IN YOUR DATA FRAMES
df1 <- read.table(text=" K P1 P2 P3
1 A 1 1 1
2 B 1 1 1
3 C 1 1 1", header=TRUE)
df2 <- read.table(text=" K P1 P2 P3
1 B 2 NA 2
2 C NA 2 2
3 D 2 2 2", header=TRUE)
all <- c(levels(df1$K), levels(df2$K)) #all cells of key column
dups <- all[duplicated(all)] #the overlapping key cells
ndups <- all[!all %in% dups] #unique key cells
df3 <- rbind(df1[df1$K%in%ndups, ], df2[df2$K%in%ndups, ]) #bind the unique rows
decider <- function(x, y) ifelse(is.na(x), y, x) #function replaces NAs if existing
df4 <- data.frame(mapply(df2[df2$K%in%dups, ], df1[df1$K%in%dups, ],
FUN = decider)) #repalce all NAs of df2 with df1 values if they exist
df5 <- rbind(df3, df4) #bind unique rows of df1 and df2 with NA replaced df4
df5 <- df5[order(df5$K), ] #reorder based on key column
rownames(df5) <- 1:nrow(df5) #give proper non duplicated rownames
df5
This yields:
K P1 P2 P3
1 A 1 1 1
2 B 2 1 2
3 C 1 2 2
4 D 2 2 2
Upon closer reading not all columns have the same name but I am assuming the same order. this may be a more helpful approach:
all <- c(levels(df1$K), levels(df2$K))
dups <- all[duplicated(all)]
ndups <- all[!all %in% dups]
LS <- list(df1, df2)
LS2 <- lapply(seq_along(LS), function(i) {
colnames(LS[[i]]) <- colnames(LS[[2]])
return(LS[[i]])
}
)
LS3 <- lapply(seq_along(LS2), function(i) LS2[[i]][LS2[[i]]$K%in%ndups, ])
LS4 <- lapply(seq_along(LS2), function(i) LS2[[i]][LS2[[i]]$K%in%dups, ])
decider <- function(x, y) ifelse(is.na(x), y, x)
DF <- data.frame(mapply(LS4[[2]], LS4[[1]], FUN = decider))
DF$K <- LS4[[1]]$K
LS3[[3]] <- DF
df5 <- do.call("rbind", LS3)
df5 <- df5[order(df5$K), ]
rownames(df5) <- 1:nrow(df5)
df5
EDIT : Please ignore this answer. Bad idea to loop by row. It works but is very slow. Left for posterity! See my 2nd attempt as separate answer.
require(data.table)
dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
K = dt2[[1]]
for (i in 1:nrow(dt2)) {
k = K[i]
p = unlist(dt2[i,-1,with=FALSE])
p = p[!is.na(p)]
dt1[J(k),names(p):=as.list(p),with=FALSE]
}
or, can you use matrix instead of data.frame? If so it could be a single line using A[B] syntax where B is a 2-column matrix containing the row and column numbers to update.
The following gives the correct answer for the small example data, tries to minimize the number of "copies" of tables, and uses the new fread and (new?) rbindlist. Does it work with your larger actual data set? I didn't quite follow all the comments in the original post about the memory issues you had when trying to flatten/normalize/stack, so apologies if you've already tried this route.
library(data.table)
library(reshape2)
cat("K,P1,P2,P3", "A,1,1,1", "B,1,1,1", "C,1,1,1", file="f1.dat", sep="\n")
cat("K,P1,P2,P3", "B,2,,2", "C,,2,2", "D,2,2,2", file="f2.dat", sep="\n")
dt1s<-data.table(melt(fread("f1.dat"), id.vars="K"), key=c("K","variable")) # read f1.dat, melt to long/stacked format, and convert to data.table
dt2s<-data.table(melt(fread("f2.dat"), id.vars="K", na.rm=T), key=c("K","variable")) # read f2.dat, melt to long/stacked format (removing NAs), and convert to data.table
setnames(dt2s,"value","value.new")
dt1s[dt2s,value:=value.new] # Update new values
dtout<-reshape(rbindlist(list(dt1s,dt1s[dt2s][is.na(value),list(K,variable,value=value.new)])), direction="wide", idvar="K", timevar="variable") # Use rbindlist to insert new records, and then reshape
setkey(dtout,K)
setnames(dtout,colnames(dtout),sub("value.", "", colnames(dtout))) # Clean up the column names

Creating a function to replace NAs from one data.frame with values from another

I regularly have situations where I need to replace missing values from a data.frame with values from some other data.frame that is at a different level of aggregation. So, for example, if I have a data.frame full of county data I might replace NA values with state values stored in another data.frame. After writing the same merge... ifelse(is.na()) yada yada a few dozen times I decided to break down and write a function to do this.
Here's what I cooked up, along with an example of how I use it:
fillNaDf <- function(naDf, fillDf, mergeCols, fillCols){
mergedDf <- merge(naDf, fillDf, by=mergeCols)
for (col in fillCols){
colWithNas <- mergedDf[[paste(col, "x", sep=".")]]
colWithOutNas <- mergedDf[[paste(col, "y", sep=".")]]
k <- which( is.na( colWithNas ) )
colWithNas[k] <- colWithOutNas[k]
mergedDf[col] <- colWithNas
mergedDf[[paste(col, "x", sep=".")]] <- NULL
mergedDf[[paste(col, "y", sep=".")]] <- NULL
}
return(mergedDf)
}
## test case
fillDf <- data.frame(a = c(1,2,1,2), b = c(3,3,4,4) ,f = c(100,200, 300, 400), g = c(11, 12, 13, 14))
naDf <- data.frame( a = sample(c(1,2), 100, rep=TRUE), b = sample(c(3,4), 100, rep=TRUE), f = sample(c(0,NA), 100, rep=TRUE), g = sample(c(0,NA), 200, rep=TRUE) )
fillNaDf(naDf, fillDf, mergeCols=c("a","b"), fillCols=c("f","g") )
So after I got this running I had this odd feeling that someone has probably solved this problem before me and in a much more elegant way. Is there a better/easier/faster solution to this problem? Also, is there a way that eliminates the loop in the middle of my function? That loop is there because I am often replacing NAs in more than one column. And, yes, the function assumes the columns we're filling from are named the same and the columns we are filling to and the same applies to the merge.
Any guidance or refactoring would be helpful.
EDIT on Dec 2 I realized I had logic flaws in my example which I fixed.
What a great question.
Here's a data.table solution:
# Convert data.frames to data.tables (i.e. data.frames with extra powers;)
library(data.table)
fillDT <- data.table(fillDf, key=c("a", "b"))
naDT <- data.table(naDf, key=c("a", "b"))
# Merge data.tables, based on their keys (columns a & b)
outDT <- naDT[fillDT]
# a b f g f.1 g.1
# [1,] 1 3 NA 0 100 11
# [2,] 1 3 NA NA 100 11
# [3,] 1 3 NA 0 100 11
# [4,] 1 3 0 0 100 11
# [5,] 1 3 0 NA 100 11
# First 5 rows of 200 printed.
# In outDT[i, j], on the following two lines
# -- i is a Boolean vector indicating which rows will be operated on
# -- j is an expression saying "(sub)assign from right column (e.g. f.1) to
# left column (e.g. f)
outDT[is.na(f), f:=f.1]
outDT[is.na(g), g:=g.1]
# Just keep the four columns ultimately needed
outDT <- outDT[,list(a,b,g,f)]
# a b g f
# [1,] 1 3 0 0
# [2,] 1 3 11 0
# [3,] 1 3 0 0
# [4,] 1 3 11 0
# [5,] 1 3 11 0
# First 5 rows of 200 printed.
Here's a slightly more concise/robust version of your approach. You could replace the for-loop with a call to lapply, but I find the loop easier to read.
This function assumes any columns not in mergeCols are fair game to have their NAs filled. I'm not really sure this helps, but I'll take my chances with the voters.
fillNaDf.ju <- function(naDf, fillDf, mergeCols) {
mergedDf <- merge(fillDf, naDf, by=mergeCols, suffixes=c(".fill",""))
dataCols <- setdiff(names(naDf),mergeCols)
# loop over all columns we didn't merge by
for(col in dataCols) {
rows <- is.na(mergedDf[,col])
# skip this column if it doesn't contain any NAs
if(!any(rows)) next
rows <- which(rows)
# replace NAs with values from fillDf
mergedDf[rows,col] <- mergedDf[rows,paste(col,"fill",sep=".")]
}
# don't return ".fill" columns
mergedDf[,names(naDf)]
}
My preference would be to pull out the code from merge that does the matching and do it myself so that I could keep the ordering of the original data frame intact, both row-wise and column-wise. I also use matrix indexing to avoid any loops, though to do so I create a new data frame with the revised fillCols and replace the columns of the original with it; I thought I could fill it in directly but apparently you can't use matrix ordering to replace parts of a data.frame, so I wouldn't be surprised if a loop over the names would be faster in some situations.
With matrix indexing:
fillNaDf <- function(naDf, fillDf, mergeCols, fillCols) {
fillB <- do.call(paste, c(fillDf[, mergeCols, drop = FALSE], sep="\r"))
naB <- do.call(paste, c(naDf[, mergeCols, drop = FALSE], sep="\r"))
na.ind <- is.na(naDf[,fillCols])
fill.ind <- cbind(match(naB, fillB)[row(na.ind)[na.ind]], col(na.ind)[na.ind])
naX <- naDf[,fillCols]
fillX <- fillDf[,fillCols]
naX[na.ind] <- fillX[fill.ind]
naDf[,colnames(naX)] <- naX
naDf
}
With a loop:
fillNaDf2 <- function(naDf, fillDf, mergeCols, fillCols) {
fillB <- do.call(paste, c(fillDf[, mergeCols, drop = FALSE], sep="\r"))
naB <- do.call(paste, c(naDf[, mergeCols, drop = FALSE], sep="\r"))
m <- match(naB, fillB)
for(col in fillCols) {
fix <- which(is.na(naDf[,col]))
naDf[fix, col] <- fillDf[m[fix],col]
}
naDf
}

Resources