Populate a data.frame based on a vector of tokens - r

I have simple data frame containing short strings, each which has a particular class assigned:
datadb <- data.frame (
Class = c('Class1', 'Class2', 'Class3'),
Document = c('This is test', 'Yet another test', 'A last test')
)
datadb$Document <- tolower(datadb$Document)
datadb$Tokens <- strsplit(datadb$Document, " ")
From this, I want to build another data frame which contains the original Class1 column, but which has a new column added for each unique token, something like this:
all_tokens <- unlist(datadb$Tokens)
all_tokens <- unique(all_tokens)
number_of_columns <- length(all_tokens)
number_of_rows <- NROW(datadb)
tokenDB <- data.frame( matrix(ncol=(1 + number_of_columns), nrow=number_of_rows) )
names(tokenDB) <- c("Classification", all_tokens)
tokenDB$Classification <- datadb$Class
The tokenDB will then look like this:
Classification this is test yet another a last
1 Class1 NA NA NA NA NA NA NA
2 Class2 NA NA NA NA NA NA NA
3 Class3 NA NA NA NA NA NA NA
How can I go through the original data frame and add a value to the new tokenDB corresponding to each of the vectors already identified? The output should look like:
Classification this is test yet another a last
1 Class1 1 1 1 0 0 0 0
2 Class2 0 0 1 1 1 0 0
3 Class3 0 0 1 0 0 1 1
The output should ideally be a data.frame, but could also be a matrix.

Use the tm package or really any other text mining package to get the job done. I am partial to tm. What you are creating is a document-Term matrix.
library(tm)
datadb <- data.frame (
Class = c('Class1', 'Class2', 'Class3'),
Document = c('This is test', 'Yet another test', 'A last test')
)
corpus <- Corpus(VectorSource(datadb$Document))
dtm <- DocumentTermMatrix(corpus)
dtm2 <- cbind(datadb$Class, as.matrix(dtm))
colnames(dtm2) <- c("Classification", colnames(dtm))
dtm2
# Classification test this another yet last
# 1 1 1 1 0 0 0
# 2 2 1 0 1 1 0
# 3 3 1 0 0 0 1
Here is another way using only base
txt <- lapply(txt, function(x) data.frame(x, count = 1))
txt <- lapply(txt, function(x) data.frame(count = tapply(x$count, x$x, sum)))
tdm <- Reduce(function(...) merge(..., all=TRUE, by="x"),
lapply(txt, function(x) data.frame(x=rownames(x), count=x$count)))
rownames(tdm) <- tdm[, 1]
dtm3 <- t(tdm[, -1])
dtm3[is.na(dtm3)] <- 0
rownames(dtm3) <- paste("Doc", 1:3)
dtm3 <- cbind(Classification=datadb$Class, dtm3)
dtm3
# Classification is test This another Yet A last
# Doc 1 1 1 1 1 0 0 0 0
# Doc 2 2 0 1 0 1 1 0 0
# Doc 3 3 0 1 0 0 0 1 1

k=lapply( datadb$Tokens,match,all_tokens)
tokenDB[,-1]=t(mapply(function(x,y) {y[x]<-1;y[-x]<-0;y}, k,data.frame(t(tokenDB[,-1]))))
tokenDB
Classification this is test yet another a last
1 Class1 1 1 1 0 0 0 0
2 Class2 0 0 1 1 1 0 0
3 Class3 0 0 1 0 0 1 1

Related

removing columns equal to 0 from multiple data frames in a list; lapply not actually removing columns when applying function to a list

I have a list of three data frames that are similar (same number of columns but different number of rows), and were split from a larger data set.
Here is some example code to make three data frames and put them in a list. It is really hard to make an exact replicate of my data since the files are so large (over 400 columns and the first 6 columns are not numerical)
a <- c(0,1,0,1,0,0,0,0,0,1,0,1)
b <- c(0,0,0,0,0,0,0,0,0,0,0,0)
c <- c(1,0,1,1,1,1,1,1,1,1,0,1)
d <- c(0,0,0,0,0,0,0,0,0,0,0,0)
e <- c(1,1,1,1,0,1,0,1,0,1,1,1)
f <- c(0,0,0,0,0,0,0,0,0,0,0,0)
g <- c(1,0,1,0,1,1,1,1,1,1)
h <- c(0,0,0,0,0,0,0,0,0,0)
i <- c(1,0,0,0,0,0,0,0,0,0)
j <- c(0,0,0,0,1,1,1,1,1,0)
k <- c(0,0,0,0,0)
l <- c(1,0,1,0,1)
m <- c(1,0,1,0,0)
n <- c(0,0,0,0,0)
o <- c(1,0,1,0,1)
df1 <- data.frame(a,b,c,d,e,f)
df2 <- data.frame(g,h,i,j)
df3 <- data.frame(k,l,m,n,o)
my.list <- list(df1,df2,df3)
I am looking to remove all the columns in each data frame whose total == 0. The code is below:
list2 <- lapply(my.list, function(x) {x[, colSums(x) != 0];x})
list2 <- lapply(my.list, function(x) {x[, colSums(x != 0) > 0];x})
Both of the above codes will run, but neither actually remove the columns == 0.
I am not sure why that is, any tips are greatly appreciated
The OP found a solution by exchanging comments with me. But I wanna drop the following. In lapply(my.list, function(x) {x[, colSums(x) != 0];x}), the OP was asking R to do two things. The first thing was subsetting each data frame in my.list. The second thing was showing each data frame. I think he thought that each data frame was updated after subsetting columns. But he was simply asking R to show each data frame as it is in the second command. So R was showing the result for the second command. (On the surface, he did not see any change.) If I follow his way, I would do something like this.
lapply(my.list, function(x) {foo <- x[, colSums(x) != 0]; foo})
He wanted to create a temporary object in the anonymous function and return the object. Alternatively, he wanted to do the following.
lapply(my.list, function(x) x[, colSums(x) != 0])
For each data frame in my.list, run a logical check for each column. If colSums(x) != 0 is TRUE, keep the column. Otherwise remove it. Hope this will help future readers.
[[1]]
a c e
1 0 1 1
2 1 0 1
3 0 1 1
4 1 1 1
5 0 1 0
6 0 1 1
7 0 1 0
8 0 1 1
9 0 1 0
10 1 1 1
11 0 0 1
12 1 1 1
[[2]]
g i j
1 1 1 0
2 0 0 0
3 1 0 0
4 0 0 0
5 1 0 1
6 1 0 1
7 1 0 1
8 1 0 1
9 1 0 1
10 1 0 0
[[3]]
l m o
1 1 1 1
2 0 0 0
3 1 1 1
4 0 0 0
5 1 0 1

Assign value to new column if another column has exact string

I have a df that looks like this.
Date Winner
4/12 Tom
4/13 Abe
4/14 George
4/15 Tom
I would like to add new columns that assign a 1 if if the name appears in the winner column and 0 if the name did not appear and vice versa. Ideally the df would look like this as a result
Date Winner Tom_Win Tom_Lose Abe_Win Abe_Lose George_Win George Lose
4/12 Tom 1 0 0 1 0 1
4/13 Abe 0 1 1 0 0 1
4/14 George 0 1 0 1 1 0
4/15 Tom 1 0 0 1 0 1
Is there an easy way to accomplish this?
This is extremely simple to do if you use the model.matrix functions, it will create N dummy columns with 0 when the name does not appear and one when it does (exactly as you requested), the code below:
(assuming your data is called db)
> winners <- model.matrix(~Winner - 1, data=db)
> winners
WinnerAbe WinnerGeorge WinnerTom
1 0 0 1
2 1 0 0
3 0 1 0
4 0 0 1
This bit is to compute the columns with the losing values
winners <- as.data.frame(winners)
winners$loserAbe <- as.numeric(!winners$WinnerAbe) #naturally you have to
#do this for every column you need
WinnerAbe WinnerGeorge WinnerTom loserAbe
1 0 0 1 1
2 1 0 0 0
3 0 1 0 1
4 0 0 1 1
winners$Date <- db$Date #this last bit so you don't lose the date.
Using mtabulate from qdapTools package we can do the following three steps,
library(qdapTools)
d1 <- mtabulate(d3$Winner)
d2 <- setNames(data.frame(sapply(d1, function(i) ifelse(i == 1, 0, 1))),
paste0(names(d1), '_Lose'))
cbind(d3$Date, d1, d2)
# d3$Date Abe George Tom Abe_Lose George_Lose Tom_Lose
#1 4/12 0 0 1 1 1 0
#2 4/13 1 0 0 0 1 1
#3 4/14 0 1 0 1 0 1
#4 4/15 0 0 1 1 1 0
DATA
str(d3)
'data.frame': 4 obs. of 2 variables:
$ Date : Factor w/ 4 levels "4/12","4/13",..: 1 2 3 4
$ Winner: Factor w/ 3 levels "Abe","George",..: 3 1 2 3
I'm sure there is a better way than this but this works in base R and it's fairly simple:
If your data looks like this:
df <- data.frame(Date = c("4/12","4/13","4/14","4/15"),Winner = c("Tom","Abe","George","Tom"))
Append the extra columns like so:
xcols <- c(paste0(unique(df$Winner), '_Win'), paste0(unique(df$Winner), '_Lose'))
df[ , xcols] <- 0
Now make a character vector with instructions to give the points for every player.
evl <- unlist(lapply(unique(df$Winner), function(x){paste0('df[', which(df$Winner == x), ',', which(names(df) == paste0(x, '_Win')), '] <- 1')}))
And execute the code:
eval(parse(text = evl))
df <- data.frame(
Date = c("4/12", "4/13","4/14", "4/15"),
Winner = c("Tom", "Abe", "George", "Tom")
)
df2 <- do.call(cbind,
lapply(seq_along(levels(df$Winner)), function(x) {
win <- ifelse(df$Winner == levels(df$Winner)[x], 1, 0)
lose <- ifelse(df$Winner == levels(df$Winner)[x], 0, 1)
dat <- cbind(win, lose)
colnames(dat) <- c(paste(levels(df$Winner)[x], "win", sep = "_"), paste(levels(df$Winner)[x], "lose", sep = "_"))
dat
})
)
cbind(df, df2)
> cbind(df, df2)
Date Winner Abe_win Abe_lose George_win George_lose Tom_win Tom_lose
1 4/12 Tom 0 1 0 1 1 0
2 4/13 Abe 1 0 0 1 0 1
3 4/14 George 0 1 1 0 0 1
4 4/15 Tom 0 1 0 1 1 0

extract information from a data frame

I have a data frame like below
df<- structure(list(s1 = structure(1:3, .Label = c("3-4", "4-1", "5-4"
), class = "factor"), s2 = structure(1:3, .Label = c("2-4", "3-15",
"7-16"), class = "factor")), .Names = c("s1", "s2"), row.names = c(NA,
-3L), class = "data.frame")
Looks like below
> df
# s1 s2
#1 3-4 2-4
#2 4-1 3-15
#3 5-4 7-16
what I want to do is to first search and find those values that are similar after -
for example here 4 is in first row of s1, first row of s2 and third row of s1
-The second column indicates how many times those values were found
-The third column shows how many of them are from first column of df
-The fourth column shows how many of them are from second column of df
-The fifth is which strings are from the first columns
-The sixth is which strings are from teh second columns
the output looks like this
Value repeated s1N s1N ss1 ss2
4 3 2 1 3,5 2
1 1 1 - 4 -
15 1 - 1 - 3
16 1 - 1 - 7
Surprisingly tough problem. It's good to break it down into several logical steps:
## 1: split into (val,ss) pairs, and capture ci (column index) association
res <- setNames(do.call(rbind,lapply(seq_along(df),function(ci)
do.call(rbind,lapply(strsplit(as.character(df[[ci]]),'-'),function(x)
data.frame(x[2L],x[1L],ci,stringsAsFactors=F)
))
)),c('val','ss','ci'));
res;
## val ss ci
## 1 4 3 1
## 2 1 4 1
## 3 4 5 1
## 4 4 2 2
## 5 15 3 2
## 6 16 7 2
## 2: aggregate ss (joining on comma) by (val,ci), and capture record count as n
res <- do.call(rbind,by(res,res[c('val','ci')],function(x)
data.frame(val=x$val[1L],ci=x$ci[1L],n=nrow(x),ss=paste(x$ss,collapse=','),stringsAsFactors=F)
));
res;
## val ci n ss
## 1 1 1 1 4
## 2 4 1 2 3,5
## 3 15 2 1 3
## 4 16 2 1 7
## 5 4 2 1 2
## 3: reshape to wide format
res <- reshape(res,idvar='val',timevar='ci',dir='w');
res;
## val n.1 ss.1 n.2 ss.2
## 1 1 1 4 NA <NA>
## 2 4 2 3,5 1 2
## 3 15 NA <NA> 1 3
## 4 16 NA <NA> 1 7
## 4: add repeated column; can be calculated by summing all n.* columns
## note: leveraging psum() from <http://stackoverflow.com/questions/12139431/add-variables-whilst-ignoring-nas-using-transform-function>
psum <- function(...,na.rm=F) { x <- list(...); rowSums(matrix(unlist(x),ncol=length(x)),na.rm=na.rm); };
res$repeated <- do.call(psum,c(res[grep('^n\\.[0-9]+$',names(res))],na.rm=T));
res;
## val n.1 ss.1 n.2 ss.2 repeated
## 1 1 1 4 NA <NA> 1
## 2 4 2 3,5 1 2 3
## 3 15 NA <NA> 1 3 1
## 4 16 NA <NA> 1 7 1
With regard to the NAs, you can fix them up afterward if you want. However, I would advise that the proper type of the n.* columns is integer, since they represent counts, therefore the use of '-' (as in your sample output) to represent null cells is inappropriate. I would suggest zero instead. The dash is fine for the ss.* columns, since they are strings. Here's how you can do this:
n.cis <- grep('^n\\.[0-9]+$',names(res));
ss.cis <- grep('^ss\\.[0-9]+$',names(res));
res[n.cis][is.na(res[n.cis])] <- 0L;
res[ss.cis][is.na(res[ss.cis])] <- '-';
res;
## val n.1 ss.1 n.2 ss.2 repeated
## 1 1 1 4 0 - 1
## 2 4 2 3,5 1 2 3
## 3 15 0 - 1 3 1
## 4 16 0 - 1 7 1
First thing you will need to do is extract the numbers from your strings. Running:
newdfstring <- apply(df,1:2, function(v) unlist(strsplit(v,"-")))
newdf <- apply(newdfstring,1:3, as.numeric)
splits the strings in the first line, and converts them to numeric values in the second. The result is a 3-dimensional matrix which you can use to extract your values.
First create a new dataframe:
#length of the columns in the new frame = number of unique values
dflength <- length(unique(array(newdf[2,,])))
dfout <- data.frame(Value=rep(0,dflength),repeated=rep(0,dflength),s1N=rep(0,dflength),s2N=rep(0,dflength),ss1=rep(0,dflength),ss2=rep(0,dflength))
The most obvious way (yet maybe not the most efficient) would then be to loop and match whatever it is you need:
dfout$Value <- unique(array(newdf[2,,]))
for(i in 1:dflength){
getID <- which(as.data.frame(table(newdf[2,,]))$Var1==dfout$Value[i])
dfout$repeated[i] <- as.data.frame(table(newdf[2,,]))$Freq[getID]
dfout$s1N[i] <- as.data.frame(table(newdf[2,,1]))$Freq[getID]
if(is.na(dfout$s1N[i])){
dfout$s1N[i] <- 0
}
dfout$s2N[i] <- as.data.frame(table(newdf[2,,2]))$Freq[getID]
if(is.na(dfout$s2N[i])){
dfout$s2N[i] <- 0
}
getID <- which(newdf[2,,1]==dfout$Value[i])
if(length(getID)>0){
dfout$ss1[i] <- toString(newdf[1,,1][getID])
} else {
dfout$ss1[i] <- 0
}
getID <- which(newdf[2,,2]==dfout$Value[i])
if(length(getID)>0){
dfout$ss2[i] <- toString(newdf[1,,2][getID])
} else {
dfout$ss2[i] <- 0
}
}
dfout
# Value repeated s1N s2N ss1 ss2
#1 4 3 2 1 3, 5 2
#2 1 1 1 1 4 0
#3 15 1 0 1 0 3
#4 16 1 0 0 0 7
EDIT to loop n amount of s values
newdfstring <- apply(df,1:2, function(v) unlist(strsplit(v,"-")))
newdf <- apply(newdfstring,1:3, as.numeric)
dflength <- length(unique(array(newdf[2,,])))
#find the number of s variables
slength <- length(newdf[1,1,])
#create a matrix of appropriate size
dfout <- matrix(data=NA,nrow=dflength,ncol=(2+2*slength))
#create a (near)-empty names array, we will fill it in later
names <- c("Value","repeated",rep("",2*slength))
#fill in the Values column
dfout[,1] <- unique(array(newdf[2,,]))
#loop for every s variable
for(j in 1:slength){
#get their names, paste N or s and add them to the names array
names[2+j] <- paste(names(df)[j],"N",sep="")
names[2+j+slength] <- paste("s",names(df)[j],sep="")
#loop to get the other values
for(i in 1:dflength){
getID <- which(as.data.frame(table(newdf[2,,]))$Var1==dfout[i,1])
dfout[i,2] <- as.data.frame(table(newdf[2,,]))$Freq[getID]
dfout[i,2+j] <- as.data.frame(table(newdf[2,,j]))$Freq[getID]
if(is.na(dfout[i,2+j])){
dfout[i,2+j] <- 0
}
getID <- which(newdf[2,,j]==dfout[i,1])
if(length(getID)>0){
dfout[i,2+j+slength] <- toString(newdf[1,,j][getID])
} else {
dfout[i,2+j+slength] <- 0
}
}
}
colnames(dfout)<-names
as.data.frame(dfout)
# Value repeated s1N s2N ss1 ss2
#1 4 3 2 1 3, 5 2
#2 1 1 1 1 4 0
#3 15 1 0 1 0 3
#4 16 1 0 0 0 7
df <-
structure(
list(
s1 = structure(1:3, .Label = c("3-4", "4-1", "5-4"), class = "factor"),
s2 = structure(1:3, .Label = c("2-4", "3-15", "7-16"), class = "factor"
)
), .Names = c("s1", "s2"), row.names = c(NA,-3L), class = "data.frame"
)
library(tidyr)
library(dplyr)
# Split columns at "-" and add to data.frame
splitCols <- function(df) {
new_headers <- paste("s1", c("1st", "2nd"), sep = "_")
split_1 <- (separate(df, s1, into = new_headers, sep = "-"))[,new_headers]
split_1$s1_1st <- as.integer(split_1$s1_1st)
split_1$s1_2nd <- as.integer(split_1$s1_2nd)
new_headers <- paste("s2", c("1st", "2nd"), sep = "_")
split_2 <- (separate(df, s2, into = new_headers, sep = "-"))[,new_headers]
split_2$s2_1st <- as.integer(split_2$s2_1st)
split_2$s2_2nd <- as.integer(split_2$s2_2nd)
cbind(df, split_1, split_2)
}
# given a df outputted from splitCols return final df
analyzeDF <- function(df) {
target_vals <- unique(c(df$s1_2nd, df$s2_2nd)) # for each uniq val compute stuff
out_df <- data.frame(Value = integer(0),
repeated = integer(0),
s1N = integer(0),
s2N = integer(0),
ss1 = character(0),
ss2 = character(0))
# iterate through target_vals, create a row of output,
# and append to out_df
for (val in target_vals) {
s1_match <- val == df$s1_2nd
s2_match <- val == df$s2_2nd
total_cnt <- sum(s1_match, s2_match)
s1_firstcol <- paste(df$s1_1st[s1_match], collapse = ",")
s2_firstcol <- paste(df$s2_1st[s2_match], collapse = ",")
# coerce empty string to "-"
if (s1_firstcol == "") s1_firstcol <- "-"
if (s2_firstcol == "") s2_firstcol <- "-"
row_df <- data.frame(Value = val,
repeated = total_cnt,
s1N = sum(s1_match),
s2N = sum(s2_match),
ss1 = s1_firstcol,
ss2 = s2_firstcol)
out_df <- rbind(out_df, row_df)
}
return(out_df)
}
(df_split <- splitCols(df))
analyzeDF(df_split)
## Value repeated s1N s2N ss1 ss2
## 1 4 3 2 1 3,5 2
## 2 1 1 1 0 4 -
## 3 15 1 0 1 - 3
## 4 16 1 0 1 - 7
I have totally rewritten all the codes based on your real data, and I have tested it on my machine. Since it is a pretty big dataframe, it takes some time to run, and the loops are not avoidable in my opinion.
# function to split the strings
myfun<-function(x){
x<-strsplit(as.character(x), '-')
x1<-unlist(x)
x.new<-as.data.frame(matrix(x1, byrow = T, length(x)))
return(x.new)
}
# this returns a list of dataframes
list.v<-lapply(df[1:dim(df)[2]], myfun)
# like this
head(list.v[[17]])
# try to combine all the dfs, produced an error of mismatching # of columns
df2<-do.call(rbind, list.v)
# some of the dfs in list.v are all NA's, they should be dropped
sum<-summary(list.v)
list.v<-list.v[-which(sum[,1] != "2")] # this excludes those all-NA datafrmes in list.v
# now combine all dfs for indexing purposes
df2<-do.call(rbind, list.v)
# create "value", "repeated" column in the desired result df.
# These codes are same as my previous answer
value<-names(table(df2[,2]))
repeated<-as.vector(table(df2[,2]))
# create an empty list to store the counts columns
list.count<-vector("list", length = length(list.v))
# every df in list.v has same number of rows, get the row number
rownum<-nrow(list.v[[1]])
# use a for loop to fill out list.count
for(i in 0:(length(list.count)-1)){
row.start<-i*rownum+1 # it is kind of tricky here
row.end<-(i+1)*rownum # same as above
list.count[[i + 1]]<-as.vector(table(df2[,2][row.start:row.end]))
}
# combine the vectors in list.count and assing names
count.df<-do.call(cbind, list.count)
count.df<-as.data.frame(count.df)
# create & assign colum names in the format of "s_n", and "_" is filled with corresponding original column name
names.cnt<-character()
for(i in 1:length(names(list.v))){
names.cnt[i]<-paste("s", names(list.v)[i], "n", sep="")
}
names(count.df)<-names.cnt
# this is a very long loop to concatenate the strings and store them into a matrix, but it gets the job done here.
ss.store<-matrix(,nrow = length(value), ncol = length(list.v), byrow = FALSE)
for(i in 1:length(list.v)){
for(j in 1:length(value)){
ss.store[j,i]<-paste(list.v[[i]][,1][which(list.v[[i]][,2] == value[j])], collapse =",")
}
}
# create a df for strings
string.df<-as.data.frame(ss.store, stringsAsFactors = FALSE)
# create & assign names to the df
names.str<-character()
for(i in 1:length(names(list.v))){
names.str[i]<-paste("s", "s", names(list.v)[i], sep="")
}
names(string.df)<-names.str
# combine everything and form the new data frame
new.df<-cbind(value, repeated, count.df, string.df, stringAsFactors = FALSE)
new.df[1:10, 1:15]
value repeated sAn sF1n sF2n sF3n sF4n sF5n sF6n sF7n sF8n sF9n sF10n sF11n sF12n
1 100 155 3 0 0 0 0 0 0 0 0 0 0 0 0
2 1005 14 1 0 0 0 0 0 0 0 0 0 0 0 0
3 1006 50 1 0 0 0 0 0 0 0 0 0 0 0 0
4 1023 1 1 0 0 0 0 0 0 0 0 0 0 0 0
5 1025 38 1 0 0 0 0 0 0 0 0 0 0 0 0
6 1030 624 1 0 1 2 0 0 0 0 0 0 1 0 0
7 1035 1 1 0 0 0 0 0 0 0 0 0 0 0 0
8 104 165 2 0 0 0 0 0 0 0 0 0 0 0 0
9 1076 186 2 0 0 0 0 0 0 0 0 0 0 0 0
10 1078 333 3 0 0 0 0 0 0 0 0 0 0 0 0

Recode Multiple Columns to Single Variable

I have some qualitative data that I have coded into various categories and I want to provide summaries for subgroups. The RQDA package is great for coding interviews but I've struggled with creating summaries for open ended survey responses. I've managed to export the coded file into HTML, and copy/paste into Excel. I now have 500 lines with all the categories in distinct columns however the same code may appear in different columns. For example, some data:
a <- c("ResponseA", "ResponseB", "ResponseC", "ResponseD", "NA")
b <- c("ResponseD", "ResponseC", "NA", "NA","NA")
c <- c("ResponseB", "ResponseA", "ResponseE", "NA", "NA")
d <- c("ResponseC", "ResponseB", "ResponseA", "NA", "NA")
df <- data.frame (a,b,c,d)
I'd like to be able to run something like
df$ResponseA <- recode (df$a | df$b | df$c, "
'ResponseA' = '1';
else='0' ")
df$ResponseB <- recode (df$a | df$b | df$c, "
'ResponseB' = '1';
else='0' ")
In short, I'd like scan 9 columns and recode into a single binary variable.
If I understand the question correctly, perhaps you can try something like this:
## Convert your data into a long format first
dfL <- cbind(id = sequence(nrow(df)), stack(lapply(df, as.character)))
## The next three lines are mostly cleanup
dfL$id <- factor(dfL$id, sequence(nrow(df)))
dfL$values[dfL$values == "NA"] <- NA
dfL <- dfL[complete.cases(dfL), ]
## `table` is the real workhorse here
cbind(df, (table(dfL[1:2]) > 0) * 1)
# a b c d ResponseA ResponseB ResponseC ResponseD ResponseE
# 1 ResponseA ResponseD ResponseB ResponseC 1 1 1 1 0
# 2 ResponseB ResponseC ResponseA ResponseB 1 1 1 0 0
# 3 ResponseC NA ResponseE ResponseA 1 0 1 0 1
# 4 ResponseD NA NA NA 0 0 0 1 0
# 5 NA NA NA NA 0 0 0 0 0
You can also try the following:
(table(rep(1:nrow(df), ncol(df)), unlist(df)) > 0) * 1L
#
# NA ResponseA ResponseB ResponseC ResponseD ResponseE
# 1 0 1 1 1 1 0
# 2 0 1 1 1 0 0
# 3 1 1 0 1 0 1
# 4 1 0 0 0 1 0
# 5 1 0 0 0 0 0

combine tables into a data frame

How do I turn a list of tables into a data frame?
I have:
> (tabs <- list(table(c('a','a','b')),table(c('c','c','b')),table(c()),table(c('b','b'))))
[[1]]
a b
2 1
[[2]]
b c
1 2
[[3]]
< table of extent 0 >
[[4]]
b
2
I want:
> data.frame(a=c(2,0,0),b=c(1,1,2),c=c(0,2,0))
a b c
1 2 1 0
2 0 1 2
3 0 0 0
4 0 2 0
PS. Please do not assume that the tables were created by table calls! They were not!
c_names <- unique(unlist(sapply(tabs, names)))
df <- do.call(rbind, lapply(tabs, `[`, c_names))
colnames(df) <- c_names
df[is.na(df)] <- 0
This assumes the tables are one dimensional.
all.names <- unique(unlist(lapply(tabs, names)))
df <- as.data.frame(do.call(rbind,
lapply(
tabs, function(x) as.list(replace(c(x)[all.names], is.na(c(x)[all.names]), 0))
) ) )
names(df) <- all.names
df
There is probably a cleaner way to do this.
# a b c
# 1 2 1 0
# 2 0 1 2
# 3 0 0 0
# 4 0 2 0
tabs <- list(table(c('a','a','b')),table(c('c','c','b')),table(c()),table(c('b','b')))
dat.names <- unique(unlist(sapply(tabs, names)))
dat <- matrix(0, nrow = length(tabs), ncol = length(dat.names))
colnames(dat) <- dat.names
for (ii in 1:length(tabs)) {
dat[ii, ] <- tabs[[ii]][match(colnames(dat), names(tabs[[ii]]) )]
}
dat[is.na(dat)] <- 0
> dat
a b c
[1,] 2 1 0
[2,] 0 1 2
[3,] 0 0 0
[4,] 0 2 0
Here is a pretty clean approach:
library(reshape2)
newTabs <- melt(tabs)
newTabs
# Var1 value L1
# 1 a 2 1
# 2 b 1 1
# 3 b 1 2
# 4 c 2 2
# 5 b 2 4
newTabs$L1 <- factor(newTabs$L1, seq_along(tabs))
dcast(newTabs, L1 ~ Var1, fill = 0, drop = FALSE)
# L1 a b c
# 1 1 2 1 0
# 2 2 0 1 2
# 3 3 0 0 0
# 4 4 0 2 0
This makes use of the fact that there is a melt method for lists (see reshape2:::melt.list) which automatically adds in a variable (L1 for an unnested list) that identifies the index of the list element. Since your list has some items which are empty, they won't show up in your melted list, so you need to factor the "L1" column, specifying the levels you want. dcast takes care of restructuring your output and allows you to specify the desired fill value.

Resources