Recode Multiple Columns to Single Variable - r

I have some qualitative data that I have coded into various categories and I want to provide summaries for subgroups. The RQDA package is great for coding interviews but I've struggled with creating summaries for open ended survey responses. I've managed to export the coded file into HTML, and copy/paste into Excel. I now have 500 lines with all the categories in distinct columns however the same code may appear in different columns. For example, some data:
a <- c("ResponseA", "ResponseB", "ResponseC", "ResponseD", "NA")
b <- c("ResponseD", "ResponseC", "NA", "NA","NA")
c <- c("ResponseB", "ResponseA", "ResponseE", "NA", "NA")
d <- c("ResponseC", "ResponseB", "ResponseA", "NA", "NA")
df <- data.frame (a,b,c,d)
I'd like to be able to run something like
df$ResponseA <- recode (df$a | df$b | df$c, "
'ResponseA' = '1';
else='0' ")
df$ResponseB <- recode (df$a | df$b | df$c, "
'ResponseB' = '1';
else='0' ")
In short, I'd like scan 9 columns and recode into a single binary variable.

If I understand the question correctly, perhaps you can try something like this:
## Convert your data into a long format first
dfL <- cbind(id = sequence(nrow(df)), stack(lapply(df, as.character)))
## The next three lines are mostly cleanup
dfL$id <- factor(dfL$id, sequence(nrow(df)))
dfL$values[dfL$values == "NA"] <- NA
dfL <- dfL[complete.cases(dfL), ]
## `table` is the real workhorse here
cbind(df, (table(dfL[1:2]) > 0) * 1)
# a b c d ResponseA ResponseB ResponseC ResponseD ResponseE
# 1 ResponseA ResponseD ResponseB ResponseC 1 1 1 1 0
# 2 ResponseB ResponseC ResponseA ResponseB 1 1 1 0 0
# 3 ResponseC NA ResponseE ResponseA 1 0 1 0 1
# 4 ResponseD NA NA NA 0 0 0 1 0
# 5 NA NA NA NA 0 0 0 0 0
You can also try the following:
(table(rep(1:nrow(df), ncol(df)), unlist(df)) > 0) * 1L
#
# NA ResponseA ResponseB ResponseC ResponseD ResponseE
# 1 0 1 1 1 1 0
# 2 0 1 1 1 0 0
# 3 1 1 0 1 0 1
# 4 1 0 0 0 1 0
# 5 1 0 0 0 0 0

Related

Sub-setting or arrange the data in R

As I am new to R, this question may seem to you piece of a cake.
I have a data in txt format. The first column has Cluster Number and the second column has names of different organisms.
For example:
0 org4|gene759
1 org1|gene992
2 org1|gene1101
3 org4|gene757
4 org1|gene1702
5 org1|gene989
6 org1|gene990
7 org1|gene1699
9 org1|gene1102
10 org4|gene2439
10 org1|gene1374
I need to re-arrange/reshape the data in following format.
Cluster No. Org 1 Org 2 org3 org4
0 0 0 1
1 0 0 0
I could not figure out how to do it in R.
Thanks
We could use table
out <- cbind(ClusterNo = seq_len(nrow(df1)), as.data.frame.matrix(table(seq_len(nrow(df1)),
factor(sub("\\|.*", "", df1[[2]]), levels = paste0("org", 1:4)))))
head(out, 2)
# ClusterNo org1 org2 org3 org4
#1 1 0 0 0 1
#2 2 1 0 0 0
It is also possible that we need to use the first column to get the frequency
out1 <- as.data.frame.matrix(table(df1[[1]],
factor(sub("\\|.*", "", df1[[2]]), levels = paste0("org", 1:4))))
Reading the table into R can be done with
input <- read.table('filename.txt')
Then we can extract the relevant number from the org4|gene759 string using a regular expression, and set this to a third column of our input:
input[, 3] <- gsub('^org(.+)\\|.*', '\\1', input[, 2])
Our input data now looks like this:
> input
V1 V2 V3
1 0 org4|gene759 4
2 1 org1|gene992 1
3 2 org1|gene1101 1
4 3 org4|gene757 4
5 4 org1|gene1702 1
6 5 org1|gene989 1
7 6 org1|gene990 1
8 7 org1|gene1699 1
9 9 org1|gene1102 1
10 10 org4|gene2439 4
11 10 org1|gene1374 1
Then we need to list the possible values of org:
possibleOrgs <- seq_len(max(input[, 3])) # = c(1, 2, 3, 4)
Now for the tricky part. The following function takes each unique cluster number in turn (I notice that 10 appears twice in your example data), takes all the rows relating to that cluster, and looks at the org value for those rows.
result <- vapply(unique(input[, 1]), function (x)
possibleOrgs %in% input[input[, 1] == x, 3], logical(4)))
We can then format this result as we like, perhaps using t to transform its orientation, * 1 to convert from TRUEs and FALSEs to 1s and 0s, and colnames to title its columns:
result <- t(result) * 1
colnames (result) <- paste0('org', possibleOrgs)
rownames(result) <- unique(input[, 1])
I hope that this is what you were looking for -- it wasn't quite clear from your question!
Output:
> result
org1 org2 org3 org4
0 0 0 0 1
1 1 0 0 0
2 1 0 0 0
3 0 0 0 1
4 1 0 0 0
5 1 0 0 0
6 1 0 0 0
7 1 0 0 0
9 1 0 0 0
10 1 0 0 1

removing columns equal to 0 from multiple data frames in a list; lapply not actually removing columns when applying function to a list

I have a list of three data frames that are similar (same number of columns but different number of rows), and were split from a larger data set.
Here is some example code to make three data frames and put them in a list. It is really hard to make an exact replicate of my data since the files are so large (over 400 columns and the first 6 columns are not numerical)
a <- c(0,1,0,1,0,0,0,0,0,1,0,1)
b <- c(0,0,0,0,0,0,0,0,0,0,0,0)
c <- c(1,0,1,1,1,1,1,1,1,1,0,1)
d <- c(0,0,0,0,0,0,0,0,0,0,0,0)
e <- c(1,1,1,1,0,1,0,1,0,1,1,1)
f <- c(0,0,0,0,0,0,0,0,0,0,0,0)
g <- c(1,0,1,0,1,1,1,1,1,1)
h <- c(0,0,0,0,0,0,0,0,0,0)
i <- c(1,0,0,0,0,0,0,0,0,0)
j <- c(0,0,0,0,1,1,1,1,1,0)
k <- c(0,0,0,0,0)
l <- c(1,0,1,0,1)
m <- c(1,0,1,0,0)
n <- c(0,0,0,0,0)
o <- c(1,0,1,0,1)
df1 <- data.frame(a,b,c,d,e,f)
df2 <- data.frame(g,h,i,j)
df3 <- data.frame(k,l,m,n,o)
my.list <- list(df1,df2,df3)
I am looking to remove all the columns in each data frame whose total == 0. The code is below:
list2 <- lapply(my.list, function(x) {x[, colSums(x) != 0];x})
list2 <- lapply(my.list, function(x) {x[, colSums(x != 0) > 0];x})
Both of the above codes will run, but neither actually remove the columns == 0.
I am not sure why that is, any tips are greatly appreciated
The OP found a solution by exchanging comments with me. But I wanna drop the following. In lapply(my.list, function(x) {x[, colSums(x) != 0];x}), the OP was asking R to do two things. The first thing was subsetting each data frame in my.list. The second thing was showing each data frame. I think he thought that each data frame was updated after subsetting columns. But he was simply asking R to show each data frame as it is in the second command. So R was showing the result for the second command. (On the surface, he did not see any change.) If I follow his way, I would do something like this.
lapply(my.list, function(x) {foo <- x[, colSums(x) != 0]; foo})
He wanted to create a temporary object in the anonymous function and return the object. Alternatively, he wanted to do the following.
lapply(my.list, function(x) x[, colSums(x) != 0])
For each data frame in my.list, run a logical check for each column. If colSums(x) != 0 is TRUE, keep the column. Otherwise remove it. Hope this will help future readers.
[[1]]
a c e
1 0 1 1
2 1 0 1
3 0 1 1
4 1 1 1
5 0 1 0
6 0 1 1
7 0 1 0
8 0 1 1
9 0 1 0
10 1 1 1
11 0 0 1
12 1 1 1
[[2]]
g i j
1 1 1 0
2 0 0 0
3 1 0 0
4 0 0 0
5 1 0 1
6 1 0 1
7 1 0 1
8 1 0 1
9 1 0 1
10 1 0 0
[[3]]
l m o
1 1 1 1
2 0 0 0
3 1 1 1
4 0 0 0
5 1 0 1

selecting all rows which has value > 1 in r

dataseti have a data and want to select all rows which has value > 1 in r.
i tried
sel <- apply(data[,collist],1,function(row) "1" %in% row)
but it is not working and give me whole a data frame,
[data set][1]
how can i subset these data?
thanks
The Note at the end shows the data used in the examples below. I have changed the headings as shown since the ones provided in the question are unwieldy and have removed the column of minus signs.
1) Using that data, the correct answer to the question of selecting all rows with a 1 in any column is that only the first two data rows are selected and that is, in fact, what happens:
subset(data, A == 1 | B == 1 | C == 1)
## Sym A B C
## 1 ACAP3 0 0 1
## 2 ACTRT2 0 0 1
2) This version does not make use of the headings:
has1 <- rowSums(data == 1) > 0
data[has1, ]
## Sym A B C
## 1 ACAP3 0 0 1
## 2 ACTRT2 0 0 1
3) Although the above should work it would be a bit safer to just check the numeric columns which for this data can be done like this:
has1 <- rowSums(data[-1] == 1) > 0
data[has1, ]
## Sym A B C
## 1 ACAP3 0 0 1
## 2 ACTRT2 0 0 1
4) or if we did not know which columns were numeric:
is.num <- sapply(data, is.numeric)
has1 <- rowSums(data[is.num] == 1) > 0
data[has1, ]
## Sym A B C
## 1 ACAP3 0 0 1
## 2 ACTRT2 0 0 1
Note
As the question did not provide input in reproducible form, the input shown in such form is assumed to be:
Lines <- 'Hugo_Symbol "A - 3 A- A9J" "B - F2 - 7273 - 01" "C - FB - AAPP - 01"
ACAP3 0 0 - 1
ACTRT2 0 0 - 1
AGRN 0 0 - 0
ANKRD65 0 0 - 0
ATAD3A 0 0 - 0
'
data <- read.table(text = Lines, skip = 1, col.names = c("Sym", "A", "B", "X", "C"),
colClasses = c(NA, NA, NA, "NULL", NA))
The above produces this:
data
## Sym A B C
## 1 ACAP3 0 0 1
## 2 ACTRT2 0 0 1
## 3 AGRN 0 0 0
## 4 ANKRD65 0 0 0
## 5 ATAD3A 0 0 0

Populate a data.frame based on a vector of tokens

I have simple data frame containing short strings, each which has a particular class assigned:
datadb <- data.frame (
Class = c('Class1', 'Class2', 'Class3'),
Document = c('This is test', 'Yet another test', 'A last test')
)
datadb$Document <- tolower(datadb$Document)
datadb$Tokens <- strsplit(datadb$Document, " ")
From this, I want to build another data frame which contains the original Class1 column, but which has a new column added for each unique token, something like this:
all_tokens <- unlist(datadb$Tokens)
all_tokens <- unique(all_tokens)
number_of_columns <- length(all_tokens)
number_of_rows <- NROW(datadb)
tokenDB <- data.frame( matrix(ncol=(1 + number_of_columns), nrow=number_of_rows) )
names(tokenDB) <- c("Classification", all_tokens)
tokenDB$Classification <- datadb$Class
The tokenDB will then look like this:
Classification this is test yet another a last
1 Class1 NA NA NA NA NA NA NA
2 Class2 NA NA NA NA NA NA NA
3 Class3 NA NA NA NA NA NA NA
How can I go through the original data frame and add a value to the new tokenDB corresponding to each of the vectors already identified? The output should look like:
Classification this is test yet another a last
1 Class1 1 1 1 0 0 0 0
2 Class2 0 0 1 1 1 0 0
3 Class3 0 0 1 0 0 1 1
The output should ideally be a data.frame, but could also be a matrix.
Use the tm package or really any other text mining package to get the job done. I am partial to tm. What you are creating is a document-Term matrix.
library(tm)
datadb <- data.frame (
Class = c('Class1', 'Class2', 'Class3'),
Document = c('This is test', 'Yet another test', 'A last test')
)
corpus <- Corpus(VectorSource(datadb$Document))
dtm <- DocumentTermMatrix(corpus)
dtm2 <- cbind(datadb$Class, as.matrix(dtm))
colnames(dtm2) <- c("Classification", colnames(dtm))
dtm2
# Classification test this another yet last
# 1 1 1 1 0 0 0
# 2 2 1 0 1 1 0
# 3 3 1 0 0 0 1
Here is another way using only base
txt <- lapply(txt, function(x) data.frame(x, count = 1))
txt <- lapply(txt, function(x) data.frame(count = tapply(x$count, x$x, sum)))
tdm <- Reduce(function(...) merge(..., all=TRUE, by="x"),
lapply(txt, function(x) data.frame(x=rownames(x), count=x$count)))
rownames(tdm) <- tdm[, 1]
dtm3 <- t(tdm[, -1])
dtm3[is.na(dtm3)] <- 0
rownames(dtm3) <- paste("Doc", 1:3)
dtm3 <- cbind(Classification=datadb$Class, dtm3)
dtm3
# Classification is test This another Yet A last
# Doc 1 1 1 1 1 0 0 0 0
# Doc 2 2 0 1 0 1 1 0 0
# Doc 3 3 0 1 0 0 0 1 1
k=lapply( datadb$Tokens,match,all_tokens)
tokenDB[,-1]=t(mapply(function(x,y) {y[x]<-1;y[-x]<-0;y}, k,data.frame(t(tokenDB[,-1]))))
tokenDB
Classification this is test yet another a last
1 Class1 1 1 1 0 0 0 0
2 Class2 0 0 1 1 1 0 0
3 Class3 0 0 1 0 0 1 1

Aggregating every 10 columns in binary matrice

I am new to R.
I would like to transform a binary matrix like this:
example:
" 1874 1875 1876 1877 1878 .... 2009
F 1 0 0 0 0 ... 0
E 1 1 0 0 0 ... 0
D 1 1 0 0 0 ... 0
C 1 1 0 0 0 ... 0
B 1 1 0 0 0 ... 0
A 1 1 0 0 0 ... 0"
Since, columns names are years I would like to aggregate them in decades and obtain something like:
"1840-1849 1850-1859 1860-1869 .... 2000-2009
F 1 0 0 0 0 ... 0
E 1 1 0 0 0 ... 0
D 1 1 0 0 0 ... 0
C 1 1 0 0 0 ... 0
B 1 1 0 0 0 ... 0
A 1 1 0 0 0 ... 0"
I am used to python and do not know how to do this transformation without making loops!
Thanks, isabel
It is unclear what aggregation you want, but using the following dummy data
set.seed(42)
df <- data.frame(matrix(sample(0:1, 6*25, replace = TRUE), ncol = 25))
names(df) <- 1874 + 0:24
The following counts events in each 10-year period.
Get the years as a numeric variable
years <- as.numeric(names(df))
Next we need an indicator for the start of each decade
ind <- seq(from = signif(years[1], 3), to = signif(tail(years, 1), 3), by = 10)
We then apply over the indices of ind (1:(length(ind)-1)), select columns from df that are the current decade and count the 1s using rowSums.
tmp <- lapply(seq_along(ind[-1]),
function(i, inds, data) {
rowSums(data[, names(data) %in% inds[i]:(inds[i+1]-1)])
}, inds = ind, data = df)
Next we cbind the resulting vectors into a data frame and fix-up the column names:
out <- do.call(cbind.data.frame, tmp)
names(out) <- paste(head(ind, -1), tail(ind, -1) - 1, sep = "-")
out
This gives:
> out
1870-1879 1880-1889 1890-1899
1 4 5 6
2 4 6 6
3 2 5 5
4 5 5 7
5 3 3 7
6 5 5 4
If you want simply a binary matrix with a 1 indicating at least 1 event happened in that decade, then you can use:
tmp2 <- lapply(seq_along(ind[-1]),
function(i, inds, data) {
as.numeric(rowSums(data[, names(data) %in% inds[i]:(inds[i+1]-1)]) > 0)
}, inds = ind, data = df)
out2 <- do.call(cbind.data.frame, tmp2)
names(out2) <- paste(head(ind, -1), tail(ind, -1) - 1, sep = "-")
out2
which gives:
> out2
1870-1879 1880-1889 1890-1899
1 1 1 1
2 1 1 1
3 1 1 1
4 1 1 1
5 1 1 1
6 1 1 1
If you want a different aggregation, then modify the function applied in the lapply call to use something other than rowSums.
This is another option, using modular arithmetic to aggregate the columns.
# setup, borrowed from #GavinSimpson
set.seed(42)
df <- data.frame(matrix(sample(0:1, 6*25, replace = TRUE), ncol = 25))
names(df) <- 1874 + 0:24
result <- do.call(cbind,
by(t(df), as.numeric(names(df)) %/% 10 * 10, colSums))
# add -xxx9 to column names, for each decade
dimnames(result)[[2]] <- paste(colnames(result), as.numeric(colnames(result)) + 9, sep='-')
# 1870-1879 1880-1889 1890-1899
# V1 4 5 6
# V2 4 6 6
# V3 2 5 5
# V4 5 5 7
# V5 3 3 7
# V6 5 5 4
If you wanted to aggregate with something other than sum, replace the call to
colSums with something like function(cols) lapply(cols, f), where f is the aggregating
function, e.g., max.

Resources