I have a data frame, which looks something like this:
CASENO Var1 Var2 Resp1 Resp2
1 1 0 1 1
2 0 0 0 0
3 1 1 1 1
4 1 1 0 1
5 1 0 1 0
There are over 400 variables in the dataset. This is just an example. I need to create a simple frequency matrix in R (excluding the case numbers), but the table function doesn't work. Specifically, I'm looking to cross-tabulate a portion of the columns to create a two-mode matrix of frequencies. The table should look like this:
Var1 Var2
Resp1 3 1
Resp2 3 2
In Stata, the command is:
gen var = 1 if Var1==1
replace var= 2 if Var2==1
gen resp = 1 if Resp1==1
replace resp = 2 if Resp2==1
tab var resp
This one should work for any number of Var & Resps:
d <- structure(list(CASENO = 1:5, Var1 = c(1L, 0L, 1L, 1L, 1L), Var2 = c(0L, 0L, 1L, 1L, 0L), Resp1 = c(1L, 0L, 1L, 0L, 1L), Resp2 = c(1L, 0L, 1L, 1L, 0L)), .Names = c("CASENO", "Var1", "Var2", "Resp1", "Resp2"), class = "data.frame", row.names = c(NA, -5L))
m <- as.matrix(d[,-1])
m2 <- t(m) %*% m
rnames <- grepl('Resp',rownames((m2)))
cnames <- grepl('Var',colnames((m2)))
m2[rnames,cnames]
[UPDATE] A more elegant version, provided in the comment by G.Grothendieck:
m <- as.matrix(d[,-1])
cn <- colnames(m);
crossprod(m[, grep("Resp", cn)], m[, grep("Var", cn)])
I'm sure there's another way, but you could do this:
library(reshape2)
library(plyr)
df1 <- melt(df[,-1],id=1:2)
ddply(df1,.(variable),summarize,
Var1 = sum(value==1&Var1==1),
Var2 = sum(value==1&Var2==1))
# variable Var1 Var2
# 1 Resp1 3 1
# 2 Resp2 3 2
Here is an approach using xtabs.
# get names of non "variables"
not_vars <- c("Resp1", "Resp2", "CASENO")
# get names of "variables"
vars <- as.matrix(d[,!names(d) %in% not_vars])
# if you have many more than 2 response variables, this could get unwieldy
result <- rbind(
xtabs( vars ~ Resp1, data=d, exclude=0),
xtabs( vars ~ Resp2, data=d, exclude=0))
# give resulting table appropriate row names.
rownames(result) <- c("Resp1", "Resp2")
# Var1 Var2
#Resp1 3 1
#Resp2 3 2
sample data:
d <- read.table(text="
CASENO Var1 Var2 Resp1 Resp2
1 1 0 1 1
2 0 0 0 0
3 1 1 1 1
4 1 1 0 1
5 1 0 1 0", header=TRUE)
Related
I want to merge them and find the values of one dataframe that would like to be added to the existing values of the other based on the same columns.
For example:
df1
No
A
B
C
D
1
1
0
1
0
2
0
1
2
1
3
0
0
1
0
df2
No
A
B
E
F
1
1
0
1
1
2
0
1
2
1
3
2
1
1
0
Finally, I want the output table like this.
df
No
A
B
C
D
E
F
1
2
0
1
0
1
1
2
0
2
2
1
2
1
3
2
1
1
0
1
0
Note: I did try merge(), but in this case, it did not work.
Any help/suggestion would be appreciated.
Reproducible sample data
df1 <-
structure(list(No = 1:3, A = c(1L, 0L, 0L), B = c(0L, 1L, 0L),
C = c(1L, 2L, 1L), D = c(0L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-3L))
df2 <-
structure(list(No = 1:3, A = c(1L, 0L, 2L), B = c(0L, 1L, 1L),
E = c(1L, 2L, 1L), F = c(1L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-3L))
You can also carry out this operation by left_joining these two data frames:
library(dplyr)
library(stringr)
df1 %>%
left_join(df2, by = "No") %>%
mutate(across(ends_with(".x"), ~ .x + get(str_replace(cur_column(), "\\.x", "\\.y")))) %>%
rename_with(~ str_replace(., "\\.x", ""), ends_with(".x")) %>%
select(!ends_with(".y"))
No A B C D E F
1 1 2 0 1 0 1 1
2 2 0 2 2 1 2 1
3 3 2 1 1 0 1 0
You can first row-bind the two dataframes and then compute the sum of each column while 'grouping' by the No column. This can be done like so:
library(dplyr)
bind_rows(df1, df2) %>%
group_by(No) %>%
summarise(across(c(A, B, C, D, E, `F`), sum, na.rm = TRUE),
.groups = "drop")
If a particular column doesn't exist in one dataframe (i.e. columns E and F), values will be padded with NA. Adding the na.rm = TRUE argument (to be passed to sum()) means that these values will get treated like zeros.
Using data.table :
library(data.table)
rbindlist(list(df1, df2), fill = TRUE)[, lapply(.SD, sum, na.rm = TRUE), No]
# No A B C D E F
#1: 1 2 0 1 0 1 1
#2: 2 0 2 2 1 2 1
#3: 3 2 1 1 0 1 0
We can use base R (with R 4.1.0). Get the values of the objects in a list ('lst1'). Then, find the union of the column names ('nm1'). Loop over the list assign to create 0 value columns with setdiff in each list element, rbind them and use aggregate to get the sum grouped by 'No'
lst1 <- mget(ls(pattern= '^df\\d+$'))
nm1 <- lapply(lst1, names) |>
{\(x) Reduce(union, x)}()
lapply(lst1, \(x) {x[setdiff(nm1, names(x))] <- 0; x}) |>
{\(x) do.call(rbind, x)}() |>
{\(dat) aggregate(.~ No, data = dat, FUN = sum, na.rm = TRUE,
na.action = na.pass)}()
# No A B C D E F
#1 1 2 0 1 0 1 1
#2 2 0 2 2 1 2 1
#3 3 2 1 1 0 1 0
I have a data frame df
m n o p
a 1 1 2 5
b 1 2 0 4
c 3 3 3 3
I can extract column m by:
df[,"m"]
Now the problem is, the column name was generated somewhere else (multiple times, in a for loop). For example, column name m was generated by choosing a specific element in the dataframe, gen, in one loop
:
> gen[i,1]
[1] m
How do I extract the column based on gen[i,1]?
Just nest the subsetting.
dat[,"m"]
# [1] 1 1 3
i <- 13
gen[i, 1]
# [1] "m"
dat[, gen[i, 1]]
# [1] 1 1 3
Or, if you don't want the column to be dropped:
dat[, gen[i, 1], drop=FALSE]
# m
# a 1
# b 1
# c 3
Data
dat <- structure(list(m = c(1L, 1L, 3L), n = 1:3, o = c(2L, 0L, 3L),
p = 5:3), class = "data.frame", row.names = c("a", "b", "c"
))
gen <- data.frame(letters)
We can use select from dplyr
library(dplyr)
i <- 13
dat %>%
select(gen[i, 1])
# m
#a 1
#b 1
#c 3
data
dat <- structure(list(m = c(1L, 1L, 3L), n = 1:3, o = c(2L, 0L, 3L),
p = 5:3), class = "data.frame", row.names = c("a", "b", "c"
))
gen <- data.frame(letters)
I want to loop through a large dataframe counting in the first column how many values >0, removing those rows that were counted.... then moving on to column 2 counting the number of values>0 and removing those rows etc...
the data frame
taxonomy A B C
1 cat 0 2 0
2 dog 5 1 0
3 horse 3 0 0
4 mouse 0 0 4
5 frog 0 2 4
6 lion 0 0 2
can be generated with
DF1 = structure(list(taxonomy = c("cat", "dog","horse","mouse","frog", "lion"),
A = c(0L, 5L, 3L, 0L, 0L, 0L), D = c(2L, 1L, 0L, 0L, 2L, 0L), C = c(0L, 0L, 0L, 4L, 4L, 2L)),
.Names = c("taxonomy", "A", "B", "C"),
row.names = c(NA, -6L), class = "data.frame")
and i expect the outcome to be
A B C
count 2 2 2
i wrote this loop but it does not remove the rows as it goes
res <- data.frame(DF1[1,], row.names = c('count'))
for(n in 1:ncol(DF1)) {
res[colnames(DF1)[n]] <- sum(DF1[n])
DF1[!DF1[n]==1]
}
it gives this incorrect result
A B C
count 2 3 3
You could do ...
DF = DF1[, -1]
cond = DF != 0
p = max.col(cond, ties="first")
fp = factor(p, levels = seq_along(DF), labels = names(DF))
table(fp)
# A B C
# 2 2 2
To account for rows that are all zeros, I think this works:
fp[rowSums(cond) == 0] <- NA
We can update the dataset in each run. Create a temporary dataset without the 'taxonomy' column ('tmp'). Initiate a named vector ('n'), loop through the columns of 'tmp', get a logical index based on whether the column is greater than 0 ('i1'), get the sum of TRUE values, update the 'n' for the corresponding column, then update the 'tmp' by removing those rows using 'i1' as row index
tmp <- DF1[-1]
n <- setNames(numeric(ncol(tmp)), names(tmp))
for(i in seq_len(ncol(tmp))) {
i1 <- tmp[[i]] > 0
n[i] <- sum(i1)
tmp <- tmp[!i1, ]}
n
# A B C
# 2 2 2
It can also be done with Reduce
sapply(Reduce(function(x, y) y[!x] > 0, DF1[3:4],
init = DF1[,2] > 0, accumulate = TRUE ), sum)
#[1] 2 2 2
Or using accumulate from purrr
library(purrr)
accumulate(DF1[3:4], ~ .y[!.x] > 0, .init = DF1[[2]] > 0) %>%
map_int(sum)
#[1] 2 2 2
This is easy with Reduce and sapply:
> first <- Reduce(function(a,b) b[a==0], df[-1], accumulate=TRUE)
> first
[[1]]
[1] 0 5 3 0 0 0
[[2]]
[1] 2 0 2 0
[[3]]
[1] 0 4 2
> then <- sapply(setNames(first, names(df[-1])), function(x) length(x[x>0]))
> then
A B C
2 2 2
I have a data frame like this:
CriterionVar Var1 Var2 Var3
3 0 0 0
1 0 0 0
2 0 0 0
5 0 0 0
I want to recode the values of Var1, Var2, and Var3 based on the value of CriterionVar. In pseudocode, it would be something like this:
for each row
if (CriterionVar.value >= Var1.index) Var1 = 1
if (CriterionVar.value >= Var2.index) Var2 = 1
if (CriterionVar.value >= Var3.index) Var3 = 1
The recoded data frame would look like this:
CriterionVar Var1 Var2 Var3
3 1 1 1
1 1 0 0
2 1 1 0
5 1 1 1
Obviously, that is not the way to get it done because (1) the number of VarN columns is determined by a data value, and (2) it's just ugly.
Any help is appreciated.
For more general values of CriterionVar, you can use outer to construct a logical matrix which you can use for indexing like this:
dat[2:4][outer(dat$CriterionVar, seq_along(names(dat)[-1]), ">=")] <- 1
In this example, this returns
dat
CriterionVar Var1 Var2 Var3
1 3 1 1 1
2 1 1 0 0
3 2 1 1 0
4 5 1 1 1
A second method using col, which returns a matrix of the column index, is a tad bit more direct:
dat[2:4][dat$CriterionVar >= col(dat[-1])] <- 1
and returns the desired result.
data
dat <-
structure(list(CriterionVar = c(3L, 1L, 2L, 5L), Var1 = c(0L,
0L, 0L, 0L), Var2 = c(0L, 0L, 0L, 0L), Var3 = c(0L, 0L, 0L, 0L
)), .Names = c("CriterionVar", "Var1", "Var2", "Var3"), class = "data.frame",
row.names = c(NA, -4L))
df[,-1] = lapply(2:NCOL(df), function(i) as.numeric(df[,1] >= (i-1)))
df
# CriterionVar Var1 Var2 Var3
#1 3 1 1 1
#2 1 1 0 0
#3 2 1 1 0
#4 5 1 1 1
DATA
df = structure(list(CriterionVar = c(3L, 1L, 2L, 5L), Var1 = c(1,
1, 1, 1), Var2 = c(1, 0, 1, 1), Var3 = c(1, 0, 0, 1)), .Names = c("CriterionVar",
"Var1", "Var2", "Var3"), row.names = c(NA, -4L), class = "data.frame")
I'm a big proponent of vapply: it's fast, and you know the shape of what it'll return. The only problem is the resulting matrix is usually the "sideways" version of what you want. But t() fixes that easily enough.
n_var_cols <- 3
truncated_criterion <- pmin(dat[["CriterionVar"]], n_var_cols)
row_template <- rep_len(0, n_var_cols)
replace_up_to_index <- function(index) {
replace(row_template, seq_len(index), 1)
}
over_matrix <- vapply(
X = truncated_criterion,
FUN = replace_up_to_index,
FUN.VALUE = row_template
)
over_matrix <- t(over_matrix)
dat[, -1] <- over_matrix
dat
# CriterionVar Var1 Var2 Var3
# 1 3 1 1 1
# 2 1 1 0 0
# 3 2 1 1 0
# 4 5 1 1 1
There was some bookkeeping in the first three lines, but nothing too bad. I used pmin() to restrict the criteria values to be no greater than the number of VarN columns.
I have a data frame in the following format
1 2 a b c
1 a b 0 0 0
2 b 0 0 0
3 c 0 0 0
I want to fill columns a through c with a TRUE/FALSE that says whether the column name is in columns 1 or 2
1 2 a b c
1 a b 1 1 0
2 b 0 1 0
3 c 0 0 1
I have a dataset of about 530,000 records, 4 description columns, and 95 output columns so a for loop does not work. I have tried code in the following format, but it was too time consuming:
> for(i in 3:5) {
> for(j in 1:3) {
> for(k in 1:2){
> if(df[j,k]==colnames(df)[i]) df[j, i]=1
> }
> }
> }
Is there an easier, more efficient way to achieve the same output?
Thanks in advance!
One option is mtabulate from qdapTools
library(qdapTools)
df1[-(1:2)] <- mtabulate(as.data.frame(t(df1[1:2])))[-3]
df1
# 1 2 a b c
#1 a b 1 1 0
#2 b 0 1 0
#3 c 0 0 1
Or we melt the dataset after converting to matrix, use table to get the frequencies, and assign the output to the columns that are numeric.
library(reshape2)
df1[-(1:2)] <- table(melt(as.matrix(df1[1:2]))[-2])[,-1]
Or we can 'paste' the first two columns and use cSplit_e to get the binary format.
library(splitstackshape)
cbind(df1[1:2], cSplit_e(as.data.table(do.call(paste, df1[1:2])),
'V1', ' ', type='character', fill=0, drop=TRUE))
data
df1 <- structure(list(`1` = c("a", "b", "c"), `2` = c("b", "", ""),
a = c(0L, 0L, 0L), b = c(0L, 0L, 0L), c = c(0L, 0L, 0L)), .Names = c("1",
"2", "a", "b", "c"), class = "data.frame", row.names = c("1",
"2", "3"))