Replace multiple columns by head string into one column - r

I want to replace multiple columns of a data frame by one column each for each group whereas I also want to change the numbers. Example:
A1 A2 A3 A4 B1 B2 B3
1 1 1 0 1 1 0 0
2 1 0 1 1 0 1 1
3 1 1 1 1 0 1 1
4 0 0 1 0 0 0 1
5 0 0 0 0 0 1 0
I want to sort this data frame by it's headers meaning I only want one column "A" instead of 4 here and only column "B" instead of 3 here. The numbers should change with the following pattern: If you are in group "A2" and the observation has the number "1" it should be changed into a "2" instead. If you are in group "A3" and the observation has the number "1" it should be changed into a "3" instead. The end result should be that I want to contain the highest number in that specific column and row (if I have 3 "1"s in my row and group, the number which is going to replace all of them is going to be the one of the highest group)
If the number is 0 then nothing changes. Here is the result I'm looking for:
A B
1 4 1
2 4 3
3 4 3
4 3 3
5 0 2
How can I replace all of these groups by a single column each? (one column for each group)
So far I've tried a lot with the function unite(data= testdata, col= "A") for example, but doing this manually would take too long. There has to be a better way, right?
Thanks in advance!

You can do:
dat <- read.table(header=TRUE, text=
"A1 A2 A3 A4 B1 B2 B3
1 1 1 0 1 1 0 0
2 1 0 1 1 0 1 1
3 1 1 1 1 0 1 1
4 0 0 1 0 0 0 1
5 0 0 0 0 0 1 0")
myfu <- function(x) if (any(x)) max(which(x)) else 0
new <- data.frame(
A=apply(dat[, 1:4]==1, 1, myfu),
B=apply(dat[, 5:7]==1, 1, myfu))
new
A more general solution:
new2 <- data.frame(
A=apply(dat[, grepl("^A", names(dat))]==1, 1, myfu),
B=apply(dat[, grepl("^B", names(dat))]==1, 1, myfu))
new2

You can try the code like below
dfout <- as.data.frame(
lapply(
split.default(df, gsub("\\d+$", "", names(df))),
function(v) max.col(v, ties.method = "last") * +(rowSums(v) >= 1)
)
)
such that
> dfout
A B
1 4 1
2 4 3
3 4 3
4 3 3
5 0 2
Data
df <- structure(list(A1 = c(1L, 1L, 1L, 0L, 0L), A2 = c(1L, 0L, 1L,
0L, 0L), A3 = c(0L, 1L, 1L, 1L, 0L), A4 = c(1L, 1L, 1L, 0L, 0L
), B1 = c(1L, 0L, 0L, 0L, 0L), B2 = c(0L, 1L, 1L, 0L, 1L), B3 = c(0L,
1L, 1L, 1L, 0L)), class = "data.frame", row.names = c("1", "2",
"3", "4", "5"))

assuming your data is in a data.frame called df1 this works in Base-R
df1 <- t(df1)*as.numeric(regmatches(colnames(df1), regexpr("\\d+$", colnames(df1))))
df1 <- split(as.data.frame(df1),sub("\\d+$","",row.names(df1)))
df1 <- sapply(df1, apply, 2, max)
output:
> df1
A B
1 4 1
2 4 3
3 4 3
4 3 3
5 0 2

Related

comparing a value with next value in a column

I have a data set of the following form:
Interval | Count | criteria
0 0 0
0 1 0
0 2 0
0 3 0
1 4 1
1 5 2
1 6 3
1 7 4
2 8 1
2 9 2
3 10 3
I need to compare the values in the Interval. I first need to create a new variable to store the values. If the value in the Interval is the same as the next value, then the new variable should have blanks. If the Interval value is not the same as the next value, then it should return criteria/Count. Output should be like this:
Interval | Count | criteria | N
0 0 0
0 1 0
0 2 0
0 3 0 0
1 4 1
1 5 2
1 6 3
1 7 4 0.5714
2 8 1
2 9 2 0.2222
3 10 3
Here is my code:
fid$N<-''
for (i in 1:length(fid$Interval))
{
if (fid$Interval[i] != fid$Interval[i+1])
fid$N<-fid$criteria/fid$Count
else
fid$N<-''
}
and this is the error I am getting.
Error in if (fid$Interval[i] != fid$Interval[i + 1]) fid$N <- fid$criteria/fid$Count else fid$N <- "" :
missing value where TRUE/FALSE needed
To add that, there is no missing value in the data set.
I would appreciate if anyone could help.
You don't necessarily need a loop for this since most of the R functions are vectorized. Here is a way to do this in base R, dplyr and data.table without using a loop.
#Base R
transform(df, N = ifelse(Interval != c(tail(Interval, -1), NA), criteria/Count, NA))
#dplyr
library(dplyr)
df %>% mutate(N = if_else(Interval != lead(Interval), criteria/Count, NA_real_))
#data.table
library(data.table)
setDT(df)[, N:= fifelse(Interval != shift(Interval, type = 'lead'), criteria/Count, NA_real_)]
All of which return :
# Interval Count criteria N
#1 0 0 0 NA
#2 0 1 0 NA
#3 0 2 0 NA
#4 0 3 0 0.0000000
#5 1 4 1 NA
#6 1 5 2 NA
#7 1 6 3 NA
#8 1 7 4 0.5714286
#9 2 8 1 NA
#10 2 9 2 0.2222222
#11 3 10 3 NA
I return NA instead of blank value because if we return blank value the entire column becomes of type character and the numbers are no longer useful. In the answer you can replace NA with '' to get a blank value.
data
df <- structure(list(Interval = c(0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 2L,
2L, 3L), Count = 0:10, criteria = c(0L, 0L, 0L, 0L, 1L, 2L, 3L,
4L, 1L, 2L, 3L)), class = "data.frame", row.names = c(NA, -11L))

Filtering rows by different columns

In the data frame
x1 x2 x3 x4 x5
1 0 1 1 0 3
2 1 2 2 0 3
3 2 2 0 0 2
4 1 3 0 0 2
5 3 3 2 1 4
6 2 0 0 0 1
column x5 indicates where the first non-zero value in a row is. The table should be read from right (x4) to left (x1). Thus, the first non-zero value in the first row is in column x3, for example.
I want to get all rows where 1 is the first non zero entry, i.e.
x1 x2 x3 x4 x5
1 0 1 1 0 3
2 3 3 2 1 4
should be the result. I tried different version of filter_at but I didn't manage to come up with a solution. E.g. one try was
testdf %>% filter_at(vars(
paste("x",testdf$x5, sep = "")),
any_vars(. == 1))
I want to solve that without a for loop, since the real data set has millions of rows and almost 100 columns.
You can do filtering row-wise easily with the new utility function c_across:
library(dplyr) # version 1.0.2
testdf %>% rowwise() %>% filter(c_across(x1:x4)[x5] == 1) %>% ungroup()
# A tibble: 2 x 5
x1 x2 x3 x4 x5
<int> <int> <int> <int> <int>
1 0 1 1 0 3
2 3 3 2 1 4
A vectorised base R solution would be :
result <- df[df[cbind(1:nrow(df), df$x5)] == 1, ]
result
# x1 x2 x3 x4 x5
#1 0 1 1 0 3
#5 3 3 2 1 4
cbind(1:nrow(df), df$x5) creates a row-column matrix of largest value in each row. We extract those first values and select rows with 1 in them.
Another vectorised solution:
df[t(df)[t(col(df)==df$x5)]==1,]
We can use apply in base R
df1[apply(df1, 1, function(x) x[x[5]] == 1),]
# x1 x2 x3 x4 x5
#1 0 1 1 0 3
#5 3 3 2 1 4
data
df1 <- structure(list(x1 = c(0L, 1L, 2L, 1L, 3L, 2L), x2 = c(1L, 2L,
2L, 3L, 3L, 0L), x3 = c(1L, 2L, 0L, 0L, 2L, 0L), x4 = c(0L, 0L,
0L, 0L, 1L, 0L), x5 = c(3L, 3L, 2L, 2L, 4L, 1L)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))

How I can make a column with some indicator?

I have 4 indicator and an id row. How I can make a column with respect of this indicator like this:
ID. col1. col2. col3. col4.
1 1 0 0 0
2 0 0 0 1
3 0 0 1 0
I n each row just one of the columns is 1 and others 0. new column is 1 if col1 is 1, is 2 if col2 is 1 , is 3 if col3 is 1 and is 4 if col4 is 1.
so the output is
ID. col.
1 1
2 4
3 3
An option is max.col
cbind(df1[1], `col.` = max.col(df1[-1], "first"))
# ID. col.
#1 1 1
#2 2 4
#3 3 3
If there are no 1s in the rows, create a logical condition to return that row as NA
df1[2, 5] <- 0
cbind(df1[1], `col.` = max.col(df1[-1], "first") * NA^ !rowSums(df1[-1] == 1))
data
df1 <- structure(list(ID. = 1:3, col1. = c(1L, 0L, 0L), col2. = c(0L,
0L, 0L), col3. = c(0L, 0L, 1L), col4. = c(0L, 1L, 0L)),
class = "data.frame", row.names = c(NA,
-3L))

Refer to column name and row name within an apply statement in R

I have a dataframe in R which looks like the one below.
a b c d e f
0 1 1 0 0 0
1 1 1 1 0 1
0 0 0 1 0 1
1 0 0 1 0 1
1 1 1 0 0 0
The database is big, spanning over 100 columns and 5000 rows and contain all binaries (0's and 1's). I want to construct an overlap between each and every columns in R. Something like the one given below. This overlap dataframe will be a square matrix with equal number of rows and columns and that will be same as the number of columns in the 1st dataframe.
a b c d e f
a 3 2 2 2 0 2
b 2 3 3 3 0 1
c 2 3 3 1 0 1
d 2 3 1 3 0 3
e 0 0 0 0 0 0
f 2 1 1 3 0 3
Each cell of the second dataframe is populated by the number of cases where both row and column have 1 in the first dataframe.
I'm thinking of constructing a empty matrix like this:
df <- matrix(ncol = ncol(data), nrow = ncol(data))
colnames(df) <- names(data)
rownames(df) <- names(data)
.. and iterating over each cell of this matrix using an apply command reading the corresponding row name (say, x) and column name (say, y) and running a function like the one below.
summation <- function (x,y) (return (sum(data$x * data$y)))
The problem with is I can't find out the row name and column name while within an apply function. Any help will be appreciated.
Any more efficient way than what I'm thinking is more than welcome.
You are looking for crossprod
crossprod(as.matrix(df1))
# a b c d e f
#a 3 2 2 2 0 2
#b 2 3 3 1 0 1
#c 2 3 3 1 0 1
#d 2 1 1 3 0 3
#e 0 0 0 0 0 0
#f 2 1 1 3 0 3
data
df1 <- structure(list(a = c(0L, 1L, 0L, 1L, 1L), b = c(1L, 1L, 0L, 0L,
1L), c = c(1L, 1L, 0L, 0L, 1L), d = c(0L, 1L, 1L, 1L, 0L), e = c(0L,
0L, 0L, 0L, 0L), f = c(0L, 1L, 1L, 1L, 0L)), .Names = c("a",
"b", "c", "d", "e", "f"), class = "data.frame", row.names = c(NA,
-5L))

R: add matrices based on row and column names

I have two matrices I want to sum based on their row and column names. The matrices will not necessarily have all rows and columns in common - some may be missing from either matrix.
For example, consider two matrices A and B:
A= B=
a b c d a c d e
v 1 1 1 0 v 0 0 0 1
w 1 1 0 1 w 0 0 1 0
x 1 0 1 1 y 0 1 0 0
y 0 1 1 1 z 1 0 0 0
Column e is missing from matrix A and column b is missing from matrix B.
Row z is missing from matrix A and row x is missing from matrix B.
The summed table I'm looking for is:
Sum=
a b c d e
v 1 1 1 0 1
w 1 1 0 2 0
x 1 0 1 1 na
y 0 1 2 1 0
z 1 na 0 0 0
The row and column ordering in the final matrix don't matter, as long as the matrix is complete, i.e. has all the data. Missing values don't have to be "Na", but could be "0" instead.
I'm not sure if there is a way to do this that doesn't involve for loops. Any help would be much appreciated.
My solution
I managed to do this easily by converting the matrices to dataframes, binding the dataframes by row and then casting the resulting dataframe back into a matrix. This looks like it works, but maybe someone could double check or let me know if there is a better way.
library(reshape2)
A_df=as.data.frame(as.table(A))
B_df=as.data.frame(as.table(B))
merged_df=rbind(A_df,B_df)
Summed_matrix=acast(merged_df, Var1 ~ Var2, sum)
merged_df looks like this:
Var1 Var2 Freq
1 v a 1
2 w a 1
3 x a 1
4 y a 0
5 v b 1
6 w b 1
etc...
May be you can try:
cAB <- union(colnames(A), colnames(B))
rAB <- union(rownames(A), rownames(B))
A1 <- matrix(0, ncol=length(cAB), nrow=length(rAB), dimnames=list(rAB, cAB))
B1 <- A1
indxA <- outer(rAB, cAB, FUN=paste) %in% outer(rownames(A), colnames(A), FUN=paste)
indxB <- outer(rAB, cAB, FUN=paste) %in% outer(rownames(B), colnames(B), FUN=paste)
A1[indxA] <- A
B1[indxB] <- B
A1+B1 #because it was mentioned to have `0` as missing values
# a b c d e
#v 1 1 1 0 1
#w 1 1 0 2 0
#x 1 0 1 1 0
#y 0 1 2 1 0
#z 1 0 0 0 0
If you want to get the NA as missing values
A1 <- matrix(NA, ncol=length(cAB), nrow=length(rAB), dimnames=list(rAB, cAB))
B1 <- A1
A1[indxA] <- A
B1[indxB] <- B
indxNA <- is.na(A1) & is.na(B1)
A1[is.na(A1)!= indxNA] <- 0
B1[is.na(B1)!= indxNA] <- 0
A1+B1
# a b c d e
#v 1 1 1 0 1
#w 1 1 0 2 0
#x 1 0 1 1 NA
#y 0 1 2 1 0
#z 1 NA 0 0 0
Or using reshape2
library(reshape2)
acast(rbind(melt(A), melt(B)), Var1~Var2, sum) #Inspired from the OP's idea
# a b c d e
#v 1 1 1 0 1
#w 1 1 0 2 0
#x 1 0 1 1 0
#y 0 1 2 1 0
#z 1 0 0 0 0
data
A <- structure(c(1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 0L,
1L, 1L, 1L), .Dim = c(4L, 4L), .Dimnames = list(c("v", "w", "x",
"y"), c("a", "b", "c", "d")))
B <- structure(c(0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L,
0L, 0L, 0L), .Dim = c(4L, 4L), .Dimnames = list(c("v", "w", "y",
"z"), c("a", "c", "d", "e")))

Resources