Filtering rows by different columns - r

In the data frame
x1 x2 x3 x4 x5
1 0 1 1 0 3
2 1 2 2 0 3
3 2 2 0 0 2
4 1 3 0 0 2
5 3 3 2 1 4
6 2 0 0 0 1
column x5 indicates where the first non-zero value in a row is. The table should be read from right (x4) to left (x1). Thus, the first non-zero value in the first row is in column x3, for example.
I want to get all rows where 1 is the first non zero entry, i.e.
x1 x2 x3 x4 x5
1 0 1 1 0 3
2 3 3 2 1 4
should be the result. I tried different version of filter_at but I didn't manage to come up with a solution. E.g. one try was
testdf %>% filter_at(vars(
paste("x",testdf$x5, sep = "")),
any_vars(. == 1))
I want to solve that without a for loop, since the real data set has millions of rows and almost 100 columns.

You can do filtering row-wise easily with the new utility function c_across:
library(dplyr) # version 1.0.2
testdf %>% rowwise() %>% filter(c_across(x1:x4)[x5] == 1) %>% ungroup()
# A tibble: 2 x 5
x1 x2 x3 x4 x5
<int> <int> <int> <int> <int>
1 0 1 1 0 3
2 3 3 2 1 4

A vectorised base R solution would be :
result <- df[df[cbind(1:nrow(df), df$x5)] == 1, ]
result
# x1 x2 x3 x4 x5
#1 0 1 1 0 3
#5 3 3 2 1 4
cbind(1:nrow(df), df$x5) creates a row-column matrix of largest value in each row. We extract those first values and select rows with 1 in them.

Another vectorised solution:
df[t(df)[t(col(df)==df$x5)]==1,]

We can use apply in base R
df1[apply(df1, 1, function(x) x[x[5]] == 1),]
# x1 x2 x3 x4 x5
#1 0 1 1 0 3
#5 3 3 2 1 4
data
df1 <- structure(list(x1 = c(0L, 1L, 2L, 1L, 3L, 2L), x2 = c(1L, 2L,
2L, 3L, 3L, 0L), x3 = c(1L, 2L, 0L, 0L, 2L, 0L), x4 = c(0L, 0L,
0L, 0L, 1L, 0L), x5 = c(3L, 3L, 2L, 2L, 4L, 1L)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))

Related

Is there an easy way to get the frequencies column wise?

I have a list of Likert values, the values range from 1 to 5. Each possible response may occur once, more than once or not at all per column. I have several columns and rows, each row corresponds to a participant, each column to a question. There is no NA data.
Example:
c1
c2
c3
1
1
5
2
2
5
3
3
4
3
4
3
2
5
1
1
3
1
1
5
1
The goal is to count the frequencies of the answer options column wise, to consequently compare them.
So the resulting table should look like this:
-
c1
c2
c3
1
3
1
3
2
2
1
0
3
2
2
1
4
0
1
1
5
0
2
2
I know how to do this for one column, and I can look at the frequencies with apply(ds, 1, table), but I do not manage to put this into a table to work further with.
Thanks!
This should do it, using plyr:
count_df = setNames(data.frame(t(plyr::ldply(apply(df, 2, table), rbind)[2:6])), colnames(df))
count_df[is.na(count_df)] = 0
You may use table in sapply -
sapply(df, function(x) table(factor(x, 1:5)))
# c1 c2 c3
#1 3 1 3
#2 2 1 0
#3 2 2 1
#4 0 1 1
#5 0 2 2
This approach can also be used in dplyr if you prefer that.
library(dplyr)
df %>% summarise(across(.fns = ~table(factor(., 1:5))))
We may use a vectorized option in base R
table(data.frame(v1 = unlist(df1), v2 = names(df1)[col(df1)]))
v2
v1 c1 c2 c3
1 3 1 3
2 2 1 0
3 2 2 1
4 0 1 1
5 0 2 2
data
df1 <- structure(list(c1 = c(1L, 2L, 3L, 3L, 2L, 1L, 1L), c2 = c(1L,
2L, 3L, 4L, 5L, 3L, 5L), c3 = c(5L, 5L, 4L, 3L, 1L, 1L, 1L)),
class = "data.frame", row.names = c(NA,
-7L))

Replacing and swapping one value by another in R for specific rows

In the following data frame, how can I replace all 1 by 2 and all 2 by 1, and keep 3 as it is
mydf=structure(list(x1 = c(1L, 2L, 3L, 1L), x2 = c(2L, 1L, 2L, 3L),
x3 = c(1L, 2L, 2L, 1L), x4 = c(1L, 2L, 1L, 1L)), row.names = 5:8, class = "data.frame")
mydf
x1 x2 x3 x4
5 1 2 1 1
6 2 1 2 2
7 3 2 2 1
8 1 3 1 1
Any kind of help is appreciated.
Base R :
mydf[] <- lapply(mydf, function(x) ifelse(x == 1, 2, ifelse(x == 2, 1, x)))
dplyr :
library(dplyr)
mydf %>%
mutate(across(.fns = ~case_when(. == 2 ~ 1L,. == 1 ~ 2L, TRUE ~ .)))
# x1 x2 x3 x4
#5 2 1 2 2
#6 1 2 1 1
#7 3 1 1 2
#8 2 3 2 2

Replace multiple columns by head string into one column

I want to replace multiple columns of a data frame by one column each for each group whereas I also want to change the numbers. Example:
A1 A2 A3 A4 B1 B2 B3
1 1 1 0 1 1 0 0
2 1 0 1 1 0 1 1
3 1 1 1 1 0 1 1
4 0 0 1 0 0 0 1
5 0 0 0 0 0 1 0
I want to sort this data frame by it's headers meaning I only want one column "A" instead of 4 here and only column "B" instead of 3 here. The numbers should change with the following pattern: If you are in group "A2" and the observation has the number "1" it should be changed into a "2" instead. If you are in group "A3" and the observation has the number "1" it should be changed into a "3" instead. The end result should be that I want to contain the highest number in that specific column and row (if I have 3 "1"s in my row and group, the number which is going to replace all of them is going to be the one of the highest group)
If the number is 0 then nothing changes. Here is the result I'm looking for:
A B
1 4 1
2 4 3
3 4 3
4 3 3
5 0 2
How can I replace all of these groups by a single column each? (one column for each group)
So far I've tried a lot with the function unite(data= testdata, col= "A") for example, but doing this manually would take too long. There has to be a better way, right?
Thanks in advance!
You can do:
dat <- read.table(header=TRUE, text=
"A1 A2 A3 A4 B1 B2 B3
1 1 1 0 1 1 0 0
2 1 0 1 1 0 1 1
3 1 1 1 1 0 1 1
4 0 0 1 0 0 0 1
5 0 0 0 0 0 1 0")
myfu <- function(x) if (any(x)) max(which(x)) else 0
new <- data.frame(
A=apply(dat[, 1:4]==1, 1, myfu),
B=apply(dat[, 5:7]==1, 1, myfu))
new
A more general solution:
new2 <- data.frame(
A=apply(dat[, grepl("^A", names(dat))]==1, 1, myfu),
B=apply(dat[, grepl("^B", names(dat))]==1, 1, myfu))
new2
You can try the code like below
dfout <- as.data.frame(
lapply(
split.default(df, gsub("\\d+$", "", names(df))),
function(v) max.col(v, ties.method = "last") * +(rowSums(v) >= 1)
)
)
such that
> dfout
A B
1 4 1
2 4 3
3 4 3
4 3 3
5 0 2
Data
df <- structure(list(A1 = c(1L, 1L, 1L, 0L, 0L), A2 = c(1L, 0L, 1L,
0L, 0L), A3 = c(0L, 1L, 1L, 1L, 0L), A4 = c(1L, 1L, 1L, 0L, 0L
), B1 = c(1L, 0L, 0L, 0L, 0L), B2 = c(0L, 1L, 1L, 0L, 1L), B3 = c(0L,
1L, 1L, 1L, 0L)), class = "data.frame", row.names = c("1", "2",
"3", "4", "5"))
assuming your data is in a data.frame called df1 this works in Base-R
df1 <- t(df1)*as.numeric(regmatches(colnames(df1), regexpr("\\d+$", colnames(df1))))
df1 <- split(as.data.frame(df1),sub("\\d+$","",row.names(df1)))
df1 <- sapply(df1, apply, 2, max)
output:
> df1
A B
1 4 1
2 4 3
3 4 3
4 3 3
5 0 2

Keeping rows with a minimum number of observations per group

With the following data, I would like to remove all the rows either containing less than or equal to only 1, 1s or 2s. The date set contains either 1 or 2.
mydata
X1 X2 X3 X4 X5 X6 X7
1 1 2 2 1 1 2 2
2 2 2 2 1 2 2 2
3 1 1 1 1 2 2 2
4 2 1 2 1 2 2 1
5 2 1 1 1 1 1 1
6 1 1 1 1 1 1 1
7 2 2 2 2 2 2 2
Remove row #2,5,6 & 7 because
sum(mydata[2,]=="1") #2nd row contains only one 1.
sum(mydata[5,]=="2") #5th row contains only one 2.
sum(mydata[6,]=="2") #6th row contains only no 2.
sum(mydata[7,]=="1") #7th row contains only no 1
Appreciate your help.
d[rowSums(d == 1) > 1 & rowSums(d == 2) > 1,]
# X1 X2 X3 X4 X5 X6 X7
#1 1 2 2 1 1 2 2
#3 1 1 1 1 2 2 2
#4 2 1 2 1 2 2 1
One option is to loop through the rows get the table and check if the frequency of all the elements are greater than 1 (just in case there are more number of unique elements per row)
mydata[apply(mydata, 1, function(x) all(table(factor(x, levels = 1:2)) >1)),]
#. X1 X2 X3 X4 X5 X6 X7
#1 1 2 2 1 1 2 2
#3 1 1 1 1 2 2 2
#4 2 1 2 1 2 2 1
data
mydata <- structure(list(X1 = c(1L, 2L, 1L, 2L, 2L, 1L, 2L), X2 = c(2L,
2L, 1L, 1L, 1L, 1L, 2L), X3 = c(2L, 2L, 1L, 2L, 1L, 1L, 2L),
X4 = c(1L, 1L, 1L, 1L, 1L, 1L, 2L), X5 = c(1L, 2L, 2L, 2L,
1L, 1L, 2L), X6 = c(2L, 2L, 2L, 2L, 1L, 1L, 2L), X7 = c(2L,
2L, 2L, 1L, 1L, 1L, 2L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7"))

How to programatically compare an entire row in R?

I have the following dataframe in R:
data=
Time X1 X2 X3
1 1 0 0
2 1 1 1
3 0 0 1
4 1 1 1
5 0 0 0
6 0 1 1
7 1 1 1
8 0 0 0
9 1 1 1
10 0 0 0
Is there a way to programatically select those rows that are equal to (0,1,1)? I know it can be done by doing data[data$X1 == 0 & data$X2 == 1 & data$X3 == 1,] but, in my scenario, (0,1,1) is a list in a variable. My ultimate goal here is to determine the number of rows that are equal to (0,1,1), or any other combination that list variable can hold.
Thanks!
Mariano.
Here's a couple of options using a merge:
merge(list(X1=0,X2=1,X3=1), dat)
#or
merge(setNames(list(0,1,1),c("X1","X2","X3")), dat)
Or even using positional indexes based on what columns you want matched up:
L <- list(0,1,1)
merge(L, dat, by.x=seq_along(L), by.y=2:4)
All of which return:
# X1 X2 X3 Time
#1 0 1 1 6
If your matching variables are all of the same type, you could also safely do it via matrix comparison like:
dat[colSums(t(dat[c("X1","X2","X3")]) == c(0,1,1)) == 3,]
apply(data, 1, function(x) all(x==c(0,1,1)))
This will go down each row of the frame and return TRUE for each row where the row is equal to c(0,1,1).
this is your data
mydf <- structure(list(Time = 1:10, X1 = c(1L, 1L, 0L, 1L, 0L, 0L, 1L,
0L, 1L, 0L), X2 = c(0L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 0L),
X3 = c(0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 0L)), .Names = c("Time",
"X1", "X2", "X3"), class = "data.frame", row.names = c(NA, -10L
))
Using subset
subset(mydf, X1 == 0 & X2==1 & X3==1)
# Time X1 X2 X3
#6 6 0 1 1
another way
mydf[mydf$X1 ==0 & mydf$X2 ==1 & mydf$X3 ==1, ]
# Time X1 X2 X3
#6 6 0 1 1
or like this
mydf[mydf$X1 ==0 & mydf$X2 & mydf$X3 %in% c(1,1), ]
# Time X1 X2 X3
#6 6 0 1 1
you can also do that by
library(dplyr)
filter(mydf, X1==0 & X2==1 & X3==1)
# Time X1 X2 X3
#1 6 0 1 1

Resources