With the following data, I would like to remove all the rows either containing less than or equal to only 1, 1s or 2s. The date set contains either 1 or 2.
mydata
X1 X2 X3 X4 X5 X6 X7
1 1 2 2 1 1 2 2
2 2 2 2 1 2 2 2
3 1 1 1 1 2 2 2
4 2 1 2 1 2 2 1
5 2 1 1 1 1 1 1
6 1 1 1 1 1 1 1
7 2 2 2 2 2 2 2
Remove row #2,5,6 & 7 because
sum(mydata[2,]=="1") #2nd row contains only one 1.
sum(mydata[5,]=="2") #5th row contains only one 2.
sum(mydata[6,]=="2") #6th row contains only no 2.
sum(mydata[7,]=="1") #7th row contains only no 1
Appreciate your help.
d[rowSums(d == 1) > 1 & rowSums(d == 2) > 1,]
# X1 X2 X3 X4 X5 X6 X7
#1 1 2 2 1 1 2 2
#3 1 1 1 1 2 2 2
#4 2 1 2 1 2 2 1
One option is to loop through the rows get the table and check if the frequency of all the elements are greater than 1 (just in case there are more number of unique elements per row)
mydata[apply(mydata, 1, function(x) all(table(factor(x, levels = 1:2)) >1)),]
#. X1 X2 X3 X4 X5 X6 X7
#1 1 2 2 1 1 2 2
#3 1 1 1 1 2 2 2
#4 2 1 2 1 2 2 1
data
mydata <- structure(list(X1 = c(1L, 2L, 1L, 2L, 2L, 1L, 2L), X2 = c(2L,
2L, 1L, 1L, 1L, 1L, 2L), X3 = c(2L, 2L, 1L, 2L, 1L, 1L, 2L),
X4 = c(1L, 1L, 1L, 1L, 1L, 1L, 2L), X5 = c(1L, 2L, 2L, 2L,
1L, 1L, 2L), X6 = c(2L, 2L, 2L, 2L, 1L, 1L, 2L), X7 = c(2L,
2L, 2L, 1L, 1L, 1L, 2L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7"))
Related
i'm trying to figure out how to write a loop that tests if a value in one of many columns is greater than or less than values in two set columns in a data frame. I'd like a 1/0 output and to drop all the columns that are tested. my solution has an embarrassing number of mutates to create new columns that are T or F and then uses a Reduce function to check if TRUE is present in one of the columns from a set position to the end of the data frame. any help on this would be appreciated!
example:
library(tidyverse)
df3 = data.frame(X = sample(1:3, 15, replace = TRUE),
Y = sample(1:3, 15, replace = TRUE),
Z = sample(1:3, 15, replace = TRUE),
A = sample(1:3, 15, replace = TRUE))
df3 <- df3 %>% mutate(T1 = Z >= X & Z <= Y,
T2 = A >= X & A <= Y)
df3$check <- Reduce(`|`, lapply(df3[5:6], `==`, TRUE))
df3 %>%
mutate(check = if_any(Z:A, function(x) {x >= X & x <= Y}))
You can compare the entire subset df3[c('A', 'Z')] at once, which should be more efficient. We are looking for rowSums greater than zero.
To understand the logic:
cols <- c('A', 'Z')
as.integer(rowSums(df3[cols] >= df3$X & df3[cols] <= df3$Y) > 0)
# [1] 1 1 0 0 1 0 0 1 0 0 1 1 0 0 1
To create the column:
transform(df3, check=as.integer(rowSums(df3[cols] >= X & df3[cols] <= Y) > 0))
# X Y Z A check
# 1 1 3 3 2 1
# 2 1 3 3 2 1
# 3 1 1 2 2 0
# 4 1 1 2 2 0
# 5 2 3 2 3 1
# 6 2 1 2 2 0
# 7 2 3 1 1 0
# 8 1 1 1 2 1
# 9 3 1 2 3 0
# 10 3 2 2 2 0
# 11 1 3 3 2 1
# 12 1 2 3 2 1
# 13 2 1 1 1 0
# 14 2 2 1 1 0
# 15 2 2 2 1 1
Data:
dat <- structure(list(X = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 3L, 3L,
1L, 1L, 2L, 2L, 2L), Y = c(3L, 3L, 1L, 1L, 3L, 1L, 3L, 1L, 1L,
2L, 3L, 2L, 1L, 2L, 2L), Z = c(3L, 3L, 2L, 2L, 2L, 2L, 1L, 1L,
2L, 2L, 3L, 3L, 1L, 1L, 2L), A = c(2L, 2L, 2L, 2L, 3L, 2L, 1L,
2L, 3L, 2L, 2L, 2L, 1L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-15L))
In the following data frame, how can I replace all 1 by 2 and all 2 by 1, and keep 3 as it is
mydf=structure(list(x1 = c(1L, 2L, 3L, 1L), x2 = c(2L, 1L, 2L, 3L),
x3 = c(1L, 2L, 2L, 1L), x4 = c(1L, 2L, 1L, 1L)), row.names = 5:8, class = "data.frame")
mydf
x1 x2 x3 x4
5 1 2 1 1
6 2 1 2 2
7 3 2 2 1
8 1 3 1 1
Any kind of help is appreciated.
Base R :
mydf[] <- lapply(mydf, function(x) ifelse(x == 1, 2, ifelse(x == 2, 1, x)))
dplyr :
library(dplyr)
mydf %>%
mutate(across(.fns = ~case_when(. == 2 ~ 1L,. == 1 ~ 2L, TRUE ~ .)))
# x1 x2 x3 x4
#5 2 1 2 2
#6 1 2 1 1
#7 3 1 1 2
#8 2 3 2 2
In the data frame
x1 x2 x3 x4 x5
1 0 1 1 0 3
2 1 2 2 0 3
3 2 2 0 0 2
4 1 3 0 0 2
5 3 3 2 1 4
6 2 0 0 0 1
column x5 indicates where the first non-zero value in a row is. The table should be read from right (x4) to left (x1). Thus, the first non-zero value in the first row is in column x3, for example.
I want to get all rows where 1 is the first non zero entry, i.e.
x1 x2 x3 x4 x5
1 0 1 1 0 3
2 3 3 2 1 4
should be the result. I tried different version of filter_at but I didn't manage to come up with a solution. E.g. one try was
testdf %>% filter_at(vars(
paste("x",testdf$x5, sep = "")),
any_vars(. == 1))
I want to solve that without a for loop, since the real data set has millions of rows and almost 100 columns.
You can do filtering row-wise easily with the new utility function c_across:
library(dplyr) # version 1.0.2
testdf %>% rowwise() %>% filter(c_across(x1:x4)[x5] == 1) %>% ungroup()
# A tibble: 2 x 5
x1 x2 x3 x4 x5
<int> <int> <int> <int> <int>
1 0 1 1 0 3
2 3 3 2 1 4
A vectorised base R solution would be :
result <- df[df[cbind(1:nrow(df), df$x5)] == 1, ]
result
# x1 x2 x3 x4 x5
#1 0 1 1 0 3
#5 3 3 2 1 4
cbind(1:nrow(df), df$x5) creates a row-column matrix of largest value in each row. We extract those first values and select rows with 1 in them.
Another vectorised solution:
df[t(df)[t(col(df)==df$x5)]==1,]
We can use apply in base R
df1[apply(df1, 1, function(x) x[x[5]] == 1),]
# x1 x2 x3 x4 x5
#1 0 1 1 0 3
#5 3 3 2 1 4
data
df1 <- structure(list(x1 = c(0L, 1L, 2L, 1L, 3L, 2L), x2 = c(1L, 2L,
2L, 3L, 3L, 0L), x3 = c(1L, 2L, 0L, 0L, 2L, 0L), x4 = c(0L, 0L,
0L, 0L, 1L, 0L), x5 = c(3L, 3L, 2L, 2L, 4L, 1L)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
I have data with the status column. I want to subset my data to the condition of 'f' status, and previous condition of 'f' status.
to simplify:
df
id status time
1 n 1
1 n 2
1 f 3
1 n 4
2 f 1
2 n 2
3 n 1
3 n 2
3 f 3
3 f 4
my result should be:
id status time
1 n 2
1 f 3
2 f 1
3 n 2
3 f 3
3 f 4
How can I do this in R?
Here's a solution using dplyr -
df %>%
group_by(id) %>%
filter(status == "f" | lead(status) == "f") %>%
ungroup()
# A tibble: 6 x 3
id status time
<int> <fct> <int>
1 1 n 2
2 1 f 3
3 2 f 1
4 3 n 2
5 3 f 3
6 3 f 4
Data -
df <- structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L),
status = structure(c(2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L,
1L), .Label = c("f", "n"), class = "factor"), time = c(1L,
2L, 3L, 4L, 1L, 2L, 1L, 2L, 3L, 4L)), .Names = c("id", "status",
"time"), class = "data.frame", row.names = c(NA, -10L))
Looking to create a function.
I would like to add the number of occurrences of an observation up within a given group (ex 5, 5 occurrences 2 times). The same numbers of Days within a Week by Business are to be summed. The summed values will be in a new row 'Total-occurrences.'
tapply or plyr works its way into this, however I'm stuck on a few nuances.
Thanks!
14X3 matrix
Business Week Days
A **1** 3
A **1** 3
A **1** 1
A 2 4
A 2 1
A 2 1
A 2 6
A 2 1
B **1** 1
B **1** 2
B **1** 7
B 2 2
B 2 2
B 2 na
**AND BECOME**
10X4 matrix
Business Week Days Total-Occurrences
A **1** 3 2
A **1** 1 1
A 2 1 3
A 2 4 1
A 2 6 1
B **1** 1 1
B **1** 2 1
B **1** 7 1
B 3 2 2
B 2 na 0
If I understand your question correctly, you want to group your data frame by Business and Week and Days and calculate the occurences of each group in a new column Total-Occurences.
df <- structure(list(Business = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"),
Week = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L,
1L, 2L, 2L, 2L), .Label = c("**1**", "2"), class = "factor"),
Days = structure(c(3L, 3L, 1L, 4L, 1L, 1L, 5L, 1L, 1L, 2L,
6L, 2L, 2L, 7L), .Label = c("1", "2", "3", "4", "6", "7",
"na"), class = "factor")), .Names = c("Business", "Week",
"Days"), class = "data.frame", row.names = c(NA, -14L))
There are certainly different ways of doing this. One way would be to use dplyr:
require(dplyr)
result <- df %.%
group_by(Business,Week,Days) %.%
summarize(Total.Occurences = n())
#>result
# Business Week Days Total.Occurences
#1 A **1** 1 1
#2 A **1** 3 2
#3 A 2 1 3
#4 A 2 4 1
#5 A 2 6 1
#6 B **1** 1 1
#7 B **1** 2 1
#8 B **1** 7 1
#9 B 2 2 2
#10 B 2 na 1
You could also use plyr:
require(plyr)
ddply(df, .(Business, Week, Days), nrow)
note that based on these functions, the output would be slightly different than what you posted in your question. I assume this may be a typo because in your original data there is no Week 3 but in your desired output there is.
Between the two solutions, the dplyr approach is probably faster.
I guess there are also other ways of doing this (but im not sure about tapply)