Conditional dataframe slicing - r

I would like to remove the rows of this dataframe in which, if the pattern ,2) exists, it just exist in one of the columns.
As an example: in this dataframe, each column is a character class (representing a vector in each position):
A c(0,1) c(1,1)
B c(0,2) c(0,1)
C c(1,1) c(0,1)
D c(1,2) c(0,2)
I would like to subset it, removing row B, as the pattern is present in one of the columns but not in the other.
I tried to use grep, but I don't know how to specify the conditional statement.
How can I achieve this?

For a single column we would do this (calling your data d)
d[!grepl(",2)", d$column_name, fixed = TRUE), ]
But we need to check all the columns and find rows that have exactly one match. For this, we'll convert to matrix and use rowSums to count the matches by row:
n_occurrences = rowSums(matrix(grepl(",2)", as.matrix(d), fixed = TRUE), nrow = nrow(d)))
d[n_occurrences != 1, ]
# V1 V2 V3
# 1 A c(0,1) c(1,1)
# 3 C c(1,1) c(0,1)
# 4 D c(1,2) c(0,2)
Using this sample data:
d = read.table(text = 'A c(0,1) c(1,1)
B c(0,2) c(0,1)
C c(1,1) c(0,1)
D c(1,2) c(0,2)')

Not as elegant as the selected answer above, but you can also split into two variables at the blank space and then create separate indices.
library(dplyr)
df = data.frame(v1=c('c(0,1) c(1,1)','c(0,2) c(0,1)',
'c(1,1) c(0,1)','c(1,2) c(0,2)'))
empty_omit <- function(vec) vec[vec!='']
get_even <- function(vec) vec[seq_along(vec) %% 2 == 0]
get_odd <- function(vec) vec[seq_along(vec) %% 2 ==1]
df$v2 = strsplit(df$v1, ' ') %>% unlist() %>% empty_omit %>% get_odd()
df$v3 = strsplit(df$v1, ' ') %>% unlist() %>% empty_omit %>% get_even()
idx_v2 = grepl(",2)", df$v2)
idx_v3 = grepl(",2)", df$v3)
df[!idx_v2 | idx_v3, ]

Related

Paste leading zero in columns A and B if column A meets condition

Data:
A B
"2058600192", "2058644"
"4087600101", "4087601"
"30138182591","30138011"
I am trying to add one leading 0 to columns A and B if column A is 10 characters.
This is what I have written so far:
for (i in 1:nrow(data)) {
if (nchar(data$A[i]) == 10) {
data$A[i] <- paste0(0, data$A)
data$B[i] <- paste0(0, data$B)
}
}
But I'm getting the following warning:
number of items to replace is not a multiple of replacement length
I've also tried using a dplyr solution, but I'm not sure how to mutate two columns based on one column. Any insight would be appreciated.
Your solution was already pretty good. You just made some very small mistakes. This code would give the correct output:
data <- data.frame(A = c("2058600192","4087600101","30138182591"), B = c("2058644","4087601","30138011"))
for (i in 1:nrow(data)) {
if (nchar(data$A[i]) == 10) {
data$A[i] <- paste0(0, data$A[i])
data$B[i] <- paste0(0, data$B[i])
}
}
The only difference is data$A[i] <- paste0(0, data$A[i]) instead of data$A[i] <- paste0(0, data$A). Without the [i] you would try to add the whole column.
You can get the index where the number of characters is equal to 10 and replace those values using lapply for multiple columns.
inds <- nchar(df$A) == 10
df[] <- lapply(df, function(x) replace(x, inds, paste0('0', x[inds])))
#If you want to replace only specific columns
#df[c('A', 'B')] <- lapply(df[c('A', 'B')], function(x)
# replace(x, inds, paste0('0', x[inds])))
df
# A B
#1 02058600192 02058644
#2 04087600101 04087601
#3 30138182591 30138011
data
df <- structure(list(A = c(2058600192, 4087600101, 30138182591), B = c(2058644L,
4087601L, 30138011L)), class = "data.frame", row.names = c(NA, -3L))
Just in case you were interested in using dplyr here's another solution using transmute.
df %>%
# Need to transmute B first, so that nchar is evaluated on the original A column and not on the one with leading zeros
transmute(B = ifelse(nchar(A) == 10, paste0(0, B), B),
A = ifelse(nchar(A) == 10, paste0(0, A), A)) %>%
# Just change the order of the columns to the original one
select(A,B)
Another way you can try
library(dplyr)
library(stringr)
df %>%
mutate(A = ifelse(str_length(A) == 10, str_pad(A, width = 11, side = "left", pad = 0), A),
B = ifelse(grepl("^0", A), paste0("0", B), B))
# A B
# 1 02058600192 02058644
# 2 04087600101 04087601
# 3 30138182591 30138011
str_length to detect length of string
You can use str_pad to add leading zeros. More information about str_pad() here
We can use grepl to detect strings with leading zeros in column A and add leading zeros to column B.
You may use the ifelse vectorized function here:
data$A <- ifelse(nchar(data$A) == 10, paste0("0", data$A), data$A)
data$B <- ifelse(nchar(data$B) == 10, paste0("0", data$B), data$B)
data
A B
1 02058600192 2058644
2 04087600101 4087601
3 30138182591 30138011

extracting observations from matrix where columns and rows match a "key"

Given a matrix m how can I create a TRUE/ FALSE or 1 / 0 matrix where the columns and rows match some "key" in a data frame?
My goal is to assign a 1 or 0 to the location in the matrix where the columns match the cols and the rows match the rows in the colsrows_df. Then essentially just extract the observations where this is true or paste them into the colsrows_df next to the correct columns.
The below forloop just creates diagonally 1's and 0's
m <- matrix(runif(30), nrow = 20, ncol = 20)
dimnames(m) <- list(c(paste0("ID", 1:5, "_2000"), paste0("ID", 1:5, "_2001"), paste0("ID", 1:5, "_2002"), paste0("ID", 1:5, "_2003")),
c(paste0("ID", 1:5, "_2000"), paste0("ID", 1:5, "_2001"), paste0("ID", 1:5, "_2002"), paste0("ID", 1:5, "_2003")))
cols <- colnames(m)
rows <- rownames(m)
library(tidyr)
library(dplyr)
colsrows <- cbind(cols, rows)
# Here I just separate the rows/cols and then add an extra year and paste them back together
colsrows_df <- colsrows %>%
data.frame %>%
separate(cols, c("id_col", "year_col"), "_", remove = FALSE) %>%
separate(rows, c("id_row", "year_row"), "_", remove = FALSE) %>%
mutate(year_row_plus_1 = as.numeric(year_row) + 1,
rows = paste0(id_row,"_", year_row_plus_1)) %>%
select(cols, rows)
colsrows_df
for(i in 1:nrow(colsrows)){
m[i, ] <- colnames(m) == colsrows_df$cols
m[, i] <- rownames(m) == colsrows_df$rows
}
m
EDIT:
This seems to "solve" the problem however I am not sure how robust it is.
ids <- colsrows_df[colsrows_df$cols %in% colnames(m) &
colsrows_df$rows %in% rownames(m), ]
res <- melt(m[as.matrix(colsrows_df[colsrows_df$cols %in% colnames(m) &
colsrows_df$rows %in% rownames(m), ][2:1])])
cbind(ids, res)
I think can you first filter colsrows_df with rownames and colnames which are actually present in m then change the order of columns, convert to matrix , use it to subset m and change those values to 1.
m[as.matrix(colsrows_df[colsrows_df$cols %in% colnames(m) &
colsrows_df$rows %in% rownames(m), ][2:1])] <- 1
Then convert remaining ones to 0
m[m != 1] <- 0

Count vectors in list with the same elements in r

I have a list of ~8000 vectors, and I would like to know how many duplicates there are of these 8000 vectors, but the order of the elements in each could be different.
for example:
list <- c()
list[[1]] <- c(1,2,3)
list[[2]] <- c(2,1,3)
list[[3]] <- c(3,2,1)
list[[4]] <- c(4,5)
list[[5]] <- c(5,4)
list[[6]] <- c(1,2,3,5)
should give me a count of 3 for c(1,2,3) and 2 for c(4,5) and 1 for c(1,2,3,5)
I'd like the count of each of the duplicates, not just how many are duplicated.
library(tidyverse)
library(gtools)
get_perm <- function(v) {
m <- permutations(n = length(v), r = length(v), v = v, set = F)
m[order(c(m))]
}
all <- map(list, get_perm)
unique <- map(list, get_perm) %>% unique()
res_vec <- c()
element <- c()
for(i in seq_along(unique)) {
element[[i]] <- unique[[i]] %>% unique() %>% paste(collapse = ",")
res_vec[[i]] <- all %in% unique[i] %>% sum()
}
tibble(
elements = unlist(element),
numbers = res_vec
)
Result
# A tibble: 3 x 2
elements numbers
<chr> <int>
1 1,2,3 3
2 4,5 2
3 1,2,3,5 1
elements contains all the individual elements of the vectors for each group and numbers are the numbers of vectors you have in each group.
We create a function to take vector as an argument ('val'), then loop through the list with sapply, check if all the 'valare%in%the 'x', andsumthe logicalvector`
f1 <- function(lst, val) sum(sapply(lst, function(x) all(val %in% x)))
f1(list, c(1, 2, 3))
[#1] 3
f1(list, c(4, 5))
#[1] 2

reference x's column in R's apply function

I have a df like this:
a <- c(4,5,3,5,1)
b <- c(8,9,7,3,5)
c <- c(6,7,5,4,3)
df <- data.frame(rbind(a,b,c))
I want a new df, df2, containing the difference between the values in each cell in rows a and b and the value in row c in their respective columns.
df2 would look like this:
a <- c(-2,-2,-2,1,-2)
b <- c(2,2,2,-1,2)
df2 <- data.frame(rbind(a,b))
Here is where I'm getting stuck:
df2 <- data.frame(apply(df,c(1,2),function(x) x - df[nrow(df),the col index of x]))
How do I reference the column index of x? Is there something like JavaScript's this?
We can do this easily by replicating the 3rd row to make the lengths equal before subtracting with the first two rows
out <- df[c("a", "b"),] - df["c",][col(df[c("a", "b"),])]
identical(df2, out)
#[1] TRUE
Or explicitly using rep
df[c("a", "b"),] - rep(unlist(df["c",]), each = 2)

join two columns in a dataframe so they do not contain same values

Sooo
I’ve got two lists
list1 <- rep(c("john","steve","lisa","sara","anna"), c(50,0,15,25,10))
list2 <- rep(c("john","steve","lisa","sara","anna"), c(15,25,0,10,50))
I need to put them into a dataframe.
df <- as.data.frame(matrix(1, nrow = 100, ncol = 2))
df$v1 <- list1
Now the problem.
I need to put list2 into df$v2
with out any row in df containing the same values.
It does not matter what values are in each row.
I use this for testing it, if each rows contains the same value:
all(apply(ballots, 1, function(x) length(unique(x)) == 2) == TRUE)
to clarify:
I need each value in the columns, which row doesn't matter.
I need a way to randomize or change the order of the second column (or the first) in such a way that the same value is never in column one or two
The output:
V1 V2
John Steve
John Lisa
Sara John
John Lisa
Steve Anna
Currently, when I join the columns in the dataframe, there are many rows in both column one and two containing the same value.
Alright... finally found the answer after many trials and errors.
If anyone has a cleaner method to do this I would love to see one.
The following code takes list A and puts it in column A
takes list B, randomizes and puts in column C, Column B is NA
If A and C is not the same, switch column B and C.
If it fails to finish all the rows, it starts over, randomizing column C
library(taRifx)
failed.counter <- 0
while (failed.counter <= 1) {
list1 <- rep(c("A","B","C"), c(3,1,2))
list2 <- sample(rep(c("A","B","C"), c(2,3,1)))
df <- as.data.frame(matrix(NA, nrow = length(list1), ncol = 3))
df[,1] <- list1
df[,3] <- list2
iteration.counter <- 0
while (anyNA(df$V2) == TRUE && failed.counter == 0 ) {
iteration.counter <- iteration.counter + 1
df.sub <- df[is.na(df[,2]) & df[,1] != df[,3] & !is.na(df[,3]),]
df.sub <- df.sub[,c("V1", "V3", "V2")]
colnames(df.sub) <- c("V1", "V2", "V3")
r.names <- rownames(df.sub)
df[r.names,] <- df.sub
df[,3] <- shift(df[,3], 1, Wrap=TRUE)
if(iteration.counter >= nrow(df)+1) {failed.counter <- 1}
}
if(anyNA(df$V2) == FALSE) {failed.counter <- 2}
}

Resources