Check if a column in a dataframe is of the same value - r

It is a follow-up question to this one. What I would like to check is if any column in a data frame contain the same value (numerical or string) for all rows. For example,
sample <- data.frame(col1=c(1, 1, 1), col2=c("a", "a", "a"), col3=c(12, 15, 22))
The purpose is to inspect each column in a data frame to see which column does not have identical entry for all rows. How to do this? In particular, there are both numbers as well as strings.
My expected output would be a vector containing the column number which has non-identical entries.

We can use apply columnwise (margin = 2) and calculate unique values in the column and select the columns which has number of unique values not equal to 1.
which(apply(sample, 2, function(x) length(unique(x))) != 1)
#col3
# 3
The same code can also be done using sapply or lapply call
which(sapply(sample, function(x) length(unique(x))) != 1)
#col3
# 3
A dplyr version could be
library(dplyr)
sample %>%
summarise_all(funs(n_distinct(.))) %>%
select_if(. != 1)
# col3
#1 3

We can use Filter
names(Filter(function(x) length(unique(x)) != 1, sample))
#[1] "col3"

Related

Trying to iterate through certain columns based on column name using R but columns not in list get eliminated

I have a dataframe with 50 columns. I am trying to iterate through the columns in a dataframe that contain the word "apple". The dataframe contains 24 "apple" columns.
If apple_1 = 1 then all the other apple_x columns in the row should equal to 1 else it shouldn't do anything.
This is my code so far:
I am successfully able to create a list of column names containing apple (excluding apple_1)
applelist<- df %>% select(contains("apple"))%>%select(!contains("apple_1"))
applelist<- list(colnames(applelist))
But when I try to loop through the columns in the applelist and update the values for each row it wants to delete the 'non' applelist columns (go from 50 columns to 24). I only want to update the apple columns and leave the rest untouched.
for (i in 1:ncol(applelist){
df[, i] <- ifelse(df$apple_1==1, 1, df[, i])
}
Here, applelist is a list of length 1. i.e. what we are getting is
applelist <- list(c('a', 'b', 'c'))
We just need to extract the names as a vector or if we need a list, use as.list (not really needed here)
library(dplyr)
nm1 <- df %>%
select(contains("apple"))%>%
select(!contains("apple_1")) %>%
names
Then use
for(nm in nm1) {df[[nm]] <- ifelse(df$apple_1 == 1, 1, df[[nm]]}
In addition, a tidyverse option would be
df <- df %>%
mutate(across(all_of(nm1), ~ ifelse(apple_1 == 1, 1, .x))

Mutate a new column that is the list of the row_id of minimally different rows from .row

tibble(
A= c("x","x","y","y"),
B= c("y","y","y","y"),
C= c("x","y","z","y")
) %>%
mutate(
id = row_number(),
.before = "A"
) %>%
mutate(
neighs_id = list(
c("2"),
c("1,4"),
c("4"),
c("2,3")
)
) %>% View()
The output of neighs_id is the list of id_row when is TRUE the condition that exactly ==1 value of A,B, or C is != from the values in that .row, in the same columns.
I want a code to replace the second mutate with map that has as outcome a list (keep: the operation would be rowise!) of all the rows that, given a selection of columns, have 1 column with a value != column[.row].
In theory, I could setup a square matrix of id X id, check the sum of columns of the tibble such that column[id] =! column[column[.id] and then keep all the matches where the element == 1, but I think that should be a more straightforward way to select vectorise a filter on these "minimally different rows", given a selector of columns.
In base R:
cols = LETTERS[1:3]
tib$neighs_id <- lapply(seq(nrow(tib)),
function(i) which(sapply(seq(nrow(tib)),
function(x) sum(tib[x, cols] != tib[i, cols])) == 1))
#> pull(tib, neighs_id)
[[1]]
[1] 2
[[2]]
[1] 1 4
[[3]]
[1] 4
[[4]]
[1] 2 3
One way to speed this up is not to work with tibbles but with a matrix instead. I guess this is because tibbles (or data frames) are lists of columns so repeated extraction of rows is expensive compared to working with a matrix.
Another significant improvement can be achieved by changing the character matrix to a numeric one so that some operations can be vectorized. This way the inner sapply from Maƫl's answer can be replaced with subtraction and summing over matrix columns.
n.rep <- 1
tib <- tibble(
A=rep(c("x", "x", "y", "y"), n.rep),
B=rep(c("y", "y", "y", "y"), n.rep),
C=rep(c("x", "y", "z", "y"), n.rep)
)
cols <- LETTERS[1:3]
# change tibble to a matrix
tib.m <- as.matrix(tib[, cols])
# named vector used to translate values to their order
val.ord <- unique(c(tib.m))
val.ord <- setNames(seq_along(val.ord), val.ord)
# create numeric representation using the orders
tib.m[] <- val.ord[tib.m]
mode(tib.m) <- 'numeric'
tib$neighs_id <- apply(tib.m, 1, function(row)
which(colSums(t(tib.m) - row != 0) == 1))
This finishes in about a second when n.rep is 1000 (i.e., tib is a 4000-row matrix). Scaling it up to 1M, however, might still be problematic, I'm afraid. For this, using Rcpp might help.

replacing cell values in dataframe for specific variables

I have thousands of rows in each column. I need to find specific values in column A based on the value of column B, and replace column A with a new value if it is greater than a specific value.
For example, if column B = 1 and the values in column A > 2, then I want to replace all the values in column A > 2 equal to 2 when column B = 1.
I've tried this code:
if(dt$B=='1'){
dt <- dt %>% mutate(A = ifelse(A > 2, 2, A))
}
But this does not work. I've tried some other methods as well, but nothing I do works. Please let me know if you can help with this! Thank you.
We can have a & option within ifelse for the test condition
library(dplyr)
dt <- dt %>%
mutate(A = ifelse(A > 2 & B == 1, 2, A))

Creating several new columns in a data frame using the same function

I'm sorry for the basic question. I'm just struggling with something that should be simple. Say I have the the data frame "Test" that originally has three fields: Col1, Col2, Col3.
I want to create new columns based on each of the original columns. The values in each row of the new columns would specify whether the corresponding value in the matching row on the original column is above or below the initial column's median. So, for example, in the image attached, Col4 is based on Col1. Col5 is based on Col2. Col6 based on Col3.
test dataframe example:
It's quite easy to perform this function on a single column and output a single column:
Test <- Test %>% mutate(Col4 = derivedFactor(
"below"= Col1 > median(Test$Col1),
"at"= Col1 == median(Test$Col1),
"above"= Col1 < median(Test$Col1)
.default = NA)
)
But if I'm performing this same operation over 50 columns, writing out/copy-paste and editing the code can be tedious and inefficient. I should mention that I am hoping to add the new columns to the data frame, not create another data frame. Additionally, there are about 200 other fields in the data frame that will not have this function performed on them (so I can't just use a mutate_all). And the columns are not uniformly named (my examples above are just examples, not the actual dataset) so I'm not able to find a pattern for mutate_at. Maybe there is a way to manually pass a list of column names to the mutate command?
There must be an easy and elegant way to do this. If anyone could help, that would be amazing.
You can do the following using data.table.
Firstly, I define a function which is applied onto a numeric vector, whereby it outputs the elements' corresponding position in relation to the vector's median:
med_fn = function(x){
med = median(x)
unlist(sapply(x, function(x){
if(x > med) {'Above'}
else if(x < med) {'Below'}
else {'At'}
}))
}
> med_fn(c(1,2,3))
[1] "Below" "At" "Above"
Let us examine some sample data:
dt = data.table(
C1 = c(1, 2, 3),
C2 = c(2, 1, 3),
C3 = c(3, 2, 1)
)
old = c('C1', 'C2', 'C3') # Name of columns I want to perform operation on
new = paste0(old, '_medfn') # Name of new columns following operation
Using the .SD and .SDcols arguments from data.table, I apply med_fn across the columns old, in my case columns C1, C2 and C3. I call the new columns C#_medfn:
dt[, (new) := lapply(.SD, med_fn), .SDcols = old]
Result:
> dt
C1 C2 C3 C1_medfn C2_medfn C3_medfn
1: 1 2 3 Below At Above
2: 2 1 2 At Below At
3: 3 3 1 Above Above Below

Using apply in R to extract rows from a dataframe

Using R, I have to extract specific rows from a data frame depending on certain conditions. The data frame is large (5.5 million rows to 251 columns) but I have given the code below to create a sample data frame.
df <- data.frame("Name" = c("Name1", "Name1", "Name1", "Name1","Name1" ), "Value"=c("X", "X", "Y", "Y", "X"))
I need to skip through the entire data frame row by row starting at the top, and while skipping, when the value of the 'Value' column changes from X to Y or Y to X, I need to extract that row and next row and append them to another data frame. For example, in the data frame above, the Value column of row 2 is X and that of row 3 is Y, and since the value has changed from X to Y, I need to extract the entire row 2 and row 3 and add them to another data frame.
The result of the operations can be seen by running the code below
dfextract <- data.frame("Name" = c("Name1", "Name1"), "Value"=c("X", "Y"))
Currently I have used a 'for' loop to skip row to row and extract the rows when the values don't match. But it very slow and inefficient. The code snippet is below
for (i in 1:count) {
if (df[[i+1, 2]] != df[i,2]) {
dfextract <- rbind(dfextract, df[i,])
dfextract <- rbind(dfextract, df[i+1,])
}
}
I am looking for a better and faster solution to the above situation. Perhaps using the functions belonging to the family of 'apply()' or using 'by()'. Any help would be greatly appreciated.
Thanks in advance.
Maybe the following does it. Note that there are two lapply based loop, in order to predict for changes in the values of column Name.
diffstr <- function(x) x[-1] == x[-length(x)]
res <- lapply(split(df, df$Name), function(x) {
inx <- which(c(FALSE, !diffstr(x$Value)))
do.call(rbind, lapply(inx, function(i) x[(i - 1):i, ]))
})
res <- do.call(rbind, res)
row.names(res) <- NULL
res
How it works.
First, I define a helper function diffstr. It compares all values of x but the first with all values of x but the last. Note that x[-1] is the vector x[2], x[3], ..., x[length(x)], negative indices remove that element from the vector. And the same for x[-length(x), the negative index removes the last x.
split(df, df$Name) splits the data frame into subsets each one of its own Name.
I then lapply an unnamed function to these subsets. This function's argument x will be each of the sub-data frames mentioned above.
That function start by determining where in df$Valueare the changes. This is done with the call to the helper function diffstr. I have to append a FALSE to the return value because at first there are no changes.
The next line is a tricky one. Use lapply on the index of change points inx and for each one get a two rows segment of the data frame x. Then use do.call to call rbind those two rows df's and reassemble them together.
Now res is a list, with one sub-data frame for each Name (done with the split). So it needs to be put back together with another call to do.call(rbind(...)).
Final tidy up. The whole process messed up with the data frame's row names. To set them to NULL is just a well known trick that forces R to renumber the rows.
That's it. If you need more explanations, just say so.
We can use dplyr. lag can shift the row by 1, so we can use Value != lag(Value) to compare if the value is different than the previous one. which(Value != lag(Value)) converts the result to row number. After that, sort(unique(unlist(lapply(which(Value != lag(Value)), function(x) c(x, x - 1))))) makes sure we also got the row number of those previous rows. Finally, slice can subset the data frame based on the row number provided.
library(dplyr)
df2 <- df %>%
slice(sort(unique(unlist(lapply(which(Value != lag(Value)), function(x) c(x, x - 1))))))
df2
# A tibble: 4 x 2
Name Value
<fctr> <fctr>
1 Name1 X
2 Name1 Y
3 Name1 Y
4 Name1 X
If the code is too long to read, you can also calculate the index before using the slice function as follows.
library(dplyr)
ind <- which(df$Value != lag(df$Value))
ind2 <- sort(unique(c(ind, ind - 1)))
df2 <- df %>% slice(ind2)
df2
# A tibble: 4 x 2
Name Value
<fctr> <fctr>
1 Name1 X
2 Name1 Y
3 Name1 Y
4 Name1 X
Using base R, I would probably use an id for the rows and with diff:
df <- data.frame(colA=c(1, 1, 1, 2, 1, 1, 1, 3, 3, 3, 1, 1),
colB=1:12)
keep <- which(diff(df$colA) != 0)
df[unique(c(keep, keep+1)), ]
colA colB
3 1 3
4 2 4
7 1 7
10 3 10
5 1 5
8 3 8
11 1 11
There is probably a faster option though.
When you have a large dataset, speed might be the bottleneck. In this case data.table might be the best option for you.
Using the data.table-library, I would solve it like so:
library(data.table)
dt <- data.table(Name = c("Name1", "Name1", "Name1", "Name1","Name1" ),
Value = c("X", "X", "Y", "Y", "X"))
# look if Value changes to the next instance
dt[, idx := Value != shift(Value, 1, fill = dt$Value[1])]
# filter the rows where the index changes and the next value
# and deselect the variable idx
dt[idx | shift(idx, 1)][, .(Name, Value)]
#> Name Value
#> 1: Name1 Y
#> 2: Name1 Y
#> 3: Name1 X
Why does it give an odd-number and not an even-number?
Well, that is because in your data example, the last row should be selected as it changes, but there is no next row to select as well.

Resources