How to find duplicated values in column in R [duplicate] - r

There is a similar question for PHP, but I'm working with R and am unable to translate the solution to my problem.
I have this data frame with 10 rows and 50 columns, where some of the rows are absolutely identical. If I use unique on it, I get one row per - let's say - "type", but what I actually want is to get only those rows which only appear once. Does anyone know how I can achieve this?
I can have a look at clusters and heatmaps to sort it out manually, but I have bigger data frames than the one mentioned above (with up to 100 rows) where this gets a bit tricky.

This will extract the rows which appear only once (assuming your data frame is named df):
df[!(duplicated(df) | duplicated(df, fromLast = TRUE)), ]
How it works: The function duplicated tests whether a line appears at least for the second time starting at line one. If the argument fromLast = TRUE is used, the function starts at the last line.
Boths boolean results are combined with | (logical 'or') into a new vector which indicates all lines appearing more than once. The result of this is negated using ! thereby creating a boolean vector indicating lines appearing only once.

A possibility involving dplyr could be:
df %>%
group_by_all() %>%
filter(n() == 1)
Or:
df %>%
group_by_all() %>%
filter(!any(row_number() > 1))
Since dplyr 1.0.0, the preferable way would be:
data %>%
group_by(across(everything())) %>%
filter(n() == 1)

Try it
library(dplyr)
DF1 <- data.frame(Part = c(1,2,3,4,5), Age = c(23,34,23,25,24), B.P = c(87,76,75,75,78))
DF2 <- data.frame(Part =c(3,5), Age = c(23,24), B.P = c(75,78))
DF3 <- rbind(DF1,DF2)
DF3 <- DF3[!(duplicated(DF3) | duplicated(DF3, fromLast = TRUE)), ]

Related

How can I get a certain value from a row in dataframe? [R]

I'm doing a prediction with a class tree, with "rpart" library, and when I make "predict", I get a table with probabilities and its value/category that test data can take, and I want to get the value/category from the hightest probability. For example (once predict is done), table I get is:
Table1
And I want to have this table:
Tale2
thanks in advance, I've tried a few things but haven't achieved much since I'm pretty new to R, cheers!
One way to achieve your desired output could be:
identify your values in vector pattern
mutate across the relevant columns and use str_detect to
check if values are in this column -> if true use cur_column() to place
the column name in the new column.
the do some tricks with .names and unite and
finally select.
library(dplyr)
library(tidyr)
library(stringr)
pattern <- c("0.85|0.5|0.6|0.8")
df %>%
mutate(across(starts_with("cat"), ~case_when(str_detect(., pattern) ~ cur_column()), .names = 'new_{col}')) %>%
unite(New_Col, starts_with('new'), na.rm = TRUE, sep = ' ') %>%
select(index, pred_category = New_Col)
index pred_category
<dbl> <chr>
1 1 cat2
2 2 cat1
3 3 cat3
4 4 cat3
You didn't post your data so I just put it in a .csv and accessed it from my R folder on my C: drive.
Might be an easier way to do it, but this is the method I use when I might have multiple different types (by column or row) I'd like to sort for. If you're new to R and don't have data.table or dplyr installed yet, you'll need to enter the second parts in the console.
I left the values in but that can be fixed with the last line if you don't want them.
setwd("C:/R")
library(data.table)
library(dplyr)
Table <- read.csv("Table1.csv", check.names = FALSE, fileEncoding = 'UTF-8-BOM')
#Making the data long form makes it much easier to sort as your data gets more complex.
LongForm <- melt(setDT(Table), id.vars = c("index"), variable.name = "Category")
Table1 <- as.data.table(LongForm)
#This gets you what you want.
highest <- Table1 %>% group_by(index) %>% top_n(1, value)
#Then just sort it how you wanted it to look
Table2 <- highest[order(highest$index, decreasing = FALSE), ]
View(Table2)
If you don't have the right packages
install.packages("data.table")
and
install.packages("dplyr")
To get rid of the numbers
Table3 <- Table2[,1:2]

R new column (variable) that rowSums across lists with NULL values

I have a data.frame that looks like this:
UID<-c(rep(1:25, 2), rep(26:50, 2))
Group<-c(rep(5, 25), rep(20, 25), rep(-18, 25), rep(-80, 25))
Value<-sample(100:5000, 100, replace=TRUE)
df<-data.frame(UID, Group, Value)
But I need the values separated into new rows so I run this:
df<-pivot_wider(df, names_from = Group,
values_from = Value,
values_fill = list(Value = 0))
Which introduces NULL into the dataset. Sorry, could not figure out a way to get an example dataset with NULL values. Note: this is now a tbl_df tbl data.frame
These aren't great variable names so I run this:
colnames(df)[which(names(df) == "20")] <- "pos20"
colnames(df)[which(names(df) == "5")] <- "pos5"
colnames(df)[which(names(df) == "-18")] <- "neg18"
colnames(df)[which(names(df) == "-80")] <- "neg80"
What I want to be able to do is create a new column (variable) that rowSums across columns. So I run this:
df<-df%>%
replace(is.na(.), 0) %>%
mutate(rowTot = rowSums(.[2:5]))
Which of course works on the example dataset but not on the one with NULL values. I have tried converting NULL to NA using df[df== "NULL"] <- NA but the values do not change. I have tried converting the lists to numeric using as.numeric(as.character(unlist(df[[2]]))) but I get an error telling me I have unequal number of rows, which I guess would be expected.
I realize there might be a better process to get my desired end result, so any suggestions to any of this is most appreciated.
EDIT: Here is a link to the actual dataset which will introduce Null values after using pivot_wider. https://drive.google.com/file/d/1YGh-Vjmpmpo8_sFAtGedxzfCiTpYnKZ3/view?usp=sharing
Difficult to answer with confidence without an actual reproducible example where the error occurs but I am going to take a guess.
I think your pivot_wider steps produces list columns (meaning some values are vectors) and that is why you are getting NULL values. Create a unique row for each Group and then use pivot_wider. Also rowSums has na.rm parameter so you don't need replace.
library(dplyr)
df %>%
group_by(temp) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from = temp, values_from = numseeds) %>%
mutate(rowTot = rowSums(.[3:6], na.rm = TRUE))
Please change the column numbers according to your data in rowSums if needed.

In R, how can I filter based on the maximum value in each row of my data?

I have a tibble (or data frame, if you like) that is 19 columns of pure numerical data and I want to filter it down to only the rows where at least one value is above or below a threshold. I prefer a tidyverse/dplyr solution but whatever works is fine.
This is related to this question but a distinct in at least two ways that I can see:
I have no identifier column (besides the row number, I suppose)
I need to subset based on the max across the current row being evaluated, not across a column
Here are attempts I've tried:
data %>% filter(max(.) < 8)
data %>% filter(max(value) < 8)
data %>% slice(which.max(.))
Here's a way which will keep rows having value above threshold. For keeping values below threshold, just reverse the inequality in any -
data %>%
filter(apply(., 1, function(x) any(x > threshold)))
Actually, #r2evans has better answer in comments -
data %>%
filter(rowSums(. > threshold) >= 1)
Couple more options that should scale pretty well:
library(dplyr)
# a more dplyr-y option
iris %>%
filter_all(any_vars(. > 5))
# or taking advantage of base functions
iris %>%
filter(do.call(pmax, as.list(.))>5)
Maybe there are better and more efficient ways, but these two functions should do what you need if I understood correctly. This solution assumes you have only numerical data.
You transpose the tibble (so you obtain a numerical matrix)
Then you use map to get the max or min by column (which is the max/min by row in the initial dataset).
You obtain the row index you are looking for
Finally, you can filter your dataset.
# Random Data -------------------------------------------------------------
data <- as.tibble(replicate(10, runif(20)))
# Threshold to be used -----------------------------------------------------
max_treshold = 0.9
min_treshold = 0.1
# Lesser_max --------------------------------------------------------------
lesser_max = function(data, max_treshold = 0.9) {
index_max_list =
data %>%
t() %>%
as.tibble() %>%
map(max) %>%
unname()
index_max =
index_max_list < max_treshold
data[index_max,]
}
# Greater_min -------------------------------------------------------------
greater_min = function(data, min_treshold = 0.1) {
index_min_list =
data %>%
t() %>%
as.tibble() %>%
map(min) %>%
unname()
index_min =
index_min_list > min_treshold
data[index_min,]
}
# Examples ----------------------------------------------------------------
data %>%
lesser_max(max_treshold)
data %>%
greater_min(min_treshold)
We can use base R methods
data[Reduce(`|`, lapply(data, `>`, threshold)),]`

Dataframe not populating when I am doing subset?

I have a dataframe that has two columns. One column is the product type, and the other is characters. I essentially want to break that column 'product' into 12 different data frames for each level. So for the first level, I am running this code:
df = df %>% select('product','comments')
df['product'] = as.character(df['product'])
df['comments'] = as.character(df['comments'])
Now that the dataframe is in the structure I want it, I want to take a variety of subsets, and here is my first subset code:
df_boatstone = df[df$product == 'water',]
#df_boatstone <- subset(df, product == "boatstone", select = c('product','comments'))
I have tried both methods, and the dataframe is being created, but has nothing in it. Can anyone catch my mistake?
The as.character works on a vector, while df['product'] or df['comments'] are both data.frame with a single column
df[['product']] <- as.character(df[['product']])
Or better would be
library(tidyverse)
df %>%
select(product, comments) %>%
mutate_all(as.character) %>%
filter(product == 'water')

How can I remove all duplicates so that NONE are left in a data frame?

There is a similar question for PHP, but I'm working with R and am unable to translate the solution to my problem.
I have this data frame with 10 rows and 50 columns, where some of the rows are absolutely identical. If I use unique on it, I get one row per - let's say - "type", but what I actually want is to get only those rows which only appear once. Does anyone know how I can achieve this?
I can have a look at clusters and heatmaps to sort it out manually, but I have bigger data frames than the one mentioned above (with up to 100 rows) where this gets a bit tricky.
This will extract the rows which appear only once (assuming your data frame is named df):
df[!(duplicated(df) | duplicated(df, fromLast = TRUE)), ]
How it works: The function duplicated tests whether a line appears at least for the second time starting at line one. If the argument fromLast = TRUE is used, the function starts at the last line.
Boths boolean results are combined with | (logical 'or') into a new vector which indicates all lines appearing more than once. The result of this is negated using ! thereby creating a boolean vector indicating lines appearing only once.
A possibility involving dplyr could be:
df %>%
group_by_all() %>%
filter(n() == 1)
Or:
df %>%
group_by_all() %>%
filter(!any(row_number() > 1))
Since dplyr 1.0.0, the preferable way would be:
data %>%
group_by(across(everything())) %>%
filter(n() == 1)
Try it
library(dplyr)
DF1 <- data.frame(Part = c(1,2,3,4,5), Age = c(23,34,23,25,24), B.P = c(87,76,75,75,78))
DF2 <- data.frame(Part =c(3,5), Age = c(23,24), B.P = c(75,78))
DF3 <- rbind(DF1,DF2)
DF3 <- DF3[!(duplicated(DF3) | duplicated(DF3, fromLast = TRUE)), ]

Resources