Remove rows based on decimal places in column - r

I have a date frame that looks something like this:
A B C D
1 1.0 2 4
2 3.1 2 3
3 4.01 3 3
4 5.00 4 5
5 2.003 3 9
I want to delete rows where column B has numeric values other than 0 after the decimal. In the example above, this will leave me with rows 1 and 4. How do I go about this?

Try subset like below
subset(df, B == floor(B))
subset(df, B == ceiling(B))
subset(df, B == round(B))
subset(df, B == trunc(B))

We can also use
subset(df, B == as.integer(B))

Assuming your dataframe is called df:
df[as.integer(df$B) == as.numeric(df$B),]

A regex solution in base R:
df[-which(grepl("\\.[0-9]*[1-9]$", df$B)),]
A B C
2 2 1 2
5 5 357 3
A regex solution in dplyr:
library(dplyr)
df %>%
filter(!grepl("\\.[0-9]*[1-9]$", B))
Data:
df <- data.frame(
A = 1:5,
B = c(1.2, 1.0, 1.00004, 2.806, 357.0),
C = c(2,2,3,4,3)
)

Related

R Compare duplicate values ​for each row in two data sets

I want to compare whether the values ​​in each row are the same.
In this case, duplicated and all_equal function are not suitable.
Reproducible Sample Data
df1 <- data.frame(a=c(1,2,3),b=c(4,5,6))
df2 <- data.frame(a=c(1,2,4),b=c(4,5,6))
> df1
a b
1 1 4
2 2 5
3 3 6
> df2
a b
1 1 4
2 2 5
3 4 6
Expected output
final <- data.frame(a=c(1,2,4),b=c(4,5,6),c=c('T','T','F'))
#c column is the result I need. whether the values ​​in each row are the same.
>final
a b c
1 1 4 T
2 2 5 T
3 4 6 F
I try method below... but This is complicated.
#1. making idx of df1, df2
#2. and full_join
#3. and left_join df1
#4. and left_join df2
df1$idx1 <- 1:nrow(df1)
df2$idx2 <- 1:nrow(df2)
df3<-full_join(df1,df2,by=c('a','b'))
df3<-left_join(df3,df1,by=c('a','b'))
df3<-left_join(df3,df2,by=c('a','b')) #This may or may not work..
I think there must be a better way. help!
We could use
df2$c <- Reduce(`&`, Map(`==`, df1, df2))
-output
> df2
a b c
1 1 4 TRUE
2 2 5 TRUE
3 4 6 FALSE
You can get column 'c' by:
c <- df1$a == df2$a & df1$b == df2$b
gives TRUE TRUE FALSE. It looks like you want to then bind this to df2, so
cbind.data.frame(df2, c)
You may use rowSums -
final <- df2
final$c <- rowSums(df1 != df2) == 0
final
# a b c
#1 1 4 TRUE
#2 2 5 TRUE
#3 4 6 FALSE
In case the positions of the rows in each data.frame do not matter you can use merge.
within(merge(df2, within(df1, c <- "T"), all.x=TRUE), c[is.na(c)] <- "F")
# a b c
#1 1 4 T
#2 2 5 T
#3 4 6 F
or using duplicated.
df2$c <- c("F", "T")[1+tail(duplicated(rbind(df1, df2)), nrow(df2))]
df2
# a b c
#1 1 4 T
#2 2 5 T
#3 4 6 F

How to keep rows with the same values in two variables in r?

I have a dataset with several variables, but I want to keep the rows that are the same based on two columns. Here is an example of what I want to do:
a <- c(rep('A',3), rep('B', 3), rep('C',3))
b <- c(1,1,2,4,4,4,5,5,5)
df <- data.frame(a,b)
a b
1 A 1
2 A 1
3 A 2
4 B 4
5 B 4
6 B 4
7 C 5
8 C 5
9 C 5
I know that if I use the duplicated function I can get:
df[!duplicated(df),]
a b
1 A 1
3 A 2
4 B 4
7 C 5
But since the level 'A' on column a does not have a unique value in b, I want to drop both observations to get a new data.frame as this:
a b
4 B 4
7 C 5
I don't mind to have repeated values across b, as long as for every same level on a there is the same value in b.
Is there a way to do this? Thanks!
This one maybe?
ag <- aggregate(b~a, df, unique)
ag[lengths(ag$b)==1,]
# a b
#2 B 4
#3 C 5
Maybe something like this:
> ind <- apply(sapply(with(df, split(b,a)), diff), 2, function(x) all(x==0) )
> out <- df[!duplicated(df),]
> out[out$a %in% names(ind)[ind], ]
a b
4 B 4
7 C 5
Here is another option with data.table
library(data.table)
setDT(df)[, if(uniqueN(b)==1) .SD[1L], by = a]
# a b
#1: B 4
#2: C 5

in R find duplicates by column 1 and filter by not NA column 3

I have a dataframe:
a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c(1,1,2,4,1,1,2,2)
c <- c(1,NA,2,4,NA,1,2,2)
df <-data.frame(a,b,c)
I have a dataframe with some duplicate variables in column 1 but when I use the duplicated function, it randomly chooses the row after de-duping using duplicate(function)
dedup_df = df[!duplicated(df$a), ]
How can I ensure that the output returns me the row that does not contain an NA on column c ?
I tried to use the dplyr package but the output prints only a result
library(dplyr)
options(dplyr.print_max = Inf )
df %>% ## source dataframe
group_by(a) %>% ## grouped by variable
filter(!is.na(c) ) %>% ## filter by Gross value
as.data.frame(dedup_df)
Your use of duplicated function to remove duplicate observations (lines) using a column as key from a data frame is correct.
But it seems that you are worried that it may keep a line that contains NA in another column and drop another line that contains a non NA value.
I'll use you example, but with a slight modification
a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c(1,1,2,4,1,1,2,2)
c <- c(NA,1,2,4,NA,1,2,2)
df <-data.frame(a,b,c)
> df
a b c
1 A 1 NA
2 A 1 1
3 A 2 2
4 B 4 4
5 B 1 NA
6 B 1 1
7 C 2 2
8 C 2 2
In this case, your dedup_df contains an NA for the first value.
> dedup_df = df[!duplicated(df$a), ]
> dedup_df
a b c
1 A 1 NA
4 B 4 4
7 C 2 2
Solution:
Reorder df by column c first and then use the same command. This reordering by column c will send all NAs to the end of the data frame. When the duplicated passes it will see these lines having NA last and will tag them as TRUE if there was a previous one without NA.
df = df[order(df$c),]
dedup_df = df[!duplicated(df$a), ]
> dedup_df
a b c
2 A 1 1
6 B 1 1
7 C 2 2
You can also reorder in descending order
df = df[order(df$c,decreasing = T),]
dedup_df = df[!duplicated(df$a), ]
> dedup_df
a b c
4 B 4 4
3 A 2 2
7 C 2 2

Count of unique values across all columns in a data frame

We have a data frame as below :
raw<-data.frame(v1=c("A","B","C","D"),v2=c(NA,"B","C","A"),v3=c(NA,"A",NA,"D"),v4=c(NA,"D",NA,NA))
I need a result data frame in the following format :
result<-data.frame(v1=c("A","B","C","D"), v2=c(3,2,2,3))
Used the following code to get the count across one particular column :
count_raw<-sqldf("SELECT DISTINCT(v1) AS V1, COUNT(v1) AS count FROM raw GROUP BY v1")
This would return count of unique values across an individual column.
Any help would be highly appreciated.
Use this
table(unlist(raw))
Output
A B C D
3 2 2 3
For data frame type output wrap this with as.data.frame.table
as.data.frame.table(table(unlist(raw)))
Output
Var1 Freq
1 A 3
2 B 2
3 C 2
4 D 3
If you want a total count,
sapply(unique(raw[!is.na(raw)]), function(i) length(which(raw == i)))
#A B C D
#3 2 2 3
We can use apply with MARGIN = 1
cbind(raw[1], v2=apply(raw, 1, function(x) length(unique(x[!is.na(x)]))))
If it is for each column
sapply(raw, function(x) length(unique(x[!is.na(x)])))
Or if we need the count based on all the columns, convert to matrix and use the table
table(as.matrix(raw))
# A B C D
# 3 2 2 3
If you have only character values in your dataframe as you've provided, you can unlist it and use unique or to count the freq, use count
> library(plyr)
> raw<-data.frame(v1=c("A","B","C","D"),v2=c(NA,"B","C","A"),v3=c(NA,"A",NA,"D"),v4=c(NA,"D",NA,NA))
> unique(unlist(raw))
[1] A B C D <NA>
Levels: A B C D
> count(unlist(raw))
x freq
1 A 3
2 B 2
3 C 2
4 D 3
5 <NA> 6

Subsetting a Data Table using %in%

A stylized version of my data.table is
outmat <- data.table(merge(merge(1:5, 1:5, all=TRUE), 1:5, all=TRUE))
What I would like to do is select a subset of rows from this data.table based on whether the value in the 1st column is found in any of the other columns (it will be handling matrices of unknown dimension, so I can't just use some sort of "row1 == row2 | row1 == row3"
I wanted to do this using
output[row1 %in% names(output)[-1], ]
but this ends up returning TRUE if the value in row1 is found in any of the rows of row2 or row3, which is not the intended behavior. It there some sort of vectorized version of %in% that will achieve my desired result?
To elaborate, what I want to get is the enumeration of 3-tuples from the set 1:5, drawn with replacement, such that the first value is the same as either the second or third value, something like:
1 1 1
1 1 2
1 1 3
1 1 4
1 1 5
...
2 1 2
2 2 1
...
5 5 5
What my code instead gives me is every enumeration of 3-tuples, as it is checking whether the first digit (say, 5), ever appears anywhere in the 2rd or 3rd columns, not simply within the same row.
One option is to construct the expression and evaluate it:
dt = data.table(a = 1:5, b = c(1,2,4,3,1), c = c(4,2,3,2,2), d = 5:1)
# a b c d
#1: 1 1 4 5
#2: 2 2 2 4
#3: 3 4 3 3
#4: 4 3 2 2
#5: 5 1 2 1
expr = paste(paste(names(dt)[-1], collapse = paste0(" == ", names(dt)[1], " | ")),
"==", names(dt)[1])
#[1] "b == a | c == a | d == a"
dt[eval(parse(text = expr))]
# a b c d
#1: 1 1 4 5
#2: 2 2 2 4
#3: 3 4 3 3
Another option is to just loop through and compare the columns:
dt[rowSums(sapply(dt, '==', dt[[1]])) > 1]
# a b c d
#1: 1 1 4 5
#2: 2 2 2 4
#3: 3 4 3 3
library(dplyr)
library(tidyr)
dt %>%
mutate(ID = 1:n() )
gather(variable, value, -first_column, -ID) %>%
filter(first_column == value) %>%
select(ID) %>%
distinct %>%
left_join(dt)

Resources