Keep first occurrence of rows irrespective of column of each element

Keep first occurrence of rows irrespective of column of each element - r

Is there a function that treats the elements of a row as set and returns only the first occurrence of each unique set?
In example below, rows 1 and 3 should be considered equal. It should be irrelevant for the function foo whether an element is in col1 or col2.
df <- data.frame(col1 = c('a', 'b', '1'), col2 = c('1', '2', 'a'))
foo(df)
> col1 col2
> 1 a 1
> 2 b 2

You could do something like this..
df[!duplicated(t(apply(df,1,sort))),]
col1 col2
1 a 1
2 b 2
It sorts each row (so that a-1 and 1-a end up the same), and then selects only those rows of df that are not duplicates.

Related

assign headers based on existing row that is not the same row and column / dataframe in R

Question. How to extract the row number that will be the header?
I think I can assign that row as a header by using index number.
header is c(a,b,c)
data
6*4 matrix
v1 v2 v3 v4 #header
d a b c #headera,b,c
d 1 1 1
d 1 1 1
a b c e #headera,b,c
2 2 2 e
2 2 2 e
output
4*3 matrix
a b c #header
1 1 1
1 1 1
2 2 2
2 2 2
my code..
str_which(df, 'a') #idetify row number

Based on what you've written above, you need to filter rows that contain 'a', 'b', 'c' letters consequentially, which implies that the sequence might start either from v1 or v2. Henceforth, I believe this will solve your issue:
# create an indexed data frame
df.with.index <- mutate(df, IDX = 1:n())
# filter the data frame by the condition above, and output the index
dplyr::filter(df.with.index,
(v2 == 'a' & v3 == 'b' & v4 == 'c')
| (v1 == 'a' & v2 == 'b' & v3 == 'c'))$IDX
This will result in:
[1] 1 4
If you need to test the rows over whether they contain only letter 'a', you might want to use this:
dplyr::filter(df.with.index, (v1 == 'a' | v2 == 'a' | v3 == 'a' | v4 == 'a'))$IDX

how I can add a column with all 1 to my dataframe?

I have a data frame and I want to add a new column with entries 1. how I can do that?
for example
col1. col2
1. 2.
4. 5.
33. 4.
5. 3.
new column
col1. col2. col3
1. 2. 1
4. 5. 1
33. 4. 1
5. 3. 1

df1$col3 <- 1
this should work as well
likewise as above
df1<-data.frame(df1,col3=1)
could also work

Simplest option is to do ?Extract
df1['col3'] <- 1
One of the good things about using [ instead of $ is that we can pass variable identifiers as well
v1 <- 'col3'
df1[v1] <- 1
But, if we do
df1$v1 <- 1
it creates a column with name as 'v1' instead of 'col3'
Other variations without changing the initial object would be
transform(df1, col3 = 1)
cbind(df1, col3 = 1)
NOTE: All of these creates a column appended as the last column
Also, there is a convenient function add_column which can add a column by specifying the position. By default, it creates the column as the last one
library(tibble)
add_column(df1, col3 = 1)
# col1. col2 col3
#1 1 2 1
#2 4 5 1
#3 33 4 1
#4 5 3 1
But, if we need to change it to a specific location, there are arguments
add_column(df1, col3 = 1, .after = 1)
# col1. col3 col2
#1 1 1 2
#2 4 1 5
#3 33 1 4
#4 5 1 3
data
df1 <- structure(list(col1. = c(1, 4, 33, 5), col2 = c(2, 5, 4, 3)),
class = "data.frame", row.names = c(NA,
-4L))

How to count number of rows by 2 columns while ignoring the duplicates from a 3rd column?

I'm trying to count the number of rows in a data.table that by 2 of the columns (works) while also trying to ignore the duplicate rows (3rd column)
Col1 Col2 Col3 Result
a x y 2
a x y 2 <- row should be ignored from count
b x y 2
a t y 1
a t i 1
I've tried dropping the columns going in, but I don't really know how to chain commands.
dt[, Result:= .N, by = .(Col2, Col3)]

We need to get the unique number of elements in 'Col1'. It can be done with uniqueN on 'Col1' after grouping by 'Col2' and 'Col3'
library(data.table)
setDT(df1)[, Result := uniqueN(Col1), .(Col2, Col3)][]
# Col1 Col2 Col3 Result
#1: a x y 2
#2: a x y 2
#3: b x y 2
#4: a t y 1
#5: a t i 1

How to remove values in a column when it contains both numeric and non-numeric values?

I have a data set in which one column has got non-numeric as well as numeric values. I want to make a subset where I have values greater than 0 and I want to retain non-numeric values as well. How can I do that?

Convert the column to numeric (if it is a character class, just do as.numeric(df1$col1), but if it is factr, then as.numeric(as.character(df1$col1))) and create a logical condition to subset the rows
v1 <- as.numeric(as.character(df1$col1))
df1[v1 > 0 | is.na(v1),]
# col1 col2
#1 24 -0.5458808
#2 asd 0.5365853
#4 d1 -0.5836272
#5 2 0.8474600
data
set.seed(24)
df1 <- data.frame(col1 = c(24, 'asd', -5, 'd1', 2), col2 = rnorm(5))

Checking if a value is numerical in R

I have two dataframes, df1 and df2.
df1:
col1 <- c('30','30','30','30')
col2 <- c(3,13,18,41)
col3 <- c("heavy","light","blue","black")
df1 <- data.frame(col1,col2,col3)
>df1
col1 col2 col3
1 30 3 heavy
2 30 13 light
3 30 18 blue
4 30 41 black
df2:
col1 <- c('10',"NONE")
col2 <- c(21,"NONE")
col3 <- c("blue","NONE")
df2 <- data.frame(col1,col2,col3)
>df2
col1 col2 col3
1 10 21 blue
2 NONE NONE NONE
I wrote a bit of script that says; if a value in col3 is equal to "light", I want to remove that row and all subsequent rows in the dataframe. So df1 would look like:
col1 col2 col3
1 30 3 heavy
And there would be no changes to df2 (as it has no matches to "light" in col3).
I have stated there are two separate df's above as two examples, but the script below just refers to a general "df" to save me copying and pasting the same bit of code twice with df1 repalced with df2.
phrase=c("light")
start_rownum=which(grepl(phrase, df[,3]))
end_rownum=nrow(df)
end_rownum=as.numeric(end_rownum)
if(start_rownum > 0){
df=df[-c(start_rownum:end_rownum),]
}
This script works fine with df1, as the start_rownum has a numerical value. However, I get the following error with df2:
Error in start_rownum:end_rownum : argument of length 0
Instead of saying "if(start_rownum > 0)", is there some way to check if start_rownum has a numerical value? I can't find a working solution.
Thanks.

For anyone who has a similar problem, I just solved it:
Use the phrase
if (length(start_rownum)>0 & is.numeric(start_rownum))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Keep first occurrence of rows irrespective of column of each element - r

You could do something like this.. df[!duplicated(t(apply(df,1,sort))),] col1 col2 1 a 1 2 b 2 It sorts each row (so that a-1 and 1-a end up the same), and then selects only those rows of df that are not duplicates.

Related

assign headers based on existing row that is not the same row and column / dataframe in R

how I can add a column with all 1 to my dataframe?

How to count number of rows by 2 columns while ignoring the duplicates from a 3rd column?

How to remove values in a column when it contains both numeric and non-numeric values?

Checking if a value is numerical in R

Categories

Resources