Merge the rows with least abundance - r

I would like to merge rows lower than specific value, like:
ID A B C
Apple 1 1 1
Banana 2 2 2
Cherry 3 3 3
Dates 4 4 4
For Apple, the total amount in A, B and C is 3, which is 10% (3/30*100%=10%) in total.
I would like to merge the rows with amount lower than 20% in total into a "Others" row, like:
ID A B C
Cherry 3 3 3
Dates 4 4 4
Others 3 3 3
May I know how to draw the function and how to achieve this?
Any suggestion or idea is appreciated

I'd do it like this:
## Your original data
df <- read.table(text="ID A B C
Apple 1 1 1
Banana 2 2 2
Cherry 3 3 3
Dates 4 4 4" ,stringsAsFactors = FALSE)
names(df) <- df[1,] ## adding column names
df <- df[-1,] ## removing the header row
df[,-1] <- lapply(df[,-1], as.numeric) ## converting to numeric
rownames(df) <- df[,1] ## adding rownames
df <- df[,-1] ## removing the header column
df$tots <- apply(df, 1, sum)
df$proportion <- df$tots/sum(df$tots)
df <- rbind(df[which(df$proportion >= 0.2), ],
Others=apply(df[which(df$proportion < 0.2), ], 2, sum))
df <- subset(df, select = -c(tots, proportion))
The result:
>df
>Banana 2 2 2
>Cherry 3 3 3
>Dates 4 4 4
>Others 1 1 1

One option would be to create a logical index by dividing the rowSums of numeric columns with the total sum to check if it is less than or equal to 0.2, then assign the 'ID' based on the index to "Others" (assuming that the "ID" column is character class) and aggregate the columns by 'ID' to get the sum
i1 <- rowSums(df1[-1])/sum(as.matrix(df1[-1])) <= 0.2
df1$ID[i1] <- "Others"
aggregate(.~ ID, df1, sum)
# ID A B C
#1 Cherry 3 3 3
#2 Dates 4 4 4
#3 Others 3 3 3

Related

How to interlacely merge two matrices?

I am facing with the other problem in coding with R-Studio. I have two dataframes (with the same number of rows and colunms). Now I want to merge them two into one, but the 6 columns of dataframe 1 would be columns 1,3,5,7,9.11 in the new matrix; while those of data frame 2 would be 2,4,6,8,10,12 in the new merged dataframe.
I can do it with for loop but is there any smarter way/function to do it? Thank you in advance ; )
You can cbind them and then reorder the columns accordingly:
df1 <- as.data.frame(sapply(LETTERS[1:6], function(x) 1:3))
df2 <- as.data.frame(sapply(letters[1:6], function(x) 1:3))
cbind(df1, df2)[, matrix(seq_len(2*ncol(df1)), 2, byrow=T)]
# A a B b C c D d E e F f
# 1 1 1 1 1 1 1 1 1 1 1 1 1
# 2 2 2 2 2 2 2 2 2 2 2 2 2
# 3 3 3 3 3 3 3 3 3 3 3 3 3
The code below will produce your required result, and will also work if one data frame has more columns than the other
# create sample data
df1 <- data.frame(
a1 = 1:10,
a2 = 2:11,
a3 = 3:12,
a4 = 4:13,
a5 = 5:14,
a6 = 6:15
)
df2 <- data.frame(
b1=11:20,
b2=12:21,
b3=13:22,
b4=14:23,
b5=15:24,
b6=16:25
)
# join by interleaving columns
want <- cbind(df1,df2)[,order(c(1:length(df1),1:length(df2)))]
Explanation:
cbind(df1,df2) combines the data frames with all the df1 columns first, then all the df2 columns.
The [,...] element re-orders these columns.
c(1:length(df1),1:length(df2)) gives 1 2 3 4 5 6 1 2 3 4 5 6 - i.e. the order of the columns in df1, followed by the order in df2
order() of this gives 1 7 2 8 3 9 4 10 5 11 6 12 which is the required column order
So [, order(c(1:length(df1), 1:length(df2)] re-orders the columns so that the columns of the original data frames are interleaved as required.

R Compare duplicate values ​for each row in two data sets

I want to compare whether the values ​​in each row are the same.
In this case, duplicated and all_equal function are not suitable.
Reproducible Sample Data
df1 <- data.frame(a=c(1,2,3),b=c(4,5,6))
df2 <- data.frame(a=c(1,2,4),b=c(4,5,6))
> df1
a b
1 1 4
2 2 5
3 3 6
> df2
a b
1 1 4
2 2 5
3 4 6
Expected output
final <- data.frame(a=c(1,2,4),b=c(4,5,6),c=c('T','T','F'))
#c column is the result I need. whether the values ​​in each row are the same.
>final
a b c
1 1 4 T
2 2 5 T
3 4 6 F
I try method below... but This is complicated.
#1. making idx of df1, df2
#2. and full_join
#3. and left_join df1
#4. and left_join df2
df1$idx1 <- 1:nrow(df1)
df2$idx2 <- 1:nrow(df2)
df3<-full_join(df1,df2,by=c('a','b'))
df3<-left_join(df3,df1,by=c('a','b'))
df3<-left_join(df3,df2,by=c('a','b')) #This may or may not work..
I think there must be a better way. help!
We could use
df2$c <- Reduce(`&`, Map(`==`, df1, df2))
-output
> df2
a b c
1 1 4 TRUE
2 2 5 TRUE
3 4 6 FALSE
You can get column 'c' by:
c <- df1$a == df2$a & df1$b == df2$b
gives TRUE TRUE FALSE. It looks like you want to then bind this to df2, so
cbind.data.frame(df2, c)
You may use rowSums -
final <- df2
final$c <- rowSums(df1 != df2) == 0
final
# a b c
#1 1 4 TRUE
#2 2 5 TRUE
#3 4 6 FALSE
In case the positions of the rows in each data.frame do not matter you can use merge.
within(merge(df2, within(df1, c <- "T"), all.x=TRUE), c[is.na(c)] <- "F")
# a b c
#1 1 4 T
#2 2 5 T
#3 4 6 F
or using duplicated.
df2$c <- c("F", "T")[1+tail(duplicated(rbind(df1, df2)), nrow(df2))]
df2
# a b c
#1 1 4 T
#2 2 5 T
#3 4 6 F

R Compare duplicate values ​for each row in two data sets + in any order

I want to compare whether the 'values set' ​​in each row are the same.
In this case, duplicated and all_equal function are not suitable.
Reproducible Sample Data
a=c(1,1)
b=c(2,2)
c=c(3,3)
d=c(4,5)
df1<-rbind(a,b,c)
df1<-df1 %>% as.data.frame()
df2<-rbind(a,d,b)
df2<-df2 %>% as.data.frame()
> df1
V1 V2
a 1 1
b 2 2
c 3 3
> df2
V1 V2
a 1 1
d 4 5
b 2 2
Expected output
df1$idx1 <- 1:nrow(df1)
df2$idx2 <- 1:nrow(df2)
df1
df2
df3<-full_join(df1,df2,by=c('V1','V2'))
df3
df3$need <- ifelse(is.na(df3$idx2), 'only_df1',
ifelse(is.na(df3$idx1), 'only_df2',
'duplicated'))
> df3
V1 V2 idx1 idx2 need
1 1 1 1 1 duplicated
2 2 2 2 3 duplicated
3 3 3 3 NA only_df1
4 4 5 NA 2 only_df2
I try... but This is complicated.
I think there must be a better way. help!
Since you are already using dplyr, you may use case_when which is easier to understand and write especially when you have lot of conditions.
library(dplyr)
full_join(df1,df2,by=c('V1','V2')) %>%
mutate(need = case_when(is.na(idx2) ~ 'only_df1',
is.na(idx1) ~ 'only_df2',
TRUE ~ 'duplicated'))
# V1 V2 idx1 idx2 need
#1 1 1 1 1 duplicated
#2 2 2 2 3 duplicated
#3 3 3 3 NA only_df1
#4 4 5 NA 2 only_df2
As already mentioned in the comments, your way looks ok. In case you want to see how it could be done in base:
a <- c(1,1)
b <- c(2,2)
c <- c(3,3) #Better don't use existing function names
d <- c(4,5)
df1 <- as.data.frame(rbind(a,b,c))
df2 <- as.data.frame(rbind(a,d,b))
df1$idx1 <- seq_len(nrow(df1)) #seq_len will also work in case nrow=0
df2$idx2 <- seq_len(nrow(df2))
df3 <- merge(df1, df2, all=TRUE)
df3$need <- ifelse(is.na(df3$idx2), "only_df1",
ifelse(is.na(df3$idx1), "only_df2",
"duplicated"))
df3
# V1 V2 idx1 idx2 need
#1 1 1 1 1 duplicated
#2 2 2 2 3 duplicated
#3 3 3 3 NA only_df1
#4 4 5 NA 2 only_df2
We can use
library(arsenal)
summary(comparedf(df1, df2))

Row wise count of Zeros' and NA in R for Columns

I want to find count of rows with respect to number of Zero's and NA's in the data frame , for example
number of rows having zeros in only 1 column etc..
code for the df is below and need to find for columns from M1 to M5
O/P needed for Zeros and NA , link provided below for desired O/P
https://imgur.com/y9qeyhV
id <- 1:9
M1 <- c(0,NA,1,0,0,NA,NA,1,7)
M2 <- c(NA,NA,0,NA,0,NA,NA,1,7)
M3 <- c(1,NA,0,0,0,1,NA,1,7)
M4 <- c(0,NA,0,3,0,NA,NA,1,7)
M5 <- c(5,0,0,NA,0,0,NA,0,NA)
data <- cbind(id,M1,M2,M3,M4,M5)
data <- as.data.frame(data)
Desired Output:
Try this
table(rowSums(is.na(data)))
# 0 1 2 3 4 5
# 3 2 1 1 1 1
table(factor(rowSums(data == 0, na.rm = T), levels = 0:5))
# 0 1 2 3 4 5
# 2 3 2 0 1 1
You can also pass the codes above to data.frame() or as.data.frame() to get an data.frame object like your expected output shows.
For NA:
data.frame(table(rowSums(is.na(data[startsWith(names(data),"M")]))))
Var1 Freq
1 0 3
2 1 2
3 2 1
4 3 1
5 4 1
6 5 1
For zeros
data.frame(table(factor(rowSums(0==data[startsWith(names(data),"M")],TRUE),0:5)))
Var1 Freq
1 0 2
2 1 3
3 2 2
4 3 0
5 4 1
6 5 1
apply(data, 1, function(x) length(x[is.na(x)]))
This will give you a vector. Each element corresponds to a row and its value is the number of NA elements in that row.
My solution is kind of complicated, but it gives the desired output using apply functions:
myFun <- function(data, count, fun) {
applyFun <- function(x) {
length(which(
apply(data, 1, function(y) length(which(fun(y))) == x)
))
}
sapply(count, applyFun)
}
myFun(data, 0:5, is.na)
myFun(data, 0:5, function(x) x == 0)
(You made a mistake in your example: two rows have no zeroes in any column: rows 7 and 9.)
Here is a for loop option to count NAs and Zeros in each row and then use dplyr::count to summarize the frequency of each value.
data$CountNA<-NA
for (i in 1:nrow(data)){
data[i,"CountNA"]<-length(which(is.na(data[i,1:(ncol(data)-1)])))}
count(data, CountNA)
data$CountZero<-NA
for (i in 1:nrow(data)){
data[i,"CountZero"]<-length(which((data[i,1:(ncol(data)-2)]==0)))}
count(data, CountZero)

in R find duplicates by column 1 and filter by not NA column 3

I have a dataframe:
a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c(1,1,2,4,1,1,2,2)
c <- c(1,NA,2,4,NA,1,2,2)
df <-data.frame(a,b,c)
I have a dataframe with some duplicate variables in column 1 but when I use the duplicated function, it randomly chooses the row after de-duping using duplicate(function)
dedup_df = df[!duplicated(df$a), ]
How can I ensure that the output returns me the row that does not contain an NA on column c ?
I tried to use the dplyr package but the output prints only a result
library(dplyr)
options(dplyr.print_max = Inf )
df %>% ## source dataframe
group_by(a) %>% ## grouped by variable
filter(!is.na(c) ) %>% ## filter by Gross value
as.data.frame(dedup_df)
Your use of duplicated function to remove duplicate observations (lines) using a column as key from a data frame is correct.
But it seems that you are worried that it may keep a line that contains NA in another column and drop another line that contains a non NA value.
I'll use you example, but with a slight modification
a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c(1,1,2,4,1,1,2,2)
c <- c(NA,1,2,4,NA,1,2,2)
df <-data.frame(a,b,c)
> df
a b c
1 A 1 NA
2 A 1 1
3 A 2 2
4 B 4 4
5 B 1 NA
6 B 1 1
7 C 2 2
8 C 2 2
In this case, your dedup_df contains an NA for the first value.
> dedup_df = df[!duplicated(df$a), ]
> dedup_df
a b c
1 A 1 NA
4 B 4 4
7 C 2 2
Solution:
Reorder df by column c first and then use the same command. This reordering by column c will send all NAs to the end of the data frame. When the duplicated passes it will see these lines having NA last and will tag them as TRUE if there was a previous one without NA.
df = df[order(df$c),]
dedup_df = df[!duplicated(df$a), ]
> dedup_df
a b c
2 A 1 1
6 B 1 1
7 C 2 2
You can also reorder in descending order
df = df[order(df$c,decreasing = T),]
dedup_df = df[!duplicated(df$a), ]
> dedup_df
a b c
4 B 4 4
3 A 2 2
7 C 2 2

Resources