R - Combining duplicate rows within dataframe in R : - r

I have a dataframe as below: please note that COL1 is having duplicate entries
COL1 COL2 COL3
10 hai 2
10 hai 3
10 pal 1
I want the output to be like this as shown below: i.e COL1 should have the unique entry alone(10), COL2 should contain the merged entries under it without duplicates(hai pal), and COL3 should contain the sum of entries(2+3+1=6)
OUTPUT:
COL1 COL2 COL3
10 hai pal 6

Perhaps we need to aggregate by group. Convert the 'data.frame' to 'data.table' (setDT(df1), grouped by 'COL1', paste the unique elements in 'COL2' together as well as get the sum of 'COL3'.
library(data.table)
setDT(df1)[,.(COL2 = paste(unique(COL2), collapse=" "), COL3= sum(COL3)) , by = COL1]
# COL1 COL2 COL3
#1: 10 hai pal 6

Related

Union dataframes in some way that updates rows with same row.name

I want to do a union of two dataframes, that share some rows with same rowName. For those rows with common rowNames, I would like to take into account the second dataframe values, and not the first one's. For example :
df1 <- data.frame(col1 = c(1,2), col2 = c(2,4), row.names = c("row_1", "row_2"))
df1
# col1 col2
# row_1 1 2
# row_2 2 4
df2 <- data.frame(col1 = c(3,6), col2 = c(10,99), row.names = c("row_3", "row_2"))
df2
# col1 col2
# row_3 3 6
# row_2 10 99
The result I would like to obtain would then be :
someSpecificRBind(df1,df2, takeIntoAccount=df2)
# col1 col2
# row_1 1 2
# row_2 10 99
# row_3 3 6
The function rbind doesn't do the job, actually it updates rowNames for common ones.
I would conceptualize this as only adding to df2 the rows in df1 that aren't already there:
rbind(df2, df1[setdiff(rownames(df1), rownames(df2)), ])
We get the index of duplicated elements and use that to filter
rbind(df2, df1)[!duplicated(c(row.names(df2), row.names(df1))),]

r - filtering alphanumeric factor columns by numeric values in dataframe

I have a very messy data frame consisting of factor columns with numbers and characters. I need to filter the rows with numeric values above a threshold. However, this is a problem because my columns are factors that cannot be turned to numeric, due to the presence of characters in them.
DF <- data.frame(
Col1 = c("Egg", "", "3"),
Col2 = c("", "Flour", ""),
Col3 = c("2", "", "Bread"),
Col4 = c("4", "", ""),
Col5 = c("", "6", "8")
)
The resulting data frame looks like this:
> DF
Col1 Col2 Col3 Col4 Col5
1 Egg 2 4
2 Flour 6
3 3 Bread 8
Where each column is a factor:
> class(DF$Col1)
[1] "factor"
>
In this example, how do I filter rows with numeric values above, say, 5 in at least one column? The desired output in this example, looks like this:
> DF
Col1 Col2 Col3 Col4 Col5
2 Flour 6
3 3 Bread 8
You'll get some warnings from dplyr but this works as well:
library(dplyr)
DF %>%
mutate_all(as.character) %>%
filter_all(any_vars(if_else(is.na(as.numeric(.)), FALSE, as.numeric(.) > 5)))
Col1 Col2 Col3 Col4 Col5
1 Flour 6
2 3 Bread 8
Per #Frank's suggestion (a bit cleaner than above):
DF %>%
filter_all(any_vars(as.numeric(as.character(.)) > 5))
Col1 Col2 Col3 Col4 Col5
1 Flour 6
2 3 Bread 8
One can pick out only numeric values using gsub from each observation and convert it to numeric. Afterwards, in base-R subset with apply can provide a solution as:
subset(DF, apply(DF, 1, function(x){
#Get only numeric values and convert to numeric
val <- as.numeric(gsub("[^[:digit:]]", "",x))
any(val[!is.na(val)] > 5)
})
)
# Col1 Col2 Col3 Col4 Col5
# 2 Flour 6
# 3 3 Bread 8
One way this can be done is:
DF[do.call(function(...) pmax(..., na.rm=TRUE), data.frame(lapply(lapply(DF, as.character), as.numeric), stringsAsFactors = FALSE)) > 5,]
To explain what this is doing, the lapply(DF, as.character) is removing the factors, then lapply(lapply(DF, as.character), as.numeric) is converting the characters to numbers (the text becomes NA), and then data.frame(lapply(lapply(DF, as.character), as.numeric), stringsAsFactors = FALSE) changes it back to a dataframe, e.g.
> data.frame(lapply(lapply(DF, as.character), as.numeric), stringsAsFactors = FALSE)
Col1 Col2 Col3 Col4 Col5
1 NA NA 2 4 NA
2 NA NA NA NA 6
3 3 NA NA NA 8
The do.call with pmax finds the row maximum (thanks rowwise maximum for R) and then we can easily filter for a maximum value above 5.

resample each two columns together in a data frame in R

I have a very large data frame that contains 100 rows and 400000 columns.
To sample each column, I can simply do:
df <- apply(df, 2, sample)
But I want every two column to be sampled together. For example, if originally col1 is c(1,2,3,4,5) and col2 is also c(6,7,8,9,10), and after resampling, col1 becomes c(1,3,2,4,5), I want col2 to be c(6,8,7,9,10) that follows the resampling pattern of col1. Same thing for col3 & col4, col5 & col6, etc.
I wrote a for loop to do this, which takes forever. Is there a better way? Thanks!
You might try this; split the data frame every two columns with split.default, for each sub data frame, sample the rows and then bind them together:
df <- data.frame(col1 = 1:5, col2 = 6:10, col3 = 11:15)
index <- seq_len(nrow(df))
cbind.data.frame(
setNames(lapply(
split.default(df, (seq_along(df) - 1) %/% 2),
function(sdf) sdf[sample(index),,drop=F]),
NULL)
)
# col1 col2 col3
#5 5 10 12
#4 4 9 11
#1 1 6 15
#2 2 7 14
#3 3 8 13

Copy cell in a row if it matches column name

I searched for a while to try to solve this, but unfortunately couldn't find an answer.
In my dataframe, the last column contains strings which match column names. I would like to create another column that for each row returns(copies) the value that matches that column name.
For example, say my data is:
col1 <- c(1, 4, 6, 0, 5)
col2 <- c(4, 6, 7, 8, 6)
col3 <- c(0, 4, 2, 2, 1)
col4 <- c("col1", "col1", "col2", "col3", "col1")
df <- data.frame(col1, col2, col3, col4)
and what I want to achieve is col5 which copies relevant cells from each row:
col1 col2 col3 col4 col5
1 4 0 col1 1
4 6 4 col1 4
6 7 2 col2 7
0 8 2 col3 2
5 6 1 col1 5
Basically it looks at col4 and returns the value from the same row that matches that column name.
This is obviously a very simplified version of my data which is why I'd like to automate it.
I would really appreciate any help :)
We can use row/col indexing to extract the elements from the dataset to create the 'col5'.
df$col5 <- df[-4][cbind(1:nrow(df), match(as.character(df$col4), colnames(df)))]
df$col5
#[1] 1 4 7 2 5

count unique values of a column for a given column in r

I have a data frame like
col1 col2 col3
A 2 b1
A 3 b2
A 2 b2
A 2 b1
A 3 b2
I want to get the count of unique values of col3 for each combination of col1 and col2 as following
col1 col2 count_unique
A 2 2
A 3 1
What is the best one line solution to this?
As #Frank and #akrun pointed out in their comments, there are several possible solutions to your question - here are three of the most used ones:
in base R:
aggregate(col3~., df, function(x) length(unique(x)) )
using the data.table package (v1.9.5 and higher):
setDT(df)[, uniqueN(col3), by=.(col1,col2)]
using the dplyr package:
df %>% group_by(col1, col2) %>% summarise(col3=n_distinct(col3))
Other two options:
plyr
library(plyr)
count(unique(df), vars = c("col1", "col2"))
Output:
col1 col2 freq
1 A 2 2
2 A 3 1
sqldf
library(sqldf)
sqldf("SELECT col1, col2, COUNT(DISTINCT(col3)) n
FROM df GROUP BY col1, col2")
Output:
col1 col2 n
1 A 2 2
2 A 3 1

Resources