Grouping low occuring levels in a dataframe in R - r

Suppose that I have a data frame that has a column called C. C has many levels that only occur once. How would I rename all of the levels that occur only once with a new level (called z)?
A B C
a a a
a b b
a a c
a b d
a b a
The above would turn into:
A B C
a a a
a b z
a a z
a b z
a b a

What about this (assuming your data is df)?
levels(df[,3])[table(df[,3])==1] <- "z"
df
A B C
1 a a a
2 a b z
3 a a z
4 a b z
5 a b a

I'm sure there is a more elegant way to do this but here is one solution:
df <- read.table(text = "A B C
a a a
a b b
a a c
a b d
a b a", header = TRUE)
# Get the number of times each factor occurs:
counts <- table(df$C)
# Replace each one that only occurs once with "z"
df$C <- ifelse(df$C %in% names(counts[counts == 1]), "z", as.character(df$C))
# Since the levels changed, encode as a factor again:
df$C <- factor(df$C)
This gives:
R> df$C
[1] a z z z a
Levels: a z

using dplyr:
library(dplyr)
df %>% group_by(C) %>%
mutate(D = as.character(ifelse(n() == 1, "z", as.character(C))))
There is some ugly stuff to deal with the ifelse in there.

Related

Create a numerical df in r using factor df

I have a factor df that I would like it to be need it to be numerical/dummy. I used as.integer to each column and then made a cbind to the original data frame. Is there a way to do all columns at once?
data <- data.frame(
x = c('a','b','c'),
y = c('d','e','f'),
z = c('g','h','i'),
stringsAsFactors = TRUE
)
x_factor <- as.integer(data$x)
y_factor <- as.integer(data$y)
z_factor <- as.integer(data$z)
data_binded <- cbind(a,x_factor,y_factor,z_factor)
Here is dplyr solution:
library(dplyr)
data %>%
mutate(across(ends_with("factor"), as.numeric))
x y z x_factor y_factor z_factor
1 a d g 1 1 1
2 b e h 2 2 2
3 c f i 3 3 3

How can I fill NA-values in a data frame column based on the values from an other column? [duplicate]

This question already has an answer here:
Replace NA with mode based on ID attribute
(1 answer)
Closed 2 years ago.
I'd like to fill the NA-values in F2-column, based on the the most common F2-value when grouped by F1-column.
F1 F2
1 A C
2 B D
3 A NA
4 A C
5 B NA
Desired outcome:
F1 F2
1 A C
2 B D
3 A C
4 A C
5 B D
Thank you for help
Here is a base R solution. First define a function for Mode (Taken from here) and then apply it to you data frame, i.e.
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
df$F2 <- with(df, ave(F2, F1, FUN = function(i) replace(i, is.na(i), Mode(i))))
df
# F1 F2
#1 A C
#2 B D
#3 A C
#4 A C
#5 B D
Here is one way using dplyr :
library(dplyr)
df %>%
group_by(F1) %>%
mutate(F2 = replace(F2, is.na(F2),
names(sort(table(F2), decreasing = TRUE)[1])))
# F1 F2
# <chr> <chr>
#1 A C
#2 B D
#3 A C
#4 A C
#5 B D
In case of ties, preference is given to lexicographic order.
Try this:
First in df2 I get max count by the variable F1 where F2 is not missing. That will give you the most common F2 value when groups by F1. I join it back onto the original data.frame and use a mutate to fill by the new variable F2_fill and then remove it from this variable from the data.frame.
library(tidyverse)
df <- tribble(
~F1, ~F2,
'A', 'C',
'B' , 'D',
'A' ,NA,
'A', 'C',
'B', NA)
df2 <- df %>%
group_by(F1) %>%
count(F2) %>%
filter(!is.na(F2), n == max(n)) %>%
select(-n) %>%
rename(F2_fill = F2)
df3 <- left_join(df,df2, by="F1") %>%
mutate(F2 = ifelse(is.na(F2), F2_fill,F2)) %>%
select(-F2_fill)
You can use ave with table and which.max and subsetting with is.na when it is a character.
i <- is.na(x$F2)
x$F2[i] <- ave(x$F2, x$F1, FUN=function(y) names(which.max(table(y))))[i]
x
# F1 F2
#1 A C
#2 B D
#3 A C
#4 A C
#5 B D
Data:
x <- data.frame(F1 = c("A", "B", "A", "A", "B")
, F2 = c("C", "D", NA, "C", NA))

Lookup value from DF column in df col names, take value for corresponding row

My DF:
dataAB <- c("A","B","A","A","B")
dataCD <- c("C","C","D","D","C")
dataEF <- c("F","E","E","E","F")
key <- c("dataC","dataA","dataC","dataE","dataE")
df <- data.frame(dataAB,dataCD,dataEF,key)
I'd like to add a column that looks for the value in "key" in the names of the DF and takes the value in that column for the row. My result would look like this:
df$result <- c("C","B","D","E","F")
Note that the value in the "key" column only partially matches the col names of df and is not the complete names of the col names. I suspect I'll need grep or grepl somewhere. I've tried variations on the following code, but can't get anything to work, and I'm unsure how to apply grep or grepl in this case.
df$result <- mapply(function(a) {df[[as.character(a)]]}, a=df$key)
Using apply with margin = 1 (row-wise) from which column we need to take the value using grepl which helps to detect the pattern.
df$result <- apply(df, 1, function(x) x[grepl(x["key"], names(x))])
df
# dataAB dataCD dataEF key result
#1 A C F dataC C
#2 B C E dataA B
#3 A D E dataC D
#4 A D E dataE E
#5 B C F dataE F
Another option with mapply would be to find out the columns from where we need to extract the values using sapply and then get the corresponding value from each row.
df$result <- mapply(function(x, y) df[x, y], 1:nrow(df),
sapply(df$key, function(x) grep(x, names(df), value = TRUE)))
df
# dataAB dataCD dataEF key result
#1 A C F dataC C
#2 B C E dataA B
#3 A D E dataC D
#4 A D E dataE E
#5 B C F dataE F
Perhaps, with 'tidyverse' :
df <- data.frame(dataAB,dataCD,dataEF,key,stringsAsFactors=FALSE) %>% mutate(id=row_number())
df %>% gather(k,v,-key,-id) %>%
filter(str_detect(substring(k,5),substring(key,5))) %>%
select(result=v,id) %>%
inner_join(df,.,by="id")
# dataAB dataCD dataEF key id result
#1 A C F dataC 1 C
#2 B C E dataA 2 B
#3 A D E dataC 3 D
#4 A D E dataE 4 E
#5 B C F dataE 5 F

Function that ignores missing columns

Say I have the following two data frames:
col1 <- c("a","b","c","d","e")
col2 <- c("A","B","C","D","E")
col1a <- c("a","b","c","d","e")
col2a <- c("A","B","C","D","E")
df1 <- data.frame(col1, col2)
df2 <- data.frame(col1a, col2a)
colnames(df1) <- c("c1","c2")
colnames(df2) <- c("c1","c3")
And I have the following function to rename column headers:
library(dplyr)
col_rename <- function(x) x %>% rename(new_c1 = c1, new_c2 = c2, new_c3 = c3)
When I run this function, I get an error because the columns in the function does not match the columns in the data frame.
df1 <- col_rename(df1)
Error: `c3` contains unknown variables
How can I make the function run only on the present columns, and ignore the ones not present, without removing or changing the column names specified in the function?
EDIT:
I can see how the example was a bit confusing. I have many dataframes with many columns. These columns are shared by some dataframes but not all. However, I want to rename all columns specified by the function, regardless of what is present in the dataframe. It looks something like this:
col1 <- c(1:5)
col2 <- c(1:5)
col3 <- c(1:5)
col4 <- c(1:5)
df1 <- data.frame(col1,col2,col3,col4)
df2 <- data.frame(col1,col2,col3,col4)
colnames(df1) <- c("c1","c2","c6","c8")
colnames(df2) <- c("c1","c3","c2","c8")
AB_rename <- function(x) x %>% rename(aa=col1,bb=col2,
cc=col3,dd=col4,
ee=col5,ff=col6,
gg=col7,hh=col8)
Therefore I cannot follow the example of #Ycw, as they do not all follow the same rename rule. How do I make this ignore columns that are not present?
Here is a workaround to use setNames for the col_rename function.
col_rename <- function(x) setNames(x, paste0("new_", names(x)))
col_rename(df1)
new_c1 new_c2
1 a A
2 b B
3 c C
4 d D
5 e E
col_rename(df2)
new_c1 new_c3
1 a A
2 b B
3 c C
4 d D
5 e E
Or use the select_all function from the dplyr.
library(dplyr)
df1 %>% select_all(function(x) paste0("new_", x))
new_c1 new_c2
1 a A
2 b B
3 c C
4 d D
5 e E
This (~) also works for select_all
df2 %>% select_all(~paste0("new_", .))
new_c1 new_c3
1 a A
2 b B
3 c C
4 d D
5 e E
rename_all also works well
library(dplyr)
df1 %>% rename_all(~paste0("new_", .))
new_c1 new_c2
1 a A
2 b B
3 c C
4 d D
5 e E
Update
This is an update to address OP's updated question.
We can create a named vector showing the relationship between old column names and new column names. And defined a function to change the name based on the setNames function.
# Create name vector
vec <- paste0("c", 1:8)
names(vec) <- c("aa", "bb", "cc", "dd", "ee", "ff", "gg", "hh")
# Create the function
AB_rename <- function(x, name_vec){
old_colname <- names(x)
new_colname <- name_vec[name_vec %in% old_colname]
x2 <- setNames(x, names(new_colname))
return(x2)
}
AB_rename(df1, vec)
aa bb ff hh
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5

How to subset only the rows that have multiple, different values in another column in R?

I have a dataset similar to that below:
zz <- "Session Rater
1 A X
2 A X
3 A X
4 B Y
5 B Y
6 B Z
7 B Z
8 C X
9 C Y
10 C Z"
Data <- read.table(text=zz, header = TRUE)
I'd like to only subset the session rows for which the session has multiple raters, even though that data is stored in another column. Therefore, I'd like end up with a dataset that looks like this:
zz2 <- "Session Rater
1 B Y
2 B Y
3 B Z
4 B Z
5 C X
6 C Y
7 C Z"
Data2 <- read.table(text=zz2, header = TRUE)
Where Session A rows were removed from the dataset because Session A only had one rater, "X," but Sessions B and C (and all of their rows) were retained because they had more than one rater (Y & Z for Session B, and X, Y, & Z for Session C).
I've played around with dplyr, but with no success. Many thanks.
We can use filter with n_distinct
library(dplyr)
Data %>%
group_by(Session) %>%
filter(n_distinct(Rater)>1)
# Session Rater
# <fctr> <fctr>
#1 B Y
#2 B Y
#3 B Z
#4 B Z
#5 C X
#6 C Y
#7 C Z
Or using data.table
library(data.table)
setDT(Data)[, if(uniqueN(Rater)>1) .SD, by = Session]
Or with base R
i1 <- rowSums(!!table(Data))
subset(Data, Session %in% names(i1)[i1 >1])
... or using ave() and subscripting (assuming Rater is a factor, which is the default when reading character data)
Data[with(Data,ave(unclass(Rater),Session,
FUN = function(x)length(unique(x)))) > 1,]

Resources