Data cleaning in R: grouping by number and then by name - r

A small sample of my dataset looks something like this:
x <- c(1,2,3,4,1,7,1)
y <- c("A","b","a","F","A",".A.","B")
data <- cbind(x,y)
My goal is to first group data that have the same number together and then followed by the same name together (A,a,.A. are considered as the same name for my case).
In other words, the final output should look something like this:
xnew <- c(1,1,3,7,1,2,4)
ynew <- c("A","A","a",".A.","B","b","F")
datanew <- cbind(xnew,ynew)
Currently, I am only able to group by number in the column labelled x. I am unable to group by name yet. I would appreciate any help given.
Note: I need an automated solution as my raw dataset contains over 10,000 lines for the x and y columns.

Assuming what you have is a dataframe data <- data.frame(x,y) and not a matrix which is being generated with cbind you could combine different values into one using fct_collapse and then arrange the data by this new column (z) and x value.
library(dplyr)
library(forcats)
data %>%
mutate(z = fct_collapse(y,
"A" = c('A', '.A.', 'a'),
"B" = c('B', 'b'))) %>%
arrange(z, x) %>%
select(-z) -> result
result
# x y
#1 1 A
#2 1 A
#3 3 a
#4 7 .A.
#5 1 B
#6 2 b
#7 4 F
Or you can remove all the punctuations from y column, make them into upper or lower case and then arrange.
data %>%
mutate(z = toupper(gsub("[[:punct:]]", "", y))) %>%
arrange(z, x) %>%
select(-z) -> result
result

library(dplyr)
data %>%
as.data.frame() %>%
group_by(x, y) %>%
summarise(records = n()) %>%
arrange(x, y)

According to your question it's just a matter of ordering data.
result <- data[order(data$x, data$y),]
or considering that you wan to collate A a .A.
result <- data[order(data$x, toupper(gsub("[^A-Za-z]","",data$y))),]

Related

Create columns based on other columns names R

I need to operate columns based on their name condition. In the following reproducible example, per each column that ends with 'x', I create a column that multiplies by 2 the respective variable:
library(dplyr)
set.seed(8)
id <- seq(1,700, by = 1)
a1_x <- runif(700, 0, 10)
a1_y <- runif(700, 0, 10)
a2_x <- runif(700, 0, 10)
df <- data.frame(id, a1_x, a1_y, a2_x)
#Create variables manually: For every column that ends with X, I need to create one column that multiplies the respective column by 2
df <- df %>%
mutate(a1_x_new = a1_x*2,
a2_x_new = a2_x*2)
Since I'm working with several columns, I need to automate this process. Does anybody know how to achieve this? Thanks in advance!
Try this:
df %>% mutate(
across(ends_with("x"), ~ .x*2, .names = "{.col}_new")
)
Thanks #RicardoVillalba for correction.
You could use transmute and across to generate the new columns for those column names ending in "x". Then, use rename_with to add the "_new" suffix and bind_cols back to the original data frame.
library(dplyr)
df <- df %>%
transmute(across(ends_with("x"), ~ . * 2)) %>%
rename_with(., ~ paste0(.x, "_new")) %>%
bind_cols(df, .)
Result:
head(df)
id a1_x a1_y a2_x a1_x_new a2_x_new
1 1 4.662952 0.4152313 8.706219 9.325905 17.412438
2 2 2.078233 1.4834044 3.317145 4.156466 6.634290
3 3 7.996580 1.4035441 4.834126 15.993159 9.668252
4 4 6.518713 7.0844794 8.457379 13.037426 16.914759
5 5 3.215092 3.5578827 8.196574 6.430184 16.393149
6 6 7.189275 5.2277208 3.712805 14.378550 7.425611

R: unique column values, combine rows of second column

From a data frame I need a list of all unique values of one column. For possible later check we need to keep information from a second column, though for simplicity combined.
Sample data
df <- data.frame(id=c(1,3,1),source =c("x","y","z"))
df
id source
1 1 x
2 3 y
3 1 z
The desired outcome is
df2
id source
1 1 x,z
2 3 y
It should be pretty easy, still I cannot find the proper function / grammar?
E.g. something like
df %>%
+ group_by(id) %>%
+ summarise(vlist = paste0(source, collapse = ","))
or
df %>%
+ distinct(id) %>%
+ summarise(vlist = paste0(source, collapse = ","))
What am I missing? Thanks for any advice!
You can use aggregate from stats to combine per group.
aggregate(source ~ id, df, paste, collapse = ",")
# id source
#1 1 x,z
#2 3 y
Using your code here is a solution:
library(dplyr)
df <- data.frame(id=c(1,3,1),source =c("x","y","z"))
df %>%
group_by(id) %>%
summarise(vlist = paste0(source, collapse = ",")) %>%
distinct(id, .keep_all = TRUE)
# A tibble: 2 x 2
id vlist
<dbl> <chr>
1 1 x,z
2 3 y
Your second approach doesn't work because you call distinct before you aggregate the data. Also, you need to use .keep_all = TRUE to also keep the other column.
Your first approach was missing the distinct.
aggregate(source ~ id, df, toString)

R Subsetting text from a comma seperated column in a data-frame

I have a data.frame with a column that looks like that:
diagnosis
F.31.2,A.43.2,R.45.2,F.43.1
I want to somehow split this column into two colums with one containing all the values with F and one for all the other values, resulting in two columns in a df that looks like that.
F other
F.31.2,F43.1 A.43.2,R.45.2
Thanks in advance
Try next tidyverse approach. You can separate the rows by , and then create a group according to the pattern in order to reshape to wide and obtain the expected result:
library(dplyr)
library(tidyr)
#Data
df <- data.frame(diagnosis='F.31.2,A.43.2,R.45.2,F.43.1',stringsAsFactors = F)
#Code
new <- df %>% separate_rows(diagnosis,sep = ',') %>%
mutate(Group=ifelse(grepl('F',diagnosis),'F','Other')) %>%
pivot_wider(values_fn = toString,names_from=Group,values_from=diagnosis)
Output:
# A tibble: 1 x 2
F Other
<chr> <chr>
1 F.31.2, F.43.1 A.43.2, R.45.2
First, use strsplit at the commas. Then, using grep find indexes of F, and select/antiselect them by multiplying by 1 or -1 and paste them.
tmp <- el(strsplit(d$diagnosis, ","))
res <- lapply(c(1, -1), function(x) paste(tmp[grep("F", tmp)*x], collapse=","))
res <- setNames(as.data.frame(res), c("F", "other"))
res
# F other
# 1 F.31.2,F.43.1 A.43.2,R.45.2
Data:
d <- setNames(read.table(text="F.31.2,A.43.2,R.45.2,F.43.1"), "diagnosis")

How can I store replaced values after filter() %>% mutate()?

I'm attempting to replace empty values in column z based on the values in column x.
I've used filter() to narrow down to the rows of importance, and apply mutate() afterwards, but the mutate values are not replaced in the original dataframe. I can store it as a new dataframe, but merging afterwards would be a considerable headaches as this is happening across dozens of conditionals.
make dummy data
xx <- data.frame(x = c(1,2,3), y = c("a","","c"), z=c(5,5,""))
xx %>% filter(x == 3) %>% # filter to value of interest
filter(z == "") %>% # filter to NA values to be replaced
mutate(z = replace(z, z =="", 5) ) # mutate to replace NA value
if i do:
xx <- xx %>% filter(x == 3) %>% # filter to value of interest
filter(z == "") %>% # filter to NA values to be replaced
mutate(z = replace(z, z =="", 5) ) # mutate to replace NA value
then only the single row is stored...
I'm looking for a way to keep all of the other dataframe data but replace the mutated data.
Feels like it should be a quick fix, but been stuck on it for a while..
You can use an ifelse() statement within dplyr::mutate().
df <- data.frame(x=sample(1:10,100,T),
y=sample(c(NA,1:5),100,T))
df %>% mutate(y=ifelse(is.na(y),x,y))
x y
1 7 7
2 10 3
3 7 1
4 7 1
5 10 4
6 3 3
...

R: Check if all values of one column match uniquely all values of another column

I have a data set with a lot of values. The majority of x matches a value in y uniquely. However some of x match multiple ys. Is there an easy way to find which values of y map to multiple xs?
mydata <- data.frame(x = c(letters,letters), y=c(LETTERS,LETTERS))
mydata$y[c(3,5)] <- "A"
mydata$y[c(10,15)] <- "Z"
mydata %>% foo
[1] "A" "Z"
I apologize if I am missing some obvious command here.
Using dplyr, you can do:
library(dplyr)
mydata <- data.frame(x = letters, y=LETTERS, stringsAsFactors = FALSE)
mydata$y[c(3,5)] <- "A"
mydata$y[c(10,15)] <- "Z"
mydata %>% group_by(y) %>% filter(n() > 1)
If you want to extract just the y values, you can store that to a data frame like this and find unique y values:
df <- mydata %>% group_by(y) %>% filter(n() > 1)
unique(df$y)
Another alternative format to get the same output into is as follows. This returns a single column data frame instead of a vector as above.
mydata %>% group_by(y) %>% filter(n() > 1) %>% select(y) %>% distinct()
use data.table
library(data.table)
setDT(mydata)
mydata[,list(n=length(unique(x))), by=y][n>2,]
# y n
# 1: A 3
# 2: Z 3
If we need the corresponding unique values in 'x'
library(data.table)
setDT(mydata)[,if(.N >2) toString(unique(.SD[[1L]])) , y]
# y V1
#1: A a, c, e
#2: Z j, o, z

Resources