Combine duplicate rows in dataframe and create new columns - r

I am trying to aggregate rows in dataframe that have some values similar and others different as below :
dataframe1 <- data.frame(Company_Name = c("KFC", "KFC", "KFC", "McD", "McD"),
Company_ID = c(1, 1, 1, 2, 2),
Company_Phone = c("237389", "-", "-", "237002", "-"),
Employee_Name = c("John", "Mary", "Jane", "Joshua",
"Anne"),
Employee_ID = c(1001, 1002, 1003, 2001, 2002))
I wish to combine the rows for the values that are similar and creating new columns for the values that are different as below:
dataframe2 <- data.frame(Company_Name = c("KFC", "McD"),
Company_ID = c(1, 2),
Company_Phone = c("237389", "237002"),
Employee_Name1 = c("John", "Joshua" ),
Employee_ID1 = c(1001, 2001),
Employee_Name2 = c("Mary", "Anne"),
Employee_ID2 = c(1002, 2002),
Employee_Name3 = c("Jane", "na"),
Employee_ID3 = c(1003, "na"))
I have checked similar questions such as this Combining duplicated rows in R and adding new column containing IDs of duplicates and R: collapse rows and then convert row into a new column but I do not wish to sepoarate the values by commas but rather create new columns.
# Company_Name Company_ID Company_Phone Employee_Name1 Employee_ID1 Employee_Name2 Employee_ID2 Employee_Name3 Employee_ID3
#1 KFC 1 237389 John 1001 Mary 1002 Jane 1003
#2 McD 2 237002 Joshua 2001 Anne 2002 na na
Thank you in advance.

A solution using tidyverse. dat is the final output.
library(tidyverse)
dat <- dataframe1 %>%
mutate_if(is.factor, as.character) %>%
mutate(Company_Phone = ifelse(Company_Phone %in% "-", NA, Company_Phone)) %>%
fill(Company_Phone) %>%
group_by(Company_ID) %>%
mutate(ID = 1:n()) %>%
gather(Info, Value, starts_with("Employee_")) %>%
unite(New_Col, Info, ID, sep = "") %>%
spread(New_Col, Value) %>%
select(c("Company_Name", "Company_ID", "Company_Phone",
paste0(rep(c("Employee_ID", "Employee_Name"), 3), rep(1:3, each = 2)))) %>%
ungroup()
# View the result
dat %>% as.data.frame(stringsAsFactors = FALSE)
# Company_Name Company_ID Company_Phone Employee_ID1 Employee_Name1 Employee_ID2 Employee_Name2 Employee_ID3 Employee_Name3
# 1 KFC 1 237389 1001 John 1002 Mary 1003 Jane
# 2 McD 2 237002 2001 Joshua 2002 Anne <NA> <NA>

We could do this with dcast from data.table which can take multiple value.var columns. Convert the 'data.frame' to 'data.table' (setDT(dataframe1)), grouped by 'Company_Name', replace the 'Company_Phone' _ elements with the first alphanumeric string, then dcast from 'long' to 'wide' by specifying 'Employee_Name' and 'Employee_ID' as the value.var columns
library(data.table)
setDT(dataframe1)[, Company_Phone := first(Company_Phone), Company_Name]
res <- dcast(dataframe1, Company_Name + Company_ID + Company_Phone ~
rowid(Company_Name), value.var = c("Employee_Name", "Employee_ID"), sep='')
-output
res
#Company_Name Company_ID Company_Phone Employee_Name1 Employee_Name2 Employee_Name3 Employee_ID1 Employee_ID2 Employee_ID3
#1: KFC 1 237389 John Mary Jane 1001 1002 1003
#2: McD 2 237002 Joshua Anne NA 2001 2002 NA
If we need to order it
res[, c(1:3, order(as.numeric(sub("\\D+", "", names(res)[-(1:3)]))) + 3), with = FALSE]
# Company_Name Company_ID Company_Phone Employee_Name1 Employee_ID1 Employee_Name2 Employee_ID2 Employee_Name3 Employee_ID3
#1: KFC 1 237389 John 1001 Mary 1002 Jane 1003
#2: McD 2 237002 Joshua 2001 Anne 2002 NA NA

Here is an other approach combining dplyr and cSplit
library(dplyr)
dataframe1 <- dataframe1 %>%
group_by(Company_Name, Company_ID) %>%
summarise_all(funs(paste((.), collapse = ",")))
library(splitstackshape)
dataframe1 <- cSplit(dataframe1, c("Company_Phone", "Employee_Name", "Employee_ID"), ",")
dataframe1
# Company_Name Company_ID Company_Phone_1 Company_Phone_2 Company_Phone_3 Employee_Name_1 Employee_Name_2 Employee_Name_3 Employee_ID_1 Employee_ID_2 Employee_ID_3
#1: KFC 1 237389 - - John Mary Jane 1001 1002 1003
#2: McD 2 237002 - NA Joshua Anne NA 2001 2002 NA

Related

How to get column mean grouped by row labels in R dataframe?

I have a dataframe that looks like this
Fruit
2021
2022
Apples
12
29
Bananas
11
31
Apples
44
55
Oranges
30
73
Oranges
19
82
Bananas
24
78
The Fruit names are not ordered so I can't group them by taking n at a time, they're listed randomly. I need to get the mean of fruits sold in 2021 & 2022 as well as mean sold for apples, oranges & bananas for each year separately.
My code is
2021 <- c(mean(df$2021), sd(df$2021))
2022 <- c(mean(df$2022), sd(df$2022))
measure <- c('mean','standard deviation')
df1 <- data.table(measure,TE,TW,NC,SC,NWC)
and output looks like this:
Measure
2021
2022
mean
23.3
58
standard deviation
12.4
23.3
But I'm not sure where to start with grouping the rows by name. I need to get something that looks like this
Measure
2021
Apples
Bananas
Oranges
2022
Apples
Bananas
Oranges
mean
23.3
58
standard deviation
12.4
23.3
(with the appropriate numbers in the blank spaces)
I suggest this might be better (in the long run) in a long format, which this summarizing can get started. This is just 'mean', not hard to repeat for sd and combine with this:
fruits <- c(NA, "Apples", "Oranges", "Bananas")
lapply(quux[,-1], function(yr) stack(sapply(fruits, function(z) mean(yr[is.na(z) | quux$Fruit %in% z])))) |>
dplyr::bind_rows(.id = "year")
# year values ind
# 1 2021 23.33333 <NA>
# 2 2021 28.00000 Apples
# 3 2021 24.50000 Oranges
# 4 2021 17.50000 Bananas
# 5 2022 58.00000 <NA>
# 6 2022 42.00000 Apples
# 7 2022 77.50000 Oranges
# 8 2022 54.50000 Bananas
where NA in ind indicates all fruits, otherwise the individual fruit labeled.
If you put your data in long form, you could use the aggregate function:
a <- aggregate(value ~ year + fruit, data=df, FUN=function(x) c(sd(x),mean(x))
Where value is a column you could create to put the values which are now under 2021 and 2022. Then create a new column called year which has 2021 or 2022 accordingly. Long form is the way to go in R almost always.
We may use
library(dplyr)
library(tidyr)
library(data.table)
library(stringr)
df1 %>%
pivot_longer(cols = where(is.numeric), names_to = 'year') %>%
as.data.table %>%
cube( .(Mean = mean(value), SD = sd(value)),
by = c("Fruit", "year")) %>%
filter(!if_all(Fruit:year, is.na)) %>%
unite(Fruit, Fruit, year, sep = "_", na.rm = TRUE) %>%
filter(str_detect(Fruit, "_|\\d+")) %>%
data.table::transpose(make.names = "Fruit", keep.names = "Measure")
-output
Measure Apples_2021 Apples_2022 Bananas_2021 Bananas_2022 Oranges_2021 Oranges_2022 2021 2022
1: Mean 28.00000 42.00000 17.500000 54.50000 24.500000 77.500000 23.33333 58.00000
2: SD 22.62742 18.38478 9.192388 33.23402 7.778175 6.363961 12.42041 23.57965
Or if we want the duplicate column names
df1 %>%
pivot_longer(cols = where(is.numeric), names_to = 'year') %>%
as.data.table %>%
cube( .(Mean = mean(value), SD = sd(value)), by = c("Fruit", "year")) %>%
mutate(Fruit = coalesce(Fruit, year)) %>%
drop_na(year) %>%
arrange(year, str_detect(Fruit, '\\d{4}', negate = TRUE)) %>%
select(-year) %>%
data.table::transpose(make.names = "Fruit", keep.names = "Measure")
-output
Measure 2021 Apples Bananas Oranges 2022 Apples Bananas Oranges
1: Mean 23.33333 28.00000 17.500000 24.500000 58.00000 42.00000 54.50000 77.500000
2: SD 12.42041 22.62742 9.192388 7.778175 23.57965 18.38478 33.23402 6.363961
data
df1 <- structure(list(Fruit = c("Apples", "Bananas", "Apples", "Oranges",
"Oranges", "Bananas"), `2021` = c(12L, 11L, 44L, 30L, 19L, 24L
), `2022` = c(29L, 31L, 55L, 73L, 82L, 78L)),
class = "data.frame", row.names = c(NA,
-6L))

How can I create multiple columns at once using R, preferably dplyr or data.table?

I would like to create multiple new variables based on the values within existing columns of my data frame.
Here is a simplified version of my data:
df <- structure(list(City = structure(c(5L, 4L, 4L, 3L, 1L, 2L), .Label = c("Chico",
"Lawndale", "Los Angeles", "San Francisco", "San Jose"), class = "factor"),
yq = c("20071", "20111", "20074", "20124", "20111", "20124"
), cyq_total = c(15582L, 33668L, 40848L, 89028L, 1069L, 178L
)), row.names = c(NA, -6L), class = "data.frame")
City yq cyq_total
1 San Jose 20071 15582
2 San Francisco 20111 33668
3 San Francisco 20074 40848
4 Los Angeles 20124 89028
5 Chico 20111 1069
6 Lawndale 20124 178
The variable cyq_total represents the number of job vacancies in a city in a year-quarter (yq). I would like to create new variables called "Vac20071", "Vac20111", and so on where the value is cyq_total for a given city for a given year and quarter.
This is simplified for my example, but essentially I want the column Vac20071 to display how many vacancies each city had in the year-quarter 20071. Similarly for other year-quarters.
Desired output:
City yq cyq_total Vac20071 Vac20111 Vac20074 Vac20124
<fct> <chr> <int> <dbl> <dbl> <dbl> <dbl>
1 San Jose 20071 15582 15582 0 0 0
2 San Francisco 20111 33668 0 33668 40848 0
3 San Francisco 20074 40848 0 33668 40848 0
4 Los Angeles 20124 89028 0 0 0 89028
5 Chico 20111 1069 0 1069 0 0
6 Lawndale 20124 178 0 0 0 178
The code I have to do this works, but is not efficient. I'm looking for a better way to generate the same results other than copy/pasting the same code with slight changes:
df <- df %>% group_by(City) %>% mutate(Vac20071 = max(ifelse(yq == '20071', cyq_total, 0)))
df <- df %>% group_by(City) %>% mutate(Vac20111 = max(ifelse(yq == '20111', cyq_total, 0)))
df <- df %>% group_by(City) %>% mutate(Vac20074 = max(ifelse(yq == '20074', cyq_total, 0)))
df <- df %>% group_by(City) %>% mutate(Vac20124 = max(ifelse(yq == '20124', cyq_total, 0)))
df <- df %>% group_by(City) %>% mutate(Vac20111 = max(ifelse(yq == '20111', cyq_total, 0)))
You can get the data in wide format and then join.
library(dplyr)
library(tidyr)
df %>%
pivot_wider(names_from = yq, values_from = cyq_total, names_prefix = 'Vac') %>%
left_join(df, by = 'City')
# A tibble: 6 x 7
# City Vac20071 Vac20111 Vac20074 Vac20124 yq cyq_total
# <fct> <int> <int> <int> <int> <chr> <int>
#1 San Jose 15582 NA NA NA 20071 15582
#2 San Francisco NA 33668 40848 NA 20111 33668
#3 San Francisco NA 33668 40848 NA 20074 40848
#4 Los Angeles NA NA NA 89028 20124 89028
#5 Chico NA 1069 NA NA 20111 1069
#6 Lawndale NA NA NA 178 20124 178
Equivalent approach in data.table suggested by #chinsoon12
library(data.table)
setDT(df)
dcast(df,City ~ paste0("Vac", yq), value.var="cyq_total", fill=0L)[df, on=.(City)]
An option using data.table with matrix numeric indexing:
cols <- paste0("Vac", unique(df$yq))
setDT(df)[, (cols) := 0L]
df[, (cols) := {
m <- as.matrix(.SD)
ix <- match(paste0("Vac", yq), cols)
m[cbind(rep(1L:.N, each=length(ix)), rep(ix, .N))] <- cyq_total
as.data.table(m)
}, City, .SDcols=cols]
df

Formatting output of R dataframe

So i currently have a dataframe in R and I want to export/write it to a text file using write.table()
Here's an example of the dataframe:
ID FirstName LastName Class
1000 John NA C-02
1001 Jane Wellington C-03
1002 Kate NA C-04
1003 Adam West C-05
I want to write it to a text file where for each row, if any column value is NA, then it won't include the word "NA" but proceed to the other column. The output I want:
1000 John C-02
1001 Jane Wellington C-03
1002 Kate C-04
1003 Adam West C-05
Example as shown, the first row didn't have a last name entered, so I will proceed to the next column, preventing something like:
1000 John NA C-02
I did the write.table() command:
write.table(df, "student_list.txt", col.names = FALSE, row.names = FALSE, quote = FALSE, sep="\t")
But the problem is I'm getting the one where NA is included in the second output i mentioned.
library(tidyverse)
dta <- tribble(
~ID, ~FirstName, ~LastName, ~Class,
1000, "John", NA, "C-02",
1001, "Jane", "Wellington", "C-03",
1002, "Kate", NA, "C-04",
1003, "Adam", "West", "C-05"
)
dta %>%
unite(column, everything(), sep = " ") %>%
mutate(column = str_remove_all(column, "NA ")) %>%
write.table("student_list.txt", col.names = FALSE, row.names = FALSE, quote = FALSE, sep = "\t")
I would use apply to remove the NAs and convert rows into text lines (using paste), as follows:
data <- apply(df, 1, function(row){
paste(row[!is.na(row)], collapse="\t")
})
write.table(data, "student_list.txt", col.names = FALSE, row.names = FALSE, quote = FALSE, sep="\t")
File output would look like the following:
#1000 John C-02
#1001 Jane Wellington C-03
#1002 Kate C-04
#1003 Adam West C-05

R: Combining several character columns into one by replacing NA-rows

I have a data frame consisting of character variables which looks like this:
V1 V2 V3 V4 V5
1 ID Date pic1 pic2 pic3
2 1 15.06.16 11:50 abc <NA> def
3 1 16.06.16 11:19 <NA> hij <NA>
4 1 17.06.16 11:41 <NA> <NA> nop
5 2 28.05.16 11:40 tuv <NA> <NA>
6 2 29.05.16 11:39 <NA> zab <NA>
7 2 30.05.16 09:07 <NA> <NA> wxy
8 3 03.06.16 07:31 lmn <NA> <NA>
9 3 04.06.16 11:01 <NA> rst <NA>
10 3 05.06.16 13:57 <NA> <NA> opq
So on each day one of the pic-variables contains a value, the rest is NA.
Now I want to combine all pic-values into one variable by replacing the NA's. Sorry if this is a dublicate, I've already tried a lot of suggested solutions but nothing has worked so far.
Thanks!
We can try with data.table. Convert the 'data.frame' to 'data.table' (setDT(df1), grouped by 'ID', and 'Date', we unlist the Subset of Data.table (.SD) and omit the NA elements (na.omit)
library(data.table)
setDT(df1)[, .(pic = na.omit(unlist(.SD))), by = .(ID, Date)]
# ID Date pic
# 1: 1 15.06.16 11:50 abc
# 2: 1 15.06.16 11:50 def
# 3: 1 16.06.16 11:19 hij
# 4: 1 17.06.16 11:41 nop
# 5: 2 28.05.16 11:40 tuv
# 6: 2 29.05.16 11:39 zab
# 7: 2 30.05.16 09:07 wxy
# 8: 3 03.06.16 07:31 lmn
# 9: 3 04.06.16 11:01 rst
#10: 3 05.06.16 13:57 opq
Or another option is pmax if there is only a single non-NA per row
setDT(df1)[, pic := do.call(pmax, c(.SD, na.rm = TRUE)),
.SDcols = pic1:pic3][, paste0("pic", 1:3) := NULL][]
Or using dplyr
library(dplyr)
df1 %>%
mutate(pic = pmax(pic1, pic2, pic3, na.rm=TRUE))%>%
select(-(pic1:pic3))
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), Date = c("15.06.16 11:50",
"16.06.16 11:19", "17.06.16 11:41", "28.05.16 11:40", "29.05.16 11:39",
"30.05.16 09:07", "03.06.16 07:31", "04.06.16 11:01", "05.06.16 13:57"
), pic1 = c("abc", NA, NA, "tuv", NA, NA, "lmn", NA, NA), pic2 = c(NA,
"hij", NA, NA, "zab", NA, NA, "rst", NA), pic3 = c("def", NA,
"nop", NA, NA, "wxy", NA, NA, "opq")), .Names = c("ID", "Date",
"pic1", "pic2", "pic3"), row.names = c(NA, -9L), class = "data.frame")
Assuming
on each day one of the pic-variables contains a value, the rest is NA
You can use coalesce from dplyr to get what you want:
library(dplyr)
result <- df1 %>% mutate(pic = coalesce(pic1, pic2, pic3)) %>%
select(-(pic1:pic3))
With the data supplied by akrun:
print(result)
## ID Date pic
##1 1 15.06.16 11:50 abc
##2 1 16.06.16 11:19 hij
##3 1 17.06.16 11:41 nop
##4 2 28.05.16 11:40 tuv
##5 2 29.05.16 11:39 zab
##6 2 30.05.16 09:07 wxy
##7 3 03.06.16 07:31 lmn
##8 3 04.06.16 11:01 rst
##9 3 05.06.16 13:57 opq

transform one long row in data-frame to individual records

I have a variable list of people I get as one long row in a data frame and I am interested to reorganize these record into a more meaningful format.
My raw data looks like this,
df <- data.frame(name1 = "John Doe", email1 = "John#Doe.com", phone1 = "(444) 444-4444", name2 = "Jane Doe", email2 = "Jane#Doe.com", phone2 = "(444) 444-4445", name3 = "John Smith", email3 = "John#Smith.com", phone3 = "(444) 444-4446", name4 = NA, email4 = "Jane#Smith.com", phone4 = NA, name5 = NA, email5 = NA, phone5 = NA)
df
# name1 email1 phone1 name2 email2 phone2
# 1 John Doe John#Doe.com (444) 444-4444 Jane Doe Jane#Doe.com (444) 444-4445
# name3 email3 phone3 name4 email4 phone4 name5
# 1 John Smith John#Smith.com (444) 444-4446 NA Jane#Smith.com NA NA
# email5 phone5
# 1 NA NA
and I am trying to bend it into a format like this,
df_transform <- structure(list(name = structure(c(2L, 1L, 3L, NA, NA), .Label = c("Jane Doe",
"John Doe", "John Smith"), class = "factor"), email = structure(c(3L,
1L, 4L, 2L, NA), .Label = c("Jane#Doe.com", "Jane#Smith.com",
"John#Doe.com", "John#Smith.com"), class = "factor"), phone = structure(c(1L,
2L, 3L, NA, NA), .Label = c("(444) 444-4444", "(444) 444-4445",
"(444) 444-4446"), class = "factor")), .Names = c("name", "email",
"phone"), class = "data.frame", row.names = c(NA, -5L))
df_transform
# name email phone
# 1 John Doe John#Doe.com (444) 444-4444
# 2 Jane Doe Jane#Doe.com (444) 444-4445
# 3 John Smith John#Smith.com (444) 444-4446
# 4 <NA> Jane#Smith.com <NA>
# 5 <NA> <NA> <NA>
It should be added that it's not always five record, it could be any number between 1 and 99. I tried with reshape2's melt and `t()1 but it got way to complicated. I imagine there is some know method that I simply do not know about.
You're on the right track, try this:
library(reshape2)
# melt it down
df.melted = melt(t(df))
# get rid of the numbers at the end
df.melted$Var1 = sub('[0-9]+$', '', df.melted$Var1)
# cast it back
dcast(df.melted, (seq_len(nrow(df.melted)) - 1) %/% 3 ~ Var1)[,-1]
# email name phone
#1 John#Doe.com John Doe (444) 444-4444
#2 Jane#Doe.com Jane Doe (444) 444-4445
#3 John#Smith.com John Smith (444) 444-4446
#4 Jane#Smith.com <NA> <NA>
#5 <NA> <NA> <NA>
1) reshape() First we strip off the digits from the column names giving the reduced column names, names0. Then we split the columns into groups producing g (which has three components corresponding to the email, name and phone column groups). Then use reshape (from the base of R) to perform the wide to long transformation and select from the resulting long data frame the desired columns in order to exclude the columns that are added automatically by reshape. That selection vector, unique(names0), is such that it reorders the resulting columns in the desired way.
names0 <- sub("\\d+$", "", names(df))
g <- split(names(df), names0)
reshape(df, dir = "long", varying = g, v.names = names(g))[unique(names0)]
and the last line gives this:
name email phone
1.1 John Doe John#Doe.com (444) 444-4444
1.2 Jane Doe Jane#Doe.com (444) 444-4445
1.3 John Smith John#Smith.com (444) 444-4446
1.4 <NA> Jane#Smith.com <NA>
1.5 <NA> <NA> <NA>
2) reshape2 package Here is a solution using reshape2. We add a rowname column to df and melt it to long form. Then we split the variable column into the name portion (name, email, phone) and the numeric suffix portion which we call id. Finally we convert it back to wide form using dcast and select out the appropriate columns as we did before.
library(reshape2)
m <- melt(data.frame(rowname = 1:nrow(df), df), id = 1)
mt <- transform(m,
variable = sub("\\d+$", "", variable),
id = sub("^\\D+", "", variable)
)
dcast(mt, rowname + id ~ variable)[, unique(mt$variable)]
where the last line gives this:
name email phone
1 John Doe John#Doe.com (444) 444-4444
2 Jane Doe Jane#Doe.com (444) 444-4445
3 John Smith John#Smith.com (444) 444-4446
4 <NA> Jane#Smith.com <NA>
5 <NA> <NA> <NA>
3) Simple matrix reshaping . Remove the numeric suffixes from the column names and set cn to the unique remaining names. (cn stands for column names). Then we merely reshape the df row into an n x length(cn) matrix adding the column names.
cn <- unique(sub("\\d+$", "", names(df)))
matrix(as.matrix(df), nc = length(cn), byrow = TRUE, dimnames = list(NULL, cn))
name email phone
[1,] "John Doe" "John#Doe.com" "(444) 444-4444"
[2,] "Jane Doe" "Jane#Doe.com" "(444) 444-4445"
[3,] "John Smith" "John#Smith.com" "(444) 444-4446"
[4,] NA "Jane#Smith.com" NA
[5,] NA NA NA
4) tapply This problem can also be solved with a simple tapply. As before names0 is the column names without the numeric suffixes. names.suffix is just the suffixes. Now use tapply :
names0 <- sub("\\d+$", "", names(df))
names.suffix <- sub("^\\D+", "", names(df))
tapply(as.matrix(df), list(names.suffix, names0), c)[, unique(names0)]
The last line gives:
name email phone
1 "John Doe" "John#Doe.com" "(444) 444-4444"
2 "Jane Doe" "Jane#Doe.com" "(444) 444-4445"
3 "John Smith" "John#Smith.com" "(444) 444-4446"
4 NA "Jane#Smith.com" NA
5 NA NA NA

Resources