Gathering data by paired columns - r

I'm having trouble in shaping my dataframe.
Here's an example:
id institution name1 id1 name2 id2
1 usp Miles Davis 123 Arturo Sandoval 111
2 unb Chet Baker 321 Clifford Brown 121
3 usp Wayne Shorter 222 Hermeto Pascoal 322
4 Puc-rio John Coltrane 333 Charlie Parker 112
I need to keep the id and institution columns and gather the other ones like this:
id institution name_all id_all
1 usp Miles Davis 123
1 usp Arturo Sandoval 111
2 unb Chet Baker 321
2 unb Clifford Brown 121
3 usp Wayne Shorter 222
3 usp Hermeto Pascoal 322
4 Puc-rio John Coltrane 333
4 Puc-rio Charlie Parker 112
I'm using the gather function from the dplyr:
df %>%
gather(name_all, id_all, -id, -institution)
but it comes like this:
id institution name id
1 usp name1 Miles Davis
1 usp id1 123
2 unb name1 Chet Baker
2 unb id2 121
Any ideas on how to pair those values? I have more than 5 columns to do so, I think that I'm missing an argument to specify which one of them are paired. I hope I've made myself clear.

For a tidyverse solution, you can:
library(dplyr)
library(tidyr)
df %>%
gather(ColType, ColValue, -id, -institution) %>%
mutate(id_number = gsub("^(\\D*)(\\d*)$", "\\2", ColType, ignore.case = TRUE, perl = TRUE),
ColType = gsub("^(\\D*)(\\d*)$", "\\1", ColType, ignore.case = TRUE, perl = TRUE)
) %>%
spread(ColType, ColValue) %>%
select(-id_number)

I'm sure that there is a more elegant solution, but you can try:
df %>%
gather(var, name_all, -matches("id|institution")) %>%
gather(var2, val, -c(id, institution, var, name_all)) %>%
mutate(id_all = ifelse(parse_number(var) == parse_number(var2), val, NA)) %>%
na.omit() %>%
select(-var, -var2, -val) %>%
arrange(id)
id institution name_all id_all
1 1 usp Miles_Davis 123
2 1 usp Arturo_Sandoval 111
3 2 unb Chet_Baker 321
4 2 unb Clifford_Brown 121
5 3 usp Wayne_Shorter 222
6 3 usp Hermeto_Pascoal 322
7 4 Puc-rio John_Coltrane 333
8 4 Puc-rio Charlie_Parker 112
First, it transforms the data from wide to long, excluding the variables that are named institution or id. Second, it performs a second wide-to-long transformation to have all the numbered "id" variables and their values as separate rows. Third, it checks whether the "name" variable has the number as the "id variable. If so, it assigns the appropriate value, otherwise NA. Finally, it removes the rows with NAs, the redundant variables and arranges the data.
Sample data:
df <- read.table(text = "
id institution name1 id1 name2 id2
1 usp Miles_Davis 123 Arturo_Sandoval 111
2 unb Chet_Baker 321 Clifford_Brown 121
3 usp Wayne_Shorter 222 Hermeto_Pascoal 322
4 Puc-rio John_Coltrane 333 Charlie_Parker 112", header = TRUE, stringsAsFactors = FALSE)

Related

Move information to new column if the first value of the cell is a four-digit number

I have a column with addresses. The data is not clean and the information includes street and house number or sometimes postcode and city. I would like to move the postcode and city information to another column with R, while street and house number stay in the old place. The postcode is a 4 digit number string. I am grateful for any suggestion for a solution.
An ifelse with grepl should help -
library(dplyr)
df <- df %>%
mutate(Strasse = ifelse(grepl('^\\d{4}', Halter), '', Halter),
Ort = ifelse(Strasse == '', Halter, ''))
# Line Halter Strasse Ort
#1 1 1007 Abc 1007 Abc
#2 2 1012 Long words 1012 Long words
#3 3 Enelbach 54 Enelbach 54
#4 4 Abcd 56 Abcd 56
#5 5 Engasse 21 Engasse 21
grepl('^\\d{4}', Halter) returns TRUE if it finds a 4-digit number at the start of the string else returns FALSE.
data
It is easier to help if you provide data in a reproducible format
df <- data.frame(Line = 1:5,
Halter = c('1007 Abc', '1012 Long words', 'Enelbach 54',
'Abcd 56', 'Engasse 21'))
In addition to the neat solution of #Ronak Shah, if you want to use base R
df <- data.frame(Line = 1:5,
Halter = c('1007 Abc', '1012 Long words', 'Enelbach 54',
'Abcd 56', 'Engasse 21'))
df$Strasse <- with(df, ifelse(grepl('^\\d{4}', Halter), '', Halter))
df$Ort <- with(df, ifelse(Strasse == '', Halter, ''))
> head(df)
Line Halter Strasse Ort
1 1 1007 Abc 1007 Abc
2 2 1012 Long words 1012 Long words
3 3 Enelbach 54 Enelbach 54
4 4 Abcd 56 Abcd 56
5 5 Engasse 21 Engasse 21
An option is also with separate
library(dplyr)
library(tidyr)
df %>%
separate(Halter, into = c("Strasse", "Ort"), sep = "(?<=[0-9])$|^(?=[0-9]{4} )")
Line Strasse Ort
1 1 1007 Abc
2 2 1012 Long words
3 3 Enelbach 54
4 4 Abcd 56
5 5 Engasse 21
data
df <- structure(list(Line = 1:5, Halter = c("1007 Abc", "1012 Long words",
"Enelbach 54", "Abcd 56", "Engasse 21")), class = "data.frame", row.names = c(NA,
-5L))
Suisse postal codes are made up of 4 digits:
library(dplyr)
library(stringr)
df %>%
mutate(Strasse = str_extract(Halter, '\\d{4}\\s.+'))
Line Halter Strasse
1 1 1007 Abc 1007 Abc
2 2 1012 Long words 1012 Long words
3 3 Enelbach 54 <NA>
4 4 Abcd 56 <NA>
5 5 Engasse 21 <NA>

Create Dataframe w/All Combinations of 2 Categorical Columns then Sum 3rd Column by Each Combination

I have an large messy dataset but want to accomplish a straightforward thing. Essentially I want to fill a tibble based on every combination of two columns and sum a third column.
As a hypothetical example, say each observation has the company_name (Wendys, BK, McDonalds), the food_option (burgers, fries, frosty), and the total_spending (in $). I would like to make a 9x3 tibble with the company, food, and total as a sum of every observation. Here's my code so far:
df_table <- df %>%
group_by(company_name, food_option) %>%
summarize(total= sum(total_spending))
company_name food_option total
<chr> <chr> <dbl>
1 Wendys Burgers 757
2 Wendys Fries 140
3 Wendys Frosty 98
4 McDonalds Burgers 1044
5 McDonalds Fries 148
6 BK Burgers 669
7 BK Fries 38
The problem is that McDonalds has zero observations with "Frosty" as the food_option. Consequently, I get a partial table. I'd like to fill that with a row that shows:
8 McDonalds Frosty 0
9 BK Frosty 0
I know I can add the rows manually, but the actual dataset has over a hundred combinations so it will be tedious and complicated. Also, I'm constantly modifying the upstream data and I want the code to automatically fill correctly.
Thank you SO MUCH to anyone who can help. This forum has really been a godsend, really appreciate all of you.
Try:
library(dplyr)
df %>%
mutate(food_option = factor(food_option, levels = unique(food_option))) %>%
group_by(company_name, food_option, .drop = FALSE) %>%
summarise(total = sum(total_spending))
Newer versions of dplyr have a .drop argument to group_by where if you've got a factor with pre-defined levels they will not be dropped (and you'll get the zeros).
You can use tidyr::expand_grid():
tidyr::expand_grid(company_name = c("Wendys", "McDonalds", "BK"),
food_option = c("Burgers", "Fries", "Frosty"))
to create all possible variations
library(tidyverse)
# example data
df = read.table(text = "
company_name food_option total
1 Wendys Burgers 757
2 Wendys Fries 140
3 Wendys Frosty 98
4 McDonalds Burgers 1044
5 McDonalds Fries 148
6 BK Burgers 669
7 BK Fries 38
", header=T)
df %>% complete(company_name, food_option, fill=list(total = 0))
# # A tibble: 9 x 3
# company_name food_option total
# <fct> <fct> <dbl>
# 1 BK Burgers 669
# 2 BK Fries 38
# 3 BK Frosty 0
# 4 McDonalds Burgers 1044
# 5 McDonalds Fries 148
# 6 McDonalds Frosty 0
# 7 Wendys Burgers 757
# 8 Wendys Fries 140
# 9 Wendys Frosty 98

Add row with group sum in new column at the end of group category

I have been searching this information since yesterday but so far I could not find a nice solution to my problem.
I have the following dataframe:
CODE CONCEPT P. NR. NAME DEPTO. PRICE
1 Lunch 11 John SALES 160
1 Lunch 11 John SALES 120
1 Lunch 11 John SALES 10
1 Lunch 13 Frank IT 200
2 Internet 13 Frank IT 120
and I want to add a column with the sum of rows by group, for instance, the total amount of concept: Lunch, code: 1 by name in order to get an output like this:
CODE CONCEPT P. NR. NAME DEPTO. PRICE TOTAL
1 Lunch 11 John SALES 160 NA
1 Lunch 11 John SALES 120 NA
1 Lunch 11 John SALES 10 290
1 Lunch 13 Frank IT 200 200
2 Internet 13 Frank IT 120 120
So far, I tried with:
aggregate(PRICE~NAME+CODE, data = df, FUN = sum)
But this retrieves just the total of the concepts like this:
NAME CODE TOTAL
John 1 290
Frank 1 200
Frank 2 120
And not the table with the rest of the data as I would like to have it.
I also tried adding an extra column with NA but somehow I cannot paste the total in a specific row position.
Any suggestions? I would like to have something I can do in BaseR.
Thanks!!
In base R you can use ave to add new column. We insert the sum of group only if it is last row in the group.
df$TOTAL <- with(df, ave(PRICE, CODE, CONCEPT, PNR, NAME, FUN = function(x)
ifelse(seq_along(x) == length(x), sum(x), NA)))
df
# CODE CONCEPT PNR NAME DEPTO. PRICE TOTAL
#1 1 Lunch 11 John SALES 160 NA
#2 1 Lunch 11 John SALES 120 NA
#3 1 Lunch 11 John SALES 10 290
#4 1 Lunch 13 Frank IT 200 200
#5 2 Internet 13 Frank IT 120 120
Similar logic using dplyr
library(dplyr)
df %>%
group_by(CODE, CONCEPT, PNR, NAME) %>%
mutate(TOTAL = ifelse(row_number() == n(), sum(PRICE) ,NA))
For a base R option, you may try merging the original data frame and aggregate:
df2 <- aggregate(PRICE~NAME+CODE, data = df, FUN = sum)
out <- merge(df[ , !(names(df) %in% c("PRICE"))], df2, by=c("NAME", "CODE"))
out[with(out, order(CODE, NAME)), ]
NAME CODE CONCEPT PNR DEPT PRICE
1 Frank 1 Lunch 13 IT 200
3 John 1 Lunch 11 SALES 290
4 John 1 Lunch 11 SALES 290
5 John 1 Lunch 11 SALES 290
2 Frank 2 Internet 13 IT 120

Best method to Merge two Datasets (Maybe if function?)

I have two data sets I am working with. Datasets TestA and Test B (Below is how to make them in R)
Instructor <- c('Mr.A','Mr.A','Mr.B', 'Mr.C', 'Mr.D')
Class <- c('French','French','English', 'Math', 'Geometry')
Section <- c('1','2','3','5','5')
Time <- c('9:00-10:00','10:00-11:00','9:00-10:00','9:00-10:00','10:00-11:00')
Date <- c('MWF','MWF','TR','TR','MWF')
Enrollment <- c('30','40','24','29','40')
TestA <- data.frame(Instructor,Class,Section,Time,Date,Enrollment)
rm(Instructor,Class,Section,Time,Date,Enrollment)
Student <- c("Frances","Cass","Fern","Pat","Peter","Kory","Cole")
ID <- c('123','121','101','151','456','789','314')
Instructor <- c('','','','','','','')
Time <- c('','','','','','','')
Date <- c('','','','','','','')
Enrollment <- c('','','','','','','')
Class <- c('French','French','French','French','English', 'Math', 'Geometry')
Section <- c('1','1','2','2','3','5','5')
TestB <- data.frame(Student, ID, Instructor, Class, Section, Time, Date, Enrollment)
rm(Instructor,Class,Section,Time,Date,Enrollment,ID,Student)
I would like to merge both datasets (If possible, without using merge() ) So that All the columns of Test A are filled with the information provided by TestB and it should be added depending on the Class and Section.
I tried using merge(TestA, TestB, by=c('Class','Section'), all.x=TRUE) but it adds observations to the original TestA. This is just a test but in the datasets I am using there are hundreds of observations. It worked when I did it with these smaller frames but something is happening to the bigger set. That's why I'd like to know if there is a merge alternative.
Any ideas on how to do this?
The output should look like this
Class Section Instructor Time Date Enrollment Student ID
English 3 Mr.B 9:00-10:00 TR 24 Peter 456
French 1 Mr.A 9:00-10:00 MWF 30 Frances 123
French 1 Mr.A 9:00-10:00 MWF 30 Cass 121
French 2 Mr.A 10:00-11:00 MWF 40 Fern 101
French 2 Mr.A 10:00-11:00 MWF 40 Pat 151
Geometry 5 Mr.D 10:00-11:00 MWF 40 Cole 314
Math 5 Mr.C 9:00-10:00 TR 29 Kory 789
I was once a big fan of merge() until I learned about dplyr's join functions.
Try this instead:
library(dplyr)
TestA %>%
left_join(TestB, by = c("Class", "Section")) %>% #Here, you're joining by just the "Class" and "Section" columns of TestA and TestB
select(Class,
Section,
Instructor = Instructor.x,
Time = Time.x,
Date = Date.x,
Enrollment = Enrollment.x,
Student,
ID) %>%
arrange(Class, Section) #Added to match your output.
The select statement is keeping only those columns that are specifically named and, in some cases, renaming them.
Output:
Class Section Instructor Time Date Enrollment Student ID
1 English 3 Mr.B 9:00-10:00 TR 24 Peter 456
2 French 1 Mr.A 9:00-10:00 MWF 30 Frances 123
3 French 1 Mr.A 9:00-10:00 MWF 30 Cass 121
4 French 2 Mr.A 10:00-11:00 MWF 40 Fern 101
5 French 2 Mr.A 10:00-11:00 MWF 40 Pat 151
6 Geometry 5 Mr.D 10:00-11:00 MWF 40 Cole 314
7 Math 5 Mr.C 9:00-10:00 TR 29 Kory 789
The key is to drop the empty but duplicate columns from TestB before merging / joining as shown by SymbolixAU.
Here is an implementation in data.table syntax:
library(data.table)
setDT(TestB)[, .(Student, ID, Class, Section)][setDT(TestA), on = .(Class, Section)]
Student ID Class Section Instructor Time Date Enrollment
1: Frances 123 French 1 Mr.A 9:00-10:00 MWF 30
2: Cass 121 French 1 Mr.A 9:00-10:00 MWF 30
3: Fern 101 French 2 Mr.A 10:00-11:00 MWF 40
4: Pat 151 French 2 Mr.A 10:00-11:00 MWF 40
5: Peter 456 English 3 Mr.B 9:00-10:00 TR 24
6: Kory 789 Math 5 Mr.C 9:00-10:00 TR 29
7: Cole 314 Geometry 5 Mr.D 10:00-11:00 MWF 40

Aggregate/Group_by second minimum value in R

I have used either group_by() in dplyr or the aggregate() function to aggregate across columns in R. For my current problem I want to group by an individual but finding the second lowest of one column (Number) and the lowest of another (Year). So, if my data looks like this:
Number Individual Year Value
123 M. Smith 2010 234
435 M. Smith 2011 346
435 M. Smith 2012 356
524 M. Smith 2015 432
119 J. Jones 2010 345
119 J. Jones 2012 432
254 J. Jones 2013 453
876 J. Jones 2014 654
I want it to become:
Number Individual Year Value
435 M. Smith 2011 346
254 J. Jones 2013 453
Thank you.
We can use the dplyr package. dt2 is the final output. The idea is to filter out the minimum in the Number column, then arrange the data frame by Individual, Number, and Year. Finally, select the first row of each group.
# Load package
library(dplyr)
# Create example data frame
dt <- read.table(text = "Number Individual Year Value
123 'M. Smith' 2010 234
435 'M. Smith' 2011 346
435 'M. Smith' 2012 356
524 'M. Smith' 2015 432
119 'J. Jones' 2010 345
119 'J. Jones' 2012 432
254 'J. Jones' 2013 453
876 'J. Jones' 2014 654",
header = TRUE, stringsAsFactors = FALSE)
# Process the data
dt2 <- dt %>%
group_by(Individual) %>%
filter(Number != min(Number)) %>%
arrange(Individual, Number, Year) %>%
slice(1)
We can use dplyr
library(dplyr)
df1 %>%
group_by(Individual) %>%
arrange(Individual, Number) %>%
filter(Number != max(Number)) %>%
slice(which.max(Number))
# A tibble: 2 x 4
# Groups: Individual [2]
# Number Individual Year Value
# <int> <chr> <int> <int>
#1 254 J. Jones 2013 453
#2 435 M. Smith 2011 346

Resources