How can I transpose data in each variable from long to wide using group_by? R - r

I have a dataframe with id variable name. I'm trying to figure out a way to transpose each variable in the dataframe by name.
My current df is below:
name jobtitle companyname datesemployed empduration joblocation jobdescrip
1 David… Project… EOS IT Man… Aug 2018 – P… 1 yr 9 mos San Franci… Coordinati…
2 David… Technic… Options Te… Sep 2017 – J… 5 mos Belfast, U… Working wi…
3 David… Data An… NA Jan 2018 – J… 6 mos Belfast, U… Working wi…
However, I'd like a dataframe in which there is only one row for name, and every observation for name becomes its own column, like below:
name jobtitle_1 companyname_1 datesemployed_1 empduration_1 joblocation_1 jobdescrip_1 job_title2 companyname_2 datesemployed_2 empduration_2 joblocation_2 jobdescrip_2
1 David… Project… EOS IT Man… Aug 2018 – P… 1 yr 9 mos San Franci… Coordinati… Technic… Options Te… Sep 2017 – J… 5 mos Belfast, U… Working wi…
I have used commands like gather_by and melt in the past to reshape from long to wide, but in this case, I'm not sure how to apply it, since every observation for the id variable will need to become its own column.

It sounds like you are looking for gather and pivot_wider.
I used my own sample data with two names:
df <- tibble(name = c('David', 'David', 'David', 'Bill', 'Bill'),
jobtitle = c('PM', 'TPM', 'Analyst', 'Dev', 'Eng'),
companyname = c('EOS', 'Options', NA, 'Microsoft', 'Nintendo'))
First add an index column to distinguish the different positions for each name.
indexed <- df %>%
group_by(name) %>%
mutate(.index = row_number())
indexed
# name jobtitle companyname .index
# <chr> <chr> <chr> <int>
# 1 David PM EOS 1
# 2 David TPM Options 2
# 3 David Analyst NA 3
# 4 Bill Dev Microsoft 1
# 5 Bill Eng Nintendo 2
Then it is possible to use gather to get a long form, with one value per row.
gathered <- indexed %>% gather('var', 'val', -c(name, .index))
gathered
# name .index var val
# <chr> <int> <chr> <chr>
# 1 David 1 jobtitle PM
# 2 David 2 jobtitle TPM
# 3 David 3 jobtitle Analyst
# 4 Bill 1 jobtitle Dev
# 5 Bill 2 jobtitle Eng
# 6 David 1 companyname EOS
# 7 David 2 companyname Options
# 8 David 3 companyname NA
# 9 Bill 1 companyname Microsoft
# 10 Bill 2 companyname Nintendo
Now pivot_wider can be used to create a column for each variable and index.
gathered %>% pivot_wider(names_from = c(var, .index), values_from = val)
# name jobtitle_1 jobtitle_2 jobtitle_3 companyname_1 companyname_2 companyname_3
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 David PM TPM Analyst EOS Options NA
# 2 Bill Dev Eng NA Microsoft Nintendo NA

Get the data in long format, create a unique column identifier and get it back to wide format.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -name, names_to = 'col') %>%
group_by(name, col) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from = c(col, row), values_from = value)

Related

How to filter for all instances of X happening only if nothing else is in the data before the associated date

I'm not sure how to word the title better - I have a list of names, dates, and services. I want to find all instances of a specific service occurring only when there were 0 other services BEFORE the date of the specific one.
Example data below.
The desired output would be ONLY returning row 5 because Bruce Wayne had a surgery with 0 services beforehand. John Doe is disqualified because there was a check-up beforehand and Jane Doe is disqualified because there was no surgery.
Extra question - Instead of checking for any occurrence beforehand, how would I check for any occurrence within 6 months instead?
Date <- c("2022-01-01","2022-04-01","2022-05-01","2022-07-01","2022-08-01","2022-08-05")
Name <- c("John Doe","John Doe","John Doe","Jane Doe","Bruce Wayne","Bruce Wayne")
Service <- c("Check-up","Surgery","Follow-up", "Check-up", "Surgery", "Follow-up")
df <- data.frame(Date,Name,Service)
df
Date Name Service
1 2022-01-01 John Doe Check-up
2 2022-04-01 John Doe Surgery
3 2022-05-01 John Doe Follow-up
4 2022-07-01 Jane Doe Check-up
5 2022-08-01 Bruce Wayne Surgery
6 2022-08-05 Bruce Wayne Follow-up
I don't always trust the ordering of the frame,
df %>%
group_by(Name) %>%
filter(Service == "Surgery", Date == min(Date)) %>%
ungroup()
# # A tibble: 1 × 3
# Date Name Service
# <chr> <chr> <chr>
# 1 2022-08-01 Bruce Wayne Surgery
You could filter on surgery and check if it is the first row_number
library(dplyr)
df %>%
group_by(Name) %>%
filter(Service == "Surgery" & row_number() == 1)
#> # A tibble: 1 × 3
#> # Groups: Name [1]
#> Date Name Service
#> <chr> <chr> <chr>
#> 1 2022-08-01 Bruce Wayne Surgery
Created on 2023-01-27 with reprex v2.0.2

Is there a way to count repeated observations using the summarize function in R?

I'm working with a data set that contains CustomerID, Sales_Rep, Product, and year columns. The problem I have with this dataset is that there is no unique Transaction Number. The data looks like this:
CustomerID Sales Rep Product Year
301978 Richard Grayson Product A 2017
302151 Maurin Thompkins Product B 2018
301962 Wallace West Product C 2019
301978 Richard Grayson Product B 2018
402152 Maurin Thompkins Product A 2017
501967 Wallace West Product B 2017
301978 Richard Grayson Product B 2018
What I'm trying to do is count how many transactions were made by each Sales Rep, per year by counting the number of Customer IDs that appear for each Sales Rep per year regardless if the customer ID is repeated, and then compile it into one data frame called "Count". I tried using the following functions in R:
Count <- Sales_Data %>%
group_by(Sales_Rep, year) %>%
summarize(count(CustomerID))
but I get this error:
Error: Problem with `summarise()` input `..1`.
i `..1 = count(PatientID)`.
x no applicable method for 'count' applied to an object of class "c('integer', 'numeric')"
The result I want to produce is this:
Sales Rep 2017 2018 2019
Richard Grayson 1 2
Maurin Thompkins 1 1
Wallace West 1 1
Can anybody help me?
There is no need to group and summarise, function count does that in one step. Then reshape to wide format.
Sales_Data <- read.table(text = "
CustomerID 'Sales Rep' Product Year
301978 'Richard Grayson' 'Product A' 2017
302151 'Maurin Thompkins' 'Product B' 2018
301962 'Wallace West' 'Product C' 2019
301978 'Richard Grayson' 'Product B' 2018
402152 'Maurin Thompkins' 'Product A' 2017
501967 'Wallace West' 'Product B' 2017
301978 'Richard Grayson' 'Product B' 2018
", header = TRUE, check.names = FALSE)
suppressPackageStartupMessages({
library(dplyr)
library(tidyr)
})
Sales_Data %>% count(CustomerID)
#> CustomerID n
#> 1 301962 1
#> 2 301978 3
#> 3 302151 1
#> 4 402152 1
#> 5 501967 1
Sales_Data %>%
count(`Sales Rep`, Year) %>%
pivot_wider(id_cols = `Sales Rep`, names_from = Year, values_from = n)
#> # A tibble: 3 x 4
#> `Sales Rep` `2017` `2018` `2019`
#> <chr> <int> <int> <int>
#> 1 Maurin Thompkins 1 1 NA
#> 2 Richard Grayson 1 2 NA
#> 3 Wallace West 1 NA 1
Created on 2022-04-03 by the reprex package (v2.0.1)
Edit
To have the output column 'Sales Rep' in the same order as in the input data, coerce to factor setting the levels attribute to that original order. This is taken care of by unique. After pivoting, 'Sales Rep' can be coerced back to character, if needed. I have omitted this final step in the code that follows.
Sales_Data %>%
mutate(`Sales Rep` = factor(`Sales Rep`, levels = unique(`Sales Rep`))) %>%
count(`Sales Rep`, Year) %>%
pivot_wider(id_cols = `Sales Rep`, names_from = Year, values_from = n)
#> # A tibble: 3 x 4
#> `Sales Rep` `2017` `2018` `2019`
#> <fct> <int> <int> <int>
#> 1 Richard Grayson 1 2 NA
#> 2 Maurin Thompkins 1 1 NA
#> 3 Wallace West 1 NA 1
Created on 2022-04-05 by the reprex package (v2.0.1)

dataframe partial merge in R

I have a data frame looks like below:
name workplace year note1 note2 job
Ben Alpha 2011 xxxx xx director
Ben Beta 2011 xx xxx director
Ben Beta 2011 xxx xxxx vice president
Wendy Sigma 2011 xxxx x director
Wendy Sigma 2011 xx xx vice president
Wendy Sigma 2011 x xxx CEO
Alice Beta 2011 xxx x staff
Alice Beta 2012 xx xx deputy director
I want to identify and merge the duplicated rows based on columns "name", "workplace" and "year" (don't consider columns "note1" and "note2). And the information in column "job" will be merged. The output should look like below. Note that the information in "job" is merged based on matching "name", "workplace" and "year". Information in "note1" and "note2" don't need to merge and should be the "note1" and "note2" information in the first row of the matching rows:
name workplace year note1 note2 job.1 job.2 job.3
Ben Alpha 2011 xxxx xx director NA NA
Ben Beta 2011 xx xxx director vice president
Wendy Sigma 2011 xxxx x director vice president CEO
Alice Beta 2011 xxx x staff NA
Alice Beta 2012 xx xx secretary NA NA
Another approach without using pivot approach. Here you can use first or last or whatever aggregate functions for notes fields as desired. You can make use of appropriate arguments to mute all warnings
df %>% group_by(name, workplace, year) %>%
summarise(note1 = last(note1),
note2 = last(note2),
job = toString(job), .groups = 'drop') %>%
separate(job, into = paste0('Job', seq_len(max(1 + str_count(.$job, ',')))),
sep = ', ',
extra = "drop",
fill = 'right')
# A tibble: 5 x 8
name workplace year note1 note2 Job1 Job2 Job3
<chr> <chr> <int> <chr> <chr> <chr> <chr> <chr>
1 Alice Beta 2011 xxx x staff NA NA
2 Alice Beta 2012 xx xx deputy director NA NA
3 Ben Alpha 2011 xxxx xx director NA NA
4 Ben Beta 2011 xxx xxxx director vice president NA
5 Wendy Sigma 2011 x xxx director vice president CEO
Here's an approach using dplyr and tidyr. First, I remove the notes fields to deal with those separately. Then I assign a row number to each job row within a name/workplace/year group. Then spread into columns based on those jobs. Then finally, add the note from the first row of each name/workplace/year.
library(tidyr); library(dplyr)
my_data %>%
select(-note1, -note2) %>%
group_by(name, workplace, year) %>%
mutate(job_num = row_number()) %>%
ungroup() %>%
pivot_wider(names_from = job_num, values_from = job, names_prefix = "job.") %>%
left_join(my_data %>% distinct(name, workplace, year, .keep_all = TRUE))
Result:
# A tibble: 5 x 9
name workplace year job.1 job.2 job.3 note1 note2 job
<chr> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr>
1 Ben Alpha 2011 director NA NA xxxx xx director
2 Ben Beta 2011 director vice president NA xx xxx director
3 Wendy Sigma 2011 director vice president CEO xxxx x director
4 Alice Beta 2011 staff NA NA xxx x staff
5 Alice Beta 2012 deputy director NA NA xx xx deputy director

How to do a frequency table where column values are variables?

I have a DF named JOB. In that DF i have 4 columns. Person_ID; JOB; FT (full time or part time with values of 1 for full time and 2 for part time) and YEAR. Every person can have only 1 full time job per year in this DF. This is the full time job they got most of their income during the year.
DF
PERSON_ID JOB FT YEAR
1 Analyst 1 2018
1 Analyst 1 2019
1 Analyst 1 2020
2 Coach 1 2018
2 Coach 1 2019
2 Analyst 1 2020
3 Gardener 1 2020
4 Coach 1 2018
4 Coach 1 2019
4 Analyst 1 2020
4 Coach 2 2019
4 Gardener 2 2019
I want to get different frequency in the lines of the following question:
What full time job changes occurred from 2019 and 2020?
I want to look only at changes where FT=1.
I want my end table to look like this
2019 2020 frequency
Analyst Analyst 1
Coach Analyst 2
NA Gardener 1
I want to look at the data so that i can say 2 people moved from they coaching job to analyst job. 1 analyst did not change their job and one person entered the labour market as a gardener.
I tried to fiddle around with the table function but did not even get close to what i wanted. I could not get the YEAR's to go to separate variables.
10 Bonus points if i can do it in base R :)
Thank you for your help
Not pretty but worked:
# split df by year
df_2019 <- df[df$YEAR %in% c(2019) & df$FT == 1, ]
df_2020 <- df[df$YEAR %in% c(2020) & df$FT == 1, ]
# rename Job columns
df_2019$JOB_2019 <- df_2019$JOB
df_2020$JOB_2020 <- df_2020$JOB
# select needed columns
df_2019 <- df_2019[, c("PERSON_ID", "JOB_2019")]
df_2020 <- df_2020[, c("PERSON_ID", "JOB_2020")]
# merge dfs
df2 <- merge(df_2019, df_2020, by = "PERSON_ID", all = TRUE)
df2$frequency <- 1
df2$JOB_2019 <- addNA(df2$JOB_2019)
df2$JOB_2020 <- addNA(df2$JOB_2020)
# aggregate frequency
aggregate(frequency ~ JOB_2019 + JOB_2020, data = df2, FUN = sum, na.action=na.pass)
JOB_2019 JOB_2020 frequency
1 Analyst Analyst 1
2 Coach Analyst 2
3 <NA> Gardener 1
Not R base but worked:
library(dplyr)
library(tidyr)
data %>%
filter(FT==1, YEAR %in% c(2019, 2020)) %>%
group_by(YEAR, JOB, PERSON_ID) %>%
tally() %>%
pivot_wider(names_from = YEAR, values_from = JOB) %>%
select(-PERSON_ID) %>%
group_by(`2019`, `2020`) %>%
summarise(n = n())
`2019` `2020` n
<chr> <chr> <int>
1 Analyst Analyst 1
2 Coach Analyst 2
3 NA Gardener 1

to set column name to row vaues in R

I have this type of table in R
April Tourist
2018 123
2018 222
I want my table to look like this:-
Month Year Domestic International Total
April 2018 123 222 345
I am new to R. I tried using melt and rownames() function given by R but not getting exactly the way out.
Based on your comment that you only have 2 rows in your data set here's a way to do this with dplyr and tidyr -
df <- data_frame(April = c(2018, 2018),
Tourist = c(123, 222))
df %>%
mutate(Type = c("Domestic", "International")) %>%
gather(Month, Year, April) %>%
spread(Type, Tourist) %>%
mutate(
Total = Domestic + International
)
# A tibble: 1 x 5
Month Year Domestic International Total
<chr> <dbl> <dbl> <dbl> <dbl>
1 April 2018 123 222 345

Resources