In R, pivot duplicate row values into column values - r

My problem is similar to this one, but I am having trouble making the code work for me:
Pivot dataframe to keep column headings and sub-headings in R
My data looks like this:
prod1<-c(1000,2000,1400,1340)
prod2<-c(5000,5400,3400,5400)
partner<-c("World","World","Turkey","Turkey")
year<-c("2017","2018","2017","2018")
type<-c("credit","credit","debit","debit")
s<-as.data.frame(rbind(partner,year,type,prod1,prod2)
But I need to convert all the rows into individual variables so that it my columns are:
column.names<-c("products","partner","year","type","value")
I've been trying the code below:
#fix partners
colnames(s)[seq(2, 7, 1)] <- colnames(s)[2] #seq(start,end,increment)
colnames(s)[seq(9, ncol(s), 1)] <- colnames(s)[8]
colnames(s) <-
c(s[1, 1], paste(sep = '_', colnames(s)[2:ncol(s)], as.character(unlist(s[1, 2:ncol(s)]))))
test<-s[-1,]
s <- rename(s, category=1)
test<- s %>%
slice(-1) %>%
pivot_longer(-1,
names_to = c("partner", ".value"),
names_sep = "_") %>%
arrange(partner, `Service item`) %>%
mutate(partner = as.character(partner))
But it keeps saying I can't have duplicate column names. Can someone please help? The initial data is submitted in this format so I need to get it in the right shape.

s <- rownames_to_column(s)
s %>% pivot_longer(starts_with("V")) %>%
pivot_wider(names_from = rowname,values_from = value) %>%
select(-name) %>% pivot_longer(starts_with("prod"), names_to = "product",
values_to = "value")
# A tibble: 8 × 5
partner year type product value
<chr> <chr> <chr> <chr> <chr>
1 World 2017 credit prod1 1000
2 World 2017 credit prod2 5000
3 World 2018 credit prod1 2000
4 World 2018 credit prod2 5400
5 Turkey 2017 debit prod1 1400
6 Turkey 2017 debit prod2 3400
7 Turkey 2018 debit prod1 1340
8 Turkey 2018 debit prod2 5400
sorry misread the question at the beginning, is that what you look for ?

Related

How best to do this pivot operation in R

Below is the sample data and the desired outcome. This is a much simplified version of the actual data set. In the actual data set, there are 20 years and 4 quarters apiece. Looking to have each unique company entry listed once and the employment data series running from beginning to end from left to right. In the event that there is no data for Vision Inc in 2019 quarter 3, then I would want it to return a O and not an NA.
library(tidyverse)
library(dplyr)
legalname <- c("Vision Inc.","Expedia","Strong Enterprise","Vision Inc.","Expedia","Strong Enterprise")
year <- c(2019,2019,2019,2019,2019,2019)
quarter <- c(1,1,1,2,2,2)
cnty <- c(031,029,027,031,029,027)
naics <- c(345110,356110,362110,345110,356110,345110)
mnth1emp <- c (11,13,15,15,17,20)
mnth2emp <- c(12,14,15,16,18,22)
mnth3emp <-c(13,15,15,17,21,29)
employers <- data.frame(legalname,year,quarter,naics,mnth1emp,mnth2emp,mnth3emp)
Desired Outcome
legalname cnty naics 2019m1 2019m2 2019m3 2019m4 2019m5 2019m6
Vision Inc 031 345110 11 12 13 15 16 17
Expedia 029 356110 13 14 15 17 18 21
I first pivot to a long form, then arrange by legalname and year(just to double-check that they are in numerical order). Then, I create a unique month series for each year for each company. Then, I drop quarter and pivot back to wide form and put name and year together, and finally replace NA with 0. Here, I'm assuming that you want each unique naics on it's own row.
library(tidyverse)
employers %>%
pivot_longer(starts_with("mnth")) %>%
arrange(legalname, year) %>%
group_by(legalname, year, naics) %>%
mutate(name = paste0("m", 1:n())) %>%
select(-quarter) %>%
pivot_wider(names_from = c("year", "name"), names_sep = "", values_from = "value") %>%
mutate(across(everything(), ~replace_na(.,0)))
Output
legalname naics `2019m1` `2019m2` `2019m3` `2019m4` `2019m5` `2019m6`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Expedia 356110 13 14 15 17 18 21
2 Strong Enterprise 362110 15 15 15 0 0 0
3 Strong Enterprise 345110 0 0 0 20 22 29
4 Vision Inc. 345110 11 12 13 15 16 17
Does this work for you?
First pivot longer to get the months and values in a quarter; and then pivot wider to get the wide format you want.
employers %>%
filter(legalname != "Strong Enterprise") %>%
pivot_longer(mnth1emp:mnth3emp, names_to = "mnth", values_to = "value") %>%
mutate(month_in_quarter = as.numeric(str_extract(mnth, "\\d")),
month =str_c("m", month_in_quarter + 3*(quarter - 1))) %>%
select(-c(month_in_quarter, mnth)) %>%
pivot_wider(c(legalname,cnty, naics), names_from = c(year, month),
values_from = value,
values_fill = 0)
values_fill will fill NAs with 0s.
perhaps try this.
I found a way to get the pivot right in R. I used the library("pivottabler") with the data.frame "bhmtrains". This worked now.
library(pivottabler)
qhpvt(bhmtrains, c("=","TOC"), "TrainCategory",
c("Mean Speed"="mean(SchedSpeedMPH, na.rm=TRUE)", "Std Dev
Speed"="sd(SchedSpeedMPH, na.rm=TRUE)"),
formats=list("%.0f", "%.1f"), totals=list("", "TrainCategory"="All",
"Categories"))
my results out of the code

Reshaping multiple long columns into wide column format in R

My sample dataset has multiple columns that I want to convert into wide format. I have tried using the dcast function, but I get error. Below is my sample dataset:
df2 = data.frame(emp_id = c(rep(1,2), rep(2,4),rep(3,3)),
Name = c(rep("John",2), rep("Kellie",4), rep("Steve",3)),
Year = c("2018","2019","2018","2018","2019","2019","2018","2019","2019"),
Type = c(rep("Salaried",2), rep("Hourly", 2), rep("Salaried",2),"Hourly",rep("Salaried",2)),
Dept = c("Sales","IT","Sales","Sales", rep("IT",3),rep("Sales",2)),
Salary = c(100,1000,95,95,1500,1500,90,1200,1200))
I'm expecting my output to look like:
One option is the function pivot_wider() from the tidyr package:
df.wide <- tidyr::pivot_wider(df2,
names_from = c("Type", "Dept", "Year"),
values_from = "Salary",
values_fn = {mean})
This should get you the desired result.
What do you think about this output? It is not the expected output, but somehow I find it easier to interpret the data??
df2 %>%
group_by(Name, Year, Type, Dept) %>%
summarise(mean = mean(Salary))
Output:
Name Year Type Dept mean
<chr> <chr> <chr> <chr> <dbl>
1 John 2018 Salaried Sales 100
2 John 2019 Salaried IT 1000
3 Kellie 2018 Hourly Sales 95
4 Kellie 2019 Salaried IT 1500
5 Steve 2018 Hourly IT 90
6 Steve 2019 Salaried Sales 1200

How can I transpose data in each variable from long to wide using group_by? R

I have a dataframe with id variable name. I'm trying to figure out a way to transpose each variable in the dataframe by name.
My current df is below:
name jobtitle companyname datesemployed empduration joblocation jobdescrip
1 David… Project… EOS IT Man… Aug 2018 – P… 1 yr 9 mos San Franci… Coordinati…
2 David… Technic… Options Te… Sep 2017 – J… 5 mos Belfast, U… Working wi…
3 David… Data An… NA Jan 2018 – J… 6 mos Belfast, U… Working wi…
However, I'd like a dataframe in which there is only one row for name, and every observation for name becomes its own column, like below:
name jobtitle_1 companyname_1 datesemployed_1 empduration_1 joblocation_1 jobdescrip_1 job_title2 companyname_2 datesemployed_2 empduration_2 joblocation_2 jobdescrip_2
1 David… Project… EOS IT Man… Aug 2018 – P… 1 yr 9 mos San Franci… Coordinati… Technic… Options Te… Sep 2017 – J… 5 mos Belfast, U… Working wi…
I have used commands like gather_by and melt in the past to reshape from long to wide, but in this case, I'm not sure how to apply it, since every observation for the id variable will need to become its own column.
It sounds like you are looking for gather and pivot_wider.
I used my own sample data with two names:
df <- tibble(name = c('David', 'David', 'David', 'Bill', 'Bill'),
jobtitle = c('PM', 'TPM', 'Analyst', 'Dev', 'Eng'),
companyname = c('EOS', 'Options', NA, 'Microsoft', 'Nintendo'))
First add an index column to distinguish the different positions for each name.
indexed <- df %>%
group_by(name) %>%
mutate(.index = row_number())
indexed
# name jobtitle companyname .index
# <chr> <chr> <chr> <int>
# 1 David PM EOS 1
# 2 David TPM Options 2
# 3 David Analyst NA 3
# 4 Bill Dev Microsoft 1
# 5 Bill Eng Nintendo 2
Then it is possible to use gather to get a long form, with one value per row.
gathered <- indexed %>% gather('var', 'val', -c(name, .index))
gathered
# name .index var val
# <chr> <int> <chr> <chr>
# 1 David 1 jobtitle PM
# 2 David 2 jobtitle TPM
# 3 David 3 jobtitle Analyst
# 4 Bill 1 jobtitle Dev
# 5 Bill 2 jobtitle Eng
# 6 David 1 companyname EOS
# 7 David 2 companyname Options
# 8 David 3 companyname NA
# 9 Bill 1 companyname Microsoft
# 10 Bill 2 companyname Nintendo
Now pivot_wider can be used to create a column for each variable and index.
gathered %>% pivot_wider(names_from = c(var, .index), values_from = val)
# name jobtitle_1 jobtitle_2 jobtitle_3 companyname_1 companyname_2 companyname_3
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 David PM TPM Analyst EOS Options NA
# 2 Bill Dev Eng NA Microsoft Nintendo NA
Get the data in long format, create a unique column identifier and get it back to wide format.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -name, names_to = 'col') %>%
group_by(name, col) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from = c(col, row), values_from = value)

to set column name to row vaues in R

I have this type of table in R
April Tourist
2018 123
2018 222
I want my table to look like this:-
Month Year Domestic International Total
April 2018 123 222 345
I am new to R. I tried using melt and rownames() function given by R but not getting exactly the way out.
Based on your comment that you only have 2 rows in your data set here's a way to do this with dplyr and tidyr -
df <- data_frame(April = c(2018, 2018),
Tourist = c(123, 222))
df %>%
mutate(Type = c("Domestic", "International")) %>%
gather(Month, Year, April) %>%
spread(Type, Tourist) %>%
mutate(
Total = Domestic + International
)
# A tibble: 1 x 5
Month Year Domestic International Total
<chr> <dbl> <dbl> <dbl> <dbl>
1 April 2018 123 222 345

Using conditions in group_by()/summarize() loop

I have a dataframe that looks something like this (I have a lot more years and variables):
Name State2014 State2015 State2016 Tuition2014 Tuition2015 Tuition2016 StateGrants2014
Jared CA CA MA 22430 23060 40650 5000
Beth CA CA CA 36400 37050 37180 4200
Steven MA MA MA 18010 18250 18720 NA
Lary MA CA MA 24080 30800 24600 6600
Tom MA OR OR 40450 15800 16040 NA
Alfred OR OR OR 23570 23680 23750 3500
Cathy OR OR OR 32070 32070 33040 4700
My objective (in this example) is to get the mean tuition for each state, and the sum of state grants for each state. My thought was to subset the data by year:
State2014 Tuition2014 StateGrants2014
CA 22430 5000
CA 36400 4200
MA 18010 NA
MA 24080 6600
MA 40450 NA
OR 23570 3500
OR 32070 4700
State2015 Tuition2015
CA 23060
CA 37050
MA 18250
CA 30800
OR 15800
OR 23680
OR 32070
State2016 Tuition2016
MA 40650
CA 37180
MA 18720
MA 24600
OR 16040
OR 23750
OR 33040
Then I would group_by state and summarize (and save each as a separate df) to get the following:
State2014 Tuition2014 StateGrants2014
CA 29415 9200
MA 27513 6600
OR 27820 6600
State2015 Tuition2015
CA 30303
MA 18250
OR 23850
State2016 Tuition2016
CA 37180
MA 27990
OR 24277
Then I would merge the by state. Here is my code:
years = c(2014,2015,2016)
for (i in seq_along(years){
#grab the variables from a certain year and save as a new df.
df_year <- df[, grep(paste(years[[i]],"$",sep=""), colnames(df))]
#Take off the year from each variable name (to make it easier to summarize)
names(df_year) <- gsub(years[[i]], "", names(df_year), fixed = TRUE)
df_year <- df_year %>%
group_by(state) %>%
summarize(Tuition = mean(Tuition, na.rm = TRUE),
#this part of the code does not work. In this example, I only want to have this part if the year is 2016.
if (years[[i]]=='2016')
{Stategrant = mean(Stategrant, na.rm = TRUE)})
#rename df_year to df####
assign(paste("df",years[[i]],sep=''),df_year)
}
I have about 50 years of data, and a good amount of variables, so I wanted to use a loop. So my question is, how do i add a conditional statement (summarize certain variables conditioned on the year) in the group_by()/summarize() function? Thanks!
*Edit: I realize that I could take the if{} out of the function, and do something like:
if (years[[i]]==2016){
df_year <- df_year %>%
group_by(state) %>%
summarize(Tuition = mean(Tuition, na.rm = TRUE),
Stategrant = mean(Stategrant, na.rm = TRUE))
#rename df_year to df####
assign(paste("df",years[[i]],sep=''),df_year)
}
else{
df_year <- df_year %>%
group_by(state) %>%
summarize(Tuition = mean(Tuition, na.rm = TRUE))
#rename df_year to df####
assign(paste("df",years[[i]],sep=''),df_year)
{
}
but there are just so many combinations of variables, that using a for loop would not be very efficient or useful.
This is so much easier with tidy data, so let me show you how to tidy up your data. See http://r4ds.had.co.nz/tidy-data.html.
library(tidyr)
library(dplyr)
df <- gather(df, key, value, -Name) %>%
# separate years from the variables
separate(key, c("var", "year"), sep = -5) %>%
# the above line splits up e.g. State2014 into State and 2014.
# It does so by splitting at the fifth element from the end of the
# entry. Please check that this works for your other variables
# in case your naming conventions are inconsistent.
spread(var, value) %>%
# turn numbers back to numeric
mutate_at(.cols = c("Tuition", "StateGrants"), as.numeric) %>%
gather(var, val, -Name, -year, -State) %>%
# group by the variables of interest. Note that `var` here
# refers to Tuition and StateGrants. If you have more variables,
# they will be included here as well. If you want to exclude more
# variables from being included here in `var`, add more "-colName"
# entries in the `gather` statement above
group_by(year, State, var) %>%
# summarize:
summarise(mean_values = mean(val))
This gives you:
Source: local data frame [18 x 4]
Groups: year, State [?]
year State var mean_values
<chr> <chr> <chr> <dbl>
1 2014 CA StateGrants 4600.00
2 2014 CA Tuition 29415.00
3 2014 MA StateGrants NA
4 2014 MA Tuition 27513.33
5 2014 OR StateGrants 4100.00
6 2014 OR Tuition 27820.00
7 2015 CA StateGrants NA
8 2015 CA Tuition 30303.33
9 2015 MA StateGrants NA
10 2015 MA Tuition 18250.00
11 2015 OR StateGrants NA
12 2015 OR Tuition 23850.00
13 2016 CA StateGrants NA
14 2016 CA Tuition 37180.00
15 2016 MA StateGrants NA
16 2016 MA Tuition 27990.00
17 2016 OR StateGrants NA
18 2016 OR Tuition 24276.67
If you don't like the shape of this, you can e.g. add an %>% spread(var, mean_values) behind the summarise statement to have the means for Tuition and StateGrants in different columns.
If you want to compute different functions for Tuition and Grants (e.g. mean of Tuition and sum for grants, you could do the following:
df <- gather(df, key, value, -Name) %>%
separate(key, c("var", "year"), sep = -5) %>%
spread(var, value) %>%
mutate_at(.cols = c("Tuition", "StateGrants"), as.numeric) %>%
group_by(year, State) %>%
summarise(Grant_Sum = sum(StateGrants, na.rm=T), Tuition_Mean = mean(Tuition) )
This gives you:
Source: local data frame [9 x 4]
Groups: year [?]
year State Grant_Sum Tuition_Mean
<chr> <chr> <dbl> <dbl>
1 2014 CA 9200 29415.00
2 2014 MA 6600 27513.33
3 2014 OR 8200 27820.00
4 2015 CA 0 30303.33
5 2015 MA 0 18250.00
6 2015 OR 0 23850.00
7 2016 CA 0 37180.00
8 2016 MA 0 27990.00
9 2016 OR 0 24276.67
Note that I used sum here, with na.rm = T, which returns 0 if all elements are NAs. Make sure this makes sense in your use case.
Also, just to mention it, to get your individual data.frames that you asked for, you can use filter(year == 2014) etc, as in df_2014 <- filter(df, year == 2014).

Resources