Condense dataframe with conditions to certain columns

Condense dataframe with conditions to certain columns - r

I've got a dataset that looks like this
df = data.frame(Site = c(rep('w',4),rep('x',5),rep('y',2),rep('z',1)),
Parent = c(rep('W Inc.',4),rep('X Inc.',5),rep('Y Inc.',2),rep('Z Inc.',1)),
Status = c(rep('Prospect',4),rep('Client',5),rep('Client',2),rep('Prospect',1)),
Country = c(rep('USA',4),rep('Canada',5),rep('Mexico',2),rep('China',1)),
ProductID = c('XP10','XP11','XP18','XP19','XP4','XP5','XP6','XP7','XP8','XP10','XP18','XP6'),
ProductName = c('10Rockets','11Rockets','18Rockets','19Rockets','4Rockets','5Rockets','6Rockets','7Rockets','8Rockets','10Rockets','18Rockets','6Rockets'),
ProductProvider= c('Provider A','Provider B','Provider A','Provider A',rep('Provider A',5),'Provider A','Provider B','Provider B'))
I'd like to condense it such that each Site is a unique row, and the last 3 columns are concatenated.
Also, I'd like to concatenate the last column such that if there are any repetitions, it takes only the unique values per Site and separates them with commas.
My attempt
library(dplyr)
output2 = df %>% group_by(Site,Parent,Status,Country) %>%
mutate(ProductID = paste(ProductID, collapse=",")) %>%
mutate(ProductName = paste(ProductName, collapse=",")) %>%
mutate(ProductProvider = unique(paste(ProductProvider, collapse=","))) %>%
distinct()
I'm almost there, but the last column seems to have repetitions of ProductProvider which I do not want.
Target Output
I'm looking for a target data set like this, with the last column concatenated and free of any repetitions. Any inputs would be appreciated.
output = data.frame(Site = c(rep('w',1),rep('x',1),rep('y',1),rep('z',1)),
Parent = c(rep('W Inc.',1),rep('X Inc.',1),rep('Y Inc.',1),rep('Z Inc.',1)),
Status = c(rep('Prospect',1),rep('Client',1),rep('Client',1),rep('Prospect',1)),
Country = c(rep('USA',1),rep('Canada',1),rep('Mexico',1),rep('China',1)),
ProductID = c('XP10,XP11,XP18,XP19','XP4,XP5,XP6,XP7,XP8','XP10,XP18','XP6'),
ProductName = c('10Rockets,11Rockets,18Rockets,19Rockets','4Rockets,5Rockets,6Rockets,7Rockets,8Rockets','10Rockets,18Rockets','6Rockets'),
ProductProvider= c('Provider A,Provider B','Provider A','Provider A,Provider B','Provider B'))

With dplyr:
library(dplyr)
result = df %>% group_by(Site, Parent, Status, Country) %>%
summarize(across(ProductProvider, ~paste(unique(.), collapse = ", ")),
across(everything(), paste, collapse = ", "))
result
# # A tibble: 4 x 7
# # Groups: Site, Parent, Status [4]
# Site Parent Status Country ProductProvider ProductID ProductName
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 w W Inc. Prospect USA Provider A, Provider B XP10, XP11, XP18, XP19 10Rockets, 11Rockets, 18Rockets, 19Rockets
# 2 x X Inc. Client Canada Provider A XP4, XP5, XP6, XP7, XP8 4Rockets, 5Rockets, 6Rockets, 7Rockets, 8Rockets
# 3 y Y Inc. Client Mexico Provider A, Provider B XP10, XP18 10Rockets, 18Rockets
# 4 z Z Inc. Prospect China Provider B XP6 6Rockets

Short and sweet with aggregate.
aggregate(. ~ Site, df, unique)
# Site Parent Status Country ProductID ProductName ProductProvider
# 1 w W Inc. Prospect USA XP10, XP11, XP18, XP19 10Rockets, 11Rockets, 18Rockets, 19Rockets Provider A, Provider B
# 2 x X Inc. Client Canada XP4, XP5, XP6, XP7, XP8 4Rockets, 5Rockets, 6Rockets, 7Rockets, 8Rockets Provider A
# 3 y Y Inc. Client Mexico XP10, XP18 10Rockets, 18Rockets Provider A, Provider B
# 4 z Z Inc. Prospect China XP6 6Rockets Provider B

Related

How to pivot_wider the n unique values of variable A grouped_by variable B?

I am trying to pivot_wider() the column X of a data frame containing various persons names. Within group_by() another variable Y of the df there are always 2 of these names. I would like R to take the 2 unique X names values within each unique identifier of Y and put them in 2 new columns ex_X_Name_1 and ex_X_Name_2.
My data frame is looking like this:
df <- data.frame(Student = rep(c(17383, 16487, 17646, 2648, 3785), each = 2),
Referee = c("Paul Severe", "Cathy Nice", "Jean Exigeant", "Hilda Ehrlich", "John Rates",
"Eva Luates", "Fred Notebien", "Aldous Grading", "Hans Streng", "Anna Filaktic"),
Rating = format(round(x = sqrt(sample(15:95, 10, replace = TRUE)), digits = 3), nsmall = 3)
)
df
I would like to make the transformation of the Referee column to 2 new columns Referee_1 and Referee_2 with the 2 unique Referees assigned to each student and end with this result:
even_row_df <- as.logical(seq_len(length(df$Referee)) %% 2)
df_wanted <- data_frame(
Student = unique(df$Student),
Referee_1 = df$Referee[even_row_df],
Rating_Ref_1 = df$Rating[even_row_df],
Referee_2 = df$Referee[!even_row_df],
Rating_Ref_2 = df$Rating[!even_row_df]
)
df_wanted
I guess I could achieve this with by subsetting unique rows of student/referee combinations and make joins , but is there a way to handle this in one call to pivot_wider?

You should create a row id per group first:
library(dplyr)
library(tidyr)
df %>%
group_by(Student) %>%
mutate(row_n = row_number()) %>%
ungroup() %>%
pivot_wider(names_from = "row_n", values_from = c("Referee", "Rating"))
# A tibble: 5 × 5
Student Referee_1 Referee_2 Rating_1 Rating_2
<dbl> <chr> <chr> <chr> <chr>
1 17383 Paul Severe Cathy Nice 9.165 7.810
2 16487 Jean Exigeant Hilda Ehrlich 5.196 6.557
3 17646 John Rates Eva Luates 7.211 5.568
4 2648 Fred Notebien Aldous Grading 4.000 8.124
5 3785 Hans Streng Anna Filaktic 7.937 6.325

using data.table
library(data.table)
setDT(df)
merge(df[, .SD[1], Student], df[, .SD[2], Student], by = "Student", suffixes = c("_1", "_2"))
# Student Referee_1 Rating_1 Referee_2 Rating_2
# 1: 2648 Fred Notebien 6.708 Aldous Grading 9.747
# 2: 3785 Hans Streng 6.245 Anna Filaktic 8.775
# 3: 16487 Jean Exigeant 7.681 Hilda Ehrlich 4.359
# 4: 17383 Paul Severe 4.583 Cathy Nice 7.616
# 5: 17646 John Rates 6.708 Eva Luates 8.246

Combine variables AND create a data frame with a column with the name of those original variables with dplyr

I have a group of variables I created in my R environment:
x <- c("My name is Andrea and I live in Vancouver",
"I work at a university")
y <- c("My name is Andrew and I live in New York",
"I work at a hospital")
z <- c("My name is Alessia and I live in Rome",
"I work for the government")
I want to convert these "character" variables into tibbles and assign as a variable to each tibble the name of the dataset (so in this case, the names would be x, y and z).
Example of the tibble I'd like:
# A tibble: 6 × 2
value name
<chr> <chr>
1 My name is Andrea and I live in Vancouver "x"
2 I work at a university "x"
3 My name is Andrew and I live in New York "y"
4 I work at a hospital "y"
5 My name is Alessia and I live in Rome "z"
6 I work for the government "z"
Now the code here: test <- as_tibble(c(x,y,z)) %>% mutate(name = c("x", "y", "z")) doesn't work because of a mismatch in sizes.
This code does, but isn't in "dplyr" format:
xyz.list <- Hmisc::llist(x, y ,z)
df <- do.call(rbind, unname(Map(cbind, source = names(xyz.list), xyz.list)))
df <- df %>% as_tibble()
My question: is there a way to "merge" or combine different datasets AND create a column with the name of the original merged dataset using dplyr?

Another tidyverse option:
library(tidyverse)
lst(x, y, z) %>%
bind_rows() %>%
pivot_longer(everything()) %>%
arrange(name)
Or we could use map_df from purrr:
map_df(lst(x, y, z), ~ data.frame(value = .x), .id = "name")
Output
name value
<chr> <chr>
1 x My name is Andrea and I live in Vancouver
2 x I work at a university
3 y My name is Andrew and I live in New York
4 y I work at a hospital
5 z My name is Alessia and I live in Rome
6 z I work for the government

In base R:
stack(list(x = x, y = y, z= z))
values ind
1 My name is Andrea and I live in Vancouver x
2 I work at a university x
3 My name is Andrew and I live in New York y
4 I work at a hospital y
5 My name is Alessia and I live in Rome z
6 I work for the government z
in tidyverse:
library(tidyverse)
unnest(enframe(lst(x, y, z)), value)
# A tibble: 6 x 2
name value
<chr> <chr>
1 x My name is Andrea and I live in Vancouver
2 x I work at a university
3 y My name is Andrew and I live in New York
4 y I work at a hospital
5 z My name is Alessia and I live in Rome
6 z I work for the government

"Compare a variable by state abbreviations

How can I compare a variable by state abbreviations?
My data set has 5 variables currently. One of them is Location, and it is written as: "Raleigh, NC"
I need to create a variable that contains the two-character state abbreviation for each observation, and afterward another to group them by state. Each observation is of a college including their classification(private/public), instate/out of state tuition, and location.

This should do for you, if I understood your issue correctly.
Note: Please always share sample data using dput(your_dataset) or dput(head(your_dataset))
library(tidyverse)
d<- tibble(id = 1:3,
Location = c("Newyork, NY", "Raleigh, NC", "Delhi, IN"))
d %>% separate(Location,into = c("city", "country")) %>%
mutate_at(vars("city","country"), str_trim)
# A tibble: 3 x 3
id city country
<int> <chr> <chr>
1 1 Newyork NY
2 2 Raleigh NC
3 3 Delhi IN

how do I extract a part of data from a column and and paste it n another column using R?

I want to extract a part of data from a column and and paste it in another column using R:
My data looks like this:
names <- c("Sia","Ryan","J","Ricky")
country <- c("London +1234567890","Paris", "Sydney +0123458796", "Delhi")
mobile <- c(NULL,+3579514862,NULL,+5554848123)
data <- data.frame(names,country,mobile)
data
> data
names country mobile
1 Sia London +1234567890 NULL
2 Ryan Paris +3579514862
3 J Sydney +0123458796 NULL
4 Ricky Delhi +5554848123
I would like to separate phone number from country column wherever applicable and put it into mobile.
The output should be,
> data
names country mobile
1 Sia London +1234567890
2 Ryan Paris +3579514862
3 J Sydney +0123458796
4 Ricky Delhi +5554848123

You can use the tidyverse package which has a separate function.
Note that I rather use NA instead of NULL inside the mobile vector.
Also, I use the option, stringsAsFactors = F when creating the dataframe to avoid working with factors.
names <- c("Sia","Ryan","J","Ricky")
country <- c("London +1234567890","Paris", "Sydney +0123458796", "Delhi")
mobile <- c(NA, "+3579514862", NA, "+5554848123")
data <- data.frame(names,country,mobile, stringsAsFactors = F)
library(tidyverse)
data %>% as_tibble() %>%
separate(country, c("country", "number"), sep = " ", fill = "right") %>%
mutate(mobile = coalesce(mobile, number)) %>%
select(-number)
# A tibble: 4 x 3
names country mobile
<chr> <chr> <chr>
1 Sia London +1234567890
2 Ryan Paris +3579514862
3 J Sydney +0123458796
4 Ricky Delhi +5554848123
EDIT
If you want to do this without the pipes (which I would not recommend because the code becomes much harder to read) you can do this:
select(mutate(separate(as_tibble(data), country, c("country", "number"), sep = " ", fill = "right"), mobile = coalesce(mobile, number)), -number)

Complex Data Frame calculations in R

I currently am importing two tables (in the most basic form) that appear as such
Table 1
State Month Account Value
NY Jan Expected Sales 1.04
NY Jan Expected Expenses 1.02
Table 2
State Month Account Value
NY Jan Sales 1,000
NY Jan Customers 500
NY Jan F Expenses 1,000
NY Jan V Expenses 100
And my end goal is to create a 3rd data frame that includes the values of the first two rows and calculates a 4th column based off of functions
NextYearExpenses = (t2 F Expenses + t2 V Expenses)* t1 Expected Expenses
NextYearSales = (t2 sales) * t1 Expected Sales
So my desired output is as followed
State Month New Account Value
NY Jan Sales 1,040
NY Jan Expenses 1,122
I am relatively new to R and I think ifelse statements might be my best bet. I have tried merging the tables and calculating with simple column functions but with no real progress.
Any suggestions?

You may need to do some data wrangling but nothing out of the ordinary
require(dplyr)
Table1<-tibble(State=c("NY","NY"), Month=c("Jan","Jan"), Account=c("Expected Sales", "Expected Expenses"), Value=c(1.04,1.02))
Table2<-tibble(State=c("NY","NY","NY","NY"), Month=c("Jan","Jan","Jan","Jan"), Account=c("Sales", "Customers", "F Expenses","V Expenses"), Value=c(1000,500,1000,100))
First thing I do is rename the accounts to have a common name, i.e. expenses, this is going to help me to merge later on to Table1
Table2$Account[Table2$Account=="F Expenses"]<-"Expenses"
Table2$Account[Table2$Account=="V Expenses"]<-"Expenses"
then I use the group_by function and group by State, Month and Account and do the sum
Table2 <- Table2 %>% group_by(State, Month,Account) %>%
summarise(Tot_Value=sum(Value)) %>% ungroup()
head(Table2)
## State Month Account Tot_Value
## <chr> <chr> <chr> <dbl>
## 1 NY Jan Customers 500
## 2 NY Jan Expenses 1100
## 3 NY Jan Sales 1000
then something similar with the renaming for the accounts in table 1
Table1$Account[Table1$Account=="Expected Sales"]<-"Sales"
Table1$Account[Table1$Account=="Expected Expenses"]<-"Expenses"
Merge into a third table, Table 3
Table3<- left_join(Table1,Table2)
use mutate to do the needed operation
Table3 <- Table3 %>% mutate(Value2=Value*Tot_Value)
head(Table3)
## # A tibble: 2 x 6
## State Month Account Value Tot_Value Value2
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 NY Jan Sales 1.04 1000 1040
## 2 NY Jan Expenses 1.02 1100 1122

Here's what I did with dplyr and tidyr.
First I combined your initial tables with rbind into a single long format table. Since you have unique identifiers for each of the Account values, these don't need to be separate tables. Next I group_by State and Month to group these assuming eventually you'll have a variety of states/months. Next I summarise based on the values of Account that you specified and created two new columns. Finally to get it into the long format that you want I used gather from tidyr to go from wide format to long format. You can separate these commands into smaller chunks by deleting after the %>% to get a better idea of what each step does.
library(dplyr)
library(tidyr)
rbind(df,df2) %>%
group_by(State,Month) %>%
summarise(Expenses = (Value[which(Account == "F Expenses")] + Value[which(Account == "V Expenses")]) * Value[which(Account == "Expected Expenses")],
Sales = Value[which(Account == "Sales")] * Value[which(Account == "Expected Sales")]) %>%
gather(New_Account,Value, c(Expenses,Sales))
# A tibble: 2 x 4
# Groups: State [1]
# State Month New_Account Value
# <chr> <chr> <chr> <dbl>
#1 NY Jan Expenses 1122
#2 NY Jan Sales 1040

I'd recommend checking out the concept of "tidy data", as there are some real challenges with working on data with the structure you currently have. E.g. creating t3 should only take 2-3 lines of code, all of this is just to work around your data architecture:
library(tidyverse)
t1 <- data.frame(State = rep("NY", 2),
Month = rep(as.Date("2018-01-01"), 2),
Account = c("Expected Sales", "Expected Expenses"),
Value = c(1.04, 1.02),
stringsAsFactors = FALSE)
t2 <- data.frame(State = rep("NY", 4),
Month = rep(as.Date("2018-01-01"), 4),
Account = c("Sales", "Customers", "F Expenses", "V Expenses"),
Value = c(1000, 500, 1000, 100),
stringsAsFactors = FALSE)
t3 <- t2 %>%
spread(Account, Value) %>%
inner_join({
t1 %>%
spread(Account, Value)
}, by = c("State" = "State", "Month" = "Month")) %>%
mutate(NewExpenses = (`F Expenses` + `V Expenses`) * `Expected Expenses`,
NewSales = Sales * `Expected Sales`) %>%
select(State, Month, Sales = NewSales, Expenses = NewExpenses) %>%
gather(Sales, Expenses, key = `New Account`, value = Value)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Condense dataframe with conditions to certain columns - r

Related

How to pivot_wider the n unique values of variable A grouped_by variable B?

Combine variables AND create a data frame with a column with the name of those original variables with dplyr

"Compare a variable by state abbreviations

how do I extract a part of data from a column and and paste it n another column using R?

Complex Data Frame calculations in R

Categories

Resources