Joining tables and applying functions to columns with the same name in R and tidyverse - r

I am looking to join tables with customer id (easy enough) but then I want to multiply the columns to get updated values.
Customer_Week_1<-data.frame(First_name=c("John","Mary","David","Paul"),
Last_name=c("Jackson","Smith","Williams", "Zimmerman"),
Factor_1=c(2,5,8,9),
Factor_2=c(.5,.5,.75,.75),
Factor_3=c(0,1,2,3))
Customer_Week_2<-data.frame(First_name=c("John","Mary","David","Paul"),
Last_name=c("Jackson","Smith","Williams", "Zimmerman"),
Factor_1=c(3,7,1,7),
Factor_2=c(.51,.65,.72,.4),
Factor_3=c(1,2,3,4))
Customer_week3<-Customer_Week_1%>%
left_join(Customer_Week_2, by = c("First_name","Last_name"))
The expected results can be found by in a vector by just
Customer_week3_expected<-Customer_Week_1[,3:5]*Customer_Week_2[,3:5]
And I know I can just manually type out every column. But I have dozens of columns and need to make this code as easy to follow as possible.
I also know that I can just bind the results vector to
Customer_week3<-Customer_Week_1%>%
left_join(Customer_Week_2, by = c("First_name","Last_name"))%>%
select(1:2)
But that does not look like best practice to me, and I would rather this be done with a join some way to ensure everything lines up when I am iterating over the customers(tables)

Assuming I understand the output you're trying to get, I can think of two methods. If you know that the names are in the first two columns and are the same in both data frames (this might not be the case in real life), you can use the same multiplication operation you tried above, bound to the first two columns of either of the data frames.
cbind(Customer_Week_1[1:2], Customer_Week_1[-1:-2] * Customer_Week_2[-1:-2])
#> First_name Last_name Factor_1 Factor_2 Factor_3
#> 1 John Jackson 6 0.255 0
#> 2 Mary Smith 35 0.325 2
#> 3 David Williams 8 0.540 6
#> 4 Paul Zimmerman 63 0.300 12
Or you can be more verbose but maybe more flexible, and eshape to a long data frame, then do a grouped operation to summarize products for each person and factor. Starting from the join you have above:
library(dplyr)
library(tidyr)
Customer_week3 <- Customer_Week_1 %>%
left_join(Customer_Week_2, by = c("First_name", "Last_name"))
Make long-shaped data, separate the Factor_1.x into Factor_1 and x, and make products as your summary calculation.
products <- Customer_week3 %>%
gather(key = factor, value = value, -First_name, -Last_name) %>%
separate(factor, into = c("factor", "week"), sep = "\\.") %>%
group_by(First_name, Last_name, factor) %>%
summarise(value = prod(value))
head(products)
#> # A tibble: 6 x 4
#> # Groups: First_name, Last_name [2]
#> First_name Last_name factor value
#> <fct> <fct> <chr> <dbl>
#> 1 David Williams Factor_1 8
#> 2 David Williams Factor_2 0.54
#> 3 David Williams Factor_3 6
#> 4 John Jackson Factor_1 6
#> 5 John Jackson Factor_2 0.255
#> 6 John Jackson Factor_3 0
If you need to get back to a wide format, spread back.
products %>%
spread(key = factor, value = value)
#> # A tibble: 4 x 5
#> # Groups: First_name, Last_name [16]
#> First_name Last_name Factor_1 Factor_2 Factor_3
#> <fct> <fct> <dbl> <dbl> <dbl>
#> 1 David Williams 8 0.54 6
#> 2 John Jackson 6 0.255 0
#> 3 Mary Smith 35 0.325 2
#> 4 Paul Zimmerman 63 0.3 12

Similar to #camille's reshaping, but in data.table (and disregarding Customer_week3):
library(data.table)
# long format
long = rbindlist(list(Customer_Week_1, Customer_Week_2), id=TRUE)
# aggregate
long[, lapply(.SD, prod), by=.(First_name, Last_name), .SDcols=patterns("^Factor")]
First_name Last_name Factor_1 Factor_2 Factor_3
1: John Jackson 6 0.255 0
2: Mary Smith 35 0.325 2
3: David Williams 8 0.540 6
4: Paul Zimmerman 63 0.300 12
Going longer (again as seen in #camille's answer) might also make sense, so as to avoid repeatedly fiddling with names of Factor_* columns:
longer = melt(long, meas=patterns("^Factor")) # analogous to gather
longer[, .(value = prod(value)), by=.(First_name, Last_name, variable)]

Related

I need help merging two rows based on certain string character, the string is complaint

I am trying to calculate the fraction of the construction noise per zip code across NY city. The data is from NYC 311.
I am using dplyr and have grouped the data per zip.
However, I am finding difficulties merging the row for the complain column, I have to merge the data as per the string "construction" it appear anywhere meaning middle, front or end.
My solution, this is just the beginning
comp_types <- df %>% select(complaint_type,descriptor,incident_zip) %>%
group_by(incident_zip)
can you help me merge the row if unique value in descriptor contains any construction value.
Can you clarify what you mean by "merging"? I don't think you actually want to merge because you only have one dataframe. The term "merging" is used to describe the joining of two dataframes.
See ?base::merge:
Merge two data frames by common columns or row names, or do other versions of database join operations.
If I understand correctly, you want to look into the descriptor variable and see if it contains the string "construction" anywhere in the cell, so you can determine if the person's complaint was construction-related; same for "music". I don't believe you need to use complaint_type since complaint_type never contains the string "construction" or "music"; only descriptor does.
You can use a combination of ifelse and grepl to create a new variable that indicates whether the complaint was construction-related, music-related, or other.
library(tidyverse)
library(janitor)
url <- "https://data.cityofnewyork.us/api/views/p5f6-bkga/rows.csv"
df <- read.csv(url, nrows = 10000) %>%
clean_names() %>%
select(complaint_type, descriptor, incident_zip)
comp_types <- df %>%
select(complaint_type, descriptor, incident_zip) %>%
group_by(incident_zip)
head(comp_types)
#> # A tibble: 6 × 3
#> # Groups: incident_zip [6]
#> complaint_type descriptor incident_zip
#> <chr> <chr> <int>
#> 1 Noise - Residential Banging/Pounding 11364
#> 2 Noise - Residential Loud Music/Party 11222
#> 3 Noise - Residential Banging/Pounding 10033
#> 4 Noise - Residential Loud Music/Party 11208
#> 5 Noise - Residential Loud Music/Party 10037
#> 6 Noise Noise: Construction Before/After Hours (NM1) 11238
table(df$complaint_type)
#>
#> Noise Noise - Commercial Noise - Helicopter
#> 555 591 145
#> Noise - House of Worship Noise - Park Noise - Residential
#> 20 72 5675
#> Noise - Street/Sidewalk Noise - Vehicle
#> 2040 902
df <- df %>%
mutate(descriptor_misc = ifelse(grepl("Construction", descriptor), "Construction",
ifelse(grepl("Music", descriptor), "Music", "Other")))
df %>%
group_by(descriptor_misc) %>%
count()
#> # A tibble: 3 × 2
#> # Groups: descriptor_misc [3]
#> descriptor_misc n
#> <chr> <int>
#> 1 Construction 328
#> 2 Music 6354
#> 3 Other 3318
head(df)
#> complaint_type descriptor incident_zip
#> 1 Noise - Residential Banging/Pounding 11364
#> 2 Noise - Residential Loud Music/Party 11222
#> 3 Noise - Residential Banging/Pounding 10033
#> 4 Noise - Residential Loud Music/Party 11208
#> 5 Noise - Residential Loud Music/Party 10037
#> 6 Noise Noise: Construction Before/After Hours (NM1) 11238
#> descriptor_misc
#> 1 Other
#> 2 Music
#> 3 Other
#> 4 Music
#> 5 Music
#> 6 Construction

Add multiple columns with the same group and sum

I've got this dataframe and I want to add the last two columns to another dataframe by summing them and grouping them by "Full.Name"
# A tibble: 6 x 5
# Groups: authority_dic, Full.Name [6]
authority_dic Full.Name Entity `2019` `2020`
<chr> <chr> <chr> <int> <int>
1 accomplished Derek J. Leathers WERNER ENTERPRISES INC 1 0
2 accomplished Dirk Van de Put MONDELEZ INTERNATIONAL INC 0 1
3 accomplished Eileen P. Drake AEROJET ROCKETDYNE HOLDINGS 1 0
4 accomplished G. Michael Sievert T-MOBILE US INC 0 3
5 accomplished Gary C. Kelly SOUTHWEST AIRLINES 0 1
6 accomplished James C. Fish, Jr. WASTE MANAGEMENT INC 1 0
This is the dataframe I want to add the two columns to: Like you can see the "Full.Name" column acts as the grouping column.
# A tibble: 6 x 3
# Groups: Full.Name [6]
Full.Name `2019` `2020`
<chr> <int> <int>
1 A. Patrick Beharelle 5541 3269
2 Aaron P. Graft 165 200
3 Aaron P. Jagdfeld 4 5
4 Adam H. Schechter 147 421
5 Adam P. Symson 1031 752
6 Adena T. Friedman 1400 1655
I can add one column using the following piece of code, but if I want to do it with the second one, it overwrites my existing one and I am only left with one instead of two columns added.
narc_auth_total <- narc_auth %>% group_by(Full.Name) %>% summarise(`2019_words` = sum(`2019`)) %>% left_join(totaltweetsyear, ., by = "Full.Name")
The output for this command looks like this:
# A tibble: 6 x 4
# Groups: Full.Name [6]
Full.Name `2019` `2020` `2019_words`
<chr> <int> <int> <int>
1 A. Patrick Beharelle 5541 3269 88
2 Aaron P. Graft 165 200 2
3 Aaron P. Jagdfeld 4 5 0
4 Adam H. Schechter 147 421 2
5 Adam P. Symson 1031 752 15
6 Adena T. Friedman 1400 1655 21
I want to do the same thing and add the 2020_words column to the same dataframe. I just cannot do it, but it cannot be that hard to do so. It should be summarized as well, just like the 2019_words column. When I add "2020" to my command, it says object "2020" not found.
Thanks in advance.
If I have understood you well, this will solve your problem:
narc_auth_total <-
narc_auth %>%
group_by(Full.Name) %>%
summarise(
`2019_words` = sum(`2019`),
`2020_words` = sum(`2020`)
) %>%
left_join(totaltweetsyear, ., by = "Full.Name")

How can I transpose data in each variable from long to wide using group_by? R

I have a dataframe with id variable name. I'm trying to figure out a way to transpose each variable in the dataframe by name.
My current df is below:
name jobtitle companyname datesemployed empduration joblocation jobdescrip
1 David… Project… EOS IT Man… Aug 2018 – P… 1 yr 9 mos San Franci… Coordinati…
2 David… Technic… Options Te… Sep 2017 – J… 5 mos Belfast, U… Working wi…
3 David… Data An… NA Jan 2018 – J… 6 mos Belfast, U… Working wi…
However, I'd like a dataframe in which there is only one row for name, and every observation for name becomes its own column, like below:
name jobtitle_1 companyname_1 datesemployed_1 empduration_1 joblocation_1 jobdescrip_1 job_title2 companyname_2 datesemployed_2 empduration_2 joblocation_2 jobdescrip_2
1 David… Project… EOS IT Man… Aug 2018 – P… 1 yr 9 mos San Franci… Coordinati… Technic… Options Te… Sep 2017 – J… 5 mos Belfast, U… Working wi…
I have used commands like gather_by and melt in the past to reshape from long to wide, but in this case, I'm not sure how to apply it, since every observation for the id variable will need to become its own column.
It sounds like you are looking for gather and pivot_wider.
I used my own sample data with two names:
df <- tibble(name = c('David', 'David', 'David', 'Bill', 'Bill'),
jobtitle = c('PM', 'TPM', 'Analyst', 'Dev', 'Eng'),
companyname = c('EOS', 'Options', NA, 'Microsoft', 'Nintendo'))
First add an index column to distinguish the different positions for each name.
indexed <- df %>%
group_by(name) %>%
mutate(.index = row_number())
indexed
# name jobtitle companyname .index
# <chr> <chr> <chr> <int>
# 1 David PM EOS 1
# 2 David TPM Options 2
# 3 David Analyst NA 3
# 4 Bill Dev Microsoft 1
# 5 Bill Eng Nintendo 2
Then it is possible to use gather to get a long form, with one value per row.
gathered <- indexed %>% gather('var', 'val', -c(name, .index))
gathered
# name .index var val
# <chr> <int> <chr> <chr>
# 1 David 1 jobtitle PM
# 2 David 2 jobtitle TPM
# 3 David 3 jobtitle Analyst
# 4 Bill 1 jobtitle Dev
# 5 Bill 2 jobtitle Eng
# 6 David 1 companyname EOS
# 7 David 2 companyname Options
# 8 David 3 companyname NA
# 9 Bill 1 companyname Microsoft
# 10 Bill 2 companyname Nintendo
Now pivot_wider can be used to create a column for each variable and index.
gathered %>% pivot_wider(names_from = c(var, .index), values_from = val)
# name jobtitle_1 jobtitle_2 jobtitle_3 companyname_1 companyname_2 companyname_3
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 David PM TPM Analyst EOS Options NA
# 2 Bill Dev Eng NA Microsoft Nintendo NA
Get the data in long format, create a unique column identifier and get it back to wide format.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -name, names_to = 'col') %>%
group_by(name, col) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from = c(col, row), values_from = value)

How to find search words from a table, in another table, and then create new columns of the results?

I'm trying to find specifice words listed in a tibble arbeit in the another tibble rawEng$Text. If a word, or words, were found, I want to create, or mutate, a new data frame iDataArbeit with two new columns, one for the found word/s wArbeit, and one for the sum of there tf-idf iArbeitscores from arbeit$tfidf
My Data:
arbeit:
X1 feature tfidf
<dbl> <chr> <dbl>
1 0 sick 0.338
2 2 contract 0.188
3 3 pay 0.175
4 4 job 0.170
5 5 boss 0.169
6 6 sozialversicherungsnummer 0.169
rawEng:
Gender Gruppe Datum Text
<chr> <chr> <dttm> <chr>
1 F Berlin Expats 2017-07-07 00:00:00 Anyone out there who's had to apply for Führung~
2 F FAB 2018-01-18 00:00:00 Dear FAB, I am in need of a Führungszeugnis no ~
3 M Free Advice ~ 2017-01-30 00:00:00 Dear Friends, i would like to ask you how can I~
4 M FAB 2018-04-12 00:00:00 "Does anyone know why the \"Standesamt Pankow (~
5 F Berlin Expats 2018-11-12 00:00:00 having trouble finding consistent information a~
6 F Toytown Berl~ 2017-06-08 00:00:00 "Hello\r\n\r\nI have a question regarding Airbn~
I've tried with dplyr::mutate, using this code:
idataEnArbeit <- mutate(rawEng, wArbeit = ifelse((str_count(rawEng$Text, arbeit$feature))>=1,
arbeit$feature, NA),
iArbeit = ifelse((str_count(rawEng$Text, arbeit$feature))>=1,
arbeit$tfidf, NA))
but all I get is one Word, and it's tf-idf score, in the new columens iDatatArbeit$wArbeitand iDataArbeit$iArbeit
Gender Gruppe Datum Text wArbeit iArbeit
<chr> <chr> <dttm> <chr> <chr> <dbl>
1 F Berlin | Girl ~ 2018-09-11 13:22:05 "11 septembre, 13:21 GGI ~ sick 0.338
2 F ExpatBabies Be~ 2017-10-19 16:24:23 "16:24 Babysitter needed! B~ sick 0.338
3 F Berlin | Girl ~ 2018-06-22 18:24:19 "gepostet. Leonor Valen~ sick 0.338
4 F 'Neu in Berlin' 2018-09-18 23:19:51 "Hello guys, I am working wit~ sick 0.338
5 M Free Advice Be~ 2018-04-27 08:49:24 "In need of legal advice: Wha~ sick 0.338
6 F Free Advice Be~ 2018-07-04 18:33:03 "Is there somebody I can pay ~ sick 0.338
In summary: I want all words from arbeit$feature which are found in rawEng$Text to be added in iDataArbeit$wArbeit, and the sum of there tf-idf score to be added in iDataArbeit$iArbeit
Since I don't have your data, I'll import the gutenbergr library and play w/ Treasure Island.
library(tidytext)
library(gutenbergr)
## Now get the dataset
Treasure_Island <- gutenberg_works(title == "Treasure Island") %>% pull(gutenberg_id) %>%
gutenberg_download(.)
## and construct a toy arbeit:
arbeit <- data.frame(feature = c("island", "treasure", "to"),
tfidf = c(0.3,0.5,0.6))
## Break up a word into it's components (the head is just to keep the example short... you omit)
tidy_treasure <- unnest_tokens(Treasure_Island, feature, text, drop = FALSE) %>%
head(500)
## now bring the tfidf into tidy_treasure
df <- left_join(tidy_treasure, arbeit, by = "feature")
## and now you can average by sentence normally.
## To get the words we have to throw out the words that don't contribute to our tfidf.
## Two options:
df %>% filter(!is.na(tfidf)) %>% group_by(text) %>% summarize(AveTFIDF = sum(tfidf, na.rm = TRUE),
Words = paste(feature, collapse = ";"))
## Or if you want to keep a row for each found word, we can't use summarize, but we can still add them all up.
df %>% filter(!is.na(tfidf)) %>% group_by(text) %>% mutate(AveTFIDF = sum(tfidf, na.rm = TRUE))

In R, how can I randomly choose two out three names 500 times, with balanced selections?

I know I can use complete_ra from the randomizr package to randomly and equally allocate to one of three "arms" (in this case "arms" are just names of people)
library(randomizr)
set.seed(100)
names <- complete_ra(N = 500, num_arms = 3)
#each "arm" is chosen ~167 times
#Now put the names in
library(plyr)
df <- transform(df,
names=revalue(names,c("T1"="Luis", "T2"="Conor","T3"="Dafydd")))
But what I need is to actually assign the 500 samples to a randomly chosen two of the three names. So I need my dataset to be:
ID# Name1 Name2
1 Conor Luis
2 Conor Dafydd
3 Luis Dafydd
...
500 Conor Luis
and at the end I need each of the 3 to still be chosen an equal amount.
A workaround is since there's 3 names, that means there's 3 combinations too, so I could simply replace Conor with "Conor and Luis", Luis with "Luis and Dafydd", and Dafydd with "Conor and Dafydd"...but I'm sure there's a more eloquent way that would allow for other combinations (like choosing 2 out of 4 names). Also I don't like the workaround because currently each name can show up 8 times in a row for example, which means we would have an exact pair 8 times in a row. I think a more eloquent method of randomly choosing 2 out of the 3 names would result in fewer "in a row" cases.
The canonical way to select n elements from a list (without replacement here) would be sample. Here a simple way to create 500 such samples and transform the result into a data.frame:
set.seed(100)
names <- c("Luis", "Conor", "Dafydd")
samples <- lapply(1:500, function(x) sample(names, 2))
head(as.data.frame(matrix(unlist(samples), ncol = 2, byrow = TRUE)))
#> V1 V2
#> 1 Luis Dafydd
#> 2 Conor Luis
#> 3 Conor Luis
#> 4 Dafydd Luis
#> 5 Conor Luis
#> 6 Conor Dafydd
Created on 2019-03-15 by the reprex package (v0.2.1)
Here's a fun approach with randomizr and tidyverse. It treats each person as a block of two observations, then uses pivot_wider to reshape the data
library(tidyverse)
library(randomizr)
tibble(
person_id = rep(1:500, each = 2),
name = rep(c("Name1", "Name2"), 500),
assignment = block_ra(
blocks = person_id,
conditions = c("Luis", "Conor", "Dafydd")
)
) %>%
pivot_wider(names_from = name,
values_from = assignment)
#> # A tibble: 500 x 3
#> person_id Name1 Name2
#> <int> <fct> <fct>
#> 1 1 Luis Dafydd
#> 2 2 Conor Luis
#> 3 3 Dafydd Luis
#> 4 4 Dafydd Conor
#> 5 5 Conor Dafydd
#> 6 6 Luis Dafydd
#> 7 7 Dafydd Luis
#> 8 8 Conor Luis
#> 9 9 Conor Luis
#> 10 10 Dafydd Conor
#> # … with 490 more rows
Created on 2020-01-24 by the reprex package (v0.3.0)

Resources