Merging rows based on unique emails with overlapping data - r

I would like to merge rows in the my data frame by unique emails, but I do not want to lose any data. To do this I would like the function to combine rows with the same email address. Along with this, if there happens to be overlapping data for an email address that I am trying to combine into one, I want the data from the row with less cells filled in to be added into a new column. Please as questions because I know that I am not explaining this very clearly.
Below is an example of what I am looking for the function to do (data made up).
First Name
Last Name
Email
Phone
Address
Shoe Size
John
Schmitt
jschmitt#gmail.com
914-392-1840
address 1
4
Paul
Johnson
pjohnson#gmail.com
274-184-3653
address 2
2
Brad
Arnold
barnold#gmail.com
157-135-3175
address 3
5
John
Schmitt
jschmitt#gmail.com
914-392-1840
6
This sheet should become:
First Name
Last Name
Email
Phone
Address
Shoe Size
Shoe Size 2
John
Schmitt
jschmitt#gmail.com
914-392-1840
address 1
4
6
Paul
Johnson
pjohnson#gmail.com
274-184-3653
address 2
2
Brad
Arnold
barnold#gmail.com
157-135-3175
address 3
5
Basically, the phone number connected to forjschmitt#gmail.com stays in the "Phone" column because it is the same for both rows. Even though the rows are not the same for the address, because the bottom row is blank, it stays the same. Finally, a new column is created for Shoe Size, because there are two differing values for the rows that we are merging. The way that the function should pick which Shoe size to put in Shoe Size 2 is by looking at the number of cells in each row. The shoe size in the row with more cells filled goes in the original Shoe Size column. The shoe size in the row with less cells filled goes in the new Shoe Size 2 column.
Feel free to ask any questions or make any suggestions about how I could do something of this nature in an easier way. I also haven't figured out what to do if the two rows with conflicting data have the same number of cells filled...

Update: tidyverse only solution with the note of Martin Gal using chop
df %>%
select(-Address) %>%
chop(`Shoe Size`) %>%
unnest_wider(`Shoe Size`) %>%
rename(`Shoe Size` = ...1, `Shoe Size 2` = ...2) %>%
left_join(df, by= "Shoe Size") %>%
select(-contains(".y")) %>%
rename_with(~str_remove(., '.x')) %>%
relocate(Address, .after = Phone) %>%
arrange(Address)
First answer:
Here is a way how we could achieve the result. The logic:
remove Address and assgin to new df1
use aggregate to basically combine the duplicate parts of rows and aggregate the not duplicate part (here: Shoe Size)
Use unnest_wider to unnest the list column
rename
left_join with df and clean with select, rename_with
relocate and arrange
library(dplyr)
library(tidyr)
# base R remove column Address and assign to df1
df1 <- df[,-5]
# aggregate Shoe Size (I don´t know how to do this in dplyr, therefore base R)
df1 <- aggregate(df1[5], df1[-5], unique)
# now with tidyverse(dpylr, tidyr)
df1 %>%
unnest_wider(`Shoe Size`) %>%
rename(`Shoe Size` = ...1, `Shoe Size 2` = ...2) %>%
left_join(df, by= "Shoe Size") %>%
select(-contains(".y")) %>%
rename_with(~str_remove(., '.x')) %>%
relocate(Address, .after = Phone) %>%
arrange(Address)
# A tibble: 3 x 7
`First Name` `Last Name` Email Phone Address `Shoe Size` `Shoe Size 2`
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 John Schmitt jschmitt#gmail.com 914-392-1840 address 1 4 6
2 Paul Johnson pjohnson#gmail.com 274-184-3653 address 2 2 NA
3 Brad Arnold barnold#gmail.com 157-135-3175 address 3 5 NA
data:
structure(list(`First Name` = c("John", "Paul", "Brad", "John"
), `Last Name` = c("Schmitt", "Johnson", "Arnold", "Schmitt"),
Email = c("jschmitt#gmail.com", "pjohnson#gmail.com", "barnold#gmail.com",
"jschmitt#gmail.com"), Phone = c("914-392-1840", "274-184-3653",
"157-135-3175", "914-392-1840"), Address = c("address 1",
"address 2", "address 3", NA), `Shoe Size` = c(4, 2, 5, 6
)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))

Related

Move column values to new column based on several conditions in R

I am currently cleaning a dataset to be ready for analysis and need to move values around columns depending on the way one column matches to another.
In my data are participants who are assigned to various numbers of projects, and they have rated each project on different variables. Therefore, Name is the participant, Project Name is the project's name, Project Number is the number of that project for that participant, and the variables with "voice" in them are questions about the project with a value indicating a rating (1-5). Each person has rated anywhere from 1-17 projects.
An example of my data in its current state can be replicated as so:
colnames<-c("Name", "Project Name", "Project Number", "T2_voice1", "T2_voice2", "T2project1_voice1", "T2project1_voice2", "T2project2_voice1", "T2project2_voice2")
r1<- c("Bob", "ProjectX", "Project1", NA, NA, 5, 2, 4 ,5)
r2<- c("Bob", "ProjectZ", "Project2", NA, NA, 5, 2, 4 ,5)
r3<- c("Amy", "ProjectQ", "Project1", NA, NA, 1, 2, 1 ,1)
r4<- c("Amy", "ProjectD", "Project2", NA, NA, 1, 2, 1 ,1)
data<-rbind(r1, r2, r3, r4)
colnames(data)<-colnames
What I would like to do is put the number value from project1_voice1 in the column T2_voice1 for project1 for each participant. This would continue then for each project number for each participant. The final product would look like this once I delete the unneeded columns:
Name
Project Name
Project Number
T2_voice1
T2_voice2
Bob
ProjectX
1
5
2
Bob
ProjectY
2
4
5
Amy
ProjectQ
1
1
2
Amy
ProjectD
2
1
1
The only way I have thought to do this is through some sort of grepl or substr to match Project Number's projet1 to the column names with project1 in it. Or, to do this positionally since 17 (2 in the example) projects are rated for each participant - some just have NAs if the participant did not have that many projects.
Any guidance or ideas would be extremely appreciated!
I would split your data into separate tables of project information and ratings; tidyr::pivot_longer() the ratings table; then merge back together:
library(dplyr)
library(stringr)
library(tidyr)
# convert example data from matrix to dataframe
data <- as.data.frame(data)
projects <- data %>%
select(Name, `Project Name`, `Project Number`) %>%
mutate(`Project Number` = str_extract(`Project Number`, "\\d+$"))
ratings <- data %>%
distinct(Name, across(T2project1_voice1:T2project2_voice2)) %>%
pivot_longer(
!Name,
names_to = c("Project Number", ".value"),
names_pattern = "T2project(\\d+)_(.+)"
) %>%
rename_with(.cols = voice1:voice2, ~ str_c("T2_", .x))
final <- full_join(projects, ratings)
Name Project Name Project Number T2_voice1 T2_voice2
1 Bob ProjectX 1 5 2
2 Bob ProjectZ 2 4 5
3 Amy ProjectQ 1 1 2
4 Amy ProjectD 2 1 1

R Concatenate Across Rows Within Groups but Preserve Sequence

My data consists of text from many dyads that has been split into sentences, one per row. I'd like to concatenate the data by speaker within dyads, essentially converting the data to speaking turns. Here's an example data set:
dyad <- c(1,1,1,1,1,2,2,2,2)
speaker <- c("John", "John", "John", "Paul","John", "George", "Ringo", "Ringo", "George")
text <- c("Let's play",
"We're wasting time",
"Let's make a record!",
"Let's work it out first",
"Why?",
"It goes like this",
"Hold on",
"Have to tighten my snare",
"Ready?")
dat <- data.frame(dyad, speaker, text)
And this is what I'd like the data to look like:
dyad speaker text
1 1 John Let's play. We're wasting time. Let's make a record!
2 1 Paul Let's work it out first
3 1 John Why?
4 2 George It goes like this
5 2 Ringo Hold on. Have to tighten my snare
6 2 George Ready?
I've tried grouping by sender and pasting/collapsing from dplyr but the concatenation combines all of a sender's text without preserving speaking turn order. For example, John's last statement ("Why") winds up with his other text in the output rather than coming after Paul's comment. I also tried to check if the next speaker (using lead(sender)) is the same as current and then combining, but it only does adjacent rows, in which case it misses John's third comment in the example. Seems it should be simple but I can't make it happen. And it should be flexible to combine any series of continuous rows by a given speaker.
Thanks in advance
Create another group with rleid (from data.table) and paste the rows in summarise
library(dplyr)
library(data.table)
library(stringr)
dat %>%
group_by(dyad, grp = rleid(speaker), speaker) %>%
summarise(text = str_c(text, collapse = ' '), .groups = 'drop') %>%
select(-grp)
-output
# A tibble: 6 × 3
dyad speaker text
<dbl> <chr> <chr>
1 1 John Let's play We're wasting time Let's make a record!
2 1 Paul Let's work it out first
3 1 John Why?
4 2 George It goes like this
5 2 Ringo Hold on Have to tighten my snare
6 2 George Ready?
Not as elegant as dear akrun's solution. helper does the same as rleid function here without the NO need of an additional package:
library(dplyr)
dat %>%
mutate(helper = (speaker != lag(speaker, 1, default = "xyz")),
helper = cumsum(helper)) %>%
group_by(dyad, speaker, helper) %>%
summarise(text = paste0(text, collapse = " "), .groups = 'drop') %>%
select(-helper)
dyad speaker text
<dbl> <chr> <chr>
1 1 John Let's play We're wasting time Let's make a record!
2 1 John Why?
3 1 Paul Let's work it out first
4 2 George It goes like this
5 2 George Ready?
6 2 Ringo Hold on Have to tighten my snare

how do I extract a part of data from a column and and paste it n another column using R?

I want to extract a part of data from a column and and paste it in another column using R:
My data looks like this:
names <- c("Sia","Ryan","J","Ricky")
country <- c("London +1234567890","Paris", "Sydney +0123458796", "Delhi")
mobile <- c(NULL,+3579514862,NULL,+5554848123)
data <- data.frame(names,country,mobile)
data
> data
names country mobile
1 Sia London +1234567890 NULL
2 Ryan Paris +3579514862
3 J Sydney +0123458796 NULL
4 Ricky Delhi +5554848123
I would like to separate phone number from country column wherever applicable and put it into mobile.
The output should be,
> data
names country mobile
1 Sia London +1234567890
2 Ryan Paris +3579514862
3 J Sydney +0123458796
4 Ricky Delhi +5554848123
You can use the tidyverse package which has a separate function.
Note that I rather use NA instead of NULL inside the mobile vector.
Also, I use the option, stringsAsFactors = F when creating the dataframe to avoid working with factors.
names <- c("Sia","Ryan","J","Ricky")
country <- c("London +1234567890","Paris", "Sydney +0123458796", "Delhi")
mobile <- c(NA, "+3579514862", NA, "+5554848123")
data <- data.frame(names,country,mobile, stringsAsFactors = F)
library(tidyverse)
data %>% as_tibble() %>%
separate(country, c("country", "number"), sep = " ", fill = "right") %>%
mutate(mobile = coalesce(mobile, number)) %>%
select(-number)
# A tibble: 4 x 3
names country mobile
<chr> <chr> <chr>
1 Sia London +1234567890
2 Ryan Paris +3579514862
3 J Sydney +0123458796
4 Ricky Delhi +5554848123
EDIT
If you want to do this without the pipes (which I would not recommend because the code becomes much harder to read) you can do this:
select(mutate(separate(as_tibble(data), country, c("country", "number"), sep = " ", fill = "right"), mobile = coalesce(mobile, number)), -number)

Count word frequency across multiple columns in R

I have a data frame in R with multiple columns with multi-word text responses, that looks something like this:
1a 1b 1c 2a 2b 2c
student job prospects money professors students campus
future career unsure my grades opportunities university
success reputation my job earnings courses unsure
I want to be able to count the frequency of words in columns 1a, 1b, and 1c combined, as well as 2a, 2b, and 2b combined.
Currently, I'm using this code to count word frequency in each column individually.
data.frame(table(unlist(strsplit(tolower(dat$1a), " "))))
Ideally, I want to be able to combine the two sets of columns into just two columns and then use this same code to count word frequency, but I'm open to other options.
The combined columns would look something like this:
1 2
student professors
future my grades
success earnings
job prospects students
career opportunities
reputation courses
money campus
unsure university
my job unsure
Here's a way using dplyr and tidyr packages. FYI, one should avoid having column names starting with a number. Naming them a1, a2... would make things easier in the long run.
df %>%
gather(variable, value) %>%
mutate(variable = substr(variable, 1, 1)) %>%
mutate(id = ave(variable, variable, FUN = seq_along)) %>%
spread(variable, value)
id 1 2
1 1 student professors
2 2 future my grades
3 3 success earnings
4 4 job prospects students
5 5 career opportunities
6 6 reputation courses
7 7 money campus
8 8 unsure university
9 9 my job unsure
Data -
df <- structure(list(`1a` = c("student", "future", "success"), `1b` = c("job prospects",
"career", "reputation"), `1c` = c("money", "unsure", "my job"
), `2a` = c("professors", "my grades", "earnings"), `2b` = c("students",
"opportunities", "courses"), `2c` = c("campus", "university",
"unsure")), .Names = c("1a", "1b", "1c", "2a", "2b", "2c"), class = "data.frame", row.names = c(NA,
-3L))
In general, you should avoid column names that start with numbers. That aside, I created a reproducible example of your problem and provided a solution using dplyr and tidyr. The substr() function inside the mutate_at assume your column names follow the [num][char] pattern in your example.
library(dplyr)
library(tidyr)
data <- tibble::tribble(
~`1a`, ~`1b`, ~`1c`, ~`2a`, ~`2b`, ~`2c`,
'student','job prospects', 'mone', 'professor', 'students', 'campus',
'future', 'career', 'unsure', 'my grades', 'opportunities', 'university',
'success', 'reputation', 'my job', 'earnings', 'courses', 'unsure'
)
data %>%
gather(key, value) %>%
mutate_at('key', substr, 0, 1) %>%
group_by(key) %>%
mutate(id = row_number()) %>%
spread(key, value) %>%
select(-id)
# A tibble: 9 x 2
`1` `2`
<chr> <chr>
1 student professor
2 future my grades
3 success earnings
4 job prospects students
5 career opportunities
6 reputation courses
7 mone campus
8 unsure university
9 my job unsure
If your end purpose is to count frequency (as opposed to switching from wide to long format), you could do
ave(unlist(df[,paste0("a",1:3)]), unlist(df[,paste0("a",1:3)]), FUN = length)
which will count the frequency of the elements of columns a1,a2,a3, where df denotes the data frame (and the columns are labeled a1,a2,a3,b1,b2,b3).

How to convert specific rows into columns in r?

I have a df in R of only one column of food ratings from amazon.
head(food_ratings)
product.productId..B001E4KFG0
1 review/userId: A3SGXH7AUHU8GW
2 review/profileName: delmartian
3 review/helpfulness: 1/1
4 review/score: 5.0
5 review/time: 1303862400
6 review/summary: Good Quality Dog Food
The rows repeat themselves, so that rows 7 through 12 have the same information regarding another user(row 7). This pattern is repeated many times.
Therefore, I need to have every group of 6 rows distributed in one row with 6 columns, so that later I can subset, for instance, the review/summary according to their review/score.
I'm using RStudio 1.0.143
EDIT: I was asked to show the output of dput(head(food_ratings, 24)) but it was too big regardless of the number used.
Thanks a lot
I have taken your data and added 2 more fake users to it. Using tidyr and dplyr you can create new columns and collapse the data into a nice data.frame. You can use select from dplyr to drop the id column if you don't need it or to rearrange the order of the columns.
library(tidyr)
library(dplyr)
df %>%
separate(product.productId..B001E4KFG0, into = c("details", "data"), sep = ": ") %>%
mutate(details = sub("review/ ", "", details)) %>%
group_by(details) %>%
mutate(id = row_number()) %>%
spread(details, data)
# A tibble: 3 x 7
id helpfulness profileName score summary time userId
<int> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 1/1 delmartian 5.0 Good Quality Dog Food 1303862400 A3SGXH7AUHU8GW
2 2 1/1 martian2 1.0 Good Quality Snake Food 1303862400 123456
3 3 2/5 martian3 5.0 Good Quality Cat Food 1303862400 123654
data:
df <- structure(list(product.productId..B001E4KFG0 = c("review/userId: A3SGXH7AUHU8GW",
"review/profileName: delmartian", "review/helpfulness: 1/1",
"review/score: 5.0", "review/time: 1303862400", "review/summary: Good Quality Dog Food",
"review/userId: 123456", "review/profileName: martian2", "review/helpfulness: 1/1",
"review/score: 1.0", "review/time: 1303862400", "review/summary: Good Quality Snake Food",
"review/userId: 123654", "review/profileName: martian3", "review/helpfulness: 2/5",
"review/score: 5.0", "review/time: 1303862400", "review/summary: Good Quality Cat Food"
)), class = "data.frame", row.names = c(NA, -18L))

Resources