How do I de-duplicate while merging this data in R? - r

Goal: Merge two excel files that will have significant overlap, but overwrite ONLY the phone numbers and record ID of one data set.
What I have been doing: Just brute force de-dup in excel where I copy over the sheet with phone numbers, sort the ID column, identify/highlight duplicates, and drag "up" the phone numbers to fill in the empty space for the matching ID. The process isn't hard, but with more records it starts to get absurdly tedious. These data look like this in the merged but not de-duplicated Excel file. Or just in plain text:
555555 | Joe | Copy | DOB |AGE | 555 Data Road | DataVille | LA |ZIP|County|(**PHONE GOES HERE**) |Male|White|Doc Name|More info
555555| Joe| Copy |DOB| AGE| 555 Data Road|DataVille|LA| ZIP|County| 555555555 (Phone)
And the phone should be added to the space between County and Gender for every record that matches the two ID's (first number in the record).
Attempts in R:
df_final <- merge(df_noPhone, df_Phone, by = c("Record_ID"), all.x = T)
But this just duplicates the columns ("PatientAddress.x" etc.) and I need those to be synced up for the records to be complete.
The real tricky part is, though that it isn't consistent this way throughout the data. Sometimes we simply don't have phone numbers for certain records and we still want to retain them in the data.
Suggestions? I've tried merging with almost every package I can imagine but sometimes it ends up creating more work in the direct, raw data file afterwards than it's worth.
Thanks!

You'd mentioned :
.. identify/highlight duplicates, and drag "up" the phone numbers to
fill in the empty space for the matching ID.
I suggest : replace the "drag up", with a formula.. then swap the column.
Assuming your data is filled in A2:S3, put :
=IF(M2="",1,0) in U2
=IF(U2=1,INDEX(M:M,MATCH(1,INDEX((0=U:U)*(A2=A:A),0,1),0)),"No data") in V2
and drag both downwards.
ref link : https://exceljet.net/formula/index-and-match-with-multiple-criteria
You'll noticed that in I use "no data" to 'fill' in column that already have numbers. you may use data > filter to remove/unselect those lines manually. (That's what I'll do.. but still it's = up to you.. )
Hope it helps..

Related

Using ImportHTML or ImportXML to Select Non-Consecutive Columns In Google Sheets, Remove Rows, & Add A Sort Column

I am wanting to import the table information from https://www.pro-football-reference.com/years/2020/draft.htm into a google sheet. However, I'm trying to avoid pulling in null cells as well as information I already have in other sheets. Here are my questions:
The only columns I want are Round (col1), Pick (Col2), and Player (Col4). I've tried using ImportHTML and so far, all i can do is grab the whole table.
I want to create a new column called 'Rd.Pick' which would convert the pick column into a representation ofwhat pick in the respective round they were. So aka Pick 33 would display 2.1
Finally, I would like to be able to remove the rows that are listed in between the last pick of a round but before the first pick in the following round. I'm not sure how to do that given that the text in those rows matches the header row.
This is just to answer the question from your comment above - how to convert the sequential draft pick number to a number like 3.12, 12th pick in the 3rd round.
This formula is a bit brute force, but it works:
={"Round-Pick";
ArrayFormula(ifna(ifs(
D2:D=1,"1."& text(E2:E,"00"),
D2:D=2,"2."& text(E2:E-max(filter(D$2:E,D$2:D=1)),"00"),
D2:D=3,"3."& text(E2:E-max(filter(D$2:E,D$2:D=2)),"00"),
D2:D=4,"4."& text(E2:E-max(filter(D$2:E,D$2:D=3)),"00"),
D2:D=5,"5."& text(E2:E-max(filter(D$2:E,D$2:D=4)),"00"),
D2:D=6,"6."& text(E2:E-max(filter(D$2:E,D$2:D=5)),"00"),
D2:D=7,"7."& text(E2:E-max(filter(D$2:E,D$2:D=6)),"00")
),""))}
If you put that in NFLDraft!F1, it should do what you want. You could then hide Column E if you like.
UPDATED: To provide the format you've requested, with leading zero.
try:
=ARRAYFORMULA(QUERY({
QUERY(IMPORTHTML("https://www.pro-football-reference.com/years/2020/draft.htm",
"table", 1), "select Col4"),
QUERY(IMPORTHTML("https://www.pro-football-reference.com/years/2020/draft.htm",
"table", 1), "select Col1")&"."&
QUERY(IMPORTHTML("https://www.pro-football-reference.com/years/2020/draft.htm",
"table", 1), "select Col2")}, "where not Col2 matches '\.'", 1))

How to match a variable to a row in a dataframe and select the variable in a different column on the same row?

I have a bunch of chat logs and have managed to pull email addresses from them and seperate the domains "#bacon.edu" I have a list of domains matched with a category name.
Basically I want to match the variable to a row in the column 2 pull the category name from column 1.
I should mention everything is formatted as factors currently but that can change.
In this example d1 = "bacon.edu" and name list is a data frame set up like this:
d1 = "bacon.edu"
Workplace Name Email List
Pancake #bac.edu
Test place #toe.edu
superworld #bacon.edu
monkey gym #aclu.edu
toaster oven #yoyo.edu
The goal is to find bacon in row 3 create a variable from column 1 row 3(so abc = "superworld"), but i struggle to find the variable to begin with.
I have tried:
which(d1, namelist$Email.List)
which(namelist$Email.List == d1)
which(grep
match(d1, namelist$Email.list)
which(grepl("bacon.edu, namelist$Email.List
Sadly I dont recall all errors or what they came from but they include:
integer(0)
object class not logical
level sets of factors are different.
I have sense deleted failed attempts. Im sure its simple and I feel bad asking but any help would be appreciated!
We can use grep
namelist$1Workplace Name`[grep(d1, namelist$`Email List`)]

data frame accessing specific rows and col from csv file in R programming

I have csv file contains iphone device roadmap like version number, name of model, release of model , price etc. I have done following:
I have imported data set in Rstudio in variable name iphonedetail by following command. iphonedetail <-read.csv("iphodedata.csv")
Than i hv changed the attribute "name of model" to character by using following: iphonedetail$nameofmodel <- as.character(iphonedetail$nameofmodel)
Now i need to access 1st 5 name of model and store them in vector .
I tried this to achieve : iphonesubset <- data.frame(iphonedetail$nameofmodel)
Then on console i typed iphonesubset, but gave 0 col and row.
Could someone help in above 2 steps correct or not ? and also suggest how to fix 3rd step?
if you want to extract the first five (non unique):
iphonedf1to5 <- df[1:5,]
That means that you get the first 5 rows and all columns. Then if you want to get the unique first five elements it should be like:
iphonedf1to5 <- unique(df[1:5,])
Edit:
df means your data frame of the read csv, iphonedetail in your case.

Character values stored in DATAFRAME with Double Quotes while reading into R

I have a csv file with almost 4 millions records and 30 + columns.
The Columns are of varied type that includes Numeric, Alphanumeric, Date Column, character etc.
Attempt 1:
When I first read the file in R using read.csv Function then only 2 millions of the records were read.
This may have happened because of some special characters in the DATA.
Attempt 2:
I provided the argument quote = "" in read.csv Function and all the records were read succesfully.
However this brings up 2 issues:
a. all teh Columns were appended with 'x.' modifier:
egs.: x.date , x.name
b. all the Character Columns were loaded in dataframe, enclosed with double quotes ""
Can someone, please advise me that how to resolve these 2 issues and get the data loaded in R succesfully?
I work for a financial insititution and the data is highly sensitive, hence cannot paste the screenshot over here.
I also tried to create the scenario at my home but all my efforts were of little or of no avail.
The below screenshot is closest I have came to the exact scenario:
DATAFRAME SCREENSHOT: Not exact copy

How to write a column to file without duplicates in R and order alphabetically

I am trying to write to a csv file with write.table but from what I've read it has limited capabilities. I am using the following command.
write.table(s$Nomen, "table.csv", row.names=FALSE, col.names=FALSE)
which exports a datasheet consisting of a single column (as I like it). However, that column contains a lot of duplicate values. I would like to remove duplicates and order the column alphabetically.
For example, if this is s$Nomen:
Nomen
------
archer
sent
chocolate
banana
arbitrary
column
paste
paste
knowledge
zen
banana
sent
surprise
The output should be:
arbitrary
archer
banana
chocolate
column
knowledge
paste
sent
surprise
zen
I'm assuming sort.list comes in handy, but I don't know how to remove the duplicates.
Note the data in the original column s$Nomen should not be altered! So I don't want to re-order the actual column, but I want to re-order the output.
You could try either unique or duplicated from base R
sort(unique(s$Nomen))
Or
sort(s$Nomen[!duplicated(s$Nomen)])

Resources