Matching data from multiple columns in r - r

I have two datasets:
Contacts2: This contains a list of ~100,000 contacts, their respective titles and a set of columns which describes the types of work contacts could be involved in. Here's an example dataset:
First<-c("George","Thomas","James","Jimmy","Howard","Herbert")
Last<-c("Washington", "Jefferson", "Madison", "Carter", "Taft", "Hoover")
Title<-c("CEO", "Accountant","Communications Specialist", "President", "Accountant", "CFO")
Finance<-NA
Executive<-NA
Communications<-NA
Contacts2<-as.data.frame(cbind(First,Last,Title,Finance,Executive,Communications))
First Last Title Finance Executive Communications
1 George Washington CEO <NA> <NA> <NA>
2 Thomas Jefferson Accountant <NA> <NA> <NA>
3 James Madison Communications Specialist <NA> <NA> <NA>
4 Jimmy Carter President <NA> <NA> <NA>
5 Howard Taft Accountant <NA> <NA> <NA>
6 Herbert Hoover CFO <NA> <NA> <NA>
Note the last three columns are numeric.
TableOfTitle: This dataset contains a list of ~1,000 unique titles and the same set of columns in which describes the type of work the contacts could be involved in. For each title I've put an 1 in the column(s) of the roles that describe that person's job.
Title<-c("CEO","Accountant", "Communications Specialist", "President", "CFO")
Finance<-c(NA,1,NA,1,1)
Executive<-c(1,NA,NA,NA,1)
Communications<-c(NA,NA,1,NA,NA)
TableOfTitle<-as.data.frame(cbind(Title,Finance,Executive,Communications))
Title Finance Executive Communications
1 CEO <NA> 1 <NA>
2 Accountant 1 <NA> <NA>
3 Communications Specialist <NA> <NA> 1
4 President 1 <NA> <NA>
5 CFO 1 1 <NA>
Note the last three columns are numeric.
I'm now trying to match the check boxes in TableOfTitle in Contacts2 based on the contact title field. For example, since TableOfTitle shows anyone with the title of CFO should have an x in the Finance and Executive field, the record for Herbert Hoover in Contacts2 should also have 1s in those columns as well.

Here's a solution that uses dplyr. It is essentially what some commenters have already recommended, except that this fulfills the request of not copying over any pre-existing data in the last 3 columns of Contacts2.
Note that ifelse() can be very slow with large datasets, but for your stated task this shouldn't really be noticeable. Algorithmically, this solution is also a bit clumsy in other ways, but I went for maximum readability here.
Contacts2 <- left_join(Contacts2, TableOfTitle, by = "Title") %>%
transmute(First = First,
Last = Last,
Title = Title,
Finance = ifelse(is.na(Finance.x), Finance.y, Finance.x),
Executive = ifelse(is.na(Executive.x), Executive.y, Executive.x),
Communications = ifelse(is.na(Communications.x), Communications.y, Communications.x))
Example output:
First Last Title Finance Executive Communications
George Washington CEO <NA> 1 <NA>
Thomas Jefferson Accountant 1 <NA> <NA>
James Madison Communications Specialist <NA> <NA> 1
Jimmy Carter President 1 <NA> <NA>
Howard Taft Accountant 1 <NA> <NA>
Herbert Hoover CFO 1 1 <NA>

Related

converting as factor for a list of data frames

I am trying to create a custom function to give labels to modified list of data frames. For example, I have a data frame like below.
df<-data.frame(
gender = c(1,2,1,2,1,2,1,2,2,2,2,1,1,2,2,2,2,1,1,1,1,1,2,1,2,1,2,2,2,1,2,1,2,1,2,1,2,2,2),
country = c(3,3,1,2,5,4,4,4,4,3,3,4,3,4,2,1,4,2,3,4,4,4,3,1,2,1,5,5,4,3,1,4,5,2,3,4,5,1,4),
Q1=c(1,1,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,1,1,1,NA,1,1,NA,NA,NA,NA,1,NA,NA,NA,NA,1,NA,1),
Q2=c(1,1,1,1,1,NA,NA,NA,NA,1,1,1,1,1,NA,NA,NA,1,1,1,NA,1,1,1,1,1,NA,NA,NA,1,1,1,1,1,1,1,NA,NA,NA),
Q3=c(1,1,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,1,1,1,NA,NA,NA,1,NA,NA,1,1,1,1,1,NA,NA,1),
Q4=c(1,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA),
Q5=c(1,2,1,1,1,2,1,2,2,1,2,NA,1,1,2,2,2,1,1,1,2,NA,2,1,1,1,2,2,2,NA,1,2,2,1,1,1,2,2,2)
)
I understand your goal to be the following: You want to take a list of data frames (ldat). For each of the dataframes in the list (df, df2) you want to take some existing columns (Q1, Q2, Q3) and replicate them with new names in the same data frame (Q1_new, Q2_new, Q3_new). This you could achieve like this:
variables = c("Q1","Q2","Q3")
new_label =c("Q1_new","Q2_new","Q3_new")
newdfs <- lapply(ldat, FUN = function(x) {
x[,new_label] = x[,variables]
return(x)})
head(newdfs$ALL)
gender country Q1 Q2 Q3 Q4 Q5 cc2 Q1_new Q2_new Q3_new
1 Male USA Yes Available Partner Depends on sales Local 1 Yes Available Partner
2 female USA Yes Available Partner <NA> Overseas NA Yes Available Partner
3 Male CAN <NA> Available <NA> <NA> Local 1 <NA> Available <NA>
4 female EU <NA> Available <NA> <NA> Local 1 <NA> Available <NA>
5 Male UK <NA> Available <NA> <NA> Local 1 <NA> Available <NA>
6 female BR <NA> <NA> <NA> <NA> Overseas NA <NA> <NA> <NA>
Is this what you had in mind?

Need help pulling JSON data with RSocrata from a website API

I need help drafting code that pulls public data directly from a website that is in Socrata format. Here is a link:
https://data.cityofchicago.org/Administration-Finance/Current-Employee-Names-Salaries-and-Position-Title/xzkq-xp2w
There is an API endpoint:
https://data.cityofchicago.org/resource/xzkq-xp2w.json
After the data is uploaded, null values in the "Annual Salary" should be replaced with 50000.
We can use the RSocrata package
library(RSocrata)
url <- "https://data.cityofchicago.org/resource/xzkq-xp2w.json"
data <- RSocrata::read.socrata(url)
head(data)
# name job_titles department full_or_part_time salary_or_hourly annual_salary typical_hours hourly_rate
#1 AARON, JEFFERY M SERGEANT POLICE F Salary 111444 <NA> <NA>
#2 AARON, KARINA POLICE OFFICER (ASSIGNED AS DETECTIVE) POLICE F Salary 94122 <NA> <NA>
#3 AARON, KIMBERLEI R CHIEF CONTRACT EXPEDITER DAIS F Salary 118608 <NA> <NA>
#4 ABAD JR, VICENTE M CIVIL ENGINEER IV WATER MGMNT F Salary 117072 <NA> <NA>
#5 ABARCA, FRANCES J POLICE OFFICER POLICE F Salary 48078 <NA> <NA>
The following will replace the NAs in annual_salary with 50000.
data[is.na(data$annual_salary),"annual_salary"] <- 50000
However, if you'd like to do what it suggests on the city of Chicago website, you could consider multipling typical_hours with hourly_rate to estimate salary.
ind <- is.na(data$annual_salary)
data[ind,]$annual_salary <- as.numeric(data[ind,]$typical_hours) * as.numeric(data[ind,]$hourly_rate) * 52

Match and list column terms in R from .xlsx

Hi I am quite new to R and i need to match terms from .xlsx columns to get a list of matched data between three .xlsx. The data in files is like this:
From one.xlsx:
OneID NameOne
ACR019 Acropectoral Syndrome
ACR020 Acropectorovertebral
GNT015 Genital Dwarfism
ACR023 Acral Dysostosis Dyserythropoiesis Syndrome
From two.xlsx:
TwoID TwoName
607907 DERMATOFIBROSARCOMA PROTUBERANS
304730 DERMOIDS OF CORNEA
605967 ACROPECTORAL SYNDROME
102510 ACROPECTOROVERTEBRAL
From three.xlsx:
ThreeID ThreeName
OM85203 Acropectoral syndrome
OM67092 Dermoids cornea
OM76580 Acardia
OM45632 Hypertryptophanemia
And the final result file in .xlsx must look like this:
OneID NameOne TwoID TwoName ThreeID ThreeName
ACR019 Acropectoral Syndrome 605967 ACROPECTORAL SYNDROME OM85203 Acropectoral syndrome
ACR020 Acropectorovertebral 102510 ACROPECTOROVERTEBRAL -
- 304730 DERMOIDS OF CORNEA OM67092 Dermoids cornea
Thank you very much, any suggestion or help to code this will be welcome.
What about this: due your only common fields are names in various dataset, we have to use them, as key to connect the various .xlsx, after make some small transformations (generally imho it's not a great idea use descriptions as key, but we could not do different in this case), using the merge() function.
After import the three MSExcel files, you can do:
# first your data (fake)
one <- data.frame(OneID=c('ACR019','ACR020','GNT015','ACR023'),
NameOne = c('Acropectoral Syndrome','Acropectorovertebral','Genital Dwarfism','Acral Dysostosis Dyserythropoiesis Syndrome'))
two <- data.frame(OneID=c('A607907','304730','605967','102510'),
NameTwo = c('DERMATOFIBROSARCOMA PROTUBERANS','DERMOIDS OF CORNEA','ACROPECTORAL SYNDROME','ACROPECTOROVERTEBRAL'))
three <-data.frame(OneID=c('OM85203','OM67092','OM76580','OM45632'),
NameThree = c('Acropectoral syndrome','Dermoids cornea','Acardia','Hypertryptophanemia'))
# then, to have uniques keys, you can put all of them as upper cases to create ids:
one$ID <- toupper(one$NameOne)
two$ID <- toupper(two$NameTwo)
three$ID <- toupper(three$NameThree)
# after that, you can merge the dataframes:
merged <- merge(merge(one,two, by ='ID', all = TRUE),three, by ='ID', all = TRUE)
#lastly, you give them the names you want (to columns)
colnames(merged) <- c('ID', 'OneID','NameOne','TwoID','NameTwo','ThreeID','NameThree')
# here the results
merged
> merged
ID OneID NameOne TwoID NameTwo
1 ACARDIA <NA> <NA> <NA> <NA>
2 ACRAL DYSOSTOSIS DYSERYTHROPOIESIS SYNDROME ACR023 Acral Dysostosis Dyserythropoiesis Syndrome <NA> <NA>
3 ACROPECTORAL SYNDROME ACR019 Acropectoral Syndrome 605967 ACROPECTORAL SYNDROME
4 ACROPECTOROVERTEBRAL ACR020 Acropectorovertebral 102510 ACROPECTOROVERTEBRAL
5 DERMATOFIBROSARCOMA PROTUBERANS <NA> <NA> A607907 DERMATOFIBROSARCOMA PROTUBERANS
6 DERMOIDS CORNEA <NA> <NA> <NA> <NA>
7 DERMOIDS OF CORNEA <NA> <NA> 304730 DERMOIDS OF CORNEA
8 GENITAL DWARFISM GNT015 Genital Dwarfism <NA> <NA>
9 HYPERTRYPTOPHANEMIA <NA> <NA> <NA> <NA>
ThreeID NameThree
1 OM76580 Acardia
2 <NA> <NA>
3 OM85203 Acropectoral syndrome
4 <NA> <NA>
5 <NA> <NA>
6 OM67092 Dermoids cornea
7 <NA> <NA>
8 <NA> <NA>
9 OM45632 Hypertryptophanemia

Extracting parts of data.frame

I have an issue while extracting and creating a new data.frame on the basis of previous one.
So we have:
> head(data.raw)
date id contacted contacted_again region
1 2015-11-29 234 CHAT EMAIL APAC
2 2015-11-29 234 EMAIL EMAIL APAC
3 2015-11-27 257 PHONE PHONE EMEA
4 2015-11-27 278 PHONE EMAIL APAC
5 2015-11-27 293 CHAT EMAIL EMEA
6 2015-11-27 243 EMAIL EMAIL EMEA
market
1 AU/NZ
2 SE Asia (English)
3 Spain
4 China Mainland
5 DACH
6 DACH
However, one I write
data.ru <- data.raw[data.raw$market=="Russia",]
I receive the following mess:
date id contacted contacted_again region market
67 2015-11-25 334 CHAT EMAIL EMEA Russia
NA <NA> <NA> <NA> <NA> <NA> <NA>
NA.1 <NA> <NA> <NA> <NA> <NA> <NA>
NA.2 <NA> <NA> <NA> <NA> <NA> <NA>
NA.3 <NA> <NA> <NA> <NA> <NA> <NA>
NA.4 <NA> <NA> <NA> <NA> <NA> <NA>
How should I write a command to receive just a normal data.frame with all rows that $market=="Russia" without any NAs?
I would just use the subset function.
test <- data.frame(x = c("USA", "USA", "USA", "Russia", "Russia", NA), y = c("Orlando", "Boston", "Memphis", NA, "St. Petersburg", "Mexico City"))
print(test)
x y
1 USA Orlando
2 USA Boston
3 USA Memphis
4 Russia <NA>
5 Russia St. Petersburg
6 <NA> Mexico City
subset(test, x == "Russia")
x y
4 Russia <NA>
5 Russia St. Petersburg
You may want to try: data.ru <- data.raw[data.raw$market %in% "Russia",]
Explanation: I am assuming you have empty lines in your dataset, which are read as NAs (missing value). Since R cannot know if a given NA is equal to "Russia" or not, the generated data frame includes them.
Illustration in code:
# create sample dataset
example.df <- data.frame(market=c(NA, "Russia", NA), outcome = c(1,2,3))
# match market using ==
example.df$market == "Russia"
example.df[example.df$market == "Russia",]
# match market using %in%
example.df$market %in% "Russia"
example.df[example.df$market %in% "Russia",]

Vector of logicals based on row membership

thank you for your patience.
I am dealing with a large dataset detailing patients and medications.
Medications are hard to code, as they are (usually) meaningless unless matched with doses.
I have a dataframe with vectors (Drug1, Drug2..... Drug 16) where individual patients are represented by rows.
The vectors are actually factors, with 100s of possible levels (all the drugs the patient could be on).
All I want to do is produce a vector of logicals (TTTTFFFFTTT......) that I could then cbind into a dataframe which will tell me whether a patient is or is not on a particular, drug.
I could then use particularly important drugs' presence or absence as categorical covariates in a model.
I've tried grep, to search along the rows, and I can generate a vector of identifiers, but I cannot seem to generate the vector of logicals.
I realise I'm doing something simply wrong.
names(drugindex)
[1] "book.MRN" "DRUG1" "DRUG2" "DRUG3" "DRUG4" "DRUG5"
[7] "DRUG6" "DRUG7" "DRUG8" "DRUG9" "DRUG10" "DRUG11"
[13] "DRUG12" "DRUG13" "DRUG14" "DRUG15" "DRUG16"
> truvec<-drugindex$book.MRN[as.vector(unlist(apply(drugindex[,2:17], 2, grep, pattern="Lamotrigine")))]
> truvec
truvec
[1] 0024633 0008291 0008469 0030599 0027667
37 Levels: 0008291 0008469 0010188 0014217 0014439 0015822 ... 0034262
> head(drugindex)
book.MRN DRUG1 DRUG2 DRUG3 DRUG4 DRUG5
4 0008291 Venlafaxine Procyclidine Flunitrazepam Amisulpiride Clozapine
31 0008469 Venlafaxine Mirtazapine Lithium Olanzapine Metoprolol
3 0010188 Flurazepam Valproate Olanzapine Mirtazapine Esomeprazole
13 0014217 Aspirin Ramipril Zuclopenthixol Lorazepam Haloperidol
15 0014439 Zopiclone Diazepam Haloperidol Paracetamol <NA>
5 0015822 Olanzapine Venlafaxine Lithium Haloperidol Alprazolam
DRUG6 DRUG7 DRUG8 DRUG9 DRUG10 DRUG11 DRUG12
4 Lamotrigine Alprazolam Lithium Alprazolam <NA> <NA> <NA>
31 Lamotrigine Ramipril Alprazolam Zolpidem Trifluoperazine <NA> <NA>
3 Paracetamol Alprazolam Citalopram <NA> <NA> <NA> <NA>
13 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
15 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
5 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
DRUG13 DRUG14 DRUG15 DRUG16
4 <NA> <NA> <NA> <NA>
31 <NA> <NA> <NA> <NA>
3 <NA> <NA> <NA> <NA>
13 <NA> <NA> <NA> <NA>
15 <NA> <NA> <NA> <NA>
5 <NA> <NA> <NA> <NA>
And what I want is a vector of logicals for each drug, saying whether that patient is on it
Thank you all for your time.
Ross Dunne MRCPsych
"Te occidere possunt sed te edere ne possunt, nefas est".
You were close with your apply attempt, but MARGIN=2 applies the function over columns, not rows. Also, grep returns the locations of the matches; you want grepl, which returns a logical vector. Try this:
apply(x[,-1], 1, function(x) any(grepl("Aspirin",x)))
You could also use %in%, which you may find more intuitive:
apply(x[,-1], 1, "%in%", x="Aspirin")
First, a comment on data structure. You have data in what some call a "wide" format, with a single row per patient and multiple columns for the drugs. It is usually the case that the "long" format, with reapeated rows per patient and a single column for drugs is more amenable to data manipulation. To reshape your data from wide to long and vice versa, take a look at the reshape package. In this case, you would have something like:
library(reshape)
dnow <- melt(drugindex, id.var='book.MRN')
subset(dnow, value=='Lamotrigine')
Much cleaner, and obvious, code, if I may say so ...
Edit: If you need the old structure back you can use cast:
cast(subset(dnow, value=='Lamotrigine'), book.MRN ~ value)
as suggested by #jonw in the comments.

Resources