From a messy character list to a matrix in R - r

I would really appreciate your help. I have large vector that contains 2000 strings of character of different length, which I retrieved from Web of Science. My dataset can be downloaded here.
Data structure and Outcome.
Each row of this vector has a different "length" but the same pattern. The characters within the "[]" determine the number of rows and the characters outside determine the columns. I will make an example with these three rows:
[Sorce, A.; Greco, A.; Magistri, L.] Univ Genoa, Polytech Sch, Thermochem Power Grp TPG DIME, I-16145 Genoa, Italy; [Costamagna, P.] Univ Genoa, Polytech Sch, Thermochem Power Grp TPG DICCA, I-16145 Genoa, Italy
[Allema, Bas; Hemerik, Lia; Rossing, Walter A. H.] Wageningen Univ, NL-6700 AP Wageningen, Netherlands; [Allema, Bas; van Lenteren, Joop C.] Wageningen Univ, Entomol Lab, NL-6700 AP Wageningen, Netherlands; [van der Werf, Wopke] Wageningen Univ, Ctr Crop Syst Anal, Crop & Weed Ecol Grp, NL-6700 AP Wageningen, Netherlands
[Abdissa, Ketema; Tadesse, Mulualem; Bezabih, Mesele; Bekele, Alemayehu; Abebe, Gemeda] Jimma Univ, Dept Med Lab Sci & Pathol, Jimma, Ethiopia; [Apers, Ludwig] Inst Trop Med, Dept Clin Sci, B-2000 Antwerp, Belgium; [Rigouts, Leen] Inst Trop Med, Dept Microbiol, Mycobacteriol Unit, B-2000 Antwerp, Belgium
The first row has 2 groups in "[]" both with 5 columns each; the second row has 2 groups, one with 3 columns and the second with 4; the third row has 3 groups, with 4, 4 and 5 columns each.
The outcome will be a matrix like this:
ID Author Info01 Info02 Info03 Info04 Info05
1 Sorce, A Univ Genoa Polytech Sch Thermochem Power Grp TPG DIME I-16145 Genoa Italy
1 Greco, A. Univ Genoa Polytech Sch Thermochem Power Grp TPG DIME I-16145 Genoa Italy
1 Magistri, L. Univ Genoa Polytech Sch Thermochem Power Grp TPG DIME I-16145 Genoa Italy
1 Costamagna, P. Univ Genoa Polytech Sch Thermochem Power Grp TPG DICCA I-16145 Genoa Italy
2 Allema Wageningen Univ NL-6700 AP Wageningen Netherlands N/A N/A
2 Bas; Hemerik Wageningen Univ NL-6700 AP Wageningen Netherlands N/A N/A
2 Lia; Rossing Wageningen Univ NL-6700 AP Wageningen Netherlands N/A N/A
2 Walter A. H. Wageningen Univ NL-6700 AP Wageningen Netherlands N/A N/A
2 Allema, Bas Wageningen Univ Entomol Lab NL-6700 AP Wageningen Netherlands N/A
2 van Lenteren, Joop C. Wageningen Univ Entomol Lab NL-6700 AP Wageningen Netherlands N/A
2 van der Werf, Wopke Wageningen Univ Ctr Crop Syst Anal Crop & Weed Ecol Grp NL-6700 AP Wageningen Netherlands
3 Abdissa, Ketema Jimma Univ Dept Med Lab Sci & Pathol Jimma Ethiopia N/A
3 Tadesse, Mulualem Jimma Univ Dept Med Lab Sci & Pathol Jimma Ethiopia N/A
3 Bezabih, Mesele Jimma Univ Dept Med Lab Sci & Pathol Jimma Ethiopia N/A
3 Bekele, Alemayehu Jimma Univ Dept Med Lab Sci & Pathol Jimma Ethiopia N/A
3 Abebe, Gemeda Jimma Univ Dept Med Lab Sci & Pathol Jimma Ethiopia N/A
3 Apers, Ludwig Inst Trop Med Dept Clin Sci B-2000 Antwerp Belgium N/A
3 Rigouts, Leen Inst Trop Med Dept Microbiol Mycobacteriol Unit B-2000 Antwerp Belgium
My Approach
Separate the strings and convert the vector into a list using this command:
CL1 <- str_split(CL, "\\[|\\]", n= Inf)
This generates a list of vectors with characters like this:
[[1999]]
[1] ""
[2] "Zhuo, Hongying; Li, Qingzhong; Li, Wenzuo; Cheng, Jianbo"
[3] " Yantai Univ, Sch Chem & Chem Engn, Lab Theoret & Computat Chem, Yantai 264005, Peoples R China"
[[2000]]
[1] ""
[2] "Zuo, Li; Meng, Qing-Hong; Chung, Peter Chee-Keung"
[3] " Guiyang Med Coll, Dept Immunol, Guiyang 550004, Guizhou Provinc, Peoples R China; "
[4] "Yuan, Kai-Tao"
[5] " Sun Yat Sen Univ, Affiliated Hosp 1, Dept Surg, Guangzhou 510080, Guangdong, Peoples R China; "
[6] "Yu, Li"
[7] " Guangzhou First Municipal Peoples Hosp, Dept Paediat, Guangzhou 510180, Guangdong, Peoples R China; "
[8] "Yang, Ding-Hua"
[9] " Southern Med Univ, Nan Fang Hosp, Dept Hepatobiliary Surg, Guangzhou 510515, Guangdong, Peoples R China"
As you can see the first element of each vector in the list is blank. Each "even" element of the vectors contains the "groups" and each "odd" element contains the columns of that group.
The next step is to separate the groups to assemble a matrix for this I'm using this two commands.
CL2 <- lapply(CL1,function(x)x[2])
AF1 <- lapply(CL1,function(x)x[3])
Since in some cases I have more that 50 groups in the same row, basically I have to repeat this process in a loop, but I don't know how, now I'm doing it manually. Another problem is that I don't know how to create an ID and how to merge the lists into a matrix.
Any ideas or suggestions will be welcome.

The following should do what you want to achieve:
A <- read.csv("AU.csv", stringsAsFactors = FALSE)
## One vector with all of the data in square brackets
A1 <- regmatches(A[[2]], gregexpr("\\[.*?\\]", A[[2]]))
LA1 <- lengths(A1)
A1 <- gsub("\\[|\\]", "", unlist(A1))
## One vector with all of the other data
A2 <- regmatches(A[[2]], gregexpr("\\[.*?\\]", A[[2]]), invert = TRUE)
LA2 <- lengths(A2) - 1
A2 <- unlist(lapply(A2, function(x) gsub("^\\s+|\\s+$|;\\s+$", "", x[-1])))
## Checking for mistakes....
all.equal(LA1, LA2)
# [1] TRUE
all.equal(sum(LA1), length(A1))
# [1] TRUE
Now that we have the vectors, we can use cSplit from my "splitstackshape" package to get the output you want:
library(splitstackshape)
library(magrittr)
## Make a data.table of the two vectors and the ID column
DT <- data.table(ID = rep(A[[1]], LA1), A1, A2)
## Here's the splitting....
final <- DT %>%
cSplit("A1", ";", "long") %>% ## The first column is split and made long
cSplit("A2", ",") ## The second column is split and made wide
Here's the result:
final
# ID A1 A2_01 A2_02
# 1: 1 Aalten, Pauline Maastricht Univ Med Ctr Sch Mental Hlth & Neurosci MHeNS
# 2: 1 Ramakers, Inez H. G. B. Maastricht Univ Med Ctr Sch Mental Hlth & Neurosci MHeNS
# 3: 1 Rozendaal, Nico Maastricht Univ Med Ctr Sch Mental Hlth & Neurosci MHeNS
# 4: 1 Verhey, Frans R. J. Maastricht Univ Med Ctr Sch Mental Hlth & Neurosci MHeNS
# 5: 1 Biessels, Geert Jan Univ Med Ctr Utrecht Dept Neurol
# ---
# 13949: 2000 Meng, Qing-Hong Guiyang Med Coll Dept Immunol
# 13950: 2000 Chung, Peter Chee-Keung Guiyang Med Coll Dept Immunol
# 13951: 2000 Yuan, Kai-Tao Sun Yat Sen Univ Affiliated Hosp 1
# 13952: 2000 Yu, Li Guangzhou First Municipal Peoples Hosp Dept Paediat
# 13953: 2000 Yang, Ding-Hua Southern Med Univ Nan Fang Hosp
# A2_03 A2_04 A2_05 A2_06 A2_07 A2_08 A2_09 A2_10
# 1: Alzheimer Ctr Limburg NL-6200 MD Maastricht Netherlands NA NA NA NA NA
# 2: Alzheimer Ctr Limburg NL-6200 MD Maastricht Netherlands NA NA NA NA NA
# 3: Alzheimer Ctr Limburg NL-6200 MD Maastricht Netherlands NA NA NA NA NA
# 4: Alzheimer Ctr Limburg NL-6200 MD Maastricht Netherlands NA NA NA NA NA
# 5: Utrecht Netherlands NA NA NA NA NA NA
# ---
# 13949: Guiyang 550004 Guizhou Provinc Peoples R China NA NA NA NA NA
# 13950: Guiyang 550004 Guizhou Provinc Peoples R China NA NA NA NA NA
# 13951: Dept Surg Guangzhou 510080 Guangdong Peoples R China NA NA NA NA
# 13952: Guangzhou 510180 Guangdong Peoples R China NA NA NA NA NA
# 13953: Dept Hepatobiliary Surg Guangzhou 510515 Guangdong Peoples R China NA NA NA NA

You can do some various manipulations with regular expressions, and use plyr and foreach functions to process everything. Here is an example of the first row
library(foreach)
library(plyr)
str1 = '[Sorce, A.; Greco, A.; Magistri, L.] Univ Genoa, Polytech Sch, Thermochem Power Grp TPG DIME, I-16145 Genoa, Italy; [Costamagna, P.] Univ Genoa, Polytech Sch, Thermochem Power Grp TPG DICCA, I-16145 Genoa, Italy'
##split the string into different parts
s1 = strsplit(str1,'; \\[')
s1. = llply(s1,strsplit,split = ']')[[1]]
##get list of authors
auths = llply(s1.,function(x) gsub('^ ','',strsplit(gsub('\\[','',x[1]),';')[[1]]))
##get all other attributes
other.stuff = llply(s1.,function(x) gsub('^ ','',strsplit(x[2],',')[[1]]))
results = foreach(auth = auths, other = other.stuff, .combine = 'rbind') %do%
expand.grid(auth,other[1],other[2],other[3],other[4],other[5])
The output's column names need to be changed, and you need to iterate this for each line, but that should be easy.

Related

Splitting a data frame in rows according to a pattern in a column

i have a data frame like so;
mydf=data.frame(Authors=c("A","B","C"), ID=c("1","2","3"), Adresses=c("[XYZ, DEF] Ege Univ, Izmir, Turkey","[Vil, Beat; Fern, Alm; Pro-Pas, Ram; Fevfz, Jes; Saur, Mari] INIA CSIC, Dept Genet Anim, Madrid, Spain; [Penza, Carna; Housen, Rosie] Univ Edigh, Roxbn Inst, Edinburgh, Scotland","[Zeek, Umt] Kastamonu Univ, Kast, Turkey; [Kalu, Sear] Ege Univ, Fac Engn, Izmir, Turkey"))
it seems like this:
i want to split it according to pattern in Adresses column like this:
Here the pattern is something like this: [ ] ;
But the last record of the cell (or if the cell has only one record ) doesn't have a semicolon as you can see from the first picture.
i tried with tidyr, dplyr, regex in r and also this strsplit(as.character(mydf[,3]), "[[(.*)]](.*);") pattern but it didn't work. Any help will be appreciated.
In base R, we can split the column into a list of vectors and then replicate the rows of the data based on the lengths of the list and update the 'Adresses' by unlisting the list
lst1 <- strsplit(mydf$Adresses, ";\\s*(?=\\[)", perl = TRUE)
mydf2 <- transform(mydf[rep(seq_len(nrow(mydf)), lengths(lst1)),],
Adresses = unlist(lst1))
row.names(mydf2) <- NULL
-output
> mydf2
Authors ID Adresses
1 A 1 [XYZ, DEF] Ege Univ, Izmir, Turkey
2 B 2 [Vil, Beat; Fern, Alm; Pro-Pas, Ram; Fevfz, Jes; Saur, Mari] INIA CSIC, Dept Genet Anim, Madrid, Spain
3 B 2 [Penza, Carna; Housen, Rosie] Univ Edigh, Roxbn Inst, Edinburgh, Scotland
4 C 3 [Zeek, Umt] Kastamonu Univ, Kast, Turkey
5 C 3 [Kalu, Sear] Ege Univ, Fac Engn, Izmir, Turkey
You could use the ; (?=\\[)-regex, looking for a semi-colon and a space before a bracket.
E.g. with tidyr:
library(tidyr)
mydf |>
separate_rows(Adresses, sep = "; (?=\\[)")
Output:
# A tibble: 5 × 3
Authors ID Adresses
<chr> <chr> <chr>
1 A 1 [XYZ, DEF] Ege Univ, Izmir, Turkey
2 B 2 [Vil, Beat; Fern, Alm; Pro-Pas, Ram; Fevfz, Jes; Saur, Mari] INIA CSIC, Dept Genet Anim, Madrid, Spain
3 B 2 [Penza, Carna; Housen, Rosie] Univ Edigh, Roxbn Inst, Edinburgh, Scotland
4 C 3 [Zeek, Umt] Kastamonu Univ, Kast, Turkey
5 C 3 [Kalu, Sear] Ege Univ, Fac Engn, Izmir, Turkey
You could use separate_rows() and set the separator as ';\\s*(?=\\[)':
library(tidyr)
mydf %>%
separate_rows(Adresses, sep = ';\\s*(?=\\[)')
# # A tibble: 5 × 3
# Authors ID Adresses
# <chr> <chr> <chr>
# 1 A 1 [XYZ, DEF] Ege Univ, Izmir, Turkey
# 2 B 2 [Vil, Beat; Fern, Alm; Pro-Pas, Ram; Fevfz, Jes; Saur, Mari] INIA…
# 3 B 2 [Penza, Carna; Housen, Rosie] Univ Edigh, Roxbn Inst, Edinburgh, …
# 4 C 3 [Zeek, Umt] Kastamonu Univ, Kast, Turkey
# 5 C 3 [Kalu, Sear] Ege Univ, Fac Engn, Izmir, Turkey

R replace characters in a column based on a word in another column

I have a dataset that has multiple columns. In the Date_Received column there are certain rows that have the word Abandonment in them. I would like to search the rows in this column that have this word, and then replace the characters AP with AB in the corresponding rows of the AP column.
How can I do this?
Sample data df:
structure(list(id = 1:6, Date_Received = c("Addition 1/2/2018",
"Swimming Pool 1/8/2018", "Swimming Pool 1/8/2018", "Abandonment 1/9/2018",
"Swimming Pool 1/12/2017", "Abandonment 2/5/2018"), Date_Approved = c("1/2/2018",
"1/8/2018", "1/8/2018", "1/9/2018", "1/12/2017", "2/5/2018"),
AP= c("AP-18-001", "AP-18-002", "AP-18-003", "AP-18-004",
"AP-18-005", "AP-18-006"), Permit.. = c("06-SE-1812147",
"06-SS-1813516", "06-SS-1813699", "06-SE-1814032", "06-SE-1814924",
"06-SS-1820333"), Owner.Name.Agent = c("Tiny Tots Academy, Inc Mike Davis",
"Ernesto & Elizabeth Diaz Ensign Pools", "DSL Contruction & Investments LLC",
"BSD North Federal LLC EPOCA Plumbing Corp", "Maria Silva Parkwood Pools And Pavers LLC",
"HPA Borrower Westland Plumbing"), X = c("NA NA", "NA NA",
"NA NA", "NA NA", "NA NA", "NA NA"), Project.Address.City = c("61111 Washington Street Hollywood, 33024",
"1224 SW 170 Avenue SW Ranches, 33331", "1233 NW 6 Place Plantation, 33325",
"1231 N Federal Hwy Hollywood, 33020", "3223 Dawson Street",
"3691 SW 31 Avenue Fort Lauderdale")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -6L))
Code:
library(tidyverse)
library(dplyr)
df = df %>% grepl("Abandonement", df$Date_Received) %>% str_replace(df$AP) #.... stuck
This should do it:
df %>%
mutate(AP = ifelse(grepl("Abandonment", Date_Received, fixed = TRUE), gsub("AP", "AB", AP), AP))
Which gives:
# A tibble: 6 × 8
id Date_Received Date_Approved AP Permit.. Owner.Name.Agent X Project.Address.City
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 Addition 1/2/2018 1/2/2018 AP-18-001 06-SE-1812147 Tiny Tots Academy, Inc Mike Davis NA NA 61111 Washington Street Holly…
2 2 Swimming Pool 1/8/2018 1/8/2018 AP-18-002 06-SS-1813516 Ernesto & Elizabeth Diaz Ensign Pools NA NA 1224 SW 170 Avenue SW Ranches…
3 3 Swimming Pool 1/8/2018 1/8/2018 AP-18-003 06-SS-1813699 DSL Contruction & Investments LLC NA NA 1233 NW 6 Place Plantation, 3…
4 4 Abandonment 1/9/2018 1/9/2018 AB-18-004 06-SE-1814032 BSD North Federal LLC EPOCA Plumbing Corp NA NA 1231 N Federal Hwy Hollywood,…
5 5 Swimming Pool 1/12/2017 1/12/2017 AP-18-005 06-SE-1814924 Maria Silva Parkwood Pools And Pavers LLC NA NA 3223 Dawson Street
6 6 Abandonment 2/5/2018 2/5/2018 AB-18-006 06-SS-1820333 HPA Borrower Westland Plumbing NA NA 3691 SW 31 Avenue Fort Lauder…

Find city names within affiliations and add them with their corresponding countries in new columns of a dataframe

I have a dataframe ‘dfa’ of affiliations that contains city names, for which the country is sometimes missing, e.g. like rows 4 (BAGHDAD) and 7 (BERLIN):
dfa <- data.frame(affiliation=c("DEPARTMENT OF PHARMACY, AMSTERDAM UNIVERSITY, AMSTERDAM, THE NETHERLANDS",
"DEPARTMENT OF BIOCHEMISTRY, LADY HARDINGE MEDICAL COLLEGE, NEW DELHI, INDIA.",
"DEPARTMENT OF PATHOLOGY, CHILDREN'S HOSPITAL, LOS ANGELES, UNITED STATES",
"COLLEGE OF EDUCATION FOR PURE SCIENCE, UNIVERSITY OF BAGHDAD.",
"DEPARTMENT OF CLINICAL LABORATORY, BEIJING GENERAL HOSPITAL, BEIJING, CHINA.",
"LABORATORY OF MOLECULAR BIOLOGY, ISTITUTO ORTOPEDICO, MILAN, ITALY.",
"DEPARTMENT OF AGRICULTURE, BERLIN INSTITUTE OF HEALTH, BERLIN",
"INSTITUTE OF LABORATORY MEDICINE, UNIVERSITY HOSPITAL, MUNICH, GERMANY.",
"DEPARTMENT OF CLINICAL PATHOLOGY, MAHIDOL UNIVERSITY, BANGKOK, THAILAND.",
"DEPARTMENT OF BIOLOGY, WASEDA UNIVERSITY, TOKYO, JAPAN",
"DEPARTMENT OF MOLECULAR BIOLOGY, MINISTRY OF HEALTH, TEHRAN, IRAN.",
"LABORATORY OF CARDIOVASCULAR DISEASE, FUWAI HOSPITAL, BEIJING, CHINA."))
I have now a second dataframe ‘dfb’ that contains a list of cities and corresponding country, some of which are present in 'dfa':
dfb <- data.frame(city=c("AGRI","AMSTERDAM","ATHENS","AUCKLAND","BUENOS AIRES","BEIJING","BAGHDAD","BANGKOK","BERLIN","BUDAPEST"),
country=c("TURKEY","NETHERLANDS","GREECE","NEW ZEALAND","ARGENTINA","CHINA","IRAQ","THAILAND","GERMANY","HUNGARY"))
How can I add cities and corresponding countries in two new columns only for cities that are present in both ‘dfa’ and ‘dfb’ (and even when the country is missing, as for BAGHDAD and BERLIN)?
NB: the goal is to add full city names but not part of them. Below in row 7, an example of what is not wanted: the AGRI city of TURKEY is inappropriately associated with BERLIN because this row includes the 'AGRICULTURE' word.
Is there a simple way to do that, ideally using dplyr?
affiliation city country
1 DEPARTMENT OF PHARMACY, AMSTERDAM UNIVERSITY, AMSTERDAM, THE NETHERLANDS AMSTERDAM NETHERLANDS
2 DEPARTMENT OF BIOCHEMISTRY, LADY HARDINGE MEDICAL COLLEGE, NEW DELHI, INDIA. <NA> <NA>
3 DEPARTMENT OF PATHOLOGY, CHILDREN'S HOSPITAL, LOS ANGELES, UNITED STATES <NA> <NA>
4 COLLEGE OF EDUCATION FOR PURE SCIENCE, UNIVERSITY OF BAGHDAD. BAGHDAD IRAQ
5 DEPARTMENT OF CLINICAL LABORATORY, BEIJING GENERAL HOSPITAL, BEIJING, CHINA. BEIJING CHINA
6 LABORATORY OF MOLECULAR BIOLOGY, ISTITUTO ORTOPEDICO, MILAN, ITALY. <NA> <NA>
7 DEPARTMENT OF AGRICULTURE, BERLIN INSTITUTE OF HEALTH, BERLIN AGRI TURKEY
8 INSTITUTE OF LABORATORY MEDICINE, UNIVERSITY HOSPITAL, MUNICH, GERMANY. <NA> <NA>
9 DEPARTMENT OF CLINICAL PATHOLOGY, MAHIDOL UNIVERSITY, BANGKOK, THAILAND. BANGKOK THAILAND
10 DEPARTMENT OF BIOLOGY, WASEDA UNIVERSITY, TOKYO, JAPAN <NA> <NA>
11 DEPARTMENT OF MOLECULAR BIOLOGY, MINISTRY OF HEALTH, TEHRAN, IRAN. <NA> <NA>
12 LABORATORY OF CARDIOVASCULAR DISEASE, FUWAI HOSPITAL, BEIJING, CHINA. BEIJING CHINA
A combination of str_extract and either a join or another str_extract is one option to get you there.
str_extract will get the first value it encounters, with a paste0 to collapse the cities into a long or string to check against.
library(dplyr)
library(stringr)
dfa %>%
mutate(city = str_extract(dfa$affiliation, paste0("\\b", dfb$city, collapse = "\\b|"))) %>%
left_join(dfb, by = "city")
Edit: added word bounderies in the paste0 so that only whole city names are matched and partial matching is avoided.
affiliation city country
1 DEPARTMENT OF PHARMACY, AMSTERDAM UNIVERSITY, AMSTERDAM, THE NETHERLANDS AMSTERDAM NETHERLANDS
2 DEPARTMENT OF BIOCHEMISTRY, LADY HARDINGE MEDICAL COLLEGE, NEW DELHI, INDIA. <NA> <NA>
3 DEPARTMENT OF PATHOLOGY, CHILDREN'S HOSPITAL, LOS ANGELES, UNITED STATES <NA> <NA>
4 COLLEGE OF EDUCATION FOR PURE SCIENCE, UNIVERSITY OF BAGHDAD. BAGHDAD IRAQ
5 DEPARTMENT OF CLINICAL LABORATORY, BEIJING GENERAL HOSPITAL, BEIJING, CHINA. BEIJING CHINA
6 LABORATORY OF MOLECULAR BIOLOGY, ISTITUTO ORTOPEDICO, MILAN, ITALY. <NA> <NA>
7 DEPARTMENT OF AGRICULTURE, BERLIN INSTITUTE OF HEALTH, BERLIN BERLIN GERMANY
8 INSTITUTE OF LABORATORY MEDICINE, UNIVERSITY HOSPITAL, MUNICH, GERMANY. <NA> <NA>
9 DEPARTMENT OF CLINICAL PATHOLOGY, MAHIDOL UNIVERSITY, BANGKOK, THAILAND. BANGKOK THAILAND
10 DEPARTMENT OF BIOLOGY, WASEDA UNIVERSITY, TOKYO, JAPAN <NA> <NA>
11 DEPARTMENT OF MOLECULAR BIOLOGY, MINISTRY OF HEALTH, TEHRAN, IRAN. <NA> <NA>
12 LABORATORY OF CARDIOVASCULAR DISEASE, FUWAI HOSPITAL, BEIJING, CHINA. BEIJING CHINA
This approach accounts for the possibility that an affiliation could match more than one city name.
library(tidyverse)
dfa %>%
mutate(city = map(affiliation, ~ str_extract(.x, dfb$city))) %>%
unnest(cols = c(city)) %>%
group_by(affiliation) %>%
mutate(nmatches = sum(!is.na(city))) %>%
filter((nmatches > 0 & !is.na(city)) | (nmatches == 0 & row_number() == 1)) %>%
ungroup() %>%
left_join(dfb, by = "city") %>%
mutate(country_match = str_detect(affiliation, country))
# A tibble: 12 x 5
affiliation city nmatches country country_match
<chr> <chr> <int> <chr> <lgl>
1 DEPARTMENT OF PHARMACY,… AMSTE… 1 NETHER… TRUE
2 DEPARTMENT OF BIOCHEMIS… NA 0 NA NA
3 DEPARTMENT OF PATHOLOGY… NA 0 NA NA
4 COLLEGE OF EDUCATION FO… BAGHD… 1 IRAQ FALSE
5 DEPARTMENT OF CLINICAL … BEIJI… 1 CHINA TRUE
6 LABORATORY OF MOLECULAR… NA 0 NA NA
7 BERLIN INSTITUTE OF HEA… BERLIN 1 GERMANY FALSE
8 INSTITUTE OF LABORATORY… NA 0 NA NA
9 DEPARTMENT OF CLINICAL … BANGK… 1 THAILA… TRUE
10 DEPARTMENT OF BIOLOGY, … NA 0 NA NA
11 DEPARTMENT OF MOLECULAR… NA 0 NA NA
12 LABORATORY OF CARDIOVAS… BEIJI… 1 CHINA TRUE
You could then double-check cases with 1 nmatches but country_match == F, and when there are 2 or more nmatches you can keep the one with country_match == T.

How do I change the way my data is organized?

My data is currently organized in the following method (see first table below for actual data). I am only showing a portion of the overall data as the full image is quite large (over 100 rows).
Row September October November December January February March April May June July
1 Chino Hills Huntington Bea~ Fountain Valley Anaheim Fountain Vall~ Arcadia Anaheim Newport Be~ Santa Ana NA NA
2 Irvine Cerritos Long Beach Chino Hills Cerritos Anaheim NA Banning Newport Beach Anaheim NA
3 Glendale NA West Covina Monterey Park Encino NA Monterey Pa~ NA Los Angeles Cerritos Beverly Hi~
4 Norco Fountain Valley NA Monterey Park NA Long Beach NA Santa Ana Huntington Be~ Fountain Val~ NA
5 Los Angeles Inglewood West Covina Glendale NA Glendale NA Granada Hi~ Chino West Covina Tarzana
I want to change the way it is organized so that it shows the following. I want to emphasize that it would show all of the cities, not just the ones I have chosen to list. This is an incomplete diagram, but it gets the idea across:
+-------------+------------------+--------+----------+
| Chino Hills | Huntington Beach | Irvine | Glendale |
+-------------+------------------+--------+----------+
| Row 1 | Row 1 | Row 2 | Row 3 |
| Row 2 | | | Row 5 |
| | | | Row 5 |
+-------------+------------------+--------+----------+
I have tried tidyr::separate_rows(dfl, col), but this only works if the cities are in one cell; however, they are in multiple cells in multiple rows. This is what happens when I try the tidyr::separate_rows(dfl, col):
Row September October November December January February March April May June July
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 Chino Hills Huntington Bea~ Fountain Valley Anaheim Fountain Vall~ Arcadia Anaheim Newport Be~ Santa Ana NA NA
2 2 Irvine Cerritos Long Beach Chino Hills Cerritos Anaheim NA Banning Newport Beach Anaheim NA
3 3 Glendale NA West Covina Monterey Park Encino NA Monterey Pa~ NA Los Angeles Cerritos Beverly Hi~
4 4 Norco Fountain Valley NA Monterey Park NA Long Beach NA Santa Ana Huntington Be~ Fountain Val~ NA
5 5 Los Angeles Inglewood West Covina Glendale NA Glendale NA Granada Hi~ Chino West Covina Tarzana
As you can see, the only thing it does is add in another row of numbers which I do not need.
In summary, I need the Program R to find all of the cities and tell me what row they are in. The row may appear more than once if the city is in that row more than once. It will organize more than one column, not just the standard one column as used in tidyr. The number of columns will depend on the number of different cities.
We can get the data in long format, keep only unique values for each Row and value and get data in wide format. Assuming df is the dataframe name.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -Row, values_drop_na = TRUE) %>%
distinct(Row, value) %>%
group_by(value) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from = value, values_from = Row)

Add new column if range of columns contains string in R

I have a dataframe like below. I would like to add 2 columns:
ContainsANZ: Indicates if any of the columns from F0 to F3 contain 'Australia' or 'New Zealand' ignoring NA values
AllANZ: Indicates if all non NA columns contain 'Australia' or 'New Zealand'
Starting dataframe would be:
dfContainsANZ
Col.A Col.B Col.C F0 F1 F2 F3
1 data 0 xxx Australia Singapore <NA> <NA>
2 data 1 yyy United States United States United States <NA>
3 data 0 zzz Australia Australia Australia Australia
4 data 0 ooo Hong Kong London Australia <NA>
5 data 1 xxx New Zealand <NA> <NA> <NA>
The end result should look like this:
df
Col.A Col.B Col.C F0 F1 F2 F3 ContainsANZ AllANZ
1 data 0 xxx Australia Singapore <NA> <NA> Australia undefined
2 data 1 yyy United States United States United States <NA> undefined undefined
3 data 0 zzz Australia Australia Australia Australia Australia Australia
4 data 0 ooo Hong Kong London Australia <NA> Australia undefined
5 data 1 xxx New Zealand <NA> <NA> <NA> New Zealand New Zealand
I'm using dplyr (preferred solution) and have come up with a code like this which doesn't work and is very repetitive. Is there a better way to write this so that I am not having to copy F0|F1|F2... rules over again? My real data set has more. Is NAs interfering with the code?
df <- df %>%
mutate(ANZFlag =
ifelse(
F0 == 'Australia' |
F1 == 'Australia' |
F2 == 'Australia' |
F3 == 'Australia',
'Australia',
ifelse(
F0 == 'New Zealand' |
F1 == 'New Zealand' |
F2 == 'New Zealand' |
F3 == 'New Zealand',
'New Zealand', 'undefined'
)
)
)
Still some typing, but I think this gets at the essence you're looking for:
library(dplyr)
df <- read.table(text='Col.A,Col.B,Col.C,F0,F1,F2,F3
data,0,xxx,Australia,Singapore,NA,NA
data,1,yyy,"United States","United States","United States",NA
data,0,zzz,Australia,Australia,Australia,Australia
data,0,ooo,"Hong Kong",London,Australia,NA
data,1,xxx,"New Zealand",NA,NA,NA', header=TRUE, sep=",", stringsAsFactors=FALSE)
down_under <- function(x) {
mtch <- c("Australia", "New Zealand")
cols <- unlist(x)[c("F0", "F1", "F2", "F3")]
bind_cols(x, data_frame(ContainsANZ=any(mtch %in% cols, na.rm=TRUE),
AllANZ=all(as.vector(na.omit(cols)) %in% cols)))
}
rowwise(df) %>% do(down_under(.))
## Source: local data frame [5 x 9]
## Groups: <by row>
##
## Col.A Col.B Col.C F0 F1 F2 F3 ContainsANZ AllANZ
## (chr) (int) (chr) (chr) (chr) (chr) (chr) (lgl) (lgl)
## 1 data 0 xxx Australia Singapore NA NA TRUE TRUE
## 2 data 1 yyy United States United States United States NA FALSE TRUE
## 3 data 0 zzz Australia Australia Australia Australia TRUE TRUE
## 4 data 0 ooo Hong Kong London Australia NA TRUE TRUE
## 5 data 1 xxx New Zealand NA NA NA TRUE TRUE

Resources