I have the following dataframe, main_df.
structure(list(Id = c(190150L, 243744L, 204796L, 139630L, 156541L,
157377L, 225627L), Name = c("Columbia University in the City of New York",
"Stanford University", "Ohio State University-Main Campus", "Emmanuel College",
"University of the Cumberlands", "Midway University", "University of the Incarnate Word"
), desired_sport = c("Archery", "Synchronized Swimming", "Synchronized Swimming",
"Archery", "Archery", "Archery", "Synchronized Swimming"), academic_strength = c("elite",
"elite", "average", "weak", "weak", "weak", "weak")), .Names = c("Id",
"Name", "desired_sport", "academic_strength"), class = "data.frame", row.names = c(1L,
258L, 1043L, 1144L, 1145L, 1146L, 1500L))
I need to have different desired sports and different levels of academic strengths.
At minimum, I need a dataframe that has at least 3 rows in each of the combinations of desired_sport and academic_strength.
In order to find that, I created a separate df to see how many were present in each combination
aggregate_test_df <- aggregate(Id ~ desired_sport + academic_strength, main_df, length)
I then created a new df with the maximum combinations that I needed and thought I could "cbind" the extra columns and then fill in the remainder.
The new combination_test_df was created as follows:
academic_strength <- c("elite", "strong", "average", "weak")
sport_test <- c("Archery", "Synchronized Swimming")
combination_test_df <- expand.grid(sport_test, academic_strength)
i <- sapply(combination_test_df, is.factor)
combination_test_df[i] <- lapply(combination_test_df[i], as.character)
combination_test_df$count <- 3
combination_test_df2 <- expandRows(combination_test_df, "count")
And got stuck in that I could not merge or cbind without creating more combinations.
The desired output would be a dataframe with each "desired_sport", "academic_strength" combination 3 times, some will be NA and some will be filled in, but that will allow me to create rules to fill in the NA's for the "name" and the "Id" columns.
Output would look like a dataframe similar to this:
Id Name desired_sport academic_strength
190150 Columbia University in the City of New York Archery elite
NA NA Archery elite
NA NA Archery elite
NA NA Archery strong
NA NA Archery strong
NA NA Archery strong
NA NA Archery average
NA NA Archery average
NA NA Archery average
139630 Emanuel College Archery weak
156541 University of the Cumberlands Archery weak
157377 Midway University Archery weak
243744 Stanford University Synchronized Swimming elite
NA NA Synchronized Swimming elite
NA NA Synchronized Swimming elite
NA NA Synchronized Swimming strong
NA NA Synchronized Swimming strong
NA NA Synchronized Swimming strong
204796 Ohio State University - Main Campus Synchronized Swimming average
NA NA Synchronized Swimming average
NA NA Synchronized Swimming average
NA NA Synchronized Swimming weak
NA NA Synchronized Swimming weak
NA NA Synchronized Swimming weak
And then i would actually like to be able to fill in- so the complete final dataframe
Id Name desired_sport academic_strength
190150 Columbia University in the City of New York Archery elite
139630 Emanuel College Archery elite
156541 University of the Cumberlands Archery elite
139630 Emanuel College Archery strong
156541 University of the Cumberlands Archery strong
157377 Midway University Archery strong
139630 Emanuel College Archery average
156541 University of the Cumberlands Archery average
157377 Midway University Archery average
139630 Emanuel College Archery weak
156541 University of the Cumberlands Archery weak
157377 Midway University Archery weak
243744 Stanford University Synchronized Swimming elite
204796 Ohio State University - Main Campus Synchronized Swimming elite
NA NA Synchronized Swimming elite
204796 Ohio State University - Main Campus Synchronized Swimming strong
NA NA Synchronized Swimming strong
NA NA Synchronized Swimming strong
204796 Ohio State University - Main Campus Synchronized Swimming average
NA NA Synchronized Swimming average
NA NA Synchronized Swimming average
NA NA Synchronized Swimming weak
NA NA Synchronized Swimming weak
NA NA Synchronized Swimming weak
Any advice?
Related
I'm a beginner on R so apologies for errors, and thank you for helping.
I have a dataset (liver) where rows are patient ID numbers, and columns include what region the patient resides in (London, Yorkshire etc) and what unit the patient was treated in (hospital name). Some of the units are private units. I've identified 120 patients from London, of whom 100 were treated across three private units. I want to remove the 100 London patients treated in private units but I keep accidentally removing all patients treated in the private units (around 900 patients). I'd be grateful for advice on how to just remove the London patients treated privately.
I've tried various combinations of using subset and filter with different exclamation points and brackets in different places including for example:
liver <- filter(liver, region_name != "London" & unit_name!="Primrose Hospital" & unit_name != "Oak Hospital" & unit_name != "Wilson Hospital")
Thank you very much.
Your unit_name condition is zeroing your results. Try using the match function which is more commonly seen in its infix form %in%:
liver <- filter(liver,
region_name != "London",
! unit_name %in% c("Primrose Hospital",
"Oak Hospital",
"Wilson Hospital"))
Also you can separate logical AND conditions using a comma.
Building on Pariksheet's great start (still drops outside-London private hospital patients). Here we need to use the OR | operator within the filter function. I've made an example dataframe which demonstrates how this works for your case. The example tibble contains your three private London hospitals plus one non-private hospital that we want to keep. Plus, it has Manchester patients who attend both Manch and one of the private hospitals, all of whom we want to keep.
EDITED: Now includes character vectors to allow generalisation of combinations to exclude.
liver <- tibble(region_name = rep(c('London', 'Liverpool', 'Glasgow', 'Manchester'), each = 4),
unit_name = c(rep(c('Primrose Hospital',
'Oak Hospital',
'Wilson Hospital',
'State Hospital'), times = 3),
rep(c('Manch General', 'Primrose Hospital'), each = 2)))
liver
# A tibble: 16 x 2
region_name unit_name
<chr> <chr>
1 London Primrose Hospital
2 London Oak Hospital
3 London Wilson Hospital
4 London State Hospital
5 Liverpool Primrose Hospital
6 Liverpool Oak Hospital
7 Liverpool Wilson Hospital
8 Liverpool State Hospital
9 Glasgow Primrose Hospital
10 Glasgow Oak Hospital
11 Glasgow Wilson Hospital
12 Glasgow State Hospital
13 Manchester Manch General
14 Manchester Manch General
15 Manchester Primrose Hospital
16 Manchester Primrose Hospital
excl.private.regions <- c('London',
'Liverpool',
'Glasgow')
excl.private.hospitals <- c('Primrose Hospital',
'Oak Hospital',
'Wilson Hospital')
liver %>%
filter(! region_name %in% excl.private.regions |
! unit_name %in% excl.private.hospitals)
# A tibble: 7 x 2
region_name unit_name
<chr> <chr>
1 London State Hospital
2 Liverpool State Hospital
3 Glasgow State Hospital
4 Manchester Manch General
5 Manchester Manch General
6 Manchester Primrose Hospital
7 Manchester Primrose Hospital
I have a dataset that has multiple columns. In the Date_Received column there are certain rows that have the word Abandonment in them. I would like to search the rows in this column that have this word, and then replace the characters AP with AB in the corresponding rows of the AP column.
How can I do this?
Sample data df:
structure(list(id = 1:6, Date_Received = c("Addition 1/2/2018",
"Swimming Pool 1/8/2018", "Swimming Pool 1/8/2018", "Abandonment 1/9/2018",
"Swimming Pool 1/12/2017", "Abandonment 2/5/2018"), Date_Approved = c("1/2/2018",
"1/8/2018", "1/8/2018", "1/9/2018", "1/12/2017", "2/5/2018"),
AP= c("AP-18-001", "AP-18-002", "AP-18-003", "AP-18-004",
"AP-18-005", "AP-18-006"), Permit.. = c("06-SE-1812147",
"06-SS-1813516", "06-SS-1813699", "06-SE-1814032", "06-SE-1814924",
"06-SS-1820333"), Owner.Name.Agent = c("Tiny Tots Academy, Inc Mike Davis",
"Ernesto & Elizabeth Diaz Ensign Pools", "DSL Contruction & Investments LLC",
"BSD North Federal LLC EPOCA Plumbing Corp", "Maria Silva Parkwood Pools And Pavers LLC",
"HPA Borrower Westland Plumbing"), X = c("NA NA", "NA NA",
"NA NA", "NA NA", "NA NA", "NA NA"), Project.Address.City = c("61111 Washington Street Hollywood, 33024",
"1224 SW 170 Avenue SW Ranches, 33331", "1233 NW 6 Place Plantation, 33325",
"1231 N Federal Hwy Hollywood, 33020", "3223 Dawson Street",
"3691 SW 31 Avenue Fort Lauderdale")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -6L))
Code:
library(tidyverse)
library(dplyr)
df = df %>% grepl("Abandonement", df$Date_Received) %>% str_replace(df$AP) #.... stuck
This should do it:
df %>%
mutate(AP = ifelse(grepl("Abandonment", Date_Received, fixed = TRUE), gsub("AP", "AB", AP), AP))
Which gives:
# A tibble: 6 × 8
id Date_Received Date_Approved AP Permit.. Owner.Name.Agent X Project.Address.City
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 Addition 1/2/2018 1/2/2018 AP-18-001 06-SE-1812147 Tiny Tots Academy, Inc Mike Davis NA NA 61111 Washington Street Holly…
2 2 Swimming Pool 1/8/2018 1/8/2018 AP-18-002 06-SS-1813516 Ernesto & Elizabeth Diaz Ensign Pools NA NA 1224 SW 170 Avenue SW Ranches…
3 3 Swimming Pool 1/8/2018 1/8/2018 AP-18-003 06-SS-1813699 DSL Contruction & Investments LLC NA NA 1233 NW 6 Place Plantation, 3…
4 4 Abandonment 1/9/2018 1/9/2018 AB-18-004 06-SE-1814032 BSD North Federal LLC EPOCA Plumbing Corp NA NA 1231 N Federal Hwy Hollywood,…
5 5 Swimming Pool 1/12/2017 1/12/2017 AP-18-005 06-SE-1814924 Maria Silva Parkwood Pools And Pavers LLC NA NA 3223 Dawson Street
6 6 Abandonment 2/5/2018 2/5/2018 AB-18-006 06-SS-1820333 HPA Borrower Westland Plumbing NA NA 3691 SW 31 Avenue Fort Lauder…
I have a dataframe ‘dfa’ of affiliations that contains city names, for which the country is sometimes missing, e.g. like rows 4 (BAGHDAD) and 7 (BERLIN):
dfa <- data.frame(affiliation=c("DEPARTMENT OF PHARMACY, AMSTERDAM UNIVERSITY, AMSTERDAM, THE NETHERLANDS",
"DEPARTMENT OF BIOCHEMISTRY, LADY HARDINGE MEDICAL COLLEGE, NEW DELHI, INDIA.",
"DEPARTMENT OF PATHOLOGY, CHILDREN'S HOSPITAL, LOS ANGELES, UNITED STATES",
"COLLEGE OF EDUCATION FOR PURE SCIENCE, UNIVERSITY OF BAGHDAD.",
"DEPARTMENT OF CLINICAL LABORATORY, BEIJING GENERAL HOSPITAL, BEIJING, CHINA.",
"LABORATORY OF MOLECULAR BIOLOGY, ISTITUTO ORTOPEDICO, MILAN, ITALY.",
"DEPARTMENT OF AGRICULTURE, BERLIN INSTITUTE OF HEALTH, BERLIN",
"INSTITUTE OF LABORATORY MEDICINE, UNIVERSITY HOSPITAL, MUNICH, GERMANY.",
"DEPARTMENT OF CLINICAL PATHOLOGY, MAHIDOL UNIVERSITY, BANGKOK, THAILAND.",
"DEPARTMENT OF BIOLOGY, WASEDA UNIVERSITY, TOKYO, JAPAN",
"DEPARTMENT OF MOLECULAR BIOLOGY, MINISTRY OF HEALTH, TEHRAN, IRAN.",
"LABORATORY OF CARDIOVASCULAR DISEASE, FUWAI HOSPITAL, BEIJING, CHINA."))
I have now a second dataframe ‘dfb’ that contains a list of cities and corresponding country, some of which are present in 'dfa':
dfb <- data.frame(city=c("AGRI","AMSTERDAM","ATHENS","AUCKLAND","BUENOS AIRES","BEIJING","BAGHDAD","BANGKOK","BERLIN","BUDAPEST"),
country=c("TURKEY","NETHERLANDS","GREECE","NEW ZEALAND","ARGENTINA","CHINA","IRAQ","THAILAND","GERMANY","HUNGARY"))
How can I add cities and corresponding countries in two new columns only for cities that are present in both ‘dfa’ and ‘dfb’ (and even when the country is missing, as for BAGHDAD and BERLIN)?
NB: the goal is to add full city names but not part of them. Below in row 7, an example of what is not wanted: the AGRI city of TURKEY is inappropriately associated with BERLIN because this row includes the 'AGRICULTURE' word.
Is there a simple way to do that, ideally using dplyr?
affiliation city country
1 DEPARTMENT OF PHARMACY, AMSTERDAM UNIVERSITY, AMSTERDAM, THE NETHERLANDS AMSTERDAM NETHERLANDS
2 DEPARTMENT OF BIOCHEMISTRY, LADY HARDINGE MEDICAL COLLEGE, NEW DELHI, INDIA. <NA> <NA>
3 DEPARTMENT OF PATHOLOGY, CHILDREN'S HOSPITAL, LOS ANGELES, UNITED STATES <NA> <NA>
4 COLLEGE OF EDUCATION FOR PURE SCIENCE, UNIVERSITY OF BAGHDAD. BAGHDAD IRAQ
5 DEPARTMENT OF CLINICAL LABORATORY, BEIJING GENERAL HOSPITAL, BEIJING, CHINA. BEIJING CHINA
6 LABORATORY OF MOLECULAR BIOLOGY, ISTITUTO ORTOPEDICO, MILAN, ITALY. <NA> <NA>
7 DEPARTMENT OF AGRICULTURE, BERLIN INSTITUTE OF HEALTH, BERLIN AGRI TURKEY
8 INSTITUTE OF LABORATORY MEDICINE, UNIVERSITY HOSPITAL, MUNICH, GERMANY. <NA> <NA>
9 DEPARTMENT OF CLINICAL PATHOLOGY, MAHIDOL UNIVERSITY, BANGKOK, THAILAND. BANGKOK THAILAND
10 DEPARTMENT OF BIOLOGY, WASEDA UNIVERSITY, TOKYO, JAPAN <NA> <NA>
11 DEPARTMENT OF MOLECULAR BIOLOGY, MINISTRY OF HEALTH, TEHRAN, IRAN. <NA> <NA>
12 LABORATORY OF CARDIOVASCULAR DISEASE, FUWAI HOSPITAL, BEIJING, CHINA. BEIJING CHINA
A combination of str_extract and either a join or another str_extract is one option to get you there.
str_extract will get the first value it encounters, with a paste0 to collapse the cities into a long or string to check against.
library(dplyr)
library(stringr)
dfa %>%
mutate(city = str_extract(dfa$affiliation, paste0("\\b", dfb$city, collapse = "\\b|"))) %>%
left_join(dfb, by = "city")
Edit: added word bounderies in the paste0 so that only whole city names are matched and partial matching is avoided.
affiliation city country
1 DEPARTMENT OF PHARMACY, AMSTERDAM UNIVERSITY, AMSTERDAM, THE NETHERLANDS AMSTERDAM NETHERLANDS
2 DEPARTMENT OF BIOCHEMISTRY, LADY HARDINGE MEDICAL COLLEGE, NEW DELHI, INDIA. <NA> <NA>
3 DEPARTMENT OF PATHOLOGY, CHILDREN'S HOSPITAL, LOS ANGELES, UNITED STATES <NA> <NA>
4 COLLEGE OF EDUCATION FOR PURE SCIENCE, UNIVERSITY OF BAGHDAD. BAGHDAD IRAQ
5 DEPARTMENT OF CLINICAL LABORATORY, BEIJING GENERAL HOSPITAL, BEIJING, CHINA. BEIJING CHINA
6 LABORATORY OF MOLECULAR BIOLOGY, ISTITUTO ORTOPEDICO, MILAN, ITALY. <NA> <NA>
7 DEPARTMENT OF AGRICULTURE, BERLIN INSTITUTE OF HEALTH, BERLIN BERLIN GERMANY
8 INSTITUTE OF LABORATORY MEDICINE, UNIVERSITY HOSPITAL, MUNICH, GERMANY. <NA> <NA>
9 DEPARTMENT OF CLINICAL PATHOLOGY, MAHIDOL UNIVERSITY, BANGKOK, THAILAND. BANGKOK THAILAND
10 DEPARTMENT OF BIOLOGY, WASEDA UNIVERSITY, TOKYO, JAPAN <NA> <NA>
11 DEPARTMENT OF MOLECULAR BIOLOGY, MINISTRY OF HEALTH, TEHRAN, IRAN. <NA> <NA>
12 LABORATORY OF CARDIOVASCULAR DISEASE, FUWAI HOSPITAL, BEIJING, CHINA. BEIJING CHINA
This approach accounts for the possibility that an affiliation could match more than one city name.
library(tidyverse)
dfa %>%
mutate(city = map(affiliation, ~ str_extract(.x, dfb$city))) %>%
unnest(cols = c(city)) %>%
group_by(affiliation) %>%
mutate(nmatches = sum(!is.na(city))) %>%
filter((nmatches > 0 & !is.na(city)) | (nmatches == 0 & row_number() == 1)) %>%
ungroup() %>%
left_join(dfb, by = "city") %>%
mutate(country_match = str_detect(affiliation, country))
# A tibble: 12 x 5
affiliation city nmatches country country_match
<chr> <chr> <int> <chr> <lgl>
1 DEPARTMENT OF PHARMACY,… AMSTE… 1 NETHER… TRUE
2 DEPARTMENT OF BIOCHEMIS… NA 0 NA NA
3 DEPARTMENT OF PATHOLOGY… NA 0 NA NA
4 COLLEGE OF EDUCATION FO… BAGHD… 1 IRAQ FALSE
5 DEPARTMENT OF CLINICAL … BEIJI… 1 CHINA TRUE
6 LABORATORY OF MOLECULAR… NA 0 NA NA
7 BERLIN INSTITUTE OF HEA… BERLIN 1 GERMANY FALSE
8 INSTITUTE OF LABORATORY… NA 0 NA NA
9 DEPARTMENT OF CLINICAL … BANGK… 1 THAILA… TRUE
10 DEPARTMENT OF BIOLOGY, … NA 0 NA NA
11 DEPARTMENT OF MOLECULAR… NA 0 NA NA
12 LABORATORY OF CARDIOVAS… BEIJI… 1 CHINA TRUE
You could then double-check cases with 1 nmatches but country_match == F, and when there are 2 or more nmatches you can keep the one with country_match == T.
My data is currently organized in the following method (see first table below for actual data). I am only showing a portion of the overall data as the full image is quite large (over 100 rows).
Row September October November December January February March April May June July
1 Chino Hills Huntington Bea~ Fountain Valley Anaheim Fountain Vall~ Arcadia Anaheim Newport Be~ Santa Ana NA NA
2 Irvine Cerritos Long Beach Chino Hills Cerritos Anaheim NA Banning Newport Beach Anaheim NA
3 Glendale NA West Covina Monterey Park Encino NA Monterey Pa~ NA Los Angeles Cerritos Beverly Hi~
4 Norco Fountain Valley NA Monterey Park NA Long Beach NA Santa Ana Huntington Be~ Fountain Val~ NA
5 Los Angeles Inglewood West Covina Glendale NA Glendale NA Granada Hi~ Chino West Covina Tarzana
I want to change the way it is organized so that it shows the following. I want to emphasize that it would show all of the cities, not just the ones I have chosen to list. This is an incomplete diagram, but it gets the idea across:
+-------------+------------------+--------+----------+
| Chino Hills | Huntington Beach | Irvine | Glendale |
+-------------+------------------+--------+----------+
| Row 1 | Row 1 | Row 2 | Row 3 |
| Row 2 | | | Row 5 |
| | | | Row 5 |
+-------------+------------------+--------+----------+
I have tried tidyr::separate_rows(dfl, col), but this only works if the cities are in one cell; however, they are in multiple cells in multiple rows. This is what happens when I try the tidyr::separate_rows(dfl, col):
Row September October November December January February March April May June July
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 Chino Hills Huntington Bea~ Fountain Valley Anaheim Fountain Vall~ Arcadia Anaheim Newport Be~ Santa Ana NA NA
2 2 Irvine Cerritos Long Beach Chino Hills Cerritos Anaheim NA Banning Newport Beach Anaheim NA
3 3 Glendale NA West Covina Monterey Park Encino NA Monterey Pa~ NA Los Angeles Cerritos Beverly Hi~
4 4 Norco Fountain Valley NA Monterey Park NA Long Beach NA Santa Ana Huntington Be~ Fountain Val~ NA
5 5 Los Angeles Inglewood West Covina Glendale NA Glendale NA Granada Hi~ Chino West Covina Tarzana
As you can see, the only thing it does is add in another row of numbers which I do not need.
In summary, I need the Program R to find all of the cities and tell me what row they are in. The row may appear more than once if the city is in that row more than once. It will organize more than one column, not just the standard one column as used in tidyr. The number of columns will depend on the number of different cities.
We can get the data in long format, keep only unique values for each Row and value and get data in wide format. Assuming df is the dataframe name.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -Row, values_drop_na = TRUE) %>%
distinct(Row, value) %>%
group_by(value) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from = value, values_from = Row)
I would really appreciate your help. I have large vector that contains 2000 strings of character of different length, which I retrieved from Web of Science. My dataset can be downloaded here.
Data structure and Outcome.
Each row of this vector has a different "length" but the same pattern. The characters within the "[]" determine the number of rows and the characters outside determine the columns. I will make an example with these three rows:
[Sorce, A.; Greco, A.; Magistri, L.] Univ Genoa, Polytech Sch, Thermochem Power Grp TPG DIME, I-16145 Genoa, Italy; [Costamagna, P.] Univ Genoa, Polytech Sch, Thermochem Power Grp TPG DICCA, I-16145 Genoa, Italy
[Allema, Bas; Hemerik, Lia; Rossing, Walter A. H.] Wageningen Univ, NL-6700 AP Wageningen, Netherlands; [Allema, Bas; van Lenteren, Joop C.] Wageningen Univ, Entomol Lab, NL-6700 AP Wageningen, Netherlands; [van der Werf, Wopke] Wageningen Univ, Ctr Crop Syst Anal, Crop & Weed Ecol Grp, NL-6700 AP Wageningen, Netherlands
[Abdissa, Ketema; Tadesse, Mulualem; Bezabih, Mesele; Bekele, Alemayehu; Abebe, Gemeda] Jimma Univ, Dept Med Lab Sci & Pathol, Jimma, Ethiopia; [Apers, Ludwig] Inst Trop Med, Dept Clin Sci, B-2000 Antwerp, Belgium; [Rigouts, Leen] Inst Trop Med, Dept Microbiol, Mycobacteriol Unit, B-2000 Antwerp, Belgium
The first row has 2 groups in "[]" both with 5 columns each; the second row has 2 groups, one with 3 columns and the second with 4; the third row has 3 groups, with 4, 4 and 5 columns each.
The outcome will be a matrix like this:
ID Author Info01 Info02 Info03 Info04 Info05
1 Sorce, A Univ Genoa Polytech Sch Thermochem Power Grp TPG DIME I-16145 Genoa Italy
1 Greco, A. Univ Genoa Polytech Sch Thermochem Power Grp TPG DIME I-16145 Genoa Italy
1 Magistri, L. Univ Genoa Polytech Sch Thermochem Power Grp TPG DIME I-16145 Genoa Italy
1 Costamagna, P. Univ Genoa Polytech Sch Thermochem Power Grp TPG DICCA I-16145 Genoa Italy
2 Allema Wageningen Univ NL-6700 AP Wageningen Netherlands N/A N/A
2 Bas; Hemerik Wageningen Univ NL-6700 AP Wageningen Netherlands N/A N/A
2 Lia; Rossing Wageningen Univ NL-6700 AP Wageningen Netherlands N/A N/A
2 Walter A. H. Wageningen Univ NL-6700 AP Wageningen Netherlands N/A N/A
2 Allema, Bas Wageningen Univ Entomol Lab NL-6700 AP Wageningen Netherlands N/A
2 van Lenteren, Joop C. Wageningen Univ Entomol Lab NL-6700 AP Wageningen Netherlands N/A
2 van der Werf, Wopke Wageningen Univ Ctr Crop Syst Anal Crop & Weed Ecol Grp NL-6700 AP Wageningen Netherlands
3 Abdissa, Ketema Jimma Univ Dept Med Lab Sci & Pathol Jimma Ethiopia N/A
3 Tadesse, Mulualem Jimma Univ Dept Med Lab Sci & Pathol Jimma Ethiopia N/A
3 Bezabih, Mesele Jimma Univ Dept Med Lab Sci & Pathol Jimma Ethiopia N/A
3 Bekele, Alemayehu Jimma Univ Dept Med Lab Sci & Pathol Jimma Ethiopia N/A
3 Abebe, Gemeda Jimma Univ Dept Med Lab Sci & Pathol Jimma Ethiopia N/A
3 Apers, Ludwig Inst Trop Med Dept Clin Sci B-2000 Antwerp Belgium N/A
3 Rigouts, Leen Inst Trop Med Dept Microbiol Mycobacteriol Unit B-2000 Antwerp Belgium
My Approach
Separate the strings and convert the vector into a list using this command:
CL1 <- str_split(CL, "\\[|\\]", n= Inf)
This generates a list of vectors with characters like this:
[[1999]]
[1] ""
[2] "Zhuo, Hongying; Li, Qingzhong; Li, Wenzuo; Cheng, Jianbo"
[3] " Yantai Univ, Sch Chem & Chem Engn, Lab Theoret & Computat Chem, Yantai 264005, Peoples R China"
[[2000]]
[1] ""
[2] "Zuo, Li; Meng, Qing-Hong; Chung, Peter Chee-Keung"
[3] " Guiyang Med Coll, Dept Immunol, Guiyang 550004, Guizhou Provinc, Peoples R China; "
[4] "Yuan, Kai-Tao"
[5] " Sun Yat Sen Univ, Affiliated Hosp 1, Dept Surg, Guangzhou 510080, Guangdong, Peoples R China; "
[6] "Yu, Li"
[7] " Guangzhou First Municipal Peoples Hosp, Dept Paediat, Guangzhou 510180, Guangdong, Peoples R China; "
[8] "Yang, Ding-Hua"
[9] " Southern Med Univ, Nan Fang Hosp, Dept Hepatobiliary Surg, Guangzhou 510515, Guangdong, Peoples R China"
As you can see the first element of each vector in the list is blank. Each "even" element of the vectors contains the "groups" and each "odd" element contains the columns of that group.
The next step is to separate the groups to assemble a matrix for this I'm using this two commands.
CL2 <- lapply(CL1,function(x)x[2])
AF1 <- lapply(CL1,function(x)x[3])
Since in some cases I have more that 50 groups in the same row, basically I have to repeat this process in a loop, but I don't know how, now I'm doing it manually. Another problem is that I don't know how to create an ID and how to merge the lists into a matrix.
Any ideas or suggestions will be welcome.
The following should do what you want to achieve:
A <- read.csv("AU.csv", stringsAsFactors = FALSE)
## One vector with all of the data in square brackets
A1 <- regmatches(A[[2]], gregexpr("\\[.*?\\]", A[[2]]))
LA1 <- lengths(A1)
A1 <- gsub("\\[|\\]", "", unlist(A1))
## One vector with all of the other data
A2 <- regmatches(A[[2]], gregexpr("\\[.*?\\]", A[[2]]), invert = TRUE)
LA2 <- lengths(A2) - 1
A2 <- unlist(lapply(A2, function(x) gsub("^\\s+|\\s+$|;\\s+$", "", x[-1])))
## Checking for mistakes....
all.equal(LA1, LA2)
# [1] TRUE
all.equal(sum(LA1), length(A1))
# [1] TRUE
Now that we have the vectors, we can use cSplit from my "splitstackshape" package to get the output you want:
library(splitstackshape)
library(magrittr)
## Make a data.table of the two vectors and the ID column
DT <- data.table(ID = rep(A[[1]], LA1), A1, A2)
## Here's the splitting....
final <- DT %>%
cSplit("A1", ";", "long") %>% ## The first column is split and made long
cSplit("A2", ",") ## The second column is split and made wide
Here's the result:
final
# ID A1 A2_01 A2_02
# 1: 1 Aalten, Pauline Maastricht Univ Med Ctr Sch Mental Hlth & Neurosci MHeNS
# 2: 1 Ramakers, Inez H. G. B. Maastricht Univ Med Ctr Sch Mental Hlth & Neurosci MHeNS
# 3: 1 Rozendaal, Nico Maastricht Univ Med Ctr Sch Mental Hlth & Neurosci MHeNS
# 4: 1 Verhey, Frans R. J. Maastricht Univ Med Ctr Sch Mental Hlth & Neurosci MHeNS
# 5: 1 Biessels, Geert Jan Univ Med Ctr Utrecht Dept Neurol
# ---
# 13949: 2000 Meng, Qing-Hong Guiyang Med Coll Dept Immunol
# 13950: 2000 Chung, Peter Chee-Keung Guiyang Med Coll Dept Immunol
# 13951: 2000 Yuan, Kai-Tao Sun Yat Sen Univ Affiliated Hosp 1
# 13952: 2000 Yu, Li Guangzhou First Municipal Peoples Hosp Dept Paediat
# 13953: 2000 Yang, Ding-Hua Southern Med Univ Nan Fang Hosp
# A2_03 A2_04 A2_05 A2_06 A2_07 A2_08 A2_09 A2_10
# 1: Alzheimer Ctr Limburg NL-6200 MD Maastricht Netherlands NA NA NA NA NA
# 2: Alzheimer Ctr Limburg NL-6200 MD Maastricht Netherlands NA NA NA NA NA
# 3: Alzheimer Ctr Limburg NL-6200 MD Maastricht Netherlands NA NA NA NA NA
# 4: Alzheimer Ctr Limburg NL-6200 MD Maastricht Netherlands NA NA NA NA NA
# 5: Utrecht Netherlands NA NA NA NA NA NA
# ---
# 13949: Guiyang 550004 Guizhou Provinc Peoples R China NA NA NA NA NA
# 13950: Guiyang 550004 Guizhou Provinc Peoples R China NA NA NA NA NA
# 13951: Dept Surg Guangzhou 510080 Guangdong Peoples R China NA NA NA NA
# 13952: Guangzhou 510180 Guangdong Peoples R China NA NA NA NA NA
# 13953: Dept Hepatobiliary Surg Guangzhou 510515 Guangdong Peoples R China NA NA NA NA
You can do some various manipulations with regular expressions, and use plyr and foreach functions to process everything. Here is an example of the first row
library(foreach)
library(plyr)
str1 = '[Sorce, A.; Greco, A.; Magistri, L.] Univ Genoa, Polytech Sch, Thermochem Power Grp TPG DIME, I-16145 Genoa, Italy; [Costamagna, P.] Univ Genoa, Polytech Sch, Thermochem Power Grp TPG DICCA, I-16145 Genoa, Italy'
##split the string into different parts
s1 = strsplit(str1,'; \\[')
s1. = llply(s1,strsplit,split = ']')[[1]]
##get list of authors
auths = llply(s1.,function(x) gsub('^ ','',strsplit(gsub('\\[','',x[1]),';')[[1]]))
##get all other attributes
other.stuff = llply(s1.,function(x) gsub('^ ','',strsplit(x[2],',')[[1]]))
results = foreach(auth = auths, other = other.stuff, .combine = 'rbind') %do%
expand.grid(auth,other[1],other[2],other[3],other[4],other[5])
The output's column names need to be changed, and you need to iterate this for each line, but that should be easy.