I have and excel with data for 2 DBs and multiple Measures (Msr) for each. There is classic ratio data Num/Denom=Ratio for each. Can anybody suggest what visualization I can use in R to graphically find big differences (let say 10%+) for each of measure between Test and X1 databases and then for each Measure.
So we compare Denom, Num, Rate between line 1 and 2.
..and then 3,4
..and then 5,6 etc
Tried to do in Excel but read that R could be much better for this purposes. But for now I can see most paired viz works for scattered display. I need something more traditional e.g. in my sample we can mark X1.SRB.Rare as low
In my example I have 3 measures, in reality could be 30. Thanks much for info.
M
db <- c('test','x1','test','x1','test','x1')
msr <- c('BCS','BCS','CCS','CCS','SRB','SRB')
denom <- c(11848,11049,35836,38458,54160,56387)
num <- c(5255,6376,16908,18124,26253,15000)
rate <- c(44.35,57.71,47.18,47.13,48.47,26.6)
df <- data.frame(db,msr,denom,num,rate)
df
db msr denom num rate
1 test BCS 11848 5255 44.35
2 x1 BCS 11049 6376 57.71
3 test CCS 35836 16908 47.18
4 x1 CCS 38458 18124 47.13
5 test SRB 54160 26253 48.47
6 x1 SRB 56387 15000 26.60
If I understood correctly, this should do what you want. I reshaped the data so you have one row per msr with separate columns for each db. I used data.table for it's performance.
library(data.table)
db <- c('test','x1','test','x1','test','x1')
msr <- c('BCS','BCS','CCS','CCS','SRB','SRB')
denom <- c(11848,11049,35836,38458,54160,56387)
num <- c(5255,6376,16908,18124,26253,15000)
rate <- c(44.35,57.71,47.18,47.13,48.47,26.6)
df <- data.frame(db,msr,denom,num,rate)
#set as a data.table
setDT(df)
#cast into one row per MSR - fill in with the "rate" variable
out <- dcast(msr ~ db, data = df, value.var = "rate")
#Compute difference
out[, test_x1_diff := test - x1]
#filter out diff >= 10
out[abs(test_x1_diff) >= 10]
#> msr test x1 test_x1_diff
#> 1: BCS 44.35 57.71 -13.36
#> 2: SRB 48.47 26.60 21.87
Created on 2019-01-11 by the reprex package (v0.2.1)
With some effort and help from the stackers, I have been able to parse a webpage and save it as a dataframe. I want to repeat the same operation on multiple xml files and rbind the list. Here is what I tried and did successfully:
library(XML)
xml.url <- "http://www.ebi.ac.uk/ena/data/view/ERS445758&display=xml"
doc <- xmlParse(xml.url)
x <- xmlToDataFrame(getNodeSet(doc,"//SAMPLE_ATTRIBUTE"))
x$UNITS <- NULL
x_t <- t(x)
x_t <- as.data.frame(x_t)
names(x_t) <- as.matrix(x_t[1, ])
x_t <- x_t[-1, ]
x_t[] <- lapply(x_t, function(x) type.convert(as.character(x)))
Above code works well, now when I try to apply a function to do the same for multiple xml files :
ERS_ID <- c("ERS445758","ERS445759", "ERS445760", "ERS445761", "ERS445762")
xml_url_test = as.vector(sprintf("http://www.ebi.ac.uk/ena/data/view/ERS445758&display=xml",
ERS_ID))
XML_parser <- function(XML_url){
doc <- xmlParse(XML_url)
x <- xmlToDataFrame(getNodeSet(doc,"//SAMPLE_ATTRIBUTE"))
x$UNITS <- NULL
x_t <- t(x)
x_t <- as.data.frame(x_t)
names(x_t) <- as.matrix(x_t[1, ])
x_t <- x_t[-1, ]
x_t[] <- lapply(x_t, function(x) type.convert(as.character(x)))
return(x_t)
}
major_test <- sapply(xml_url_test, XML_parser)
It works, but gives me a long list that is not in the right data frame format as I generated for the single XML file.
Finally I would like to also add a column to the final dataframe that has the ERS number from the ERS_ID vector
Something like x_t$ERSid <- ERS_ID in the function
Can someone point out what am I missing in the function as well as any better ways to do the task?
Thanks!
Your main issue is using sapply over lapply() where the latter returns a list and former attempts to simplify to a vector or matrix, here being a matrix.
major_test <- lapply(xml_url_test, XML_parser)
Of course, sapply is a wrapper for lapply and can also return a list: sapply(..., simplify=FALSE):
major_test <- sapply(xml_url_test, XML_parser, simplify=FALSE)
However, a few other items came up:
At beginning, you are not concatenating your ERS_ID to the url stem with sprintf's %s operator. So right now, the same urls are repeating.
At end, you are not binding your list of data frames into a compiled final single dataframe.
Add new ERS column inside your defined function, passing in ERS_ID vector. And while creating column, also remove the ERS prefix with gsub.
R code (adjusted)
XML_parser <- function(eid) {
XML_url <- as.vector(sprintf("http://www.ebi.ac.uk/ena/data/view/%s&display=xml", eid))
doc <- xmlParse(XML_url)
x <- xmlToDataFrame(getNodeSet(doc,"//SAMPLE_ATTRIBUTE"))
x$UNITS <- NULL
x_t <- t(x)
x_t <- as.data.frame(x_t)
names(x_t) <- as.matrix(x_t[1, ])
x_t <- x_t[-1, ]
x_t[] <- lapply(x_t, function(x) type.convert(as.character(x)))
x_t$ERSid <- gsub("ERS", "", eid) # ADD COL, REMOVE ERS
x_t <- x_t[,c(ncol(x_t),2:ncol(x_t)-1)] # MOVE NEW COL TO FIRST
return(x_t)
}
major_test <- lapply(ERS_ID, XML_parser)
# major_test <- sapply(ERS_ID, XML_parser, simplify=FALSE)
# BIND DATA FRAMES TOGETHER
finaldf <- do.call(rbind, major_test)
# RESET ROW NAMES
row.names(finaldf) <- seq(nrow(finaldf))
Using xml2 and the tidyverse you can do something like this:
require(xml2)
require(purrr)
require(tidyr)
urls <- rep("http://www.ebi.ac.uk/ena/data/view/ERS445758&display=xml", 2)
identifier <- LETTERS[seq_along(urls)] # Take a unique identifier per url here
parse_attribute <- function(x){
out <- data.frame(tag = xml_text(xml_find_all(x, "./TAG")),
value = xml_text(xml_find_all(x, "./VALUE")), stringsAsFactors = FALSE)
spread(out, tag, value)
}
doc <- map(urls, read_xml)
out <- doc %>%
map(xml_find_all, "//SAMPLE_ATTRIBUTE") %>%
set_names(identifier) %>%
map_df(parse_attribute, .id="url")
Which gives you a 2x36 data.frame. To parse the column type i would suggest using readr::type_convert(out)
Out looks as follows:
url age body product body site body-mass index chimera check collection date
1 A 28 mucosa Sigmoid colon 16.95502 ChimeraSlayer; Usearch 4.1 database 2009-03-16
2 B 28 mucosa Sigmoid colon 16.95502 ChimeraSlayer; Usearch 4.1 database 2009-03-16
disease status ENA-BASE-COUNT ENA-CHECKLIST ENA-FIRST-PUBLIC ENA-LAST-UPDATE ENA-SPOT-COUNT
1 remission 627051 ERC000015 2014-12-31 2016-10-21 1668
2 remission 627051 ERC000015 2014-12-31 2016-10-21 1668
environment (biome) environment (feature) environment (material) experimental factor
1 organism-associated habitat organism-associated habitat mucus microbiome
2 organism-associated habitat organism-associated habitat mucus microbiome
gastrointestinal tract disorder geographic location (country and/or sea,region) geographic location (latitude)
1 Ulcerative Colitis India 72.82807
2 Ulcerative Colitis India 72.82807
geographic location (longitude) host subject id human gut environmental package investigation type
1 18.94084 1 human-gut metagenome
2 18.94084 1 human-gut metagenome
medication multiplex identifiers pcr primers phenotype project name
1 ASA;Steroids;Probiotics;Antibiotics TGATACGTCT 27F-338R pathological BMRP
2 ASA;Steroids;Probiotics;Antibiotics TGATACGTCT 27F-338R pathological BMRP
sample collection device or method sequence quality check sequencing method sequencing template sex target gene
1 biopsy software pyrosequencing DNA male 16S rRNA
2 biopsy software pyrosequencing DNA male 16S rRNA
target subfragment
1 V1V2
2 V1V2
purrr is really helpful here, as you can iterate over a vector of URLs or a list of XML files with map, or within nested elements with at_depth, and simplify the results with the *_df forms and flatten.
library(tidyverse)
library(xml2)
# be kind, don't call this more times than you need to
x <- c("ERS445758","ERS445759", "ERS445760", "ERS445761", "ERS445762") %>%
sprintf("http://www.ebi.ac.uk/ena/data/view/%s&display=xml", .) %>%
map(read_xml) # read each URL into a list item
df <- x %>% map(xml_find_all, '//SAMPLE_ATTRIBUTE') %>% # for each item select nodes
at_depth(2, as_list) %>% # convert each (nested) attribute to list
map_df(map_df, flatten) # flatten items, collect pages to df, then all to one df
df
## # A tibble: 175 × 3
## TAG VALUE UNITS
## <chr> <chr> <chr>
## 1 investigation type metagenome <NA>
## 2 project name BMRP <NA>
## 3 experimental factor microbiome <NA>
## 4 target gene 16S rRNA <NA>
## 5 target subfragment V1V2 <NA>
## 6 pcr primers 27F-338R <NA>
## 7 multiplex identifiers TGATACGTCT <NA>
## 8 sequencing method pyrosequencing <NA>
## 9 sequence quality check software <NA>
## 10 chimera check ChimeraSlayer; Usearch 4.1 database <NA>
## # ... with 165 more rows
You can retrieve multiple IDs with a single REST url using a comma-separated list or range like ERS445758-ERS445762 and avoid multiple queries to the ENA.
This code gets all 5 samples into a node set and then applies functions using a leading dot in the xpath string so its relative to that node.
ERS_ID <- c("ERS445758","ERS445759", "ERS445760", "ERS445761", "ERS445762")
url <- paste0( "http://www.ebi.ac.uk/ena/data/view/", paste(ERS_ID, collapse=","), "&display=xml")
doc <- xmlParse(url)
samples <- getNodeSet( doc, "//SAMPLE")
## check the first node
samples[[1]]
## get the sample attribute node set and apply xmlToDataFrame to that
x <- lapply( lapply(samples, getNodeSet, ".//SAMPLE_ATTRIBUTE"), xmlToDataFrame)
# labels for bind_rows
names(x) <- sapply(samples, xpathSApply, ".//PRIMARY_ID", xmlValue)
library(dplyr)
y <- bind_rows(x, .id="sample")
z <- subset(y, TAG %in% c("age","sex","body site","body-mass index") , 1:3)
sample TAG VALUE
15 ERS445758 age 28
16 ERS445758 sex male
17 ERS445758 body site Sigmoid colon
19 ERS445758 body-mass index 16.9550173
50 ERS445759 age 58
51 ERS445759 sex male
...
library(tidyr)
z %>% spread( TAG, VALUE)
sample age body site body-mass index sex
1 ERS445758 28 Sigmoid colon 16.9550173 male
2 ERS445759 58 Sigmoid colon 23.22543185 male
3 ERS445760 26 Sigmoid colon 20.76124567 female
4 ERS445761 30 Sigmoid colon 0 male
5 ERS445762 36 Sigmoid colon 0 male
I have two datasets A and B with 8 coloumns each. Dataset A has 942 rows and Dataset B has 5079 rows. I have to compare Dataset A and Dataset B and do fuzzy matching. If there is any row is matched in Dataset B I have to mark "Matched" in dataset A in additional column.
I'm relatively new to R and not sure how to optimize r code with lapply, mapply or sapply instead of forloop.
Following is my code
##############################
# Install Necessary Packages #
##############################
#install.packages("openxlsx")
#install.packages("stringdist")
#install.packages("XLConnect")
##############################
# Load Packages #
##############################
library(openxlsx)
library(stringdist)
library(XLConnect)
cmd_newleads <- read.xlsx("Src/CMD - New Leads to Load.xlsx", sheet = "Top Leads Full Data", startRow = 1, colNames = TRUE)
cmd_newleads[c("Lead_Match","Opportunity_Match")] <- ""
c4c_leads <- read.xlsx("Src/C4C - Leads.xlsx", sheet = "Leads", startRow = 1, colNames = TRUE)
#c4c_opportunities <- read.xlsx("Src/C4C - Opportunities Data 6-24-16.xlsx", sheet = "Export 06-24-2016 04.55.46 PM", startRow = 1, colNames = TRUE)
cmd_newleads_selcols <- cmd_newleads[,c("project_name","project_address","project_city","project_state_province_region_code","project_postalcode","project_country","project_sector","project_type")]
cmd_newleads_selcols[is.na(cmd_newleads_selcols)] <- ""
#rownames(cmd_newleads_selcols)
c4cleads_selcols <- c4c_leads[,c("Lead","Address1.(Lead)","City.(Lead)","Region.(Lead)","Postal.Code.(Lead)","Country.(Lead)","Sector.(Lead)","Type.(Lead)")]
c4cleads_selcols[is.na(c4cleads_selcols)] <- ""
#cmd_c4copportunities_selcols <- c4c_opportunities[,c("project_name","project_address","project_city","project_state_province_region_code","project_postalcode","project_country","project_sector","project_type")]
rcount_cmdnewleads <- nrow(cmd_newleads)
rcount_c4cleads <- nrow(c4c_leads)
#rcount_c4copportunities <- nrow(c4c_opportunities)
for(i in 1:rcount_cmdnewleads)
{
cmd_project_name <- cmd_newleads_selcols[i,1]
cmd_project_address <- cmd_newleads_selcols[i,2]
cmd_project_city <- cmd_newleads_selcols[i,3]
cmd_project_region_code <- cmd_newleads_selcols[i,4]
cmd_project_postalcode <- cmd_newleads_selcols[i,5]
cmd_project_country <- cmd_newleads_selcols[i,6]
cmd_project_sector <- cmd_newleads_selcols[i,7]
cmd_project_type <- cmd_newleads_selcols[i,8]
for(j in 1:rcount_c4cleads)
{
c4cleads_project_name <- c4cleads_selcols[j,1]
c4cleads_project_address <- c4cleads_selcols[j,2]
c4cleads_project_city <- c4cleads_selcols[j,3]
c4cleads_project_region_code <- c4cleads_selcols[j,4]
c4cleads_project_postalcode <- c4cleads_selcols[j,5]
c4cleads_project_country <- c4cleads_selcols[j,6]
c4cleads_project_sector <- c4cleads_selcols[j,7]
c4cleads_project_type <- c4cleads_selcols[j,8]
project_percent <- stringsim(cmd_project_name,c4cleads_project_name, method="dl", p=0.1)
address_percent <- stringsim(cmd_project_address,c4cleads_project_address, method="dl", p=0.1)
city_percent <- stringsim(cmd_project_city,c4cleads_project_city, method="dl", p=0.1)
region_percent <- stringsim(cmd_project_region_code,c4cleads_project_region_code, method="dl", p=0.1)
postalcode_percent <- stringsim(cmd_project_postalcode,c4cleads_project_postalcode, method="dl", p=0.1)
country_percent <- stringsim(cmd_project_country,c4cleads_project_country, method="dl", p=0.1)
sector_percent <- stringsim(cmd_project_sector,c4cleads_project_sector, method="dl", p=0.1)
type_percent <- stringsim(cmd_project_type,c4cleads_project_type, method="dl", p=0.1)
if(project_percent > 0.833 && address_percent > 0.833 && city_percent > 0.833 && region_percent > 0.833 && postalcode_percent > 0.833 && country_percent > 0.833 && sector_percent > 0.833 && type_percent > 0.833)
{
cmd_newleads[i,51] <- c4cleads[j,c4cleads$Lead.ID]
}
else
{
cmd_newleads[i,51] <- "New Lead"
}
}
}
Sample data for cmd_newleads_selcols and c4cleads_selcols respectively
project_name project_address project_city
1 Wynn Mystic Casino & Hotel 22 Chemical Ln Everett
2 Northpoint Complex Development East Street Cambridge
3 Northpoint Complex Development East Street Cambridge
4 Northpoint Complex Development East Street Cambridge
5 Northpoint Complex Development East Street Cambridge
6 Northpoint Complex Development East Street Cambridge
project_state_province_region_code project_postalcode
1 MA 02149
2 MA 02138
3 MA 02138
4 MA 02138
5 MA 02138
6 MA 02138
project_country project_sector project_type
1 United States of America Hospitality New Building
2 United States of America Apartments New Building
3 United States of America Apartments New Building
4 United States of America Apartments New Building
5 United States of America Apartments New Building
6 United States of America Apartments New Building
Lead Address1.(Lead) City.(Lead) Region.(Lead) Postal.Code.(Lead) Country.(Lead)
1 1 Hotel Brooklyn Bridge Park Old Fulton St & Furman St Brooklyn New York 11201 United States
2 10 Trinity Square Hotel 10 Trinity Square London # EC3P United Kingdom
3 100 Stewart 1900 1st Avenue Seattle Washington 98101 United States
4 1136 S Wabash # # # # Not assigned
5 115-129 37th Street 115-129 37th Street Union CIty New Jersey # United States
6 1418 W Addison 1418 w Addison Chicago # 60613 Not assigned
Sector.(Lead) Type.(Lead)
1 Hospitality New Building
2 Hospitality Brand Conversion
3 Hospitality New Building
4 High Rise Residential New Building
5 Developer New Building
6 High Rise Residential New Building
If you are experiencing efficiency problems, it's not because you are using a for loop. The main issue is that you are doing a lot of work for every possible combination of rows in your two data sets. Using more efficient language features might speed things up a bit, but it wouldn't change the fact that you're doing a lot of unnecessary computation.
One of the best ways to increase efficiency in data matching problems is to rule out obvious non-matches to cut down on unnecessary computations. For example, you could change your inner loop to first check some key condition; if the score is low (i.e. it's obviously a non-match) you don't need to compute similarity scores for the rest of the attributes.
For example:
for(i in 1:rcount_cmdnewleads)
{
cmd_project_name <- cmd_newleads_selcols[i,1]
...
for(j in 1:rcount_c4cleads)
{
c4cleads_project_name <- c4cleads_selcols[j,1]
project_percent <- stringsim(cmd_project_name,c4cleads_project_name, method="dl", p=0.1)
if (project_percent < .83) {
# you already know that this is a non-match, so go to the next one
next
} else {
# check the rest of the values!
...
}
}
}
I'm not familiar with the R RecordLinkage package, but the Python recordlinkage package has tools for ruling out obvious non-matches early in the process to increase efficiency. Consider checking out this tutorial to learn more about speeding up record linkage by ruling out obvious non matches.
You might want to look at the package RecordLinkage, which allows you to perform phonetic matching, probabilistic record linkage and machine learning approaches.
I am trying to perform some scientometrics analysis from a Scopus csv file. The first column of the imported csv is like:
Authors,Title,Year,Source title,Volume,Issue,Art. No.,Page start,Page end,Page count,Cited by,DOI,Link,Document Type,Source,EID
"Kuck, L.S., Noreña, C.P.Z.","Microencapsulation of grape (Vitis labrusca var. Bordo) skin phenolic extract using gum Arabic, polydextrose, and partially hydrolyzed guar gum as encapsulating agents",2016,"Food Chemistry","194",,,"569","576",,,10.1016/j.foodchem.2015.08.066,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84940212199&partnerID=40&md5=e4c36e03156570a7fe31c2937b3a170d",Article,Scopus,2-s2.0-84940212199
"Grasel, F.D.S., Ferrão, M.F., Wolf, C.R.","Development of methodology for identification the nature of the polyphenolic extracts by FTIR associated with multivariate analysis",2016,"Spectrochimica Acta - Part A: Molecular and Biomolecular Spectroscopy","153",,,"94","101",,,10.1016/j.saa.2015.08.020,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84939865445&partnerID=40&md5=8239487f4eea9479d698792e6aa348de",Article,Scopus,2-s2.0-84939865445
"De Souza, D., Sbardelotto, A.F., Ziegler, D.R., Marczak, L.D.F., Tessaro, I.C.","Characterization of rice starch and protein obtained by a fast alkaline extraction method",2016,"Food Chemistry","191",, 17279,"36","44",,,10.1016/j.foodchem.2015.03.032,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84938952690&partnerID=40&md5=989cbfcc72286a87f726925732db4b49",Article,Scopus,2-s2.0-84938952690
"Filho, P.R.M., Vercelino, R., Cioato, S.G., Medeiros, L.F., de Oliveira, C., Scarabelot, V.L., Souza, A., Rozisky, J.R., Quevedo, A.S., Adachi, L.N.S., Sanches, P.R.S., Fregni, F., Caumo, W., Torres, I.L.S.","Transcranial direct current stimulation (tDCS) reverts behavioral alterations and brainstem BDNF level increase induced by neuropathic pain model: Long-lasting effect",2016,"Progress in Neuro-Psychopharmacology and Biological Psychiatry","64",,,"44","51",,,10.1016/j.pnpbp.2015.06.016,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84937468588&partnerID=40&md5=b03f0ccfbf66a49a438c9983cc2e8f9d",Article,Scopus,2-s2.0-84937468588
"Duarte, Á.T., Borges, A.R., Zmozinski, A.V., Dessuy, M.B., Welz, B., De Andrade, J.B., Vale, M.G.R.","Determination of lead in biomass and products of the pyrolysis process by direct solid or liquid sample analysis using HR-CS GF AAS",2016,"Talanta","146",,,"166","174",,,10.1016/j.talanta.2015.08.041,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84940416990&partnerID=40&md5=55d7ddad27e955b9b6e269469e29c8c3",Article,Scopus,2-s2.0-84940416990
"Francischini, H., Paes Neto, V.D., Martinelli, A.G., Pereira, V.P., Marinho, T.S., Teixeira, V.P.A., Ferraz, M.L.F., Soares, M.B., Schultz, C.L.","Invertebrate traces in pseudo-coprolites from the upper Cretaceous Marília Formation (Bauru Group), Minas Gerais State, Brazil",2016,"Cretaceous Research","57",,,"29","39",,,10.1016/j.cretres.2015.07.016,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84939175950&partnerID=40&md5=b049de15a08ba477cc189d7e8fe7f0a3",Article,Scopus,2-s2.0-84939175950
"Bonfatti, B.R., Hartemink, A.E., Giasson, E., Tornquist, C.G., Adhikari, K.","Digital mapping of soil carbon in a viticultural region of Southern Brazil",2016,"Geoderma","261",,,"204","221",,,10.1016/j.geoderma.2015.07.016,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84939499978&partnerID=40&md5=b470166e01648dcbe8f0d43be86c84e0",Article,Scopus,2-s2.0-84939499978
"Scaramuzza dos Santos, T.C., Holanda, E.C., de Souza, V., Guerra-Sommer, M., Manfroi, J., Uhl, D., Jasper, A.","Evidence of palaeo-wildfire from the upper Lower Cretaceous (Serra do Tucano Formation, Aptian-Albian) of Roraima (North Brazil)",2016,"Cretaceous Research","57",,,"46","49",,,10.1016/j.cretres.2015.08.003,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84939615367&partnerID=40&md5=e59f5130c6a2e1863f9aa77c960e6462",Article,Scopus,2-s2.0-84939615367
"da Silva, S.W., Bortolozzi, J.P., Banús, E.D., Bernardes, A.M., Ulla, M.A.","TiO<inf>2</inf> thick films supported on stainless steel foams and their photoactivity in the nonylphenol ethoxylate mineralization",2016,"Chemical Engineering Journal","283",, 14049,"1264","1272",,,10.1016/j.cej.2015.08.057,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84940747062&partnerID=40&md5=aebc7357f9dedaadeebabfeda4aa3dd9",Article,Scopus,2-s2.0-84940747062
"Dalmora, A.C., Ramos, C.G., Oliveira, M.L.S., Teixeira, E.C., Kautzmann, R.M., Taffarel, S.R., de Brum, I.A.S., Silva, L.F.O.","Chemical characterization, nano-particle mineralogy and particle size distribution of basalt dust wastes",2016,"Science of the Total Environment","539",, 18331,"560","565",,,10.1016/j.scitotenv.2015.08.141,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84941754626&partnerID=40&md5=1c4ca1a3486ff55f92f238083af3eb50",Article,Scopus,2-s2.0-84941754626
"Fink, J.R., Inda, A.V., Bavaresco, J., Barrón, V., Torrent, J., Bayer, C.","Adsorption and desorption of phosphorus in subtropical soils as affected by management system and mineralogy",2016,"Soil and Tillage Research","155",,,"62","68",,,10.1016/j.still.2015.07.017,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84940195225&partnerID=40&md5=2e43a874f1e36f11aa5efa057ce660b9",Article,Scopus,2-s2.0-84940195225
"Martins, A.B., Santana, R.M.C.","Effect of carboxylic acids as compatibilizer agent on mechanical properties of thermoplastic starch and polypropylene blends",2016,"Carbohydrate Polymers","135",,,"79","85",,,10.1016/j.carbpol.2015.08.074,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84940781718&partnerID=40&md5=426e62c6c0de33a91bdb2f75442fbd6f",Article,Scopus,2-s2.0-84940781718
In each line, there are a variable number of authors (up to more than 20). Until now, I am doing something like:
test <- read.csv("test.csv")
test$Authors <- as.character(test$Authors)
test2 <- strsplit(as.character(test$Authors), '.,', fixed=TRUE)
Which gives me a list correctly separating each author. I tested several alternatives proposed in the list to move the list to a data frame, but the closer one was:
test3 <- str_split_fixed(test$Authors, '.,', n = 20)
Which gave me two problems:
1) I have to define the number of columns, which I do not know before analyzing the data;
2) the authors are not properly separated, but surname and abbreviated names are in different columns. Additionally, the command removed some characters from the names.
Some of the strategies suggested elsewhere gave me the correct separation of authors in columns, but the empty columns were fulfilled by repeating the initial names. Sorry if the question is too naive, I am beginning in the use of R.
Any advises and or insights would be greatly appreciated!
Here's how I would do it.
Firstly, using read.csv is causing the split between Authors last name and initial, so I'm using readLines instead.
Secondly, having "wide data" like this is not in general a good idea. It makes data more difficult to work with in subsequent analyses. For that reason, I've made it "long" instead.
n1 <- readLines(con="test.csv")
n1 <- strsplit(n1, '., ', fixed=TRUE)
n1 <- do.call(rbind, lapply(1:length(n1), function(x){data.frame(aut = n1[[x]], pub = x, order = 1:length(n1[[x]]))}))
n1$aut <- gsub("\\.$", "", n1$aut)
Here's the output:
aut pub order
1 Kuck, L.S 1 1
2 Noreña, C.P.Z 1 2
3 Grasel, F.D.S 2 1
4 Ferrão, M.F 2 2
5 Wolf, C.R 2 3
6 Abreu, M.S 3 1
7 Giacomini, A.C.V 3 2
8 Gusso, D 3 3
9 Rosa, J.G.S 3 4
NB if you really want your data in "wide format", we can easily reshape it:
library(tidyr)
spread(n1, order, aut)
pub 1 2 3 4
1 1 Kuck, L.S Noreña, C.P.Z <NA> <NA>
2 2 Grasel, F.D.S Ferrão, M.F Wolf, C.R <NA>
3 3 Abreu, M.S Giacomini, A.C.V Gusso, D Rosa, J.G.S
EDIT: for your full version, you need to use read.csv:
input <- n1 <- read.csv("test.csv")
n1$Authors <- as.character(n1$Authors)
n1$Authors <- strsplit(n1$Authors, '., ', fixed=TRUE)
n1 <- do.call(rbind, lapply(1:length(n1$Authors), function(x){data.frame(aut = n1$Authors[[x]], pub = x, order = 1:length(n1$Authors[[x]]))}))
n1$aut <- gsub("\\.$", "", n1$aut)
If you want to go back to wide with all your stuff:
library(dplyr)
library(tidyr)
input <- mutate(input, row = row_number())
n1 %>% spread(order, aut) %>%
left_join(input, by = c("pub" = "row")) %>%
select(-Authors)