Splitting complex string between symbols R

Splitting complex string between symbols R - r

I have a dataset full of IDs and qualification strings. My issue with this is two fold;
How to deal with splits between different symbols and,
how to iterate output down a dataframe whilst retaining an ID.
ID <- c(1,2,3)
Qualstring <- c("LE:Science = 45 Distinctions",
"A:Chemistry = A A:Biology = A A:Mathematics = A",
"A:Biology = A A:Chemistry = A A:Mathematics = A B:Baccalaureate Advanced Diploma = Pass"
)
s <- data.frame(ID, Qualstring)
The desired output would be:
ID Qualification Subject Grade
1 1 LE: Science 45 Distinctions
2 2 A: Chemistry A
3 2 A: Biology A
4 2 A: Mathematics A
5 3 A: Biology A
6 3 A: Chemistry A
7 3 A: Mathematics A
8 3 WB: Welsh Baccalaureate Advanced Diploma Pass
The commonality of the splits is the ":" and "=", and the codes/words around those.
Looking at the problem from my perspective, it appears complex and whether a continued fudge in excel is ultimately the way to go for this structure of data. Would love to know otherwise if there are any recommendations or direction.

A solution using data.table and stringr. The use of data.table is just for my personal convenience, you could use data.frame with do.call(rbind,.) instead of rbindlist()
library(stringr)
qual <- str_extract_all(s$Qualstring,"[A-Z]+(?=\\:)")
subject <- str_extract_all(s$Qualstring,"(?<=\\:)[\\w ]+")
grade <- str_extract_all(s$Qualstring,"(?<=\\= )[A-z0-9]+")
library(data.table)
df <- lapply(seq(s$ID),function(i){
N = length(qual[[i]])
data.table(ID = rep(s[i,"ID"],N),
Qualification = qual[[i]],
Subject = subject[[i]],
Grade = grade[[i]]
)
}) %>% rbindlist()
ID Qualification Subject Grade
1: 1 LE Science 45
2: 2 A Chemistry A
3: 2 A Biology A
4: 2 A Mathematics A
5: 3 A Biology A
6: 3 A Chemistry A
7: 3 A Mathematics A
8: 3 B Baccalaureate Advanced Diploma Pass
In short, I use positive look behind (?<=) and positive look ahead (?=). [A-Z]+ is for a group of upper letters, [\\w ]+ for a group of words and spaces, [A-z0-9]+ for letters (up and low cases) and numbers. string_extract_all gives a list with all the match on each cell of the character vector tested.

Related

In R, left join two tables whose 2 potential keys contain missing data

Background:
I'm working with a fairly large (>10,000 rows) dataset of individual cars, and I need to do some analysis on it. I need to keep this dataset d intact, but I'm only going to be analyzing cars made by Japanese companies (e.g. Nissan, Honda, etc.). d contains information like VIN_prefix (the first two letters of a VIN number that indicates the "World Manufacturer Number"), model year, and make, but no explicit indicator of whether the car is made by a Japanese firm. Here's d:
d <- data.frame(
make = c("GMC","Dodge","NA","Subaru","Nissan","Chrysler"),
model_yr = c("1999","2004","1989","1999","2006","2012"),
VIN_prefix = c("1G","1D","JH","JF","NA","2C"),
stringsAsFactors=FALSE)
Here, rows 3, 4, and 5 correspond to Japanese cars: the NA in row 3 is actually an Acura whose make is missing. See below when I get to the other dataset about why this is.
d also lacks some attributes (columns) about cars that I need for my analysis, e.g. the current CEO of Japanese car firms.
Enter another dataset, a, a dataset about Japanese car firms which contains those extra attributes as well as columns that could be used to identify whether a given car (row) in d is made by a Japanese firm. One of those is VIN_prefix; the other is jp_makes, a list of Japanese auto firms. Here's a:
a <- data.frame(
VIN_prefix = c("JH","JF","1N"),
jp_makes = c("Acura","Subaru","Nissan"),
current_ceo = c("Toshihiro Mibe","Tomomi Nakamura","Makoto Ushida"),
stringsAsFactors=FALSE)
Here, we can see that the "Acura" make, missing in the car from row 3 in d, could be identified by its VIN_prefix "JH", which in row 3 of d is not NA.
Goal:
Left join a onto d so that each of the 3 Japanese cars in d gets the relevant corresponding attributes from a - mainly, current_ceo. (Non-Japanese cars in d would have NA for columns joined from a; this is fine.)
Problem:
As you can tell, the two relevant variables in d that could be used as keys in a join - make and VIN_prefix - have missing data in d. The "matching rules" we could use are imperfect: I could match on d$make == a$jp_makes or on d$VIN_prefix == a$VIN_prefix, but they'd each be wrong due to the missing data in d.
What to do?
What I've tried:
I can try left joining on either one of these potential keys, but not all 3 of the Japanese cars in d wind up with their correct information from a:
try1 <- left_join(d, a, by = c("make" = "jp_makes"))
try2 <- left_join(d, a, by = c("VIN_prefix" = "VIN_prefix"))
I can successfully generate an logical 'indicator' variable in d that tells me whether a car is Japanese or not:
entries_make <- a$jp_makes
entries_vin_prefix <- a$VIN_prefix
d<- d %>%
mutate(is_jp = ifelse(d$VIN_prefix %in% entries_vin_prefix | d$make %in% entries_make, 1, 0)
%>% as.logical())
But that only gets me halfway: I still need those other columns from a to sit next to those Japanese cars in d. It's unfeasible to manually fill all the missing data in some other way; the real datasets these toy examples correspond to are too big for that and I don't have the manpower or time.
Ideally, I'd like a dataset that looks something like this:
ideal <- data.frame(
make = c("GMC","Dodge","NA","Subaru","Nissan","Chrysler"),
model_yr = c("1999","2004","1989","1999","2006","2012"),
VIN_prefix = c("1G","1D","JH","JF","NA","2C"),
current_ceo = c("NA", "NA", "Toshihiro Mibe","Tomomi Nakamura","Makoto Ushida", "NA"),
stringsAsFactors=FALSE)
What do you all think? I've looked at other posts (e.g. here) but their solutions don't really apply. Any help is much appreciated!

Left join on an OR of the two conditions.
library(sqldf)
sqldf("select d.*, a.current_ceo
from d
left join a on d.VIN_prefix = a.VIN_prefix or d.make = a.jp_makes")
giving:
make model_yr VIN_prefix current_ceo
1 GMC 1999 1G <NA>
2 Dodge 2004 1D <NA>
3 NA 1989 JH Toshihiro Mibe
4 Subaru 1999 JF Tomomi Nakamura
5 Nissan 2006 NA Makoto Ushida
6 Chrysler 2012 2C <NA>

Use a two pass method. First fill in the missing make (or VIN values). I'll illustrate by filling in make valuesDo notice taht "NA" is not the same as NA. The first is a character value while the latter is a true R missing value, so I'd first convert those to a true missing value. In natural language I am replacing the missing values in d (note correction of df) with values of 'jp_makes' that are taken from a on the basis of matching VIN_prefix values:
is.na( d$make) <- df$make=="NA"
d$make[is.na(df$make)] <- a$jp_makes[
match( d$VIN_prefix[is.na(d$make)], a$VIN_prefix) ]
Now you have the make values filled in on the basis of the table look up in a. It should be trivial to do the match you wanted by using by.x='make', by.y='jp_make'
merge(d, a, by.x='make', by.y='jp_makes', all.x=TRUE)
make model_yr VIN_prefix.x VIN_prefix.y current_ceo
1 Acura 1989 JH JH Toshihiro Mibe
2 Chrysler 2012 2C <NA> <NA>
3 Dodge 2004 1D <NA> <NA>
4 GMC 1999 1G <NA> <NA>
5 Nissan 2006 NA 1N Makoto Ushida
6 Subaru 1999 JF JF Tomomi Nakamura
You can then use the values in VIN_prefix.y to replace the values the =="NA" in VIN_prefix.x.

R - Finding identical rows or rows that only differ by x columns

I'm trying to use R on a large CSV file that for this example can be said to represent a list of people and forms of transportation. If a person owns that mode of transportation, this is represented by a X in the corresponding cell. Example data of this is as per below:
Type,Peter,Paul,Mary,Don,Stan,Mike
Scooter,X,X,X,,X,
Car,,,,X,,X
Bike,,,,,,
Skateboard,X,X,X,X,X,X
Boat,,X,,,,
The below image makes it easier to see what it represents:
What I'm after is to learn which persons have identical modes of transportation, or, ideally, where the modes of transportation differs by no more than one.
The format is a bit weird but, assuming the csv file is named example.csv, I can read it into a data frame and transpose it as per below (it should be fairly obvious that I'm a complete R noob)
ex <- read.csv('example.csv')
ext <- as.data.frame(t(ex))
This post explained how to find duplicates and it seems to work
duplicated(ext) | duplicated(ext[nrow(ext):1, ])[nrow(ext):1]
which(duplicated(ext) | duplicated(ext[nrow(ext):1, ])[nrow(ext):1])
This returns the following indexes:
1 2 4 5 6 7
That does indeed correspond with what I consider to be duplicate rows. That is, Peter has the same modes of transportation as Mary and Stan (indexes 2, 4 and 6); Don and Mike likewise share the same modes of transportation, indexes 5 and 7.
Again, that seems to work ok but if the modes of transportation and number of people are significant, it becomes really difficult finding/knowing not just which rows are duplicates, but which indexes actually matched. In this case that indexes 2, 4 and 6 are identical and that 5 and 7 are identical.
Is there an easy way of getting that information so that one doesn't have to try and find the matches manually?
Also, given all of the above, is it possible to alter the code in any way so that it would consider rows to match if there was only a difference in X positions (for example a difference of one is acceptable so as long as the persons in the above example have no more than one mode of transportation that is different, it's still considered a match)?
Happy to elaborate further and very grateful for any and all help.

library(dplyr)
library(tidyr)
ex <- read.csv(text = "Type,Peter,Paul,Mary,Don,Stan,Mike
Scooter,X,X,X,,X,
Car,,,,X,,X
Bike,,,,,,
Skateboard,X,X,X,X,X,X
Boat,,X,,,,", )
ext <- tidyr::pivot_longer(ex, -Type, names_to = "person")
# head(ext)
ext <- ext %>%
group_by(person) %>%
filter(value == "X") %>%
summarise(Modalities = n(), Which = paste(Type, collapse=", ")) %>%
arrange(desc(Modalities), Which) %>%
mutate(IdenticalGrp = rle(Which)$lengths %>% {rep(seq(length(.)), .)})
ext
#> # A tibble: 6 x 4
#> person Modalities Which IdenticalGrp
#> <chr> <int> <chr> <int>
#> 1 Paul 3 Scooter, Skateboard, Boat 1
#> 2 Don 2 Car, Skateboard 2
#> 3 Mike 2 Car, Skateboard 2
#> 4 Mary 2 Scooter, Skateboard 3
#> 5 Peter 2 Scooter, Skateboard 3
#> 6 Stan 2 Scooter, Skateboard 3
To get a membership list in any particular IndenticalGrp you can just pull like this.
ext %>% filter(IdenticalGrp == 3) %>% pull(person)
#> [1] "Mary" "Peter" "Stan"

Data Cleaning in R: remove test customer names

I am handling customer data that has customer first and last name. I want to clean the names of any random keystrokes. Test accounts are jumbled in the data-set and have junk names. For example in the below data I want to remove customers 2,5,9,10,12 etc. I would appreciate your help.
Customer Id FirstName LastName
1 MARY MEYER
2 GFRTYUIO UHBVYY
3 CHARLES BEAL
4 MARNI MONTANEZ
5 GDTDTTD DTTHDTHTHTHD
6 TIFFANY BAYLESS
7 CATHRYN JONES
8 TINA CUNNINGHAM
9 FGCYFCGCGFC FGCGFCHGHG
10 ADDHJSDLG DHGAHG
11 WALTER FINN
12 GFCTFCGCFGC CG GFCGFCGFCGF
13 ASDASDASD AASDASDASD
14 TYKTYKYTKTY YTKTYKTYK
15 HFHFHF HAVE
16 REBECCA CROSSWHITE
17 GHSGHG HGASGH
18 JESSICA TREMBLEY
19 GFRTYUIO UHBVYY
20 HUBHGBUHBUH YTVYVFYVYFFV
21 HEATHER WYRICK
22 JASON SPLICHAL
23 RUSTY OWENS
24 DUSTIN WILLIAMS
25 GFCGFCFGCGFC GRCGFXFGDGF
26 QWQWQW QWQWWW
27 LIWNDVLIHWDV LIAENVLIHEAV
28 DARLENE SHORTRIDGE
29 BETH HDHDHDH
30 ROBERT SHIELDS
31 GHERDHBXFH DFHFDHDFH
32 ACE TESSSSSRT
33 ALLISON AWTREY
34 UYGUGVHGVGHVG HGHGVUYYU
35 HCJHV FHJSEFHSIEHF

The problem seems to be that you'd need a solid definition of improbable names, and that is not really related to R. Anyway, I suggest you go by the first names and remove all those names that are not plausible. As a source of plausible first names or positive list, you could use e.g. SSA Baby Name Database. This should work reasonably well to filter out English first names. If you have more location specific needs for first names, just look online for other baby name databases and try to scrape them as a positive list.
Once you have them in a vector named positiveNames, filter out all non-positive names like this:
data_new <- data_original[!data_original$firstName %in% positiveNames,]

My approach is the following:
1) Merge FirstName and LastName into a single string, strname.
Then, count the number of letters for each strname.
2) At this point, we find that for real names, like "MARNIMONTANEZ", are composed of two 'M'; two 'A'; one 'R'; one 'I'; three 'N'; one 'O'; one 'T'.
And we find that fake names, like "GFCTFCGCFGCCGGFCGFCGFCGF", are composed of six 'G'; five 'F'; 8 'C'.
3) The pattern to distinguish real names from fake names becomes clear:
real names are characterized by a more variety of letters. We can measure this by creating a variable check_real computed as: number of unique letters / total string length
fake names are characterized by few letters repeated several times. We can measure this by creating a variable check_fake computed as: average frequency of each letter
4) Finally, we just have to define a threshold to identify an anomaly for both variable. In the cases where these threshold are triggered, a flag_real and a flag_fake appears.
if flag_real == 1 & flag_fake == 0, the name is real
if flag_real == 0 & flag_fake == 1, the name is fake
In the rare cases when the two flags agrees (i.e. flag_real == 1 & flag_fake == 1), you have to investigate the record manually to optimize the threshold.

You can calculate variability strength of full name (combine FirstName and LastName) by calculating length of unique letters in full name divided by total number of characters in the full name. Then, just remove the names that has low variability strength. This means that you are removing the names that has a high frequency of same random keystrokes resulting in low variability strength.
I did this using charToRaw function because it very faster and using dplyr library, as below:
# Building Test Data
df <- data.frame(CustomerId = c(1, 2, 3, 4, 5, 6, 7),
FirstName = c("MARY", "FGCYFCGCGFC", "GFCTFCGCFGC", "ASDASDASD", "GDTDTTD", "WALTER", "GFCTFCGCFGC"),
LastName = c("MEYER", "FGCGFCHGHG", "GFCGFCGFCGF", "AASDASDASD", "DTTHDTHTHTHD", "FINN", "CG GFCGFCGFCGF"), stringsAsFactors = FALSE)
#test data: df
# CustomerId FirstName LastName
#1 1 MARY MEYER
#2 2 FGCYFCGCGFC FGCGFCHGHG
#3 3 GFCTFCGCFGC GFCGFCGFCGF
#4 4 ASDASDASD AASDASDASD
#5 5 GDTDTTD DTTHDTHTHTHD
#6 6 WALTER FINN
#7 7 GFCTFCGCFGC CG GFCGFCGFCGF
library(dplyr)
df %>%
## Combining FirstName and LastName
mutate(FullName = paste(FirstName, gsub(" ", "", LastName, fixed = TRUE))) %>%
group_by(FullName) %>%
## Calculating variability strength for each full name
mutate(Variability = length(unique(as.integer(charToRaw(FullName))))/nchar(FullName))%>%
## Filtering full name, I set above or equal to 0.4 (You can change this)
## Meaning we are keeping full name that has variability strength greater than or equal to 0.40
filter(Variability >= 0.40)
# A tibble: 2 x 5
# Groups: FullName [2]
# CustomerId FirstName LastName FullName Variability
# <dbl> <chr> <chr> <chr> <dbl>
#1 1 MARY MEYER MARY MEYER 0.6000000
#2 6 WALTER FINN WALTER FINN 0.9090909

I tried to combine the suggestions in the below code. Thanks everyone for the help.
# load required libraries
library(hunspell)
library(dplyr)
# read data in dataframe df
df<-data.frame(CustomerId = c(1, 2, 3, 4, 5, 6, 7,8),
FirstName = c("MARY"," ALBERT SAM", "FGCYFCGCGFC", "GFCTFCGCFGC", "ASDASDASD", "GDTDTTD", "WALTER", "GFCTFCGCFGC"),
LastName = c("MEYER","TEST", "FGCGFCHGHG", "GFCGFCGFCGF", "AASDASDASD", "DTTHDTHTHTHD", "FINN", "CG GFCGFCGFCGF"), stringsAsFactors = FALSE)
# Keep unique names
df<-distinct(df,FirstName, LastName, .keep_all = TRUE)
# Spell check using hunspel
df$flag <- hunspell_check(df$FirstName) | hunspell_check(as.character(df$LastName))
# remove middle names
df$FirstNameOnly<-gsub(" .*","",df$FirstName)
# SSA name data using https://www.ssa.gov/oact/babynames/names.zip
# unzip files in folder named names
files<-list.files("/names",pattern="*.txt")
ssa_names<- do.call(rbind, lapply(files, function(x) read.csv(x,
col.names = c("Name","Gender","Frequency"),stringsAsFactors = FALSE)))
# Change SSA names to uppercase
ssa_names$Name <- toupper(ssa_names$Name)
# Flad for SSA names
df$flag_SSA<-ifelse(df$FirstNameOnly %in% ssa_names$Name,TRUE,FALSE)
rm(ssa_names)
# remove spaces and concatenate first name and last name
df$strname<-gsub(" ","",paste(df$FirstName,df$LastName, sep = ""))
# Name string length
df$len<-nchar(df$strname)
# Unique string length
for(n in 1:nrow(df))
{
df$ulen[n]<-length(unique(strsplit(df$strname[n], "")[[1]]))
}
# Ratio variable for unique string length over total string length
df$ratio<-ifelse(df$len==0,0,df$ulen/df$len)
# Histogram to determine cutoff ratio
hist(df$ratio)
test<-df[df$ratio<.4 & df$flag_SSA==FALSE & df$flag==FALSE,]

Order R dataframe columns by using second dataframe as a reference.

I am working on developing a statistical program using R, this program accepts two dataFrames. The first dataFrame carries demographics information of patients and the second carries their clinical information. The key column in the demographics dataFrame is the patientID column. While in the clinical dataFrame each patientID is a column. I wish to arrange/sort my demographics dataFrame by patientID, based upon the order of patientID's(ind columns) in the clinical dataFrame. Also the ID's could numeric or alphanumeric or could just be some-alphabet sequence. I was able to write some code, but would need help/guidance to come up with a better way to sort columns irrespective of their datatype(character, factor, numeric etc).
demogr = read.csv(mydemoFile, header = T, stringsAsFactors
=TRUE,colClasses=c('factor','factor','factor','factor','factor'))
demogr=demogr[order(as.numeric(demogr$Patient_ID)),]
myClinicalFrame=fread(myInputFile,header=T,data.table=FALSE,sep=",")
rowNames=myClinicalFrame[,1]
myClinicalFrame[,1]<-NULL
rownames(myClinicalFrame)=rowNames
names(myClinicalFrame)=sort((names(myClinicalFrame)))
The above works for certain types but fails for others. eg: Patient_ID in
demoFrame is numerically sorted above, in some situations R changes patient_ID like
109999345554545465 to 1.09e+18, which doesn't match with the second dataFrame.
Thanks

Let's start by creating two example data frames:
patientID = c(123456789012345,1234,1234567890,123)
state = c("FL","NJ","CA","TX")
demog = data.frame(ID = patientID,state = state)
clinical = data.frame(col1 = c(1,2,3),
col2 = c(3,4,5),
col2 = c(1,7,9),
col2 = c(6,4,2))
colnames(clinical) = c("1234567890","123","123456789012345","1234")
This gives us:
> demog
ID state
1 1.234568e+14 FL
2 1.234000e+03 NJ
3 1.234568e+09 CA
4 1.230000e+02 TX
and
> clinical
1234567890 123 123456789012345 1234
1 1 3 1 6
2 2 4 7 4
3 3 5 9 2
As you can see the rows in demog are in a different order than the columns in clinical.
To sort the rows in demog do:
rownames(demog) = demog$ID
demog = demog[colnames(clinical),]
This works even for IDs that are factors or characters, because rownames() will convert them to character.
Result:
> demog
ID state
1234567890 1.234568e+09 CA
123 1.230000e+02 TX
123456789012345 1.234568e+14 FL
1234 1.234000e+03 NJ

R how to import a list with different number of columns to a data frame

I am trying to perform some scientometrics analysis from a Scopus csv file. The first column of the imported csv is like:
Authors,Title,Year,Source title,Volume,Issue,Art. No.,Page start,Page end,Page count,Cited by,DOI,Link,Document Type,Source,EID
"Kuck, L.S., Noreña, C.P.Z.","Microencapsulation of grape (Vitis labrusca var. Bordo) skin phenolic extract using gum Arabic, polydextrose, and partially hydrolyzed guar gum as encapsulating agents",2016,"Food Chemistry","194",,,"569","576",,,10.1016/j.foodchem.2015.08.066,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84940212199&partnerID=40&md5=e4c36e03156570a7fe31c2937b3a170d",Article,Scopus,2-s2.0-84940212199
"Grasel, F.D.S., Ferrão, M.F., Wolf, C.R.","Development of methodology for identification the nature of the polyphenolic extracts by FTIR associated with multivariate analysis",2016,"Spectrochimica Acta - Part A: Molecular and Biomolecular Spectroscopy","153",,,"94","101",,,10.1016/j.saa.2015.08.020,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84939865445&partnerID=40&md5=8239487f4eea9479d698792e6aa348de",Article,Scopus,2-s2.0-84939865445
"De Souza, D., Sbardelotto, A.F., Ziegler, D.R., Marczak, L.D.F., Tessaro, I.C.","Characterization of rice starch and protein obtained by a fast alkaline extraction method",2016,"Food Chemistry","191",, 17279,"36","44",,,10.1016/j.foodchem.2015.03.032,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84938952690&partnerID=40&md5=989cbfcc72286a87f726925732db4b49",Article,Scopus,2-s2.0-84938952690
"Filho, P.R.M., Vercelino, R., Cioato, S.G., Medeiros, L.F., de Oliveira, C., Scarabelot, V.L., Souza, A., Rozisky, J.R., Quevedo, A.S., Adachi, L.N.S., Sanches, P.R.S., Fregni, F., Caumo, W., Torres, I.L.S.","Transcranial direct current stimulation (tDCS) reverts behavioral alterations and brainstem BDNF level increase induced by neuropathic pain model: Long-lasting effect",2016,"Progress in Neuro-Psychopharmacology and Biological Psychiatry","64",,,"44","51",,,10.1016/j.pnpbp.2015.06.016,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84937468588&partnerID=40&md5=b03f0ccfbf66a49a438c9983cc2e8f9d",Article,Scopus,2-s2.0-84937468588
"Duarte, Á.T., Borges, A.R., Zmozinski, A.V., Dessuy, M.B., Welz, B., De Andrade, J.B., Vale, M.G.R.","Determination of lead in biomass and products of the pyrolysis process by direct solid or liquid sample analysis using HR-CS GF AAS",2016,"Talanta","146",,,"166","174",,,10.1016/j.talanta.2015.08.041,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84940416990&partnerID=40&md5=55d7ddad27e955b9b6e269469e29c8c3",Article,Scopus,2-s2.0-84940416990
"Francischini, H., Paes Neto, V.D., Martinelli, A.G., Pereira, V.P., Marinho, T.S., Teixeira, V.P.A., Ferraz, M.L.F., Soares, M.B., Schultz, C.L.","Invertebrate traces in pseudo-coprolites from the upper Cretaceous Marília Formation (Bauru Group), Minas Gerais State, Brazil",2016,"Cretaceous Research","57",,,"29","39",,,10.1016/j.cretres.2015.07.016,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84939175950&partnerID=40&md5=b049de15a08ba477cc189d7e8fe7f0a3",Article,Scopus,2-s2.0-84939175950
"Bonfatti, B.R., Hartemink, A.E., Giasson, E., Tornquist, C.G., Adhikari, K.","Digital mapping of soil carbon in a viticultural region of Southern Brazil",2016,"Geoderma","261",,,"204","221",,,10.1016/j.geoderma.2015.07.016,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84939499978&partnerID=40&md5=b470166e01648dcbe8f0d43be86c84e0",Article,Scopus,2-s2.0-84939499978
"Scaramuzza dos Santos, T.C., Holanda, E.C., de Souza, V., Guerra-Sommer, M., Manfroi, J., Uhl, D., Jasper, A.","Evidence of palaeo-wildfire from the upper Lower Cretaceous (Serra do Tucano Formation, Aptian-Albian) of Roraima (North Brazil)",2016,"Cretaceous Research","57",,,"46","49",,,10.1016/j.cretres.2015.08.003,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84939615367&partnerID=40&md5=e59f5130c6a2e1863f9aa77c960e6462",Article,Scopus,2-s2.0-84939615367
"da Silva, S.W., Bortolozzi, J.P., Banús, E.D., Bernardes, A.M., Ulla, M.A.","TiO<inf>2</inf> thick films supported on stainless steel foams and their photoactivity in the nonylphenol ethoxylate mineralization",2016,"Chemical Engineering Journal","283",, 14049,"1264","1272",,,10.1016/j.cej.2015.08.057,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84940747062&partnerID=40&md5=aebc7357f9dedaadeebabfeda4aa3dd9",Article,Scopus,2-s2.0-84940747062
"Dalmora, A.C., Ramos, C.G., Oliveira, M.L.S., Teixeira, E.C., Kautzmann, R.M., Taffarel, S.R., de Brum, I.A.S., Silva, L.F.O.","Chemical characterization, nano-particle mineralogy and particle size distribution of basalt dust wastes",2016,"Science of the Total Environment","539",, 18331,"560","565",,,10.1016/j.scitotenv.2015.08.141,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84941754626&partnerID=40&md5=1c4ca1a3486ff55f92f238083af3eb50",Article,Scopus,2-s2.0-84941754626
"Fink, J.R., Inda, A.V., Bavaresco, J., Barrón, V., Torrent, J., Bayer, C.","Adsorption and desorption of phosphorus in subtropical soils as affected by management system and mineralogy",2016,"Soil and Tillage Research","155",,,"62","68",,,10.1016/j.still.2015.07.017,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84940195225&partnerID=40&md5=2e43a874f1e36f11aa5efa057ce660b9",Article,Scopus,2-s2.0-84940195225
"Martins, A.B., Santana, R.M.C.","Effect of carboxylic acids as compatibilizer agent on mechanical properties of thermoplastic starch and polypropylene blends",2016,"Carbohydrate Polymers","135",,,"79","85",,,10.1016/j.carbpol.2015.08.074,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84940781718&partnerID=40&md5=426e62c6c0de33a91bdb2f75442fbd6f",Article,Scopus,2-s2.0-84940781718
In each line, there are a variable number of authors (up to more than 20). Until now, I am doing something like:
test <- read.csv("test.csv")
test$Authors <- as.character(test$Authors)
test2 <- strsplit(as.character(test$Authors), '.,', fixed=TRUE)
Which gives me a list correctly separating each author. I tested several alternatives proposed in the list to move the list to a data frame, but the closer one was:
test3 <- str_split_fixed(test$Authors, '.,', n = 20)
Which gave me two problems:
1) I have to define the number of columns, which I do not know before analyzing the data;
2) the authors are not properly separated, but surname and abbreviated names are in different columns. Additionally, the command removed some characters from the names.
Some of the strategies suggested elsewhere gave me the correct separation of authors in columns, but the empty columns were fulfilled by repeating the initial names. Sorry if the question is too naive, I am beginning in the use of R.
Any advises and or insights would be greatly appreciated!

Here's how I would do it.
Firstly, using read.csv is causing the split between Authors last name and initial, so I'm using readLines instead.
Secondly, having "wide data" like this is not in general a good idea. It makes data more difficult to work with in subsequent analyses. For that reason, I've made it "long" instead.
n1 <- readLines(con="test.csv")
n1 <- strsplit(n1, '., ', fixed=TRUE)
n1 <- do.call(rbind, lapply(1:length(n1), function(x){data.frame(aut = n1[[x]], pub = x, order = 1:length(n1[[x]]))}))
n1$aut <- gsub("\\.$", "", n1$aut)
Here's the output:
aut pub order
1 Kuck, L.S 1 1
2 NoreÃ±a, C.P.Z 1 2
3 Grasel, F.D.S 2 1
4 FerrÃ£o, M.F 2 2
5 Wolf, C.R 2 3
6 Abreu, M.S 3 1
7 Giacomini, A.C.V 3 2
8 Gusso, D 3 3
9 Rosa, J.G.S 3 4
NB if you really want your data in "wide format", we can easily reshape it:
library(tidyr)
spread(n1, order, aut)
pub 1 2 3 4
1 1 Kuck, L.S NoreÃ±a, C.P.Z <NA> <NA>
2 2 Grasel, F.D.S FerrÃ£o, M.F Wolf, C.R <NA>
3 3 Abreu, M.S Giacomini, A.C.V Gusso, D Rosa, J.G.S
EDIT: for your full version, you need to use read.csv:
input <- n1 <- read.csv("test.csv")
n1$Authors <- as.character(n1$Authors)
n1$Authors <- strsplit(n1$Authors, '., ', fixed=TRUE)
n1 <- do.call(rbind, lapply(1:length(n1$Authors), function(x){data.frame(aut = n1$Authors[[x]], pub = x, order = 1:length(n1$Authors[[x]]))}))
n1$aut <- gsub("\\.$", "", n1$aut)
If you want to go back to wide with all your stuff:
library(dplyr)
library(tidyr)
input <- mutate(input, row = row_number())
n1 %>% spread(order, aut) %>%
left_join(input, by = c("pub" = "row")) %>%
select(-Authors)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Splitting complex string between symbols R - r

Related

In R, left join two tables whose 2 potential keys contain missing data

R - Finding identical rows or rows that only differ by x columns

Data Cleaning in R: remove test customer names

Order R dataframe columns by using second dataframe as a reference.

R how to import a list with different number of columns to a data frame

Categories

Resources