R: converting comma separated entry to columns with non-characters - r

I have a column of names in R that are separated by a comma.
For example:
John, Doe
Rebecca, Homes
I'd like to separate the first and last names into separate columns.
One additional problem I have is that sometimes there will be a name that doesn't have a comma. For example:
John, Doe
Rebecca, Homes
Organization LLC
I've looked at using strsplit(a, ","), but I get the following error Error in strsplit(wn, ",") : non-character argument.
Here is an example within Stack Convert comma separated entry to columns
Any help regarding solving this simple problem will be greatly appreciated. Thanks.

In 2 steps :
You can use read.table with fill=TRUE, to read all lines (You can also use readLines)
treat without commas as seprator.
The code is something like this :
aa <- read.table(text='John, Doe
Rebecca, Homes
Organization LLC',sep=',',fill=TRUE,colClasses='character')
## treat lines without comma
aa[nchar(aa$V2) ==0,] <-
do.call(rbind,strsplit(aa[nchar(aa$V2) ==0,]$V1,' ')) ## space as separator :I assume you
don't have compound name
> aa
V1 V2
1 John Doe
2 Rebecca Homes
3 Organization LLC
EDIT better method : I use a reglar expression to replace any space by a comma to have regular separator. I assume that you don't have any compound name.
ff <- readLines(textConnection('John, Doe
Rebecca, Homes
Organization LLC'))
do.call(rbind,
strsplit(gsub('[ ]|, |,[ ]',',',ff),','))

Related

simple Join error where some rows join and some don't

I have two dataframe which I'm trying to join which should be straight forward but I see some anomalous behavior.
Dataframe A
Name Sample Country Path
John S18902 UK /Home/drive/John
BOB 135671 USA /Home/drive/BOB
Tim GB12345_serum_63 UK /Home/drive/Tim
Wayne 12345_6789 UK /Home/drive/Wayne
Dataframe B
Surname Sample State FILE
Paul S18902 NJ John.csv
Gem 135671 PP BOB.csv
Love GB12345_serum_63 AP Tim.csv
Dave 12345_6789 MK Wayne.csv
I am using R markdown to do a simple join using the following command
Dataframec <- DataframeA %>%
left_join(DataframeB ,by = "Sample",all.x=T )
All rows join apart from the row where sample== GB12345_serum_63
There should be a simple fix to this but I'm out of ideas.
Thank you
If you are cutting-and-pasting your data directly into your question then the reason for this is because your key values are technically different due to having different numbers of spaces.
I cut and paste from your question from the beginning of the value to the start of the adjacent column name. So to 'country' in the first case and to 'state' in the second case
DataframeA: "GB12345_serum_63"
DataframeB: "GB12345_serum_63 "
You can see for DataframeB there are 3 space characters after the value. This can be resolved by removing extra whitespace from your key values as follows using regular expressions: gsub("^\\s+|\\s+$", "", x)
DataframeA$Sample <- gsub("^\\s+|\\s+$", "", DataframeA$Sample)
DataframeB$Sample <- gsub("^\\s+|\\s+$", "", DataframeB$Sample)
Now your join should work

Extract words starting with # in R dataframe and save as new column

My dataframe column looks like this:
head(tweets_date$Tweet)
[1] b"It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac
[2] b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81
[3] b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!
[4] b'CHAMPIONS - 2018 #IPLFinal
[5] b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.
[6] b"Final. It's all over! Chennai Super Kings won by 8 wickets
These are tweets which have mentions starting with '#', I need to extract all of them and save each mention in that particular tweet as "#mention1 #mention2". Currently my code just extracts them as lists.
My code:
tweets_date$Mentions<-str_extract_all(tweets_date$Tweet, "#\\w+")
How do I collapse those lists in each row to a form a string separated by spaces as mentioned earlier.
Thanks in advance.
I trust it would be best if you used an asis column in this case:
extract words:
library(stringr)
Mentions <- str_extract_all(lis, "#\\w+")
some data frame:
df <- data.frame(col = 1:6, lett = LETTERS[1:6])
create a list column:
df$Mentions <- I(Mentions)
df
#output
col lett Mentions
1 1 A #DineshK....
2 2 B #IPL, #p....
3 3 C
4 4 D
5 5 E #ChennaiIPL
6 6 F
I think this is better since it allows for quite easy sub setting:
df$Mentions[[1]]
#output
[1] "#DineshKarthik" "#KKRiders"
df$Mentions[[1]][1]
#output
[1] "#DineshKarthik"
and it succinctly shows whats inside the column when printing the df.
data:
lis <- c("b'It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac",
"b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81",
"b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!",
"b'CHAMPIONS - 2018 #IPLFinal",
"b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.",
"b'Final. It's all over! Chennai Super Kings won by 8 wickets")
The str_extract_all function from the stringr package returns a list of character vectors. So, if you instead want a list of single CSV terms, then you may try using sapply for a base R option:
tweets <- str_extract_all(tweets_date$Tweet, "#\\w+")
tweets_date$Mentions <- sapply(tweets, function(x) paste(x, collapse=", "))
Demo
Via Twitter's help site: "Your username cannot be longer than 15 characters. Your real name can be longer (20 characters), but usernames are kept shorter for the sake of ease. A username can only contain alphanumeric characters (letters A-Z, numbers 0-9) with the exception of underscores, as noted above. Check to make sure your desired username doesn't contain any symbols, dashes, or spaces."
Note that email addresses can be in tweets as can URLs with #'s in them (and not just the silly URLs with username/password in the host component). Thus, something like:
(^|[^[[:alnum:]_]#/\\!?=&])#([[:alnum:]_]{1,15})\\b
is likely a better, safer choice

R how to get data column to rows of first and second values

Apologies, I'm a novice but I don't seem to be able to find an answer to this question.
I've scraped tabular data from a web page. After some cleaning It appears in a single unnamed column.
[1] John
[2] Smith
[3] Tina
[4] Jordan
and so on.....
I'm obviously looking for the result of:
FirstName | LastName
[1] John Smith
[2] Tina Jordan
et al.
Much of what has gotten me to this point was sourced from: http://statistics.berkeley.edu/computing/r-reading-webpages
A very helpful resource for beginners such as myself.
I would be grateful for any advice you can give me.
Thanks,
C R Eaton
We create a logical index ('i1'), create a data.frame by extracting the elements in the first column of the original dataset ('dat') using 'i1'. The 'i1' elements will recycle to the length of the column, so if we do 'dat[i1,1]`, it will extract 1st element, 3rd, 5th, etc. For the last name, we just negate the 'i1', so that it will extract 2nd, 4th, etc..
i1 <- c(TRUE, FALSE)
d1 <- data.frame(FirstName = dat[i1,1], LastName = dat[!i1, 1], stringsAsFactors=FALSE)

Transforming character strings in R

I have to merge to data frames in R. The two data frames share a common id variable, the name of the subject. However, the names in one data frame are partly capitalized, while in the other they are in lower cases. Furthermore the names appear in reverse order. Here is a sample from the data frames:
DataFrame1$Name:
"Van Brempt Kathleen"
"Gräßle Ingeborg"
"Gauzès Jean-Paul"
"Winkler Iuliu"
DataFrame2$Name:
"Kathleen VAN BREMPT"
"Ingeborg GRÄSSLE"
"Jean-Paul GAUZÈS"
"Iuliu WINKLER"
Is there a way in R to make these two variables usable as an identifier for merging the data frames?
Best, Thomas
You can use gsub to convert the names around:
> names
[1] "Kathleen VAN BREMPT" "jean-paul GAULTIER"
> gsub("([^\\s]*)\\s(.*)","\\2 \\1",names,perl=TRUE)
[1] "VAN BREMPT Kathleen" "GAULTIER jean-paul"
>
This works by matching first anything up to the first space and then anything after that, and switching them around. Then add tolower() or toupper() if you want, and use match() for joining your data frames.
Good luck matching Grassle with Graßle though. Lots of other things will probably bite you too, such as people with two first names separated by space, or someone listed with a title!
Barry
Here's a complete solution that combines the two partial methods offered so far (and overcomes the fears expressed by Spacedman about "matching Grassle with Graßle"):
DataFrame2$revname <- gsub("([^\\s]*)\\s(.*)","\\2 \\1",DataFrame2$Name,perl=TRUE)
DataFrame2$agnum <-sapply(tolower(DataFrame2$revname), agrep, tolower(DataFrame1$Name) )
DataFrame1$num <-1:nrow(DataFrame1)
merge(DataFrame1, DataFrame2, by.x="num", by.y="agnum")
Output:
num Name.x Name.y revname
1 1 Van Brempt Kathleen Kathleen VAN BREMPT VAN BREMPT Kathleen
2 2 Gräßle Ingeborg Ingeborg GRÄSSLE GRÄSSLE Ingeborg
3 3 Gauzès Jean-Paul Jean-Paul GAUZÈS GAUZÈS Jean-Paul
4 4 Winkler Iuliu Iuliu WINKLER WINKLER Iuliu
The third step would not be necessary if DatFrame1 had rownames that were still sequentially numbered (as they would be by default). The merge statement would then be:
merge(DataFrame1, DataFrame2, by.x="row.names", by.y="agnum")
--
David.
Can you add an additional column/variable to each data frame which is a lowercase version of the original name:
DataFrame1$NameLower <- tolower(DataFrame1$Name)
DataFrame2$NameLower <- tolower(DataFrame2$Name)
Then perform a merge on this:
MergedDataFrame <- merge(DataFrame1, DataFrame2, by="NameLower")
In addition to the answer using gsub to rearrange the names, you might want to also look at the agrep function, this looks for approximate matches. You can use this with sapply to find the matching rows from one data frame to the other, e.g.:
> sapply( c('newyork', 'NEWJersey', 'Vormont'), agrep, x=state.name, ignore.case=TRUE )
newyork NEWJersey Vormont
32 30 45

Parallel gsub: how does one remove a different string in each element of a vector

I have a guest list that has a last name in one column and then in another column I have the first names or the full names (first space last) of each person in the family. I am wanting to get the other column to just have the first names.
gsub(guest.w$Last.Name,"",guest.w$Party.Name.s.)
That would work perfectly if I just had one row but how do it do it for each row in the dataframe. Do I have to write a for loop? Is there a way to do it in parallel similarly to the way pmax() relates to max().
My problem is similar in a way to a previously asked question by JD Long but that question was a piece of cake compared to mine.
Example
:
Smith; Joe Smith, Kevin Smith, Jane Smith
Alter; Robert Alter, Mary Alter, Ronald Alter
Becomes
Smith; Joe, Kevin, Jane
Alter; Robert, Mary, Ronald
Using hadleys adply:
library(plyr)
df <- data.frame(rbind(c('Smith', 'Joe Smith, Kevin Smith, Jane Smith'), c('Alter', 'Robert Alter, Mary Alter, Ronald Alter')))
names(df) <- c("last", "name")
adply(df,1,transform, name=gsub(last, '', name))
You will probably need to clean up the spaces in your new vector.
you probably need to do some "wrapping" around your expression in order to get the apply() function working:
If your working on a data.frame you should use apply() (and not sapply())
you must create a function for apply (with a return clause)
working on data.frame line as function input is a bit tricky - they are converted into vectors and loose some properties (you can't use the $ sign to call named fields) so it's better to convert it first into a list
The final result looks something like this:
df <- rbind(c('Smith', 'Joe Smith, Kevin Smith, Jane Smith'), c('Alter', 'Robert Alter, Mary Alter, Ronald Alter'))
colnames(df) = c('Last.Name', 'Party.Name.s.')
apply(df,1,function(y) {y = as.list(y);return(gsub(y$Last.Name, "", y$Party.Name.s.))})
I am not sure it will work on a dataframe, but you could try one of the apply functions:
`y1 <- sapply(dataframe, gsub(guest.w$Last.Name,"",guest.w$Party.Name.s.))`

Resources