Create a column value based on a matching regular expression - r

I have the following character string in a column called "Sentences" for a df:
I like an apple
I would like to create a second column, called Type, whose values are determined by matching strings. I would like to take the regular expression \bapple\b, match it with the sentence and if it matches, add the value Fruit_apple in the Type column.
In the long run I'd like to do this with several other strings and types.
Is there an easy way to do this using a function?
dataset (survey_1):
structure(list(slider_8.response = c(1L, 1L, 3L, 7L, 7L, 7L,
1L, 3L, 2L, 1L, 1L, 7L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 6L, 1L, 7L,
7L, 7L, 1L, 1L, 7L, 6L, 6L, 1L, 1L, 7L, 1L, 7L, 7L, 1L, 7L, 7L,
7L, 7L, 7L, 6L, 7L, 7L, 7L, 1L, 1L, 6L, 1L, 1L, 1L, 1L, 7L, 2L
), Sentences = c("He might could do it.", "I ever see the film.",
"I may manage to come visit soon.", "She’ll never be forgotten.",
"They might find something special.", "It might not be a good buy.",
"Maybe my pain will went away.", "Stephen maybe should fix your bicycle.",
"It used to didnʼt matter if you walked in late.", "He’d could climb the stairs.",
"Only Graeme would might notice that.", "I used to cycle a lot. ",
"Your dad belongs to disagree with this. ", "We can were pleased to see her.",
"He may should take us to the city.", "I could never forgot his deep voice.",
"I should can turn this thing over to Ann.", "They must knew who they really are.",
"We used to runs down three flights.", "I don’t care what he may be up to. ",
"That’s something I ain’t know about.", "That must be quite a skill.",
"We must be able to invite Jim.", "She used to play with a trolley.",
"He is done gone. ", "You might can check this before making a decision.",
"It would have a positive effect on the team. ", "Ruth can maybe look for it later.",
"You should tag along at the dance.", "They’re finna leave town.",
"A poem should looks like that.", "I can tell you didn’t do your homework. ",
"I can driving now.", "They should be able to put a blanket over it.",
"We could scarcely see each other.", "I might says I was never good at maths.",
"The next dance will be a quickstep. ", "I might be able to find myself a seat in this place.",
"Andrew thinks we shouldn’t do it.", "Jack could give a hand.",
"She’ll be able to come to the event.", "She’d maybe keep the car the way it is.",
"Sarah used to be able to agree with this proposal.", "I’d like to see your lights working. ",
"I’d be able to get a little bit more sleep.", "John may has a second name.",
"You must can apply for this job.", "I maybe could wait till the 8 o’clock train.",
"She used to could go if she finished early.", "That would meaned something else, eh?",
"You’ll can enjoy your holiday.", "We liketa drowned that day. ",
"I must say it’s a nice feeling.", "I eaten my lunch."), construct = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA)), row.names = c(NA, 54L), class = "data.frame")
type_list:
list("DM_will_can"=c("ll can","will can"), "DM_would_could"=c("d could","would could"),
"DM_might_can"="might can","DM_might_could"="might could","DM_used_to_could"="used to could",
"DM_should_can"="should can","DM_would_might"=c("d might", "would might"),"DM_may_should"="may should",
"DM_must_can"="must can", "SP_will_be_able"=c("ll be able","will be able"),
"SP_would_be_able"=c("d be able","would be able"),"SP_might_be_able"="might be able",
"SP_maybe_could"="maybe could","SP_used_to_be_able"="used to be able","SP_should_be_able"=
"should be able","SP_would_maybe"=c("d maybe", "would maybe"), "SP_maybe_should"="maybe should",
"SP_must_be_able"="must be able", "Filler_will_a"="quickstep","Filler_will_b"="forgotten",
"Filler_would_a"="lights working","Filler_would_b"="positive effect","Filler_can_a"="homework",
"Filler_can_b"="Ruth","Filler_could_a"="scarcely","Filler_could_b"="Jack", "Filler_may_a"="may be up to",
"Filler_may_b"="visit soon", "Filler_might_a"="good buy","Filler_might_be"="something special",
"Filler_should_a"="tag along","Filler_should_b"="Andrew","Filler_used_to_a"="trolley",
"Filler_used_to_b"="cycle a lot","Filler_must_a"="quite a skill","Filler_must_b"="nice feeling",
"Dist_gram_will_went"="will went","Dist_gram_meaned"="meaned","Dist_gram_can_were"="can were",
"Dist_gram_forgot"="never forgot", "Dist_gram_may_has"="may has",
"Dist_gram_might_says"="might says","Dist_gram_used_to_runs"="used to runs",
"Dist_gram_should_looks"="should looks","Dist_gram_must_knew"="must knew","Dist_dial_liketa"="liketa",
"Dist_dial_belongs"="belongs to disagree","Dist_dial_finna"="finna","Dist_dial_used_to_didnt"="used to didn't matter",
"Dist_dial_eaten"="I eaten", "Dist_dial_can_driving"="can driving","Dist_dial_aint_know"="That's something",
"Dist_dial_ever_see"="ever see the film","Dist_dial_done_gone"="done gone")

I want to do this with a Python dictionary, but we're talking about R, so I've more or less translated the approach. There is probably a more idiomatic way to do this in R than two for loops, but this should work:
# Define data
df <- data.frame(
id = c(1:5),
sentences = c("I like apples", "I like dogs", "I have cats", "Dogs are cute", "I like fish")
)
# id sentences
# 1 1 I like apples
# 2 2 I like dogs
# 3 3 I have cats
# 4 4 Dogs are cute
# 5 5 I like fish
type_list <- list(
"fruit" = c("apples", "oranges"),
"animals" = c("dogs", "cats")
)
types <- names(type_list)
df$type <- NA
df$item <- NA
for (type in types) {
for (item in type_list[[type]]) {
matches <- grep(item, df$sentences, ignore.case = TRUE)
df[matches, "type"] = type
df[matches, "item"] = item
}
}
# Output:
# id sentences type item
# 1 1 I like apples fruit apples
# 2 2 I like dogs animals dogs
# 3 3 I have cats animals cats
# 4 4 Dogs are cute animals dogs
# 5 5 I like fish <NA> <NA>
EDIT
Added after data was added. If I read in your data, and call it df, and your type list and call it type_list, the following works:
types <- names(type_list)
df$type <- NA
df$item <- NA
for (type in types) {
for (item in type_list[[type]]) {
matches <- grep(item, df$Sentences, ignore.case = TRUE)
df[matches, "type"] = type
df[matches, "item"] = item
}
}
This is exactly the same as my previous code, except Sentences has an upper case S in your data frame.

Related

Comparing two apparently identical levels from factors

I have two dataframe columns that have apparently identical factors, but they don't:
levels(train$colA)
## [1] "I am currently using (least once over the last 2 weeks)"
## [2] "I have never tried nor used"
## [3] "I have tried or used at some point in time"
levels(test$colA)
## [1] "I am currently using (least once over the last 2 weeks)"
## [2] "I have never tried nor used"
## [3] "I have tried or used at some point in time"
levels(train$colA) == levels(test$colA)
## [1] FALSE TRUE TRUE
I have tried comparing both sentences and actually they are equal:
"I am currently using (least once over the last 2 weeks)" == "I am currently using (least once over the last 2 weeks)"
## [1] TRUE
I am trying to apply xgboost trained model to test data. Trained model comes from train dataframe. Now I am trying to apply it to test, but with no success, as I get the error that test has a new factor.
Edited:
Here is the output of dput():
dput(head(train$colA))
structure(c(1L, 1L, 1L, 2L, 1L, 1L), .Label = c("I am currently using (least once over the last 2 weeks)", "I have never tried nor used"
"I have tried or used at some point in time"
), class = "factor")
dput(head(test$colA))
structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("I am currently using (least once over the last 2 weeks)", "I have never tried nor used"
"I have tried or used at some point in time"
), class = "factor")
I can see there is a difference from: c(1L, 1L, 1L, 2L, 1L, 1L) to c(1L, 1L, 1L, 1L, 1L, 1L) . So I guess here is the key, although I don't know what does it exactly mean.

Multiple matching function in r

I am trying to match two datasets using the following variables School (unique) with classes that need teachers. Some teachers have one specialty, some have more than one. I have been trying to use the match() and which( %in% ) base functions but I cannot get it to search for all the possible teacher matches. It always stops after the first match. Here is some sample data:
class<-c("english","history","art","art","math","history","art")
school<-c("C.H.S.","B.H.S.","D.H.S.","A.H.S.","Z.H.S.","M.H.S.","L.H.S.")
specialty<-c("math","history","English","history","literature","art","English")
teacher<-c("Jill","Jill","Sam","Liz","Liz","Liz","Rob")
teacher.skills<-data.frame(teacher, specialty)
school.needs<-data.frame(school,class)
teacher.match<-data.frame(Jill,Sam,Rob,Liz)
The final result would look like this:
Jill<-c("No","Yes","No","No","Yes","Yes","No")
Sam<-c("Yes","No","No","No","No","No","No")
Liz<-c("No","Yes","Yes","Yes","No","Yes","Yes")
Rob<-c("Yes","No","No","No","No","No","No")
match.result<-data.frame(school.needs, teacher.match)
match.result
I have even tried working on a little function like this but still can't get the final formatting right.
source.1<-school.needs
source.2<-teacher.skills
dist.name<-adist(source.1$class, source.2$specialty, partial = FALSE, ignore.case = TRUE)
min.name<-apply(dist.name, 1, min)
school.teacher.match<-NULL
for(i in 1:nrow(dist.name))
{
skills.ref<-match(min.name[i], dist.name[i,])
school.ref<-i
school.teacher.match<-rbind(data.frame(skills.ref=skills.ref, school.ref=school.ref, Teacher=source.2[skills.ref,]$teacher, Class=source.1[school.ref,]$class, School=source.1[school.ref,]$school, adist=min.name[i]), school.teacher.match)
school.teacher.match<-subset(school.teacher.match, school.teacher.match$adist==0)
}
school.teacher.match
Any help would be much appreciated, thanks!
Note that I had to modify your input data to change "English" to "english" for each match. The data is given by:
school.needs <- structure(list(school = structure(c(3L, 2L, 4L, 1L, 7L, 6L, 5L
), .Label = c("A.H.S.", "B.H.S.", "C.H.S.", "D.H.S.", "L.H.S.",
"M.H.S.", "Z.H.S."), class = "factor"), class = structure(c(2L,
3L, 1L, 1L, 4L, 3L, 1L), .Label = c("art", "english", "history",
"math"), class = "factor")), .Names = c("school", "class"), row.names = c(NA,
-7L), class = "data.frame")
teacher.skills <- structure(list(teacher = structure(c(1L, 1L, 4L, 2L, 2L, 2L,
3L), .Label = c("Jill", "Liz", "Rob", "Sam"), class = "factor"),
specialty = structure(c(5L, 3L, 2L, 3L, 4L, 1L, 2L), .Label = c("art",
"english", "history", "literature", "math"), class = "factor")), .Names = c("teacher",
"specialty"), row.names = c(NA, -7L), class = "data.frame")
Using merge and dcast from reshape2 (or data.table):
library(reshape2)
## use merge to match needs to skills
m <- merge(school.needs,teacher.skills,by.x="class",by.y="specialty")
m$val <- "Yes" ## add a column for the "Yes"
## go to wide format for the final result filling NA with "No"
result <- dcast(m,school+class~teacher,value.var="val",fill="No")
## school class Jill Liz Rob Sam
##1 A.H.S. art No Yes No No
##2 B.H.S. history Yes Yes No No
##3 C.H.S. english No No Yes Yes
##4 D.H.S. art No Yes No No
##5 L.H.S. art No Yes No No
##6 M.H.S. history Yes Yes No No
##7 Z.H.S. math Yes No No No
Here's how I'd do it:
(data)
schools <- data.frame(
school = c("C.H.S.", "B.H.S.", "D.H.S.", "A.H.S.","Z.H.S.", "M.H.S.", "L.H.S."),
class = c("english", "history", "art", "art", "math", "history", "art"),
stringsAsFactors = F)
teachers <- data.frame(
teacher = c("Jill", "Jill", "Sam", "Liz", "Liz", "Liz", "Rob"),
specialty = c("math", "history", "English", "history", "literature", "art", "English"),
stringsAsFactors = F)
(key concepts)
# you can get the specialties of a given teacher like this:
subset(teachers, teacher == 'Jill')$specialty
# [1] "math" "history"
# you can get the set of unique teachers like this:
unique(teachers$teacher)
# [1] "Jill" "Sam" "Liz" "Rob"
(solution)
# for each teacher, do any of their specialties match the class need of each school?
matches <-
sapply(unique(teachers$teacher), function(this_t) {
specs <- subset(teachers, teacher == this_t)$specialty
schools$class %in% specs
})
# combine with school data.frame
data.frame(schools, matches)
# school class Jill Sam Liz Rob
# 1 C.H.S. english FALSE FALSE FALSE FALSE
# 2 B.H.S. history TRUE FALSE TRUE FALSE
# 3 D.H.S. art FALSE FALSE TRUE FALSE
# 4 A.H.S. art FALSE FALSE TRUE FALSE
# 5 Z.H.S. math TRUE FALSE FALSE FALSE
# 6 M.H.S. history TRUE FALSE TRUE FALSE
# 7 L.H.S. art FALSE FALSE TRUE FALSE
Some notes:
1) It's way easier to read (and think about) when you include appropriate spacing in your code. Also, rather than create a bunch of vectors and then assemble into data.frames, do this in one step -- it's shorter, it helps show how the vectors relate to each other, and it won't clutter your global environment.
2) I'm leaving the match values as FALSE/TRUE, because since this is boolean data, it makes sense to use the appropriate data type. If you really want No/Yes, though, you can change these values into factors with those labels
3) The results are a little bit different than what you expected because 'English' == 'english' is FALSE. You might want to clean up your starting data. If you know that cases will be mixed and you case-insensitive matching, you can coerce all values to lowercase before comparing: tolower(schools$class) %in% tolower(specs)

R - grepl over 7 million observations - how to boost efficiency?

I've run into a bit of a dead end with some R code that I've written, and I thought maybe you'd know how to make this whole thing feasible, in the sense that efficiency can be improved.
So, what I'm trying to do is the following:
I've got a tweet dataset with ~7 million observations. Currently, I'm not interested in the text of the tweets or any of the other metadata, but only in the "Location" field, so I've extracted that data into a new data.frame, which contains the location variable (string) and a new, currently empty, "isRelevant" variable (logical). Furthermore, I've got a vector containing text information formatted as follows: "Placename(1)|Placename(2)[...]|Placename(i)" . What I'm trying to do is to grepl every line of the locations variable to see if there is a match with the Placenames vector, and if so, return a "TRUE" in the isRelevant variable and return a "FALSE" if not.
To do this, I wrote some R code, which basically boils down to this line:
locations.df$isRelevant <- sapply(locations.df$locations, function(s) grepl(grep_places, s, ignore.case = TRUE))
whereby grep_places is the list of possible matching terms separated by "|" characters, to let R know that it can match any of the elements in the vector. I am running this on a remote high-capacity computer, which provides over 2 TB of RAM using RStudio (R 3.2.0), and I'm running it with 'pbsapply' which provides me with a progress bar. As it turns out, this is taking ridiculously long. It's done about 45% to date (I started it more than a week ago) and it's saying it's still going to need over 270 hours to complete it. That's obviously not really a workable situation, as I'm going to have to run similar code in the future, using way larger datasets. Have you got any idea how I might get this job done in a more acceptable timeframe, perhaps like one day or something like that (keeping in mind the super-strong computer).
EDIT
Here's some semi-simulated data to indicate what I'm working with approximately looks like:
print(grep_places)
> grep_places
"Acworth NH|Albany NH|Alexandria NH|Allenstown NH|Alstead NH|Alton NH|Amherst NH|Andover NH|Antrim NH|Ashland NH|Atkinson NH|Auburn NH|Barnstead NH|Barrington NH|Bartlett NH|Bath NH|Bedford NH|Belmont NH|Bennington NH|Benton NH|Berlin NH|Bethlehem NH|Boscawen NH|Bow NH|Bradford NH|Brentwood NH|Bridgewater NH|Bristol NH|Brookfield NH|Brookline NH|Campton NH|Canaan NH|Candia NH|Canterbury NH|Carroll NH|CenterHarbor NH|Charlestown NH|Chatham NH|Chester NH|Chesterfield NH|Chichester NH|Claremont NH|Clarksville NH|Colebrook NH|Columbia NH|Concord NH|Conway NH|Cornish NH|Croydon NH|Dalton NH|Danbury NH|Danville NH|Deerfield NH|Deering NH|Derry NH|Dorchester NH|Dover NH|Dublin NH|Dummer NH|Dunbarton NH|Durham NH|EastKingston NH|Easton NH|Eaton NH|Effingham NH|Ellsworth NH|Enfield NH|Epping NH|Epsom NH|Errol NH|Exeter NH|Farmington NH|Fitzwilliam NH|Francestown NH|Franconia NH|Franklin NH|Freedom NH|Fremont NH|Gilford NH|Gilmanton NH|Gilsum NH|Goffstown NH|Gorham NH|Goshen NH|Grafton NH|Grantham NH|Greenfield NH|Greenland NH|Greenville NH|Groton NH|Hampstead NH|Hampton NH|HamptonFalls NH|Hancock NH|Hanover NH|Harrisville NH|Hart'sLocation NH|Haverhill NH|Hebron NH|Henniker NH|Hill NH|Hillsborough NH|Hinsdale NH|Holderness NH|Hollis NH|Hooksett NH|Hopkinton NH|Hudson NH|Jackson NH|Jaffrey NH|Jefferson NH|Keene NH|Kensington NH|Kingston NH|Laconia NH|Lancaster NH|Landaff NH|Langdon NH|Lebanon NH|Lee NH|Lempster NH|Lincoln NH|Lisbon NH|Litchfield NH|Littleton NH|Londonderry NH|Loudon NH|Lyman NH|Lyme NH|Lyndeborough NH|Madbury NH|Madison NH|Manchester NH|Marlborough NH|Marlow NH|Mason NH|Meredith NH|Merrimack NH|Middleton NH|Milan NH|Milford NH|Milton NH|Monroe NH|MontVernon NH|Moultonborough NH|Nashua NH|Nelson NH|NewBoston NH|NewCastle NH|NewDurham NH|NewHampton NH|NewIpswich NH|NewLondon NH|Newbury NH|Newfields NH|Newington NH|Newmarket NH|Newport NH|Newton NH|NorthHampton NH|Northfield NH|Northumberland NH|Northwood NH|Nottingham NH|Orange NH|Orford NH|Ossipee NH|Pelham NH|Pembroke NH|Peterborough NH|Piermont NH|Pittsburg NH|Pittsfield NH|Plainfield NH|Plaistow NH|Plymouth NH|Portsmouth NH|Randolph NH|Raymond NH|Richmond NH|Rindge NH|Rochester NH|Rollinsford NH|Roxbury NH|Rumney NH|Rye NH|Salem NH|Salisbury NH|Sanbornton NH|Sandown NH|Sandwich NH|Seabrook NH|Sharon NH|Shelburne NH"
head(location.df, n=20)
> location isRelevant
1 London NA
2 Orleans village VT USA NA
3 The World NA
4 D M V Towson NA
5 Playa del Sol Solidaridad NA
6 Beautiful Downtown Burbank NA
7 <NA> NA
8 US NA
9 Gaithersburg Md NA
10 <NA> NA
11 California NA
12 Indy NA
13 Florida NA
14 exsnaveen com NA
15 Houston TX NA
16 Tweaking NA
17 Phoenix AZ NA
18 Malibu Ca USA NA
19 Hermosa Beach CA NA
20 California USA NA
Thanks in advance everyone, I'd seriously appreciate any help with this.
grepl is a vectorized function, there should be no need to apply a loop to it. Have you tried:
#dput(location.df)
location.df<-structure(list(location = structure(c(12L, 14L, 17L, 5L, 16L,
2L, 1L, 19L, 8L, 1L, 3L, 11L, 7L, 6L, 10L, 18L, 15L, 13L, 9L,
4L), .Label = c("<NA>", "Beautiful Downtown Burbank", "California",
"California USA", "D M V Towson", "exsnaveen com", "Florida",
"Gaithersburg Md", "Hermosa Beach CA", "Houston TX", "Indy",
"London", "Malibu Ca USA", "Orleans village VT USA", "Phoenix AZ",
"Playa del Sol Solidaridad", "The World", "Tweaking", "US"), class = "factor"),
isRelevant = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("location",
"isRelevant"), row.names = c(NA, -20L), class = "data.frame")
#grep_places with places in the test data
grep_places<-"Gaithersburg Md|Phoenix AZ"
location.df$isRelevant[grepl(grep_places, location.df$location, ignore.case = TRUE)]<-TRUE
or for a slightly faster implementation,as per David Arenburg's comment:
location.df$isRelevant <- grepl(grep_places, location.df$location, ignore.case = TRUE)

R: Need to perform multiple matches for each row in data frame

I have a data frame where for each Filename value, there is a set of values for Compound. Some compounds have a value for IS.Name, which is a value that is one of the Compound values for a Filename.
,Batch,Index,Filename,Sample.Name,Compound,Chrom.1.Name,Chrom.1.RT,IS.Name,IS.RT
1,Batch1,1,Batch1-001,Sample001,Compound1,1,0.639883333,IS-1,0
2,Batch1,1,Batch1-001,Sample001,IS-1,IS1,0.61,NONE,0
For each set of rows with the same Filename value in my data frame, I want to match the IS.Name value with the corresponding Compound value, and put the Chrom.1.RT value from the matched row into the IS.RT cell. For example, in the table above I want to take the Chrom.1.RT value from row 2 for Compound=IS-1 and put it into IS.RT on row 1 like this:
,Batch,Index,Filename,Sample.Name,Compound,Chrom.1.Name,Chrom.1.RT,IS.Name,IS.RT
1,Batch1,1,Batch1-001,Sample001,Compound1,1,0.639883333,IS-1,0.61
2,Batch1,1,Batch1-001,Sample001,IS-1,IS1,0.61,NONE,0
If possible I need to do this in R. Thanks in advance for any help!
EDIT: Here is a larger, more detailed example:
Filename Compound Chrom.1.RT IS.Name IS.RT
1 Sample-001 IS-1 1.32495 NONE NA
2 Sample-001 Compound-1 1.344033333 IS-1 NA
3 Sample-001 IS-2 0.127416667 NONE NA
4 Sample-001 Compound-2 0 IS-2 NA
5 Sample-002 IS-1 1.32495 NONE NA
6 Sample-002 Compound-1 1.344033333 IS-1 NA
7 Sample-002 IS-2 0.127416667 NONE NA
8 Sample-002 Compound-2 0 IS-2 NA
This is chromatography data. For each sample, four compounds are being analyzed, and each compound has a retention time value (Chrom.1.RT). Two of these compounds are references that are used by the other two compounds. For example, compound-1 is using IS-1, while IS-1 does not have a reference (IS). Within each sample I am trying to match up the IS Name to the compound row for it to grab the CHrom.1.RT and put it in the IS.RT field. So for Compound-1, I want to find the Chrom.1.RT value for the Compound with the same name as the IS.Name field (IS-1) and put it in the IS.RT field for Compound-1. The tables I'm working with list all of the compounds together and don't match up the values for the references, which I need to do for the next step of calculating the difference between Chrom.1.RT and IS.RT for each compound. Does that help?
EDIT - Here's the code I found that seems to work:
sampleList<- unique(df1$Filename)
for (i in sampleList){
SampleRows<-which(df1$Filename == sampleList[i])
RefRows <- subset(df1, Filename== sampleList[i])
df1$IS.RT[SampleRows]<- RefRows$Chrom.1.RT[ match(df1$IS.Name[SampleRows], RefRows$Compound)]
}
I'm definitely open to any suggestions to make this more efficient though.
First of all, I suggest in the future you provide your example as the output of dput(df1) as it makes it a lot easier to read it into R instead of the space delimited table you provided
That being said, I've managed to wrangle it into R with the "help" of MS Excel.
df1=structure(list(Filename = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L), .Label = c("Sample-001", "Sample-002"), class = "factor"),
Compound = structure(c(3L, 1L, 4L, 2L, 3L, 1L, 4L, 2L), .Label = c("Compound-1",
"Compound-2", "IS-1", "IS-2"), class = "factor"), Chrom.1.RT = c(1.32495,
1.344033333, 0.127416667, 0, 1.32495, 1.344033333, 0.127416667,
0), IS.Name = structure(c(3L, 1L, 3L, 2L, 3L, 1L, 3L, 2L), .Label = c("IS-1",
"IS-2", "NONE"), class = "factor"), IS.RT = c(NA, NA, NA,
NA, NA, NA, NA, NA)), .Names = c("Filename", "Compound",
"Chrom.1.RT", "IS.Name", "IS.RT"), class = "data.frame", row.names = c(NA,
-8L))
The code below is severely clunky but it does the job.
library("dplyr")
df1=tbl_df(df1)
left_join(df1,left_join(df1%>%select(-Compound),df1%>%group_by(Compound)%>%summarise(unique(Chrom.1.RT)),c("IS.Name"="Compound")))%>%select(-IS.RT)%>%rename(IS.RT=`unique(Chrom.1.RT)`)
Unless I got i wrong, this is what you need?

split text to create table with answers from a suvery?

I have a data frame with answers from a survey that looks like this:
df = structure(list(Part.2..Question.1..Response = c("You did not know about the existence of the course",
"The email you received was confusing and you did not know what to do",
"Other:", "You did not know about the existence of the course",
"The email you received was confusing and you did not know what to do",
"The email you received was confusing and you did not know what to do|Other:",
"You think is not worth your time", "No Answer", "You think is not worth your time",
"You think is not worth your time", "You did not know about the existence of the course",
"You did not know about the existence of the course", "You think is not worth your time|The email you received was confusing and you did not know what to do|You did not know about the existence of the course",
"You think is not worth your time", "You did not know about the existence of the course",
"You did not know about the existence of the course", "You think is not worth your time|Other:",
"You think is not worth your time", "No Answer", "You did not know about the existence of the course",
"You think is not worth your time", "You think is not worth your time",
"You did not know about the existence of the course", "You did not know about the existence of the course",
"You think is not worth your time"), group = structure(c(1L,
2L, 1L, 3L, 2L, 3L, 2L, 2L, 2L, 3L, 1L, 1L, 1L, 2L, 3L, 1L, 3L,
3L, 3L, 3L, 2L, 2L, 2L, 1L, 3L), .Label = c("control", "treatment1",
"treatment2"), class = "factor")), .Names = c("Part.2..Question.1..Response",
"group"), row.names = c(151L, 163L, 109L, 188L, 141L, 158L, 131L,
32L, 86L, 53L, 148L, 64L, 89L, 30L, 159L, 23L, 40L, 101L, 173L,
165L, 15L, 156L, 2L, 174L, 41L), class = "data.frame")
Some people select multiple answers, for example:
df$Part.2..Question.1..Response[13]
I want to create a table that has the number of people that selected a given answers for each "group":
control treatment1 treatment2
You think is not worth your time 0 0 0
The email you received was confusing and you did not know what to do 10 1 4
You did not know about the existence of the course 4 4 1
What is the best way of doing this?
I would first split the responses on the "|" and turn multiple responses into multiple rows. Then, after doing that, I can do a simple table()
dd<-do.call(rbind, Map(data.frame,
group=df$group,
resp=strsplit(df$Part.2..Question.1..Response,"|", fixed=T)
))
with(dd, table(resp, group))
You will get results like
group
resp control treatment1 treatment2
You did not know about the existence ... 6 1 3
The email you received was confusing ... 1 2 1
Other: 1 0 2
You think is not worth your time 1 5 4
No Answer 0 1 1

Resources