Collapse columns by grouping variable (in base) - r

I have a text variable and a grouping variable. I'd like to collapse the text variable into one string per row (combine) by factor. So as long as the group column says m I want to group the text together and so on. I provided a sample data set before and after. I am writing this for a package and have thus far avoided all reliance on other packages except for wordcloudand would like to keep it this way.
I suspect rle may be useful with cumsum but haven't been able to figure this one out.
Thank you in advance.
What the data looks like
text group
1 Computer is fun. Not too fun. m
2 No its not, its dumb. m
3 How can we be certain? f
4 There is no way. m
5 I distrust you. m
6 What are you talking about? f
7 Shall we move on? Good then. f
8 Im hungry. Lets eat. You already? m
What I'd like the data to look like
text group
1 Computer is fun. Not too fun. No its not, its dumb. m
2 How can we be certain? f
3 There is no way. I distrust you. m
4 What are you talking about? Shall we move on? Good then. f
5 Im hungry. Lets eat. You already? m
The Data
dat <- structure(list(text = c("Computer is fun. Not too fun.", "No its not, its dumb.",
"How can we be certain?", "There is no way.", "I distrust you.",
"What are you talking about?", "Shall we move on? Good then.",
"Im hungry. Lets eat. You already?"), group = structure(c(2L,
2L, 1L, 2L, 2L, 1L, 1L, 2L), .Label = c("f", "m"), class = "factor")), .Names = c("text",
"group"), row.names = c(NA, 8L), class = "data.frame")
EDIT: I found I can add unique column for each run of the group variable with:
x <- rle(as.character(dat$group))[[1]]
dat$new <- as.factor(rep(1:length(x), x))
Yielding:
text group new
1 Computer is fun. Not too fun. m 1
2 No its not, its dumb. m 1
3 How can we be certain? f 2
4 There is no way. m 3
5 I distrust you. m 3
6 What are you talking about? f 4
7 Shall we move on? Good then. f 4
8 Im hungry. Lets eat. You already? m 5

This makes use of rle to create an id to group the sentences on. It uses tapply along with paste to bring the output together
## Your example data
dat <- structure(list(text = c("Computer is fun. Not too fun.", "No its not, its dumb.",
"How can we be certain?", "There is no way.", "I distrust you.",
"What are you talking about?", "Shall we move on?  Good then.",
"Im hungry.  Lets eat.  You already?"), group = structure(c(2L,
2L, 1L, 2L, 2L, 1L, 1L, 2L), .Label = c("f", "m"), class = "factor")), .Names = c("text",
"group"), row.names = c(NA, 8L), class = "data.frame")
# Needed for later
k <- rle(as.numeric(dat$group))
# Create a grouping vector
id <- rep(seq_along(k$len), k$len)
# Combine the text in the desired manner
out <- tapply(dat$text, id, paste, collapse = " ")
# Bring it together into a data frame
answer <- data.frame(text = out, group = levels(dat$group)[k$val])

I got the answer and came back to post but Dason beat me to it and more understandably than my own.
x <- rle(as.character(dat$group))[[1]]
dat$new <- as.factor(rep(1:length(x), x))
Paste <- function(x) paste(x, collapse=" ")
aggregate(text~new, dat, Paste)
EDIT
How I'd do it with aggregate and what I learned from your response (though tapply is a better solution):
y <- rle(as.character(dat$group))
x <- y[[1]]
dat$new <- as.factor(rep(1:length(x), x))
text <- aggregate(text~new, dat, paste, collapse = " ")[, 2]
data.frame(text, group = y[[2]])

Related

Matching strings loop over multiple columns

I have data from an open ended survey. I have a comments table and a codes table. The codes table is a set of themes or strings.
What I am trying to do:
Check to see if a word / string exists from the relevant column in the codes table is in an open ended comment. Add a new column in the comments table for the specific theme and a binary 1 or 0 to denote what records have been tagged.
There are quite a number of columns in the codes table, these are live and ever changing, column orders and number of columns subject to change.
I am currently doing this in a rather convoluted way, I am checking each column individually with multiple lines of code and I reckon there is likely a much better way of doing it.
I can't figure out how to get lapply to work with the stringi function.
Help is greatly appreciated.
Here is an example set of code so you can see what I am trying to do:
#Two tables codes and comments
#codes table
codes <- structure(
list(
Support = structure(
c(2L, 3L, NA),
.Label = c("",
"help", "questions"),
class = "factor"
),
Online = structure(
c(1L,
3L, 2L),
.Label = c("activities", "discussion board", "quiz"),
class = "factor"
),
Resources = structure(
c(3L, 2L, NA),
.Label = c("", "pdf",
"textbook"),
class = "factor"
)
),
row.names = c(NA,-3L),
class = "data.frame"
)
#comments table
comments <- structure(
list(
SurveyID = structure(
1:5,
.Label = c("ID_1", "ID_2",
"ID_3", "ID_4", "ID_5"),
class = "factor"
),
Open_comments = structure(
c(2L,
4L, 3L, 5L, 1L),
.Label = c(
"I could never get the pdf to download",
"I didn’t get the help I needed on time",
"my questions went unanswered",
"staying motivated to get through the textbook",
"there wasn’t enough engagement in the discussion board"
),
class = "factor"
)
),
class = "data.frame",
row.names = c(NA,-5L)
)
#check if any words from the columns in codes table match comments
#here I am looking for a match column by column but looking for a better way - lappy?
support = paste(codes$Support, collapse = "|")
supp_stringi = stri_detect_regex(comments$Open_comments, support)
supp_grepl = grepl(pattern = support, x = comments$Open_comments)
identical(supp_stringi, supp_grepl)
comments$Support = ifelse(supp_grepl == TRUE, 1, 0)
# What I would like to do is loop through all columns in codes rather than outlining the above code for each column in codes
Here is an approach that uses string::stri_detect_regex() with lapply() to create vectors of TRUE = 1, FALSE = 0 depending on whether any of the words in the Support, Online or Resources vectors are in the comments, and merges this data back with the comments.
# build data structures from OP
resultsList <- lapply(1:ncol(codes),function(x){
y <- stri_detect_regex(comments$Open_comments,paste(codes[[x]],collapse = "|"))
ifelse(y == TRUE,1,0)
})
results <- as.data.frame(do.call(cbind,resultsList))
colnames(results) <- colnames(codes)
mergedData <- cbind(comments,results)
mergedData
...and the results.
> mergedData
SurveyID Open_comments Support Online
1 ID_1 I didn’t get the help I needed on time 1 0
2 ID_2 staying motivated to get through the textbook 0 0
3 ID_3 my questions went unanswered 1 0
4 ID_4 there wasn’t enough engagement in the discussion board 0 1
5 ID_5 I could never get the pdf to download 0 0
Resources
1 0
2 1
3 0
4 0
5 1
>
One liner using base R :
comments[names(codes)] <- lapply(codes, function(x)
+(grepl(paste0(na.omit(x), collapse = "|"), comments$Open_comments)))
comments
# SurveyID Open_comments Support Online Resources
#1 ID_1 I didn’t get the help I needed on time 1 0 0
#2 ID_2 staying motivated to get through the textbook 0 0 1
#3 ID_3 my questions went unanswered 1 0 0
#4 ID_4 there wasn’t enough engagement in the discussion board 0 1 0
#5 ID_5 I could never get the pdf to download 0 0 1

split dataset by day and save it as data frame

I have a dataset with 2 months of data (month of Feb and March). Can I know how can I split the data into 59 subsets of data by day and save it as data frame (28 days for Feb and 31 days for Mar)? Preferably to save the data frame in different name according to the date, i.e. 20140201, 20140202 and so forth.
df <- structure(list(text = structure(c(4L, 6L, 5L, 2L, 8L, 1L), .Label = c(" Terpilih Jadi Maskapai dengan Pelayanan Kabin Pesawat cont",
"booking number ZEPLTQ I want to cancel their flight because they can not together with my wife and kids",
"Can I change for the traveler details because i choose wrongly for the Mr or Ms part",
"cant do it with cards either", "Coming back home AK", "gotta try PNNL",
"Jadwal penerbangan medanjktsblm tangalmasi ada kah", "Me and my Tart would love to flyLoveisintheAir",
"my flight to Bangkok onhas been rescheduled I couldnt perform seat selection now",
"Pls checks his case as money is not credited to my bank acctThanks\n\nCASLTP",
"Processing fee Whatt", "Tacloban bound aboardto get them boats Boats boats boats Tacloban HeartWork",
"thanks I chatted with ask twice last week and told the same thing"
), class = "factor"), created = structure(c(1L, 1L, 2L, 2L, 3L,
3L), .Label = c("1/2/2014", "2/2/2014", "5/2/2014", "6/2/2014"
), class = "factor")), .Names = c("text", "created"), row.names = c(NA,
6L), class = "data.frame")
You don't need to output multiple dataframes. You only need to select/subset them by year&month of the 'created' field. So here are two ways do do that: 1. is simpler if you don't plan on needing any more date-arithmetic
# 1. Leave 'created' a string, just use text substitution to extract its month&date components
df$created_mthyr <- gsub( '([0-9]+/)[0-9]+/([0-9]+)', '\\1\\2', df$created )
# 2. If you need to do arbitrary Date arithmetic, convert 'created' field to Date object
# in this case you need an explicit format-string
df$created <- as.Date(df$created, '%M/%d/%Y')
# Now you can do either a) split
split(df, df$created_mthyr)
# specifically if you want to assign the output it creates to 3 dataframes:
df1 <- split(df, df$created_mthyr)[[1]]
df2 <- split(df, df$created_mthyr)[[2]]
df5 <- split(df, df$created_mthyr)[[3]]
# ...or else b) do a Split-Apply-Combine and perform arbitrary command on each separate subset. This is very powerful. See plyr/ddply documentation for examples.
require(plyr)
df1 <- dlply(df, .(created_mthyr))[[1]]
df2 <- dlply(df, .(created_mthyr))[[2]]
df5 <- dlply(df, .(created_mthyr))[[3]]
# output looks like this - strictly you might not want to keep 'created','created_mthyr':
> df1
# text created created_mthyr
#1 cant do it with cards either 1/2/2014 1/2014
#2 gotta try PNNL 1/2/2014 1/2014
> df2
#3
#Coming back home AK
#4 booking number ZEPLTQ I want to cancel their flight because they can not together with my wife and kids
# created created_mthyr
#3 2/2/2014 2/2014
#4 2/2/2014 2/2014

melting dataframe and pasting together values in columns

I have a dataframe, dfregion, which looks as follows:
dput(dfregion)
structure(list(region = structure(c(1L, 2L, 3L, 3L, 1L), .Label = c("East",
"New England", "Southeast"), class = "factor"), words = structure(c(4L,
2L, 1L, 3L, 5L), .Label = c("buildings, tallahassee", "center, mass, visitors",
"god, instruct, estimated", "seeks, metropolis, convey", "teaching, academic, metropolis"
), class = "factor")), .Names = c("region", "words"), row.names = c(NA,
-5L), class = "data.frame")
region words
1 East seeks, metropolis, convey
3 New England center, mass, visitors
4 Southeast buildings, tallahassee
5 Southeast god, instruct, estimated
6 East teaching, academic, metropolis
I am working on "melting" or "reshaping" this dataframe by region and then would like to paste the words together.
The following code is what I have tried:
dfregionnew<-dcast(dfregion, region ~ words,fun.aggregate= function(x) paste(x) )
dfregionnew<-dcast(dfregion, region ~ words, paste)
dfregionnew <- melt(dfregion,id=c("region"),variable_name="words")
Finally, I did this- however I am not sure this is the best way to accomplish what I want
dfregionnew<-ddply(dfregion, .(region), mutate, index= paste0('words', 1:length(region)))
dfregionnew<-dcast(dfregionnew, region~ index, value.var ='words')
The result is a dataframe reshapen in the right way, yet each "word" column is separate.
Subsequently, I tried to paste these columns together and am getting various errors while doing so.
dfregionnew$new<-lapply(dfregionnew[,2:ncol(dfregionnew)], paste, sep=",")
dfregionnew$new<-ldply(apply(dfregionnew, 1, function(x) data.frame(x = paste(x[2:ncol(dfregionnew], sep=",", collapse=NULL))))
dfregionnew$new <- apply( dfregionnew[ , 2:ncol(dfregionnew) ] , 1 , paste , sep = "," )
I was able to solve that problem by doing something similar to below:
dfregionnew$new <- apply( dfregionnew[ , 2:5] , 1 , paste , collapse = "," )
I guess my real question is, would it be possible to do this in one step using melt or dcast, without having to paste together the various columns after they are output.
I am very interested in improving my skills and would love faster/ better practices in R.
Thanks in advance!
It sounds like you just want to paste the values in the "word" column together, in which case, you should be able to just use aggregate as follows:
aggregate(words ~ region, dfregion, paste)
# region words
# 1 East seeks, metropolis, convey, teaching, academic, metropolis
# 2 New England center, mass, visitors
# 3 Southeast buildings, tallahassee, god, instruct, estimated
No melting or dcasting required....
If you do want to use dcast from "reshape2", you can try something like this:
dcast(dfregion, region ~ "WORDS", value.var="words",
fun.aggregate=function(x) paste(x, collapse = ", "))
# region WORDS
# 1 East seeks, metropolis, convey, teaching, academic, metropolis
# 2 New England center, mass, visitors
# 3 Southeast buildings, tallahassee, god, instruct, estimated

R - ordering in boxplot

I am trying to produce a series of box plots in R that is grouped by 2 factors. I've managed to make the plot, but I cannot get the boxes to order in the correct direction.
My data farm I am using looks like this:
Nitrogen Species Treatment
2 G L
3 R M
4 G H
4 B L
2 B M
1 G H
I tried:
boxplot(mydata$Nitrogen~mydata$Species*mydata$Treatment)
this ordered the boxes alphabetically (first three were the "High" treatments, then within those three they were ordered by species name alphabetically).
I want the box plot ordered Low>Medium>High then within each of those groups G>R>B for the species.
So i tried using a factor in the formula:
f = ordered(interaction(mydata$Treatment, mydata$Species),
levels = c("L.G","L.R","L.B","M.G","M.R","M.B","H.G","H.R","H.B")
then:
boxplot(mydata$Nitrogen~f)
however the boxes are still shoeing up in the same order. The labels are now different, but the boxes have not moved.
I have pulled out each set of data and plotted them all together individually:
lg = mydata[mydata$Treatment="L" & mydata$Species="G", "Nitrogen"]
mg = mydata[mydata$Treatment="M" & mydata$Species="G", "Nitrogen"]
hg = mydata[mydata$Treatment="H" & mydata$Species="G", "Nitrogen"]
etc ..
boxplot(lg, lr, lb, mg, mr, mb, hg, hr, hb)
This gives what i want, but I would prefer to do this in a more elegant way, so I don't have to pull each one out individually for larger data sets.
Loadable data:
mydata <-
structure(list(Nitrogen = c(2L, 3L, 4L, 4L, 2L, 1L), Species = structure(c(2L,
3L, 2L, 1L, 1L, 2L), .Label = c("B", "G", "R"), class = "factor"),
Treatment = structure(c(2L, 3L, 1L, 2L, 3L, 1L), .Label = c("H",
"L", "M"), class = "factor")), .Names = c("Nitrogen", "Species",
"Treatment"), class = "data.frame", row.names = c(NA, -6L))
The following commands will create the ordering you need by rebuilding the Treatment and Species factors, with explicit manual ordering of the levels:
mydata$Treatment = factor(mydata$Treatment,c("L","M","H"))
mydata$Species = factor(mydata$Species,c("G","R","B"))
edit 1 : oops I had set it to HML instead of LMH. fixing.
edit 2 : what factor(X,Y) does:
If you run factor(X,Y) on an existing factor, it uses the ordering of the values in Y to enumerate the values present in the factor X. Here's some examples with your data.
> mydata$Treatment
[1] L M H L M H
Levels: H L M
> as.integer(mydata$Treatment)
[1] 2 3 1 2 3 1
> factor(mydata$Treatment,c("L","M","H"))
[1] L M H L M H <-- not changed
Levels: L M H <-- changed
> as.integer(factor(mydata$Treatment,c("L","M","H")))
[1] 1 2 3 1 2 3 <-- changed
It does NOT change what the factor looks like at first glance, but it does change how the data is stored.
What's important here is that many plot functions will plot the lowest enumeration leftmost, followed by the next, etc.
If you create factors simply using factor(X) then usually the enumeration is based upon the alphabetical order of the factor levels, (e.g. "H","L","M"). If your labels have a conventional ordering different from alphabetical (i.e. "H","M","L"), this can make your graphs seems strange.
At first glance, it may seem like the problem is due to the ordering of data in the data frame - i.e. if only we could place all "H" at the top and "L" at the bottom, then it would work. It doesn't. But if you want your labels to appear in the same order as the first occurrence in the data, you can use this form:
mydata$Treatment = factor(mydata$Treatment, unique(mydata$Treatment))
This earlier StackOverflow question shows how to reorder a boxplot based on a numerical value; what you need here is probably just a switch from factor to the related type ordered. But it is hard say as we do not have your data and you didn't provide a reproducible example.
Edit Using the dataset you posted in variable md and relying on the solution I pointed to earlier, we get
R> md$Species <- ordered(md$Species, levels=c("G", "R", "B"))
R> md$Treatment <- ordered(md$Treatment, levels=c("L", "M", "H"))
R> with(md, boxplot(Nitrogen ~ Species * Treatment))
which creates the chart you were looking to create.
This is also equivalent to the other solution presented here.

how to substitute a for loop in R with an optimized function (lapply?)

I've a data frame with time events on each row. In one row I've have the events types of sender (typeid=1) and on the other the events of the receiver (typeid=2). I want to calculate the delay between sender and receiver (time difference).
My data is organized in a data.frame, as the following snapshot shows:
dd[1:10,]
timeid valid typeid
1 18,00035 1,00000 1
2 18,00528 0,00493 2
3 18,02035 2,00000 1
4 18,02116 0,00081 2
5 18,04035 3,00000 1
6 18,04116 0,00081 2
7 18,06035 4,00000 1
8 18,06116 0,00081 2
9 18,08035 5,00000 1
10 18,08116 0,00081 2
calc_DelayVIDEO <- function (dDelay ){
pktProcess <- TRUE
nLost <- 0
myDelay <- data.frame(time=-1, delay=-1, jitter=-1, nLost=-1)
myDelay <- myDelay[-1, ]
tini <- 0
tend <- 0
for (itr in c(1:length(dDelay$timeid))) {
aRec <- dDelay[itr,]
if (aRec$typeid == 1){
tini <- as.numeric(aRec$timeid)
if (!pktProcess ) {
nLost <- (nLost + 1)
myprt(paste("Packet Lost at time ", aRec$timeid, " lost= ", nLost, sep=""))
}
pktProcess <- FALSE
}else if (aRec$typeid == 2){
tend <- as.numeric(aRec$timeid)
dd <- tend - tini
jit <- calc_Jitter(dant=myDelay[length(myDelay), 2], dcur=dd)
myDelay <- rbind(myDelay, c(aRec$timeid, dd, jit, nLost))
pktProcess <- TRUE
#myprt(paste("time=", aRec$timeev, " delay=", dd, " Delay Var=", jit, " nLost=", nLost ))
}
}
colnames(myDelay) <- c("time", "delay", "jitter", "nLost")
return (myDelay)
}
To perform the calculations for delay I use calc_DelayVideo function, neverthless for data frames with a high number of records (~60000) it takes a lot of time.
How can I substitute the for loop with more optimized R functions?
Can I use lapply to do such computation? If so, can you provide me an example?
Thanks in advance,
The usual solution is to think hard enough about the problem to find something vectorized.
If that fails, I sometimes resort to re-writing the loop in C++; the Rcpp package can helps with the interface.
The *apply suite of functions are not optimized for loops. Further, I've worked on problems where for loops are faster than apply because apply used more memory and caused my machine to swap.
I would suggest fully initializing the myDelay object and avoid using rbind (which must re-allocate memory):
init <- rep(NA, length(dDelay$timeid))
myDelay <- data.frame(time=init, delay=init, jitter=init, nLost=init)
then replace:
myDelay <- rbind(myDelay, c(aRec$timeid, dd, jit, nLost))
with
myDelay[i,] <- c(aRec$timeid, dd, jit, nLost)
As Dirk said: vectorization will help. An example of this would be to move the call to as.numeric out of the loop (since this function works with vectors).
dDelay$timeid <- as.numeric(dDelay$timeid)
Other things that may help are
Not bothering with the line aRec <- dDelay[itr,], since you can just access the row of dDelay, without creating a new variable.
Preallocating myDelay, since having it grow within the loop is likely to be a bottleneck. See Joshua's answer for more on this.
Another optimization : If I read your code right, you can easily calculate the vector nLost by using :
nLost <-cumsum(dDelay$typeid==1)
outside the loop. That one you can just add to the dataframe in the end. Saves you a lot of time already. If I use your dataframe, then :
> nLost <-cumsum(dd$typeid==1)
> nLost
[1] 1 1 2 2 3 3 4 4 5 5
Likewise the times at which the packages were lost can be calculated as:
> dd$timeid[which(dd$typeid==1)]
[1] 18,00035 18,02035 18,04035 18,06035 18,08035
in case you want to report them somewhere too.
For testing, I used :
dd <- structure(list(timeid = structure(1:10, .Label = c("18,00035",
"18,00528", "18,02035", "18,02116", "18,04035", "18,04116", "18,06035",
"18,06116", "18,08035", "18,08116"), class = "factor"), valid = structure(c(3L,
2L, 4L, 1L, 5L, 1L, 6L, 1L, 7L, 1L), .Label = c("0,00081", "0,00493",
"1,00000", "2,00000", "3,00000", "4,00000", "5,00000"), class = "factor"),
typeid = c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L)), .Names = c("timeid",
"valid", "typeid"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"))

Resources