I have data from an open ended survey. I have a comments table and a codes table. The codes table is a set of themes or strings.
What I am trying to do:
Check to see if a word / string exists from the relevant column in the codes table is in an open ended comment. Add a new column in the comments table for the specific theme and a binary 1 or 0 to denote what records have been tagged.
There are quite a number of columns in the codes table, these are live and ever changing, column orders and number of columns subject to change.
I am currently doing this in a rather convoluted way, I am checking each column individually with multiple lines of code and I reckon there is likely a much better way of doing it.
I can't figure out how to get lapply to work with the stringi function.
Help is greatly appreciated.
Here is an example set of code so you can see what I am trying to do:
#Two tables codes and comments
#codes table
codes <- structure(
list(
Support = structure(
c(2L, 3L, NA),
.Label = c("",
"help", "questions"),
class = "factor"
),
Online = structure(
c(1L,
3L, 2L),
.Label = c("activities", "discussion board", "quiz"),
class = "factor"
),
Resources = structure(
c(3L, 2L, NA),
.Label = c("", "pdf",
"textbook"),
class = "factor"
)
),
row.names = c(NA,-3L),
class = "data.frame"
)
#comments table
comments <- structure(
list(
SurveyID = structure(
1:5,
.Label = c("ID_1", "ID_2",
"ID_3", "ID_4", "ID_5"),
class = "factor"
),
Open_comments = structure(
c(2L,
4L, 3L, 5L, 1L),
.Label = c(
"I could never get the pdf to download",
"I didn’t get the help I needed on time",
"my questions went unanswered",
"staying motivated to get through the textbook",
"there wasn’t enough engagement in the discussion board"
),
class = "factor"
)
),
class = "data.frame",
row.names = c(NA,-5L)
)
#check if any words from the columns in codes table match comments
#here I am looking for a match column by column but looking for a better way - lappy?
support = paste(codes$Support, collapse = "|")
supp_stringi = stri_detect_regex(comments$Open_comments, support)
supp_grepl = grepl(pattern = support, x = comments$Open_comments)
identical(supp_stringi, supp_grepl)
comments$Support = ifelse(supp_grepl == TRUE, 1, 0)
# What I would like to do is loop through all columns in codes rather than outlining the above code for each column in codes
Here is an approach that uses string::stri_detect_regex() with lapply() to create vectors of TRUE = 1, FALSE = 0 depending on whether any of the words in the Support, Online or Resources vectors are in the comments, and merges this data back with the comments.
# build data structures from OP
resultsList <- lapply(1:ncol(codes),function(x){
y <- stri_detect_regex(comments$Open_comments,paste(codes[[x]],collapse = "|"))
ifelse(y == TRUE,1,0)
})
results <- as.data.frame(do.call(cbind,resultsList))
colnames(results) <- colnames(codes)
mergedData <- cbind(comments,results)
mergedData
...and the results.
> mergedData
SurveyID Open_comments Support Online
1 ID_1 I didn’t get the help I needed on time 1 0
2 ID_2 staying motivated to get through the textbook 0 0
3 ID_3 my questions went unanswered 1 0
4 ID_4 there wasn’t enough engagement in the discussion board 0 1
5 ID_5 I could never get the pdf to download 0 0
Resources
1 0
2 1
3 0
4 0
5 1
>
One liner using base R :
comments[names(codes)] <- lapply(codes, function(x)
+(grepl(paste0(na.omit(x), collapse = "|"), comments$Open_comments)))
comments
# SurveyID Open_comments Support Online Resources
#1 ID_1 I didn’t get the help I needed on time 1 0 0
#2 ID_2 staying motivated to get through the textbook 0 0 1
#3 ID_3 my questions went unanswered 1 0 0
#4 ID_4 there wasn’t enough engagement in the discussion board 0 1 0
#5 ID_5 I could never get the pdf to download 0 0 1
Related
I am new to programming in R and Python, however have some basics. I have a technical question about computation. I would like to know if there are any functions for performing subtraction of all features(rows) to a particular value (row) from the same data list. I would like to obtain the output_value1 as shown in the link below and post this, multiply by (-1) to obtain the output_value2.
data file link: https://www.dropbox.com/s/m5rsi6ru419f5bf/Template_matrixfile.xlsx?dl=0
Please let me know if you need more details.
I have tried performing the same operation in the MS Excel, this is very tedious and time consuming.
I have many large datasets with several hundred rows and columns which becomes more complex to manually perform the same in MS Excel. Hence, I would prefer to write a code and obtain the desired outputs.
Here is the example data:Inputs are feature and value columns and outputs are Output_value1, and Output_value2 columns.
|Feature| |Value| |Output_value1| |Output_value2|
|Gene_1| |14.25633934| |0.80100922| |-0.80100922|
|Gene_2| |16.88394578| |3.42861566| |-3.42861566|
|Gene_3| |16.01| |2.55466988| |-2.55466988|
|Gene_4| |13.82329514| |0.36796502| |-0.36796502|
|Gene_5| |12.96382949| |-0.49150063| |0.49150063|
|Normalizer| |13.45533012| |0| |0|
dput(head(Exampledata))
structure(list(Feature = structure(1:6, .Label = c("Gene_1", "Gene_2",
"Gene_3", "Gene_4", "Gene_5", "Normalizer"), class = "factor"), Value =
c(14.25633934, 16.88394578, 16.01, 13.82329514, 12.96382949,
13.45533012), Output_value1 = c(0.80100922, 3.42861566, 2.55466988,
0.36796502, -0.49150063, 0), Output_value2 = c(-0.80100922,
-3.42861566, -2.55466988, -0.36796502, 0.49150063, 0)), row.names = c(NA, 6L), class = "data.frame")
Assuming you'll only have one row where Feature == "Normalizer", in R you get the Value of that row and subtract it from rest of the rows.
Exampledata$Output_value1 <- Exampledata$Value -
Exampledata$Value[Exampledata$Feature == "Normalizer"]
Exampledata$Output_value2 <- Exampledata$Output_value1 * -1
Exampledata
# Feature Value Output_value1 Output_value2
#1 Gene_1 14.25634 0.8010092 -0.8010092
#2 Gene_2 16.88395 3.4286157 -3.4286157
#3 Gene_3 16.01000 2.5546699 -2.5546699
#4 Gene_4 13.82330 0.3679650 -0.3679650
#5 Gene_5 12.96383 -0.4915006 0.4915006
#6 Normalizer 13.45533 0.0000000 0.0000000
EDIT
For multiple such columns, we can do
cols <- grep("^Value", names(data))
inds <- which(data$Feature == "Normalizer")
data[paste0("Output", seq_along(cols))] <- data[cols] - data[rep(inds, nrow(data)),cols]
data[paste0("Output_inverted", seq_along(cols))] <- data[grep("Output", names(data))] * -1
data
Exampledata <- structure(list(Feature = structure(1:6, .Label = c("Gene_1",
"Gene_2", "Gene_3", "Gene_4", "Gene_5", "Normalizer"), class = "factor"),
Value = c(14.25633934, 16.88394578, 16.01, 13.82329514, 12.96382949,
13.45533012)), row.names = c(NA, 6L), class = "data.frame")
Am at beginner stage of R programming, please help me in below issue.
I have different desc values assigned to the same sol attribute in different rows. I want to make all desc values of sol attribute in single row as mentioned below
My data is as follows:
sol desc
1 fry, toast
1 frt,grt,gty
1 ytr,uyt,ytr
6 hyt, ytr,oiu
4 hyg,hyu,loi
4 opu,yut,yut
I want the output as follows :
sol desc
1 fry,toast,frt,grt,gty,ytr,uyt,yir
6 hyt, ytr,oiu
4 hyg,hyu,loi,opu,yut,yut
Note: you can input any values in desc as per your convenience.
aggregate() is what you are looking for. Try this:
aggregate(desc ~ sol, data = df, paste, collapse = ",")
sol desc
1 1 fry, toast,frt,grt,gty,ytr,uyt,ytr
2 4 hyg,hyu,loi,opu,yut,yut
3 6 hyt, ytr,oiu
Data
df <- structure(list(sol = c(1L, 1L, 1L, 6L, 4L, 4L), desc = c("fry, toast",
"frt,grt,gty", "ytr,uyt,ytr", "hyt, ytr,oiu", "hyg,hyu,loi",
"opu,yut,yut")), .Names = c("sol", "desc"), class = "data.frame", row.names = c(NA,
-6L))
I've been trying to solve this issue with mapply, but I believe I will have to use several nested applies to make this work, and it has gotten real confusing.
The problem is as follows:
Dataframe one contains around 400 keywords. These fall into roughly 15 categories.
Dataframe two contains a string description field, and 15 additional columns, each named to correspond to the categories mentioned in dataframe one. This has millions of rows.
If a keyword from dataframe 1 exists in the string field in dataframe 2, the category in which the keyword exists should be flagged in dataframe 2.
What I want should look something like this:
> #Dataframe1 df1
>> keyword category
>> cat A
>> dog A
>> pig A
>> crow B
>> pigeon B
>> hawk B
>> catfish C
>> carp C
>> ...
>>
> #Dataframe2 df2
>> description A B C ....
>> false cat 1 0 0 ....
>> smiling pig 1 0 0 ....
>> shady pigeon 0 1 0 ....
>> dogged dog 2 0 0 ....
>> sad catfish 0 0 1 ....
>> hawkward carp 0 1 1 ....
>> ....
I tried to use mapply to get this to work but it fails, giving me the error "longer argument not a multiple of length of shorter". It also computes this only for the first string in df2. I haven't proceeded beyond this stage, i.e. attempting to get category flags.
> mapply(grepl, pattern = df1$keyword, x = df2$description)
Could anyone be of help? I thank you very much. I am new to R so it would also help if someone could mention some 'thumb rules' for turning loops into apply functions. I cannot afford to use loops to solve this as it would take way too much time.
There might be a more elegant way to do this but this is what I came up with:
## Your sample data:
df1 <- structure(list(keyword = c("cat", "dog", "pig", "crow", "pigeon", "hawk", "catfish", "carp"),
category = c("A", "A", "A", "B", "B", "B", "C", "C")),
.Names = c("keyword", "category"),
class = "data.frame", row.names = c(NA,-8L))
df2 <- structure(list(description = structure(c(2L, 6L, 5L, 1L, 4L,3L),
.Label = c("dogged dog", "false cat", "hawkward carp", "sad catfish", "shady pigeon", "smiling pig"), class = "factor")),
.Names = "description", row.names = c(NA, -6L), class = "data.frame")
## Load packages:
library(stringr)
library(dplyr)
library(tidyr)
## For each entry in df2$description count how many times each keyword
## is contained in it:
outList <- lapply(df2$description, function(description){
outDf <- data.frame(description = description,
value = vapply(stringr::str_extract_all(description, df1$keyword),
length, numeric(1)), category = df1$category)
})
## Combine to one long data frame and aggregate by category:
outLongDf<- do.call('rbind', outList) %>%
group_by(description, category) %>%
dplyr::summarise(value = sum(value))
## Reshape from long to wide format:
outWideDf <- tidyr::spread(data = outLongDf, key = category,
value = value)
outWideDf
# Source: local data frame [6 x 4]
# Groups: description [6]
#
# description A B C
# * <fctr> <dbl> <dbl> <dbl>
# 1 dogged dog 2 0 0
# 2 false cat 1 0 0
# 3 hawkward carp 0 1 1
# 4 sad catfish 1 0 1
# 5 shady pigeon 1 1 0
# 6 smiling pig 1 0 0
This approach, however also catches the "pig" in "pigeon" and the "cat" in "catfish". I don't know if this is what you want, though.
What you are looking for is a so-called document-term-matrix (or dtm in short), which stems from NLP (Natural Language Processing). There are many options available. I prefer text2vec. This package is blazingly fast (I wouldn't be surprised if it would outperform the other solutions here by a large magnitude) especially in combination with tokenizers.
In your case the code would look something like this:
# Create the data
df1 <- structure(list(keyword = c("cat", "dog", "pig", "crow", "pigeon", "hawk", "catfish", "carp"),
category = c("A", "A", "A", "B", "B", "B", "C", "C")),
.Names = c("keyword", "category"),
class = "data.frame", row.names = c(NA,-8L))
df2 <- structure(list(description = structure(c(2L, 6L, 5L, 1L, 4L,3L),
.Label = c("dogged dog", "false cat", "hawkward carp", "sad catfish", "shady pigeon", "smiling pig"), class = "factor")),
.Names = "description", row.names = c(NA, -6L), class = "data.frame")
# load the libraries
library(text2vec) # to create the dtm
library(tokenizers) # to help creating the dtm
library(reshape2) # to reshape the data from wide to long
# 1. create the vocabulary from the keywords
vocabulary <- vocab_vectorizer(create_vocabulary(itoken(df1$keyword)))
# 2. create the dtm
dtm <- create_dtm(itoken(as.character(df2$description)), vocabulary)
# 3. convert the sparse-matrix to a data.frame
dtm_df <- as.data.frame(as.matrix(dtm))
dtm_df$description <- df2$description
# 4. melt to long format
df_result <- melt(dtm_df, id.vars = "description", variable.name = "keyword")
df_result <- df_result[df_result$value == 1, ]
# 5. combine the data, i.e., add category
df_final <- merge(df_result, df1, by = "keyword")
# keyword description value category
# 1 carp hawkward carp 1 C
# 2 cat false cat 1 A
# 3 catfish sad catfish 1 C
# 4 dog dogged dog 1 A
# 5 pig smiling pig 1 A
# 6 pigeon shady pigeon 1 B
Whatever the implementation, counting the number of matches per category needs k x d comparisons, where k is the number of keywords and d the number of descriptions.
There are a few tricks to make solve this problem fast and without a lot of memory:
Use vectorized operations. These can be performed a lot quicker than use for loops. Note that lapply, mapply or vapply are just shorthand for for loops. I parallelize (see next) over the keywords such that the vectorization can be over the descriptions which is the largest dimension.
Use parallelization. Optimally using your multiple cores speeds up the proces at the cost of an increase in memory (since every core needs its own copy).
Example:
keywords <- stringi::stri_rand_strings(400, 2)
categories <- letters[1:15]
keyword_categories <- sample(categories, 400, TRUE)
descriptions <- stringi::stri_rand_strings(3e6, 20)
keyword_occurance <- function(word, list_of_descriptions) {
description_keywords <- str_detect(list_of_descriptions, word)
}
category_occurance <- function(category, mat) {
rowSums(mat[,keyword_categories == category])
}
list_keywords <- mclapply(keywords, keyword_occurance, descriptions, mc.cores = 8)
df_keywords <- do.call(cbind, list_keywords)
list_categories <- mclapply(categories, category_occurance, df_keywords, mc.cores = 8)
df_categories <- do.call(cbind, list_categories)
With my computer this takes 140 seconds and 14GB RAM to match 400 keywords in 15 categories to 3 million descriptions.
I have a dataset with 2 months of data (month of Feb and March). Can I know how can I split the data into 59 subsets of data by day and save it as data frame (28 days for Feb and 31 days for Mar)? Preferably to save the data frame in different name according to the date, i.e. 20140201, 20140202 and so forth.
df <- structure(list(text = structure(c(4L, 6L, 5L, 2L, 8L, 1L), .Label = c(" Terpilih Jadi Maskapai dengan Pelayanan Kabin Pesawat cont",
"booking number ZEPLTQ I want to cancel their flight because they can not together with my wife and kids",
"Can I change for the traveler details because i choose wrongly for the Mr or Ms part",
"cant do it with cards either", "Coming back home AK", "gotta try PNNL",
"Jadwal penerbangan medanjktsblm tangalmasi ada kah", "Me and my Tart would love to flyLoveisintheAir",
"my flight to Bangkok onhas been rescheduled I couldnt perform seat selection now",
"Pls checks his case as money is not credited to my bank acctThanks\n\nCASLTP",
"Processing fee Whatt", "Tacloban bound aboardto get them boats Boats boats boats Tacloban HeartWork",
"thanks I chatted with ask twice last week and told the same thing"
), class = "factor"), created = structure(c(1L, 1L, 2L, 2L, 3L,
3L), .Label = c("1/2/2014", "2/2/2014", "5/2/2014", "6/2/2014"
), class = "factor")), .Names = c("text", "created"), row.names = c(NA,
6L), class = "data.frame")
You don't need to output multiple dataframes. You only need to select/subset them by year&month of the 'created' field. So here are two ways do do that: 1. is simpler if you don't plan on needing any more date-arithmetic
# 1. Leave 'created' a string, just use text substitution to extract its month&date components
df$created_mthyr <- gsub( '([0-9]+/)[0-9]+/([0-9]+)', '\\1\\2', df$created )
# 2. If you need to do arbitrary Date arithmetic, convert 'created' field to Date object
# in this case you need an explicit format-string
df$created <- as.Date(df$created, '%M/%d/%Y')
# Now you can do either a) split
split(df, df$created_mthyr)
# specifically if you want to assign the output it creates to 3 dataframes:
df1 <- split(df, df$created_mthyr)[[1]]
df2 <- split(df, df$created_mthyr)[[2]]
df5 <- split(df, df$created_mthyr)[[3]]
# ...or else b) do a Split-Apply-Combine and perform arbitrary command on each separate subset. This is very powerful. See plyr/ddply documentation for examples.
require(plyr)
df1 <- dlply(df, .(created_mthyr))[[1]]
df2 <- dlply(df, .(created_mthyr))[[2]]
df5 <- dlply(df, .(created_mthyr))[[3]]
# output looks like this - strictly you might not want to keep 'created','created_mthyr':
> df1
# text created created_mthyr
#1 cant do it with cards either 1/2/2014 1/2014
#2 gotta try PNNL 1/2/2014 1/2014
> df2
#3
#Coming back home AK
#4 booking number ZEPLTQ I want to cancel their flight because they can not together with my wife and kids
# created created_mthyr
#3 2/2/2014 2/2014
#4 2/2/2014 2/2014
I have a text variable and a grouping variable. I'd like to collapse the text variable into one string per row (combine) by factor. So as long as the group column says m I want to group the text together and so on. I provided a sample data set before and after. I am writing this for a package and have thus far avoided all reliance on other packages except for wordcloudand would like to keep it this way.
I suspect rle may be useful with cumsum but haven't been able to figure this one out.
Thank you in advance.
What the data looks like
text group
1 Computer is fun. Not too fun. m
2 No its not, its dumb. m
3 How can we be certain? f
4 There is no way. m
5 I distrust you. m
6 What are you talking about? f
7 Shall we move on? Good then. f
8 Im hungry. Lets eat. You already? m
What I'd like the data to look like
text group
1 Computer is fun. Not too fun. No its not, its dumb. m
2 How can we be certain? f
3 There is no way. I distrust you. m
4 What are you talking about? Shall we move on? Good then. f
5 Im hungry. Lets eat. You already? m
The Data
dat <- structure(list(text = c("Computer is fun. Not too fun.", "No its not, its dumb.",
"How can we be certain?", "There is no way.", "I distrust you.",
"What are you talking about?", "Shall we move on? Good then.",
"Im hungry. Lets eat. You already?"), group = structure(c(2L,
2L, 1L, 2L, 2L, 1L, 1L, 2L), .Label = c("f", "m"), class = "factor")), .Names = c("text",
"group"), row.names = c(NA, 8L), class = "data.frame")
EDIT: I found I can add unique column for each run of the group variable with:
x <- rle(as.character(dat$group))[[1]]
dat$new <- as.factor(rep(1:length(x), x))
Yielding:
text group new
1 Computer is fun. Not too fun. m 1
2 No its not, its dumb. m 1
3 How can we be certain? f 2
4 There is no way. m 3
5 I distrust you. m 3
6 What are you talking about? f 4
7 Shall we move on? Good then. f 4
8 Im hungry. Lets eat. You already? m 5
This makes use of rle to create an id to group the sentences on. It uses tapply along with paste to bring the output together
## Your example data
dat <- structure(list(text = c("Computer is fun. Not too fun.", "No its not, its dumb.",
"How can we be certain?", "There is no way.", "I distrust you.",
"What are you talking about?", "Shall we move on? Good then.",
"Im hungry. Lets eat. You already?"), group = structure(c(2L,
2L, 1L, 2L, 2L, 1L, 1L, 2L), .Label = c("f", "m"), class = "factor")), .Names = c("text",
"group"), row.names = c(NA, 8L), class = "data.frame")
# Needed for later
k <- rle(as.numeric(dat$group))
# Create a grouping vector
id <- rep(seq_along(k$len), k$len)
# Combine the text in the desired manner
out <- tapply(dat$text, id, paste, collapse = " ")
# Bring it together into a data frame
answer <- data.frame(text = out, group = levels(dat$group)[k$val])
I got the answer and came back to post but Dason beat me to it and more understandably than my own.
x <- rle(as.character(dat$group))[[1]]
dat$new <- as.factor(rep(1:length(x), x))
Paste <- function(x) paste(x, collapse=" ")
aggregate(text~new, dat, Paste)
EDIT
How I'd do it with aggregate and what I learned from your response (though tapply is a better solution):
y <- rle(as.character(dat$group))
x <- y[[1]]
dat$new <- as.factor(rep(1:length(x), x))
text <- aggregate(text~new, dat, paste, collapse = " ")[, 2]
data.frame(text, group = y[[2]])