Often, the data from multiple response survey items are structured without sufficient information to make tidying very easy. Specifically, I have a survey question in which respondents pick one or more of 8 categorical items. The resulting dataframe has up to 8 strings separated by commas. Some cells might have two, four or none of the 8 options separated by commas. The eighth item is "Other" and may be populated with custom text.
Incidentally, this is a typical format for GoogleForms multiple response data.
Below are example data. The third and last rows include a unique response for the eighth "other" option:
structure(list(actvTypes = c(NA, NA, "Data collection, Results / findings / learnings, ate ants and milkweed",
NA, "Discussion of our research question, Planning for data collection",
"Data analysis, Collected data, apples are yummy")), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
I'd like to make a set of 8 new columns into which the responses are recorded as either 0 or 1. How can this be done efficiently?
I have a solution but it is cumbersome. I started by creating new columns for each of the response options:
atypes<- c("atype1","atype2","atype3","atype4","atype5","atype6","atype7","atype8")
log[atypes]<-NA
Next, I wrote eight ifelse statements; the format for the first seven is shown below:
log$atype7<-ifelse(str_detect(log$actvTypes,"Met with non-DASA team member (not data collection)"),1,0)
For the "other" response option, I used a list of strings and a sapply solution:
alloptions<-c('Discussion of our research question' ,'Planning for data collection' ,'Data analysis','Discussion of results | findings | learnings' ,'Mid-course corrections to our project' ,'Collected data' ,'Met with non-DASA team member (not data collection)' )
log$atype8<-sapply(log$actvTypes, function(x)
ifelse(
any(sapply(alloptions, str_detect, string = x)==TRUE),1,0) )
How might this coding scheme be more elegant? Perhaps sapply and using an index?
Depending on what you're ultimately trying to do, the following could be helpful:
library(tidyverse)
df %>%
rownames_to_column(var = "responder") %>%
separate_rows(actvTypes, sep = ",") %>%
mutate(actvTypes = fct_explicit_na(actvTypes)) %>%
count(actvTypes)
# # A tibble: 9 x 2
# actvTypes n
# <fct> <int>
# 1 " apples are yummy" 1
# 2 " ate ants and milkweed" 1
# 3 " Collected data" 1
# 4 " Planning for data collection" 1
# 5 " Results / findings / learnings" 1
# 6 Data analysis 1
# 7 Data collection 1
# 8 Discussion of our research question 1
# 9 (Missing) 3
Taking note of what this looks like right before the call to count() -- grouping up the "other" category should be trivial if you know the "non-other" categories beforehand. You may also want to look at what this looks like after the call to separate_rows().
Related
I am analyzing a data set which is feedback from teachers. Each line in the data frame is a teacher, each of their answers is a variable, however I've run into a problem inputting the year level for each teacher as a lot of the teachers teach multiple grades.
eg:
Teacher Year
a 1
b 3
c 1/2
d 7
e 3/4
How can I enter this data into an excel sheet and then into R and analyse it usefully? I've never dealt with a variable before which contains multiple options on the same row.
Suppose you already have this data in R in an object called teacher_data. I will show you the way to deal with such responses that I have seen most commonly employed: you create additional columns so that each answer gets its own cell via the convenient tidyr function separate().
library(tidyr)
separate(teacher_data, col = "Year", into = paste0("Year", 1:2), sep = "/")
Here's the result:
Teacher Year1 Year2
1 a 1 <NA>
2 b 3 <NA>
3 c 1 2
4 d 7 <NA>
5 e 3 4
How you then use those columns kind of depends on what sort of answer you're trying to ask with the data. This part of your question is probably best asked at the sister site Cross Validated (Stack Exchange for statistics).
As far as Excel goes, I would not even deal with Excel as an intermediate step; it's just unnecessary. If you write the data out when you're done into a CSV, Excel can read CSVs just fine:
write.csv(teacher_data, file = "teacher_data.csv", row.names = FALSE)
Also, just so you know, I put your data into R via the following:
teacher_data <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "
Teacher Year
a 1
b 3
c 1/2
d 7
e 3/4")
Disclaimer: Totally inexperience with R so please bear with me!...
Context: I have a series of .csv files in a directory. These files contain 7 columns and approx 100 rows. I've compiled some scripts that will read in all of the files, loop over each one adding some new columns based on different factors (e.g. if a specific column makes reference to a "box set" then it creates a new column called "box_set" with "yes" or "no" for each row), and write out over the original files. The only thing that I can't quite figure out (and yes, I've Googled high and low) is how to split one of the columns into two, based on a particular string. The string always begins with ": Series" but can end with different numbers or ranges of numbers. E.g. "Poldark: Series 4", "The Musketeers: Series 1-3".
I want to be able to split that column (currently named Programme_Title) into two columns (one called Programme_Title and one called Series_Details). Programme_Title would just contain everything before the ":" whilst Series_Details would contain everything from the "S" onwards.
To further complicate matters, the Programme_Title column contains a number of different strings, not all of which follow the examples above. Some don't contain ": Series", some will include the ":" but will not be followed by "Series".
Because I'm terrible at explaining these things, here's a sample of what it currently looks like:
Programme_Title
Hidden
Train Surfing Wars: A Matter of Life and Death
Bollywood: The World's Biggest Film Industry
Cuckoo: Series 4
Mark Gatiss on John Minton: The Lost Man of British Art
Love and Drugs on the Street
Asian Provocateur: Series 1-2
Poldark: Series 4
The Musketeers: Series 1-3
War and Peace
And here's what I want it to look like:
Programme_Title Series_Details
Hidden
Train Surfing Wars: A Matter of Life and Death
Bollywood: The World's Biggest Film Industry
Cuckoo Series 4
Mark Gatiss on John Minton: The Lost Man of British Art
Love and Drugs on the Street
Asian Provocateur Series 1-2
Poldark Series 4
The Musketeers Series 1-3
War and Peace
As I said, I'm a total R novice so imagine that you're speaking to a 5 yr old. If you need more info to be able to answer this then please let me know.
Here's the code that I'm using to do everything else (I'm sure it's a bit messy but I cobbled it together from different sources, and it works!)
### Read in files ###
filenames = dir(pattern="*.csv")
### Loop through all files, add various columns, then save ###
for (i in 1:length(filenames)) {
tmp <- read.csv(filenames[i], stringsAsFactors = FALSE)
### Add date part of filename to column labelled "date" ###
tmp$date <- str_sub(filenames[i], start = 13L, end = -5L)
### Create new column labelled "Series" ###
tmp$Series <- ifelse(grepl(": Series", tmp$Programme_Title), "yes", "no")
### Create "rank" for Programme_Category ###
tmp$rank <- sequence(rle(as.character(tmp$Programme_Category))$lengths)
### Create new column called "row" to assign numerical label to each group ###
DT = data.table(tmp)
tmp <- DT[, row := .GRP, by=.(Programme_Category)][]
### Identify box sets and create new column with "yes" / "no" ###
tmp$Box_Set <- ifelse(grepl("Box Set", tmp$Programme_Synopsis), "yes", "no")
### Remove the data.table which we no longer need ###
rm (DT)
### Write out the new file###
write.csv(tmp, filenames[[i]])
}
I don't have your exact data structure, but I created some example for you that should work:
library(tidyr)
movieName <- c("This is a test", "This is another test: Series 1-5", "This is yet another test")
df <- data.frame(movieName)
df
movieName
1 This is a test
2 This is another test: Series 1-5
3 This is yet another test
df <- df %>% separate(movieName, c("Title", "Series"), sep= ": Series")
for (row in 1:nrow(df)) {
df$Series[row] <- ifelse(is.na(df$Series[row]), "", paste("Series", df$Series[row], sep = ""))
}
df
Title Series
1 This is a test
2 This is another test Series 1-5
3 This is yet another test
I tried to capture all the examples you might encounter, but you can easily add things to capture variants not covered in the examples I provided.
Edit: I added a test case that did not include : or series. It will just produce a NA for the Series Details.
## load library: main ones using are stringr, dplyr, tidry, and tibble from the tidyverse, but I would recommend just installing the tidyverse
library(tidyverse)
## example of your data, hard to know all the unique types of data, but this will get you in the right direction
data <- tibble(title = c("X:Series 1-6",
"Y: Series 1-2",
"Z : Series 1-10",
"The Z and Z: 1-3",
"XX Series 1-3",
"AA AA"))
## Example of the data we want to format, see the different cases covered
print(data)
title
<chr>
1 X:Series 1-6
2 Y: Series 1-2
3 Z : Series 1-10
4 The Z and Z: 1-3
5 XX Series 1-3
6 AA AA
## These %>% are called pipes, and used to feed data through a pipeline, very handy and useful.
data_formatted <- data %>%
## Need to fix cases where you have Series but no : or vice versa, this keep everything the same.
## Sounds like you will always have either :, series, or :Series If this is different you can easily
## change/update this to capture other cases
mutate(title = case_when(
str_detect(title,'Series') & !(str_detect(title,':')) ~ str_replace(title,'Series',':Series'),
!(str_detect(title,'Series')) & (str_detect(title,':')) ~ str_replace(title,':',':Series'),
TRUE ~ title)) %>%
## first separate the columns based on :
separate(col = title,into = c("Programme_Title","Series_Details"), sep = ':') %>%
##This just removes all white space at the ends to clean it up
mutate(Programme_Title = str_trim(Programme_Title),
Series_Details = str_trim(Series_Details))
## Output of the data to see how it was formatted
print(data_formatted)
Programme_Title Series_Details
<chr> <chr>
1 X Series 1-6
2 Y Series 1-2
3 Z Series 1-10
4 The Z and Z Series 1-3
5 XX Series 1-3
6 AA AA NA
Thanks to lawyeR for recommending the tidytext package. Here is some code based on that package that seems to work pretty well on my sample data. It doesn't work quite so well though when the value of the text column is blank. (There are times when this will happen and it will make sense to keep the blank rather than filtering it.) I've set the first observation for TVAR to a blank to illustrate. The code drops this observation. How can I get R to keep the observation and to set the frequencies for each word to zero? I tried some ifelse statements using and not using the pipe. It's not working so well though. The troulbe seems to center around the unnest_tokens function from the tidytext package.
sampletxt$TVAR[1] <- ""
chunk_words <- sampletxt %>%
group_by(PTNO, DATE, TYPE) %>%
unnest_tokens(word, TVAR, to_lower = FALSE) %>%
count(word) %>%
spread(word, n, 0)
I have an R data frame. I want to use it to create a document term matrix. Presumably I would want to use the tm package to do that but there might be other ways. I then want to convert that matrix back to a data frame. I want the final data frame to contain identifying variables from the original data frame.
Question is, how do I do that? I found some answers to a similar question, but that was for a data frame with text and a single ID variable. My data are likely to have about half a dozen variables that identify a given text record. So far, attempts to scale up the solution for a single ID variable haven't proven all that successful.
Below are some sample data. I created these for another task which I did manage to solve.
How can I get a version of this data frame that has an additional frequency column for each word in the text entries and that retains variables like PTNO, DATE, and TYPE?
sampletxt <-
structure(
list(
PTNO = c(1, 2, 2, 3, 3),
DATE = structure(c(16801, 16436, 16436, 16832, 16845), class = "Date"),
TYPE = c(
"Progress note",
"Progress note",
"CAT scan",
"Progress note",
"Progress note"
),
TVAR = c(
"This sentence contains the word metastatic and the word Breast plus the phrase denies any symptoms referable to.",
"This sentence contains tnm code T-1, N-O, M-0. This sentence contains contains tnm code T-1, N-O, M-1. This sentence contains tnm code T1N0M0. This sentence contains contains tnm code T1NOM1. This sentence is a sentence!?!?",
"This sentence contains Dr. Seuss and no target words. This sentence contains Ms. Mary J. blige and no target words.",
"This sentence contains the term stageIV and the word Breast. This sentence contains no target words.",
"This sentence contains the word breast and the term metastatic. This sentence contains the word breast and the term stage IV."
)), .Names = c("PTNO", "DATE", "TYPE", "TVAR"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -5L))
The quanteda package is faster and more straightforward than tm, and works nicely with tidytext as well. Here's how to do it:
These operations create a corpus from your object, create a document-feature matrix, and then return a data.frame that combines the variables with the feature counts. (Additional options are available when creating the dfm, see ?dfm).
library("quanteda")
samplecorp <- corpus(sampletxt, text_field = "TVAR")
sampledfm <- dfm(samplecorp)
result <- cbind(docvars(sampledfm), as.data.frame(sampledfm))
You can then group by the variables to get your result. (Here I am showing just the first 6 columns.)
dplyr::group_by(result[, 1:6], PTNO, DATE, TYPE)
# # A tibble: 5 x 6
# # Groups: PTNO, DATE, TYPE [5]
# PTNO DATE TYPE this sentence contains
# * <dbl> <date> <chr> <dbl> <dbl> <dbl>
# 1 1 2016-01-01 Progress note 1 1 1
# 2 2 2015-01-01 Progress note 5 6 6
# 3 2 2015-01-01 CAT scan 2 2 2
# 4 3 2016-02-01 Progress note 2 2 2
# 5 3 2016-02-14 Progress note 2 2 2
packageVersion("quanteda")
# [1] ‘0.99.6’
this function from package "SentimentAnalysis" is the easiest way to do this, especially if you are trying to convert from column of a dataframe to DTM (though it also works on txt file!):
library("SentimentAnalysis")
corpus <- VCorpus(VectorSource(df$column_or_txt))
tdm <- TermDocumentMatrix(corpus,
control=list(wordLengths=c(1,Inf),
tokenize=function(x) ngram_tokenize(x, char=FALSE,
ngmin=1, ngmax=2)))
it is simple and works like a charm each time, with both chinese and english, for those doing text mining in chinese.
Quick question - I have a dataframe (severity) that looks like,
industryType relfreq relsev
1 Consumer Products 2.032520 0.419048
2 Biotech/Pharma 0.650407 3.771429
3 Industrial/Construction 1.327913 0.609524
4 Computer Hardware/Electronics 1.571816 2.019048
5 Medical Devices 1.463415 3.028571
6 Software 0.758808 1.314286
7 Business/Consumer Services 0.623306 0.723810
8 Telecommunications 0.650407 4.247619
if I wanted to pull the relfreq of Medical Devices (row 5) - how could I subset just that value?
I was thinking about just indexing and doing severity$relfreq[[5]], but I'd be using this line in a bigger function where the user would specify the industry i.e.
example <- function(industrytype) {
weight <- relfreq of industrytype parameter
thing2 <- thing1*weight
return(thing2)
}
So if I do subset by an index, is there a way R would know which index corresponds to the industry type specified in the function parameter? Or is it easier/a way to just subset the relfreq column by the industry name?
You would require to first select the row of interest and then keep the 2 column you requested (industryType and relfreq).
There is a great package that allows you to do this intuitively with tidyverse library(tidyverse)
data_want <- severity %>%
subset(industryType =="Medical Devices") %>%
select(industryType, relfreq)
Here you read from left to right with the %>% serving as passing the result to the next step as if nesting.
I think that selecting whole row is better, then choose column which you would like to see.
frame <- severity[severity$industryType == 'Medical Devices',]
frame$relfreq
Can someone help me with how to find the most frequently used two and three words in a text using R?
My text is...
text <- c("There is a difference between the common use of the term phrase and its technical use in linguistics. In common usage, a phrase is usually a group of words with some special idiomatic meaning or other significance, such as \"all rights reserved\", \"economical with the truth\", \"kick the bucket\", and the like. It may be a euphemism, a saying or proverb, a fixed expression, a figure of speech, etc. In grammatical analysis, particularly in theories of syntax, a phrase is any group of words, or sometimes a single word, which plays a particular role within the grammatical structure of a sentence. It does not have to have any special meaning or significance, or even exist anywhere outside of the sentence being analyzed, but it must function there as a complete grammatical unit. For example, in the sentence Yesterday I saw an orange bird with a white neck, the words an orange bird with a white neck form what is called a noun phrase, or a determiner phrase in some theories, which functions as the object of the sentence. Theorists of syntax differ in exactly what they regard as a phrase; however, it is usually required to be a constituent of a sentence, in that it must include all the dependents of the units that it contains. This means that some expressions that may be called phrases in everyday language are not phrases in the technical sense. For example, in the sentence I can't put up with Alex, the words put up with (meaning \'tolerate\') may be referred to in common language as a phrase (English expressions like this are frequently called phrasal verbs\ but technically they do not form a complete phrase, since they do not include Alex, which is the complement of the preposition with.")
The tidytext package makes this sort of thing pretty simple:
library(tidytext)
library(dplyr)
data_frame(text = text) %>%
unnest_tokens(word, text) %>% # split words
anti_join(stop_words) %>% # take out "a", "an", "the", etc.
count(word, sort = TRUE) # count occurrences
# Source: local data frame [73 x 2]
#
# word n
# (chr) (int)
# 1 phrase 8
# 2 sentence 6
# 3 words 4
# 4 called 3
# 5 common 3
# 6 grammatical 3
# 7 meaning 3
# 8 alex 2
# 9 bird 2
# 10 complete 2
# .. ... ...
If the question is asking for counts of bigrams and trigrams, tokenizers::tokenize_ngrams is useful:
library(tokenizers)
tokenize_ngrams(text, n = 3L, n_min = 2L, simplify = TRUE) %>% # tokenize bigrams and trigrams
as_data_frame() %>% # structure
count(value, sort = TRUE) # count
# Source: local data frame [531 x 2]
#
# value n
# (fctr) (int)
# 1 of the 5
# 2 a phrase 4
# 3 the sentence 4
# 4 as a 3
# 5 in the 3
# 6 may be 3
# 7 a complete 2
# 8 a phrase is 2
# 9 a sentence 2
# 10 a white 2
# .. ... ...
Your text is:
text <- c("There is a difference between the common use of the term phrase and its technical use in linguistics. In common usage, a phrase is usually a group of words with some special idiomatic meaning or other significance, such as \"all rights reserved\", \"economical with the truth\", \"kick the bucket\", and the like. It may be a euphemism, a saying or proverb, a fixed expression, a figure of speech, etc. In grammatical analysis, particularly in theories of syntax, a phrase is any group of words, or sometimes a single word, which plays a particular role within the grammatical structure of a sentence. It does not have to have any special meaning or significance, or even exist anywhere outside of the sentence being analyzed, but it must function there as a complete grammatical unit. For example, in the sentence Yesterday I saw an orange bird with a white neck, the words an orange bird with a white neck form what is called a noun phrase, or a determiner phrase in some theories, which functions as the object of the sentence. Theorists of syntax differ in exactly what they regard as a phrase; however, it is usually required to be a constituent of a sentence, in that it must include all the dependents of the units that it contains. This means that some expressions that may be called phrases in everyday language are not phrases in the technical sense. For example, in the sentence I can't put up with Alex, the words put up with (meaning \'tolerate\') may be referred to in common language as a phrase (English expressions like this are frequently called phrasal verbs\ but technically they do not form a complete phrase, since they do not include Alex, which is the complement of the preposition with.")
In Natural Language Processing, 2-word phrases are referred to as "bi-gram", and 3-word phrases are referred to as "tri-gram", and so forth. Generally, a given combination of n-words is called an "n-gram".
First, we install the ngram package (available on CRAN)
# Install package "ngram"
install.packages("ngram")
Then, we will find the most frequent two-word and three-word phrases
library(ngram)
# To find all two-word phrases in the test "text":
ng2 <- ngram(text, n = 2)
# To find all three-word phrases in the test "text":
ng3 <- ngram(text, n = 3)
Finally, we will print the objects (ngrams) using various methods as below:
print(ng, output="truncated")
print(ngram(x), output="full")
get.phrasetable(ng)
ngram::ngram_asweka(text, min=2, max=3)
We can also use Markov Chains to babble new sequences:
# if we are using ng2 (bi-gram)
lnth = 2
babble(ng = ng2, genlen = lnth)
# if we are using ng3 (tri-gram)
lnth = 3
babble(ng = ng3, genlen = lnth)
We can split the words and use table to summarize the frequency:
words <- strsplit(text, "[ ,.\\(\\)\"]")
sort(table(words, exclude = ""), decreasing = T)
Simplest?
require(quanteda)
# bi-grams
topfeatures(dfm(text, ngrams = 2, verbose = FALSE))
## of_the a_phrase the_sentence may_be as_a in_the in_common phrase_is
## 5 4 4 3 3 3 2 2
## is_usually group_of
## 2 2
# for tri-grams
topfeatures(dfm(text, ngrams = 3, verbose = FALSE))
## a_phrase_is group_of_words of_a_sentence of_the_sentence for_example_in example_in_the
## 2 2 2 2 2 2
## in_the_sentence an_orange_bird orange_bird_with bird_with_a
# 2 2 2 2
Here's a simple base R approach for the 5 most frequent words:
head(sort(table(strsplit(gsub("[[:punct:]]", "", text), " ")), decreasing = TRUE), 5)
# a the of in phrase
# 21 18 12 10 8
What it returns is an integer vector with the frequency count and the names of the vector correspond to the words that were counted.
gsub("[[:punct:]]", "", text) to remove punctuation since you don't want to count that, I guess
strsplit(gsub("[[:punct:]]", "", text), " ") to split the string on spaces
table() to count unique elements' frequency
sort(..., decreasing = TRUE) to sort them in decreasing order
head(..., 5) to select only the top 5 most frequent words