Matching phrases (word-search), noting finding in different columns - r

I'm working with: feedback from teachers, multiple selections from a google form. All selections end up producing a single column.
There was also the option to type in responses... so some words/responses are only present one time. There are about 50 responses, with between 1, and 7 words(ish) per entry. For the MRE I've included 5 as a vector.
eg_responses <- c("Difficult, Challenging, Fair", "Necessary, Useful", "Cruel", "Easy, Challenging", "School's shouldn't have to do that")
words_example <- c("Difficult","Easy","Fair","Challenging","Necessary","Useful")
While I can get a summary of the results, I need to create isolated responses so that I can select and compare with other variables later on...
What I would like to be able to do:
have columns representative of each word, and an extra column with other comments.
Look through each row for each word that was in the drop down.
while there are rows left to check look through word x.
if word is present mark in x column.
if word is not present mark NA in x column.
next column and next word when rows run out.
any other words/phrases left at the end go in the last column,
if no other words phrases are left mark end column with NA.
I'm sure I've been overcomplicating things. But I'm an absolute beginner. I'm working in R.
I have tried, separating the words, turning things into factors, then turning to character... Searching for exact complete phrases, or searching through words, would be really helpful.
There are 6 other sections with the same problem, and other sections of other variables.
I would like to have something like this at the end...
(apologies, just going to make it up bit of a chicken and egg)
respcolnames <- c( "Challenging", "Fair", "Unfair", "Other" )
row1 <- c("NA", "Fair", "NA", "NA")
row2 <- c("Challenging", "NA", "Unfair", "NA")
row3 <- c("NA", "NA","NA", "School's shouldn't have to do that")
teacher_responses <- cbind(row1, row2,row3)
names(teacher_responses) <- respcolnames
so that the corresponding response is in the correct column. I can then strip away the "Other" responses for a more social science analysis, and use the drop down response selections for some graphics.

Related

Create a new row to assign M/F to a column based on heading, referencing second table?

I am new to R (and coding in general) and am really stuck on how to approach this problem.
I have a very large data set; columns are sample ID# (~7000 samples) and rows are gene expression (~20,000 genes). Column headings are BIOPSY1-A, BIOPSY1-B, BIOPSY1-C, ..., BIOPSY200-Z. Each number (1-200) is a different patient, and each sample for that patient is a different letter (-A, -Z).
I would like to do some comparisons between samples that came from men and women. Gender is not included in this gene expression table. I have a separate file with patient numbers (BIOPSY1-200) and their gender M/F.
I would like to code something that will look at the column ID (ex: BIOPSY7-A), recognize that it includes "BIOPSY7" (but not == BIOPSY7 because there is BIOPSY7-A through BIOPSY7-Z), find "BIOPSY7" in the reference file, extrapolate M/F, and create a new row with M/F designation.
Honestly, I am so overwhelmed with coding this that I tried to open the file in Excel to manually input M/F, for the 7000 columns as it would probably be faster. However, the file is so large that Excel crashes when it opens.
Any input or resources that would put me on the right path would be extremely appreciated!!
I don't quite know how your data looks like, so I made mine based on your definitions. I'm sure you can modify this answer based on your needs and your dataset structure:
library(data.table)
genderfile <-data.frame("ID"=c("BIOPSY1", "BIOPSY2", "BIOPSY3", "BIOPSY4", "BIOPSY5"),"Gender"=c("F","M","M","F","M"))
#you can just read in your gender file to r with the line below
#genderfile <- read.csv("~/gender file.csv")
View(genderfile)
df<-matrix(rnorm(45, mean=10, sd=5),nrow=3)
colnames(df)<-c("BIOPSY1-A", "BIOPSY1-B", "BIOPSY1-C", "BIOPSY2-A", "BIOPSY2-B", "BIOPSY2-C","BIOPSY3-A", "BIOPSY3-B", "BIOPSY3-C","BIOPSY4-A", "BIOPSY4-B", "BIOPSY4-C","BIOPSY5-A", "BIOPSY5-B", "BIOPSY5-C")
df<-cbind(Gene=seq(1:3),df)
df<-as.data.frame(df)
#you can just read in your main df to r with the line below, fread prevents dashes to turn to period in r, you need data.table package installed and checked in
#df<-fread("~/first file.csv")
View(df)
Note that the following line of code removes the dash and letter from the column names of df (I removed the first column by df[,-c(1)] because it is the Gene id):
substr(x=names(df[,-c(1)]),start=1,stop=nchar(names(df[,-c(1)]))-2)
#[1] "BIOPSY1" "BIOPSY1" "BIOPSY1" "BIOPSY2" "BIOPSY2" "BIOPSY2" "BIOPSY3" "BIOPSY3" "BIOPSY3" "BIOPSY4" "BIOPSY4"
#[12] "BIOPSY4" "BIOPSY5" "BIOPSY5" "BIOPSY5"
Now, we are ready to match the columns of df with the ID in genderfile to get the Gender column:
Gender<-genderfile[, "Gender"][match(substr(x=names(df[,-c(1)]),start=1,stop=nchar(names(df[,-c(1)]))-2), genderfile[,"ID"])]
Gender
#[1] F F F M M M M M M F F F M M M
Last step is to add the Gender defined above as a row to the df:
df_withGender<-rbind(c("Gender", as.character(Gender)), df)
View(df_withGender)

R - Filter any rows and show all columns

I would like to an output that shows the column names that has rows containing a string value. Assume the following...
Animals Sex
I like Dogs Male
I like Cats Male
I like Dogs Female
I like Dogs Female
Data Missing Male
Data Missing Male
I found an SO tread here, David Arenburg provided answer which works very well but I was wondering if it is possible to get an output that doesn't show all the rows. So If I want to find a string "Data Missing" the output I would like to see is...
Animals
Data Missing
or
Animal
TRUE
instead of
Anmials Sex
Data Missing Male
Data Missing Male
I have also found using filters such as df$columnName works but I have big file and a number of large quantity of column names, typing column names would be tedious. Assume string "Data Missing" is also in other columns and there could be different type of strings. So that is why I like David Arenburg's answer, so bear in mind I don't have two columns, as sample given above.
Cheers
One thing you could do is grep for "Data Missing" like this:
x <- apply(data, 2, grep, pattern = "Data Missing")
lapply(x, length) > 1
This will give you the:
Animal
TRUE
result you're after. It's also good because it checks all columns, which you mentioned was something you wanted.
If we want only the first row where it matches, use match
data[match("Data Missing", data$Animals), "Animals", drop = FALSE]
# Animals
#5 Data Missing

Combining Two GREPL Searches with 'OR' and adding a 'NOT'

I have this dataframe:
ID Description
1 Tree fell on car
2 Tree was uprooted
3 While cutting tree, it came down
4 Tree came down
I am trying to search a column in a dataframe for weather words. I am doing this by using multiple GREPL functions seperated by an 'OR'. However, I want to combine two grepl functions to say "If the description has THIS WORD and THIS WORD, but not THIS WORD, it is weather". If you look at the dataframe above, one can assume that "Tree came down" would be classified as weather, but "While cutting tree, it came down" is non-weather related.
The code that I tried from other stack overflow answers is :
Data$Type<-ifelse(grepl(' Tree|^Tree|-
Tree|:Tree',Data$DESCRIPTION,ignore.case=TRUE)&
grepl('^[^Cutting]*[Feel|Fell|Fall|Up Rooted|Uprooted|Came Down| Down|Knocked
Onto|Caused Damage]
[^Cutting]*$',Data$DESCRIPTION,ignore.case=TRUE)), "weather", "Not
Classified")
But this is not working. I tried:
Data$Type<-ifelse(grepl(' Tree|^Tree|-
Tree|:Tree',Data$DESCRIPTION,ignore.case=TRUE)& grepl('Feel|Fell|Fall|Up
Rooted|Uprooted|Came Down| Down|Knocked Onto|Caused
Damage',Data$DESCRIPTION,ignore.case=TRUE) &
!grepl('Cutting',Data$DESCRIPTION,ignore.case=TRUE)), "Weather", "Not
Classified")
I am expecting this outcome:
ID Description Type
1 Tree fell on car "Weather"
2 Tree was uprooted "Weather"
3 While cutting tree, it came down "Non-Weather"
4 Tree came down "Weather"
But these do not work. Thank you
Since you have only two cases (Weather and Non-Weather), I think it would be sufficient to use grepl for one only:
df$Type <- sapply(df$Description,
function(x) ifelse(grepl(pattern = 'Tree|fell|^cutting',x = x),'Weather','Non-Weather'))
[1] "Weather" "Weather" "Non-Weather" "Weather"
I ended up just doing things like this to make sure "Ice" is a weather word, but "Maker".
ifelse(grepl('Ice$| Ice |,Ice |^Ice | Ice,',Data$DESCRIPTION,ignore.case=TRUE) &
!grepl('Maker',Data$DESCRIPTION,ignore.case=TRUE))

Count how many times specific words are used

I want to perform textmining on several bank account descriptions. My first step would be get a ranking of the words that are used the most in the description.
So lets say i have a dataframe that looks like this:
a b
1 1 House expenses
2 2 Office furniture bought
3 3 Office supplies ordered
Then I want to create a ranking of the use of the words. Like this:
Name Times
1. Office 2
2. Furniture 1
Etc...
Any thoughts on how I can quickly get an overview of the words that are used most in the description?
Another way around this is using the tm package.
You can create a corpus:
require(tm)
corpus <- Corpus(DataframeSource(data))
dtm<-DocumentTermMatrix(corpus)
dtmDataFrame <- as.data.frame(inspect(dtm))
by default it makes term frequencies tf using "weightTf". I converted the Document Term Matrix into a Dataframe.
Now what you have is a row per document, a column for each term and the value is the term frequency for every term, you can just create the rankings in a straightforward way, adding all values for each column.
colSums(dtmDataFrame)
You can sort it too after, whatever. The good point of using tm is that you can filter easily words out, process them with bunch of things like stop words, remove punctuations, stemming, remove sparse words in case you need it.
d<-data.frame(a=c(1,2,3), b=c("1 House expenses", "2 Office furniture bought", "3 Office supplies ordered"), stringsAsFactors =FALSE)
e <- unlist(strsplit(d$b, " "))
f <- e[! e %in% c("")]
g <- sapply(f, function(x) { sum(f %in% c(x))})
h = data.frame(Name=names(g), Times=g)
h[!duplicated(h),]

Pasting (or merging) two elements of a column together

I have two sources of clinical procedure billing information that I have added together (with rbind). In each row there is a CPT field and a CPT.description field that supplys a brief explanation. However, the descriptions are slightly different from the two sources. I want to be able to combine them. That way, if different words or abbreviations are used, then I can just do a string search to find what I am looking for.
So lets make up a simplified representation of a data table that I was able to generate.
cpt <- c(23456,23456,10000,44555,44555)
description <- c("tonsillectomy","tonsillectomy in >12 year old","brain transplant","castration","orchidectomy")
cpt.desc <- data.frame(cpt,description)
And here is what I want to get to.
cpt.wanted <- c(23456,10000,44555)
description.wanted <- c("tonsillectomy; tonsillectomy in >12 year old","brain transplant","castration; orchidectomy")
cpt.desc.wanted <- data.frame(cpt.wanted,description.wanted)
I have tried using functions such as unstack and then lapply(list,paste) but that is not pasting the elements of each list. I also tried reshape but there was no categorical variable to differentiate first or second version of description or even in some cases a third. The really annoying part is I had a similar problem a few months or years ago and someone helped me either on stackoverflow or on r-help and for the life of me I cannot find it.
So the underlying problem is, imagine that I have a spreadsheet in front of me. I need to do a vertical merge (paste) of two or maybe even three description cells who have the same CPT code in the adjacent column.
What buzzwords should I have been using to search for a solution to this problem.
Thank you so much for your help.
sapply( sapply(unique(cpt), function(x) grep(x, cpt) ),
# creates sets of index vectors as a list
function(x) paste(description[x], collapse=";") )
# ... and this pastes each set of selected items from "description" vector
[1] "tonsillectomy;tonsillectomy in >12 year old"
[2] "brain transplant"
[3] "castration;orchidectomy"
Here is an approach that uses plyr.
library("plyr")
cpt.desc.wanted <- ddply(cpt.desc, .(cpt), summarise,
description.wanted = paste(unique(description), collapse="; "))
which gives
> cpt.desc.wanted
cpt description.wanted
1 10000 brain transplant
2 23456 tonsillectomy; tonsillectomy in >12 year old
3 44555 castration; orchidectomy

Resources