Transpose R data frame with text data - r

I have been getting more familiar with R and learning about long and wide data frames. I am getting decent at using dcast (and ddply), but as far as I can tell, they rely on my data being numerical. In the following example, I have:
data.frame(color=c("red","orange","blue","white"),safe=c("N","N","Y","Y"))
Basically, the old assumption that insurance companies penalized "risky colors" of cars as being less safe. I'd like a command to turn this into a wide table. Is there a flavor or syntax of dcast I'm missing that would turn the above table into
red | orange | blue | white
N | N | Y | Y
Thanks for any help.

Maybe the transpose function?
a <- data.frame(color=c("red","orange","blue","white"),safe=c("N","N","Y","Y"))
# transpose and make it a dataframe.
new_a <- data.frame(t(a), stringsAsFactors=FALSE)
# makes the column names the first row of the new dataframe
names(new_a) <- new_a[1,]
# now you can get rid of the first row.
new_a <- new_a[-1,]
> new_a
red orange blue white
safe N N Y Y

Related

R - Change column variable from categorical value to nominal

I have a CSV dataset where a column X has values between [1-4] which I would like to replace for ["Low","Medium Low","Medium High","High"] according to its value. So now dataset$X would be a vector of those categories instead of a vector of numbers.
I've checked this example, but it seems like a complicated version of what I'm trying to fix (it seems since it's from fixed values to fixed categories, there should be an easier and cleaner way). Any suggestion on how to do it?
PS: In the first place I tried it with "levels" and "cut" but since it is one fixed number and not a range it wouldn't work properly.
You can use X to subset your categorical vector.
dataset$X <- c("Low","Medium Low","Medium High","High")[dataset$X]
dataset
# X
#1 Low
#2 Medium Low
#3 Medium High
#4 High
Data:
dataset <- data.frame(X=1:4)

Summarizing R corpus with doc ID

I've created a DocumentTermMatrix similar to the one in this post:
Keep document ID with R corpus
Where I've maintained the doc_id so I can join the data back to a larger data set.
My issue is that I can't figure out how to summarize the words and word count and keep the doc_id. I'd like to be able to join this data to an existing data set using only 3 columns (doc_id, word, freq).
Without needing the doc_id, this is straight forward and I use this code to get my end result.
df_source=DataframeSource(df)
df_corpus=VCorpus(df_source)
tdm=TermDocumentMatrix(df_corpus)
tdm_m=as.matrix(tdm)
word_freqs=sort(rowSums(tdm_m), decreasing = TRUE)
tdm_sorted=data.frame(word = names(word_freqs), freq = word_freqs)
I've tried several different approaches to this and just cannot get it to work. This is where I am now (image). I've used this code:
tdm_m=cbind("doc.id" =rownames(tdm_m),tdm_m)
to move the doc_id into a column in the matrix, but cannot get the numeric columns to sum and keep the doc_id associated.
Any help, greatly appreciated, thanks!
Expected result:
doc.id | word | frequency
1 | Apple | 2
2 | Apple | 1
3 | Banana | 4
3 | Orange | 1
4 | Pear | 3
If I look at your expected output, you don't need to use this line of code word_freqs=sort(rowSums(tdm_m), decreasing = TRUE). Because this creates a total sum of the word, like Apple = 3 instead of 2 and 1 over multiple documents.
To get to the output you want, instead of using TermDocumentMatrix, using DocumentTermMatrix is slightly easier. No need in switching columns around. I'm showing you two examples on how to get the result. One with melt from the reshape2 package and one with the tidy function from the tidytext package.
# example 1
dtm <- DocumentTermMatrix(df_corpus)
dtm_df <- reshape2::melt(as.matrix(dtm))
# remove 0 values and order the data.frame
dtm_df <- dtm_df[dtm_df$value > 0, ]
dtm_df <- dtm_df[order(dtm_df$value, decreasing = TRUE), ]
or using tidytext::tidy to get the data into a tidy format. No need to remove the 0 values as tidytext doesn't transform it into a matrix before casting it into a data.frame
# example 2
dtm_tidy <- tidytext::tidy(dtm)
# order the data.frame or start using dplyr syntax if needed
dtm_tidy <- dtm_tidy[order(dtm_tidy$count, decreasing = TRUE), ]
In my tests tidytext is a lot faster and uses less memory as there is no need to first create a dense matrix.

Writing a while loop for two sets of data for R

This is probably simple, but Im new to R and it doesn't work like GrADs so I;ve been searching high and low for examples but to no avail..
I have two sets of data. Data A (1997) and Data B (2000)
Data A has 35 headings (apples, orange, grape etc). 200 observations.
Data B has 35 headings (apples, orange, grape, etc). 200 observations.
The only difference between the two datasets is the year.
So i would like to correlate the two dataset i.e. 200 data under Apples (1997) vs 200 data under Apples (2000). So 1 heading should give me only 1 value.
I've converted all the header names to V1,V2,V3...
So now I need to do this:
x<-1
while(x<35) {
new(x)=cor(1997$V(x),2000$V(x))
print(new(x))
}
and then i get this error:
Error in pptn26$V(x) : attempt to apply non-function.
Any advise is highly appreciated!
Your error comes directly from using parentheses where R isn't expecting them. You'll get the same type of error if you do 1(x). 1 is not a function, so if you put it right next to parentheses with no white space between, you're attempting to apply a non function.
I'm also a bit surprised at how you are managing to get all the way to that error, before running into several others, but I suppose that has something to do with when R evaluates what...
Here's how to get the behavior you're looking for:
mapply(cor, A, B)
# provided A is the name of your 1997 data frame and B the 2000
Here's an example with simulated data:
set.seed(123)
A <- data.frame(x = 1:10, y = sample(10), z = rnorm(10))
B <- data.frame(x = 4:13, y = sample(10), z = rnorm(10))
mapply(cor, A, B)
# x y z
# 1.0000000 0.1393939 -0.2402058
In its typical usage, mapply takes an n-ary function and n objects that provide the n arguments for that function. Here the n-ary function is cor, and the objects are A, and B, each a data frame. A data frame is structured as a list of vectors, the columns of the data frame. So mapply will loop along your columns for you, making 35 calls to cor, each time with the next column of both A and B.
If you have managed to figure out how to name your data frames 1997 and 2000, kudos. It's not easy to do that. It's also going to cause you headaches. You'll want to have a syntactically valid name for your data frame(s). That means they should start with a letter (or a dot, but really a letter). See the R FAQ for the details.

combining dataframe columns into two dimensional matrix

Is there a simple way to transform this dataframe into the form below? I thought I could just get the desired column and cast it to a matrix, but that didnt work.
set.seed(1)
data1<-data.frame(dv=rep(c("low","high"),3),iv1=rep(c("A","B","C"),2),freq=runif(6))
as.matrix(data1[,3],ncol=3) #this didnt work
GOAL:
# A B C
#high .28 .32 .39
#low .31 .36 .31
We can try
xtabs(freq~dv+iv1, data1)
Or
library(reshape2)
acast(data1, dv~iv1, value.var='freq')
Or
with(data1, tapply(freq, list(dv, iv1), FUN=I))

Plot a tree in R given pairs of leaves and heights where they merge

I have a list of leaves in a tree and the height at which I'd like them to merge, i.e. the height of their most recent common ancestor. All leaves are assumed to be at height 0. A toy example might look like:
as.data.frame(rbind(c("a","b",1),c("c","d",2),c("a","d",4)))
V1 V2 V3
1 a b 1
2 c d 2
3 a d 4
I want to plot a tree representing this data. I know that R can plot trees coming from hclust. How do I get my data into the format returned by hclust or into some other format that is easily plotted?
Edited to add diagram:
The tree for the above dataset looks like this:
__|___
| |
| _|_
_|_ | |
| | | |
a b c d
What you have is a hierarchical clustering already specified (in your own data format convention), and you would like to use R's plotting facilities. This seems to be not easy. The only way I can see now to achieve this is to create an object such as that returned by hclust. It has attributes "merge", "height", "order", "labels", "method", "call", "dist.method" which are all fairly easy to understand. Someone already tried this: https://stat.ethz.ch/pipermail/r-help/2006-February/089170.html but apparently still had issues. What you could also try to do is to fill in a distance matrix with dummy values that are consistent with your clustering, then submit this to hclust. E.g.
a <- matrix(ncol=4,nrow=4, c(0,1,4,4,1,0,4,4,4,4,0,2,4,4,2,0))
b <- hclust(as.dist(a), method="single")
plot(b, hang=-1)
This could perhaps be useful.

Resources