Comparing columns and extracting information from them - r

I have 2 columns of data that I pulled from a dataset using this code:
ID <- matrix(c(df[[2]], df[[19]]), nrow = 737, ncol = 2)
I have uploaded a small example of this table here http://imgur.com/aGQ02It
The first column contains codes that relate to a location, the most important part of that code is the 1st 4 digits which tell me which town e.g. 6011 = Town A.
The second column is a key coded from 1 to 6 that tells me which of 6 species was found in this town.
I was hoping to find a way for R to run through these columns to produce a matrix that will tell me which species occurred in which town? So I guess the table would look something like this...
|Species 1| Species 2| Species 3|
|Town|
|6011| 21| 23| 15|
|6013| 21| 23| 15|
So some how I need to sort through the matrix, sorting the town column by the first 4 digits, whilst at the same time counting the number of each species in the towns.
I have used substr function in the past to extract information from a matrix to use, but I'm not sure how to do something as complex as this.
I would really appreciate any help!
Thank you.

You can do by:
creating a data.frame from i) the substr result (see ?substr) on the first column and ii) the second column of your matrix.
using table on it.
Your example is not reproducible, so here is a matrix, m, that looks like yours:
m <- matrix(c(
"6011-0001", "1",
"6011-0002", "2",
"6011-0003", "2",
"6012-0001", "1",
"6012-0002", "2",
"6012-0003", "2",
"6012-0004", "4"), ncol=2, byrow=T)
Then:
table(data.frame(town=substr(m[, 1], 1, 4), sp=m[, 2]))
Using a data.frame rather than a matrix would ease consequent operations.

Related

Summarizing R corpus with doc ID

I've created a DocumentTermMatrix similar to the one in this post:
Keep document ID with R corpus
Where I've maintained the doc_id so I can join the data back to a larger data set.
My issue is that I can't figure out how to summarize the words and word count and keep the doc_id. I'd like to be able to join this data to an existing data set using only 3 columns (doc_id, word, freq).
Without needing the doc_id, this is straight forward and I use this code to get my end result.
df_source=DataframeSource(df)
df_corpus=VCorpus(df_source)
tdm=TermDocumentMatrix(df_corpus)
tdm_m=as.matrix(tdm)
word_freqs=sort(rowSums(tdm_m), decreasing = TRUE)
tdm_sorted=data.frame(word = names(word_freqs), freq = word_freqs)
I've tried several different approaches to this and just cannot get it to work. This is where I am now (image). I've used this code:
tdm_m=cbind("doc.id" =rownames(tdm_m),tdm_m)
to move the doc_id into a column in the matrix, but cannot get the numeric columns to sum and keep the doc_id associated.
Any help, greatly appreciated, thanks!
Expected result:
doc.id | word | frequency
1 | Apple | 2
2 | Apple | 1
3 | Banana | 4
3 | Orange | 1
4 | Pear | 3
If I look at your expected output, you don't need to use this line of code word_freqs=sort(rowSums(tdm_m), decreasing = TRUE). Because this creates a total sum of the word, like Apple = 3 instead of 2 and 1 over multiple documents.
To get to the output you want, instead of using TermDocumentMatrix, using DocumentTermMatrix is slightly easier. No need in switching columns around. I'm showing you two examples on how to get the result. One with melt from the reshape2 package and one with the tidy function from the tidytext package.
# example 1
dtm <- DocumentTermMatrix(df_corpus)
dtm_df <- reshape2::melt(as.matrix(dtm))
# remove 0 values and order the data.frame
dtm_df <- dtm_df[dtm_df$value > 0, ]
dtm_df <- dtm_df[order(dtm_df$value, decreasing = TRUE), ]
or using tidytext::tidy to get the data into a tidy format. No need to remove the 0 values as tidytext doesn't transform it into a matrix before casting it into a data.frame
# example 2
dtm_tidy <- tidytext::tidy(dtm)
# order the data.frame or start using dplyr syntax if needed
dtm_tidy <- dtm_tidy[order(dtm_tidy$count, decreasing = TRUE), ]
In my tests tidytext is a lot faster and uses less memory as there is no need to first create a dense matrix.

Sparklyr : separate rows on 2 columns

I am using sparklyr for a project. I have a Spark Dataframe with lists in some of the columns and I'd like to separate them into multiple rows, i.e. have one value in each row, exactly like separate_rows does in dplyr.
So basically my dataframe is like this
| x | y
1| [a,b] | [c,d]
And I'd like to have something like this in the end :
| x | y
1| a | c
2| b | d
Like suggested in this post, explode is a good start, but it can do the job for only one column at once ; and if I use it twice, I will end up with 4 rows here instead of the 2 I want. In this very simple example, I could manage my way to keep only the rows that I want, but things can get a bit messier if there are more than two elements in the lists...
Something I thought about would be to do :
Merge the columns x and y into a single column which would contain [[a,c] , [b,d]]
Then use explode to have [a,c] and then [b,d]
Then explode but in columns (rather that in rows).
Only I don't know how to do 1) and 3).
Thank you for the help !
Here is a reproducible example obtained with collect and dput :
structure(list(ref_amount = list(list(967.66, 1592.56), list(
967.66, 1592.56)), ref_theta = list(list(5.26977034898459,
5.16119062369122), list(5.26977034898459, 5.16119062369122))), .Names = c("ref_amount",
"ref_theta"), row.names = c(NA, -2L), class = c("tbl_df", "tbl",
"data.frame"))

Creating a dictionary for tabular data in Julia

I have a tabular data like:
+---+----+----+
| | a | b |
+---+----+----+
| P | 1 | 2 |
| Q | 10 | 20 |
+---+----+----+
and I want to represent this using a Dict.
With the column and row names:
x = ["a", "b"]
y = ["P", "Q"]
and data
data = [ 1 2 ;
10 20 ]
how may I create a dictionary object d, so that d["a", "P"] = 1 and so on? Is there a way like
d = Dict(zip(x,y,data))
?
Your code works with a minor change to use Iterators.product:
d = Dict(zip(Iterators.product(x, y), data.'))
To do this you need to add a line using Iterators to your project, and might need to Pkg.add("Iterators"). Because Julia matrices are column-major (elements are stored in order within columns, and columns are stored in order within the matrix), we needed to transpose the data matrix using the transpose operator .'.
This is a literal answer to your question. I don't recommend doing that. If you have tabular data, it's probably better to use a DataFrame. These are not two dimensional (rows have no names) but that can be fixed by adding an additional column, and using select.

Transpose R data frame with text data

I have been getting more familiar with R and learning about long and wide data frames. I am getting decent at using dcast (and ddply), but as far as I can tell, they rely on my data being numerical. In the following example, I have:
data.frame(color=c("red","orange","blue","white"),safe=c("N","N","Y","Y"))
Basically, the old assumption that insurance companies penalized "risky colors" of cars as being less safe. I'd like a command to turn this into a wide table. Is there a flavor or syntax of dcast I'm missing that would turn the above table into
red | orange | blue | white
N | N | Y | Y
Thanks for any help.
Maybe the transpose function?
a <- data.frame(color=c("red","orange","blue","white"),safe=c("N","N","Y","Y"))
# transpose and make it a dataframe.
new_a <- data.frame(t(a), stringsAsFactors=FALSE)
# makes the column names the first row of the new dataframe
names(new_a) <- new_a[1,]
# now you can get rid of the first row.
new_a <- new_a[-1,]
> new_a
red orange blue white
safe N N Y Y

Plot a tree in R given pairs of leaves and heights where they merge

I have a list of leaves in a tree and the height at which I'd like them to merge, i.e. the height of their most recent common ancestor. All leaves are assumed to be at height 0. A toy example might look like:
as.data.frame(rbind(c("a","b",1),c("c","d",2),c("a","d",4)))
V1 V2 V3
1 a b 1
2 c d 2
3 a d 4
I want to plot a tree representing this data. I know that R can plot trees coming from hclust. How do I get my data into the format returned by hclust or into some other format that is easily plotted?
Edited to add diagram:
The tree for the above dataset looks like this:
__|___
| |
| _|_
_|_ | |
| | | |
a b c d
What you have is a hierarchical clustering already specified (in your own data format convention), and you would like to use R's plotting facilities. This seems to be not easy. The only way I can see now to achieve this is to create an object such as that returned by hclust. It has attributes "merge", "height", "order", "labels", "method", "call", "dist.method" which are all fairly easy to understand. Someone already tried this: https://stat.ethz.ch/pipermail/r-help/2006-February/089170.html but apparently still had issues. What you could also try to do is to fill in a distance matrix with dummy values that are consistent with your clustering, then submit this to hclust. E.g.
a <- matrix(ncol=4,nrow=4, c(0,1,4,4,1,0,4,4,4,4,0,2,4,4,2,0))
b <- hclust(as.dist(a), method="single")
plot(b, hang=-1)
This could perhaps be useful.

Resources