I have this data frame with date, Mkt, Rf and then 237 variables which have numbered names. I want to subtract the variable Rf from all 230 numbered variables. I have tried
df[,4:240] = df[,4:240] - df[,3]
but it doesn't seem to work. I'm assuming I would have to create a loop for this type of subtraction but I don't know how I would add the Rf column to subtract inside the loop.
| |Date |Mkt |Rf |10094|10098|10115|...
|:-|:---------|:----|:----|:----|:----|:----|...
|1 |01-01-1997|0.056|0.006|0.002|0.034|0.564|...
|2 |01-02-1997|0.653|0.009|0.009|0.052|0.445|...
You could use this simple for loop:
for(column in 4:240){
df[,column]=df[,column]-df[,3]
}
Related
I've created a DocumentTermMatrix similar to the one in this post:
Keep document ID with R corpus
Where I've maintained the doc_id so I can join the data back to a larger data set.
My issue is that I can't figure out how to summarize the words and word count and keep the doc_id. I'd like to be able to join this data to an existing data set using only 3 columns (doc_id, word, freq).
Without needing the doc_id, this is straight forward and I use this code to get my end result.
df_source=DataframeSource(df)
df_corpus=VCorpus(df_source)
tdm=TermDocumentMatrix(df_corpus)
tdm_m=as.matrix(tdm)
word_freqs=sort(rowSums(tdm_m), decreasing = TRUE)
tdm_sorted=data.frame(word = names(word_freqs), freq = word_freqs)
I've tried several different approaches to this and just cannot get it to work. This is where I am now (image). I've used this code:
tdm_m=cbind("doc.id" =rownames(tdm_m),tdm_m)
to move the doc_id into a column in the matrix, but cannot get the numeric columns to sum and keep the doc_id associated.
Any help, greatly appreciated, thanks!
Expected result:
doc.id | word | frequency
1 | Apple | 2
2 | Apple | 1
3 | Banana | 4
3 | Orange | 1
4 | Pear | 3
If I look at your expected output, you don't need to use this line of code word_freqs=sort(rowSums(tdm_m), decreasing = TRUE). Because this creates a total sum of the word, like Apple = 3 instead of 2 and 1 over multiple documents.
To get to the output you want, instead of using TermDocumentMatrix, using DocumentTermMatrix is slightly easier. No need in switching columns around. I'm showing you two examples on how to get the result. One with melt from the reshape2 package and one with the tidy function from the tidytext package.
# example 1
dtm <- DocumentTermMatrix(df_corpus)
dtm_df <- reshape2::melt(as.matrix(dtm))
# remove 0 values and order the data.frame
dtm_df <- dtm_df[dtm_df$value > 0, ]
dtm_df <- dtm_df[order(dtm_df$value, decreasing = TRUE), ]
or using tidytext::tidy to get the data into a tidy format. No need to remove the 0 values as tidytext doesn't transform it into a matrix before casting it into a data.frame
# example 2
dtm_tidy <- tidytext::tidy(dtm)
# order the data.frame or start using dplyr syntax if needed
dtm_tidy <- dtm_tidy[order(dtm_tidy$count, decreasing = TRUE), ]
In my tests tidytext is a lot faster and uses less memory as there is no need to first create a dense matrix.
I have an array object containing 3 columns and a ton of rows.
Example nonsensical data to show the format:
Name | Owner | Price
chair | roger | 50
table | roger | 150
sofa | bill | 500
I want to use the lm function to get stats about the price column. My problem is, my formular needs to compare the current value to the last value, skipping the very first row completely.
Right now I have
lm(My_Function(Price, 5)~., data=myArray)
This allows me to do whatever logic I need with the price values. But I need to get the Price, and also the price of the previous row, in My_Function, to allow for some comparison logic.
How could I do that?
My code should look sort of like this
lm(My_Function(Price, previousPrice, 5)~., data=myArray)
So I need two things:
How to get the previous value (or any other arbitrary index's value
during the logic, in relation to the current one)
How to skip the
very first row, without losing its data of course (since it will be
the "previous" data for the next row)
Here's code which implements Robert Tan's suggestion:
# Make example data
X = data.frame("Price" = rnorm(10),
"Owner" = sample(c("roger", "bill"), 10, replace = T))
# Lag the price variable
library(Hmisc)
X$previousPrice = Lag(X$Price, shift = 1) #shift gives number of lags
X #Note first value for previousPrice is NA
# Run linear model. Note the first row will be ignored from the model as the "lagging" generates an NA
f = lm(Price ~ previousPrice, data = X)
summary(f)
Note that this approach will solve both of your questions: (1) is addressed by the lag function; (2) happens automatically because lm() will omit the first row because X$previousPrice has an NA for the first value.
If the above approach doesn't solve your problems and you still need to explicitly call My_Function() on an object with the first row removed, you could do the following:
My_Function = function(x1, x2) {x1 - x2} #Just for illustration
X2 = X[complete.cases(X), ] #make copy of X with first row removed (NB you could use `X[-1, ]` but complete.cases() will remove *all* rows with NAs)
lm(My_Function(X2$Price, X2$previousPrice) ~ ., data = X2)
I have a tabular data like:
+---+----+----+
| | a | b |
+---+----+----+
| P | 1 | 2 |
| Q | 10 | 20 |
+---+----+----+
and I want to represent this using a Dict.
With the column and row names:
x = ["a", "b"]
y = ["P", "Q"]
and data
data = [ 1 2 ;
10 20 ]
how may I create a dictionary object d, so that d["a", "P"] = 1 and so on? Is there a way like
d = Dict(zip(x,y,data))
?
Your code works with a minor change to use Iterators.product:
d = Dict(zip(Iterators.product(x, y), data.'))
To do this you need to add a line using Iterators to your project, and might need to Pkg.add("Iterators"). Because Julia matrices are column-major (elements are stored in order within columns, and columns are stored in order within the matrix), we needed to transpose the data matrix using the transpose operator .'.
This is a literal answer to your question. I don't recommend doing that. If you have tabular data, it's probably better to use a DataFrame. These are not two dimensional (rows have no names) but that can be fixed by adding an additional column, and using select.
I have two data frames. One of them contains 165 columns (species names) and almost 193.000 rows which in each cell is a number from 0 to 1 which is the percent possibility of the species to be present in that cell.
POINTID Abie_Xbor Acer_Camp Acer_Hyrc Acer_Obtu Acer_Pseu Achi_Gran
2 0.0279037 0.604687 0.0388309 0.0161980 0.0143966 0.240152
3 0.0294101 0.674846 0.0673055 0.0481405 0.0397423 0.231308
4 0.0292839 0.603869 0.0597947 0.0526606 0.0463431 0.188875
6 0.0331264 0.541165 0.0470451 0.0270871 0.0373348 0.256662
8 0.0393825 0.672371 0.0715808 0.0559353 0.0565391 0.230833
9 0.0376557 0.663732 0.0747417 0.0445794 0.0602539 0.229265
The second data frame contains 164 columns (species names, as the first data frame) and one row which is the threshold that above this we assume that the species is present and under of this the species is absent
Abie_Xbor Acer_Camp Acer_Hyrc Acer_Obtu Acer_Pseu Achi_Gran Acta_Spic
0.3155 0.2816 0.2579 0.2074 0.3007 0.3513 0.3514
What i want to do is to make a new data frame that will contain for every species in the presence possibility (my.data) the number of possibility if it is above the threshold (thres) and if it is under the threshold the zero number.
I know that it would be a for loop and if statement but i am new in R and i don't know for to do this.
Please help me.
I think you want something like this:
(Make up small reproducible example)
set.seed(101)
speciesdat <- data.frame(pointID=1:10,matrix(runif(100),ncol=10,
dimnames=list(NULL,LETTERS[1:10])))
threshdat <- rbind(seq(0.1,1,by=0.1))
Now process:
thresh <- unlist(threshdat) ## make data frame into a vector
## 'sweep' runs the function column-by-column if MARGIN=2
ss2 <- sweep(as.matrix(speciesdat[,-1]),MARGIN=2,STATS=thresh,
FUN=function(x,y) ifelse(x<y,0,x))
## recombine results with the first column
speciesdat2 <- data.frame(pointID=speciesdat$pointID,ss2)
It's simpler to have the same number of columns (with the same meanings of course).
frame2 = data.frame(POINTID=0, frame2)
R works with vectors so a row of frame1 can be directly compared to frame2
frame1[,1] < frame2
Could use an explicit loop for every row of frame1 but it's common to use the implicit loop of "apply"
answer = apply(frame1, 1, function(x) x < frame2)
This was all rather sloppy solution (especially changing frame2) but it hopefully demonstrates some basic R. Also, I'd generally prefer arrays and matrices when possible (they can still use labels but are generally faster).
This produces a logical matrix which can be used to generate assignments with "[<-"; (Assuming name of multi-row dataframe is "cols" and named vector is "vec":
sweep(cols[-1], 2, vec, ">") # identifies the items to keep
cols[-1][ sweep(cols[-1], 2, vec, "<") ] <- 0
Your example produced a warning about the mismatch of the number of columns with the length of the vector, but presumably you can adjust the length of the vector to be the correct number of entries.
I have a list of leaves in a tree and the height at which I'd like them to merge, i.e. the height of their most recent common ancestor. All leaves are assumed to be at height 0. A toy example might look like:
as.data.frame(rbind(c("a","b",1),c("c","d",2),c("a","d",4)))
V1 V2 V3
1 a b 1
2 c d 2
3 a d 4
I want to plot a tree representing this data. I know that R can plot trees coming from hclust. How do I get my data into the format returned by hclust or into some other format that is easily plotted?
Edited to add diagram:
The tree for the above dataset looks like this:
__|___
| |
| _|_
_|_ | |
| | | |
a b c d
What you have is a hierarchical clustering already specified (in your own data format convention), and you would like to use R's plotting facilities. This seems to be not easy. The only way I can see now to achieve this is to create an object such as that returned by hclust. It has attributes "merge", "height", "order", "labels", "method", "call", "dist.method" which are all fairly easy to understand. Someone already tried this: https://stat.ethz.ch/pipermail/r-help/2006-February/089170.html but apparently still had issues. What you could also try to do is to fill in a distance matrix with dummy values that are consistent with your clustering, then submit this to hclust. E.g.
a <- matrix(ncol=4,nrow=4, c(0,1,4,4,1,0,4,4,4,4,0,2,4,4,2,0))
b <- hclust(as.dist(a), method="single")
plot(b, hang=-1)
This could perhaps be useful.