I have a problem with the stem- and leaf-plot-function.
One example:
I want to stem the correlation coefficients of my meta-analysis. Here I have just 2 correlation coefficients (0,056 and -0,022).
I tried the following function:
y<-c(0.056, -0.022)
stem(y)
and I get the following result:
-2 | 2
-0 |
0 |
2 |
4 | 6
but thats not the right result, it has to be:
0 | 6
-0 | 2
So I don't understand which function I have to use to get the right result.
I would be realy thankful if somebody could help me!
Check out help(stem) and change the scale parameter to control the length of stem plot:
R > stem(y, scale = 2)
The decimal point is 2 digit(s) to the left of the |
-2 | 2
-1 |
-0 |
0 |
1 |
2 |
3 |
4 |
5 | 6
Does that make more sense?
The closest I get to your output is:
stem(y, scale=0.5, atom=0.1)
But it has the negative at the top instead of the bottom.
The first one that you show is a correct answer (the 0.04 and 0.05 stems are grouped together) even if not the desired answer.
Related
Im quiet confused.
I have 50 clusters each with a different size, and I have two variables "Year" and "Income level".
The data set I have right now has 10,000 rows where each row represents a single individual.
What I want to do is to form a new dataset from this dataframe where each row represents the number of clusters (50) and the columns be the two variables + the cluster variable. The problem is these two variables (that we call the study level covariates) do not have a unique value for clusters.
How would I put them in one cell for each cluster then?
X1<-c(1,1,1,2,2,2,2,2,3,3,4,4,4,4,4,4) #Clusters
X2<c(1,2,3,1,1,1,1,1,1,2,3,3,1,1,2,2,2) #Covariate1
X3<-c(1991,2001,2002,1998,2014,2015,1990,
2002,2004,2006,2006,2006,2005,2003,2003,2000) #Covariate2
data<-data.frame(X1,X2,X3)
My desire output should be something like this:
|Clusters|Covariet1|Covariate2|
|--------|---------|----------|
|1 | ? |? |
|2 | ? |? |
|3 | ? |? |
|4 | ? |? |
Meanening that instead of a data frame with 16 rows, a dataframe with 4 rows
Here is how to aggreagate the data using the average of the covariate per cluster:
df <- data.frame(X1 = c(1,1,1,2,2,2,2,2,3,3,4,4,4,4,4,4),
X2 = c(1,2,3,1,1,1,1,1,1,2,3,3,1,1,2,2),
X3 = c(1991,2001,2002,1998,2014,2015,1990,2002,2004,2006,2006,2006,2005,2003,2003,2000)
)
library(tidyverse)
df %>% group_by(X1) %>% summarise(mean_cov1 = mean(X2))
# A tibble: 4 x 2
X1 mean_cov1
* <dbl> <dbl>
1 1 2
2 2 1
3 3 1.5
4 4 2
For the case you are working on, you have to decide what the most relevant aggreagation is. You can probably also create multiple at once.
I have a datframe like the following:
group | amount_food | amount_finance | amount_clothes
A | 30 | 40 | 50
B | 34 | 43 | 53
C | 50 | 86 | 90
I would like to colour the contents of the cells depending on the value (a gradient of sorts where e.g. red would indicate higher and blue would indicate lower values etc). Similar to conditional formatting in excel. Ideally would like this done on a column by column basis, s i know which group has the highest amount_food etc.
How can i achieve this in R?
df <- read.csv("shopspend.csv")
new to R so any pointers helpful.
I'm trying to group my data by a very specific condition. Consider below data.frame:
from <- c("a", "b", "a", "b")
to <- c("b", "a", "b", "a")
give <- c("x", "y", "y", "x")
take <- c("y", "x", "x", "y")
amount <- c(1, 2, 3, 4)
df <- data.frame(from, to, give, take, amount)
which creates something like:
| from | to | give | take | amount
---------------------------------------
1 | a | b | x | y | 1
2 | b | a | y | x | 2
3 | a | b | y | x | 3
4 | b | a | x | y | 4
To provide some background: consider some user in the 'from' column giving something (in column 'give') to the user in 'to' column and taking something in return (in column 'take'). As you might see, rows 1 & 2 are the same in that way, because they describe the same scenario, just form another perspective. Therefore, I want these to belong to the same group. (You could also consider them as duplicates, which involves the same task, i.e. identifying them as similar.) The same holds for rows 3 & 4. The amount is some value to be summed up per group, to make the example clear.
My desired result for grouping them is as follows.
| user1 | user2 | given_by_user1 | taken_by_user1 | amount
-----------------------------------------------------------
| a | b | x | y | 3 # contains former rows 1&2
| a | b | y | x | 7 # contains former rows 3&4
Note that both from&to and give&take need to by inverted, i.e. taking the values from two columns, sorting their values and considering them equal on that basis is not what I need. This would lead to all four rows in above example being considered equal. That kind of solution was proposed in similar posts, e.g.:
Remove duplicates where values are swapped across 2 columns in R
I've read many similar solutions and found one which actually does the trick:
match two columns with two other columns
However, the proposed solution creates an outer product of two columns, which is not feasible in my case, because my data has millions of rows and at least thousands of unique values within each column.
(Any solution that either groups the rows directly, or gets the indices of rows belonging to the same group would be great!)
Many thanks for any suggestions!
I have a dataset with items but with no user ratings.
Items have features (~400 feature).
I want to measure the similarity between items based on features (Row similarity).
I convert the item-feature into a binary matrix like the fowllowing
itemID | feature1 | feature2 | feature3 | feature4 ....
1 | 0 | 1 | 1 | 0
2 | 1 | 0 | 0 | 1
3 | 1 | 1 | 1 | 0
4 | 0 | 0 | 1 | 1
I don't know what to use (and how to use it) to measure the row similarity.
I want, for Item X, to get the top k similar items.
A sample code will be very appreciated
What you are looking for is termed similarity measure. A quick google/SO search will reveal various methods to get similarity between two vectors. Here is some sample code in python2 for cosine similarity:
from math import *
def square_rooted(x):
return round(sqrt(sum([a*a for a in x])),3)
def cosine_similarity(x,y):
numerator = sum(a*b for a,b in zip(x,y))
denominator = square_rooted(x)*square_rooted(y)
return round(numerator/float(denominator),3)
print cosine_similarity([3, 45, 7, 2], [2, 54, 13, 15])
taken from: http://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/
I noticed that you want top k similar items for every item. The best way to do that is with a k Nearest Neighbour implementation. What you can do is create a knn graph and return the top k similar items from the graph for a query.
A great library for this is nmslib. Here is some sample code for a knn query from the library for the HNSW method with cosine similarity (you can use one of the several available methods. HNSW is particularly efficient for your high dimensional data):
import nmslib
import numpy
# create a random matrix to index
data = numpy.random.randn(10000, 100).astype(numpy.float32)
# initialize a new index, using a HNSW index on Cosine Similarity
index = nmslib.init(method='hnsw', space='cosinesimil')
index.addDataPointBatch(data)
index.createIndex({'post': 2}, print_progress=True)
# query for the nearest neighbours of the first datapoint
ids, distances = index.knnQuery(data[0], k=10)
# get all nearest neighbours for all the datapoint
# using a pool of 4 threads to compute
neighbours = index.knnQueryBatch(data, k=10, num_threads=4)
At the end of the code, the k top neighbours for every data point will be stored in the neighbours variable. You can use that for your purposes.
I have a large panel data set in the form:
ID | Time| X-VALUE
---| ----|-----
1 | 1 |x
1 | 2 |x
1 | 3 |x
2 | 1 |x
2 | 2 |x
2 | 3 |x
3 | 1 |x
3 | 2 |x
3 | 3 |x
. | . |.
. | . |.
More specifically, I have dataset of a large set of individual stock returns over a period of 30 years. I would like to calculate the "stock-specific" first (lag 1) autocorrelation in returns for all stocks individually.
I suspect that by applying the code: acf(pdata$return, lag.max = 1, plot = FALSE) I'll only get som kind of "average" autocorrelation value, is that correct?
Thank you
You can split the data frame and do the acf on each subset. There are tons of ways to do this in R. For example
by(pdata$return, pdata$ID, function(i) { acf(i, lag.max = 1, plot = FALSE) })
You may need to change variable and data frame names to match your own data.
This is not exactly what was requested, but a real autocorrelation function for panel data in R is collapse::psacf, it works by first standardizing data in each group, and then computing the autocovariance on the group-standardized panel-series using proper panel-lagging. Implementation is in C++ and very fast.