Im quiet confused.
I have 50 clusters each with a different size, and I have two variables "Year" and "Income level".
The data set I have right now has 10,000 rows where each row represents a single individual.
What I want to do is to form a new dataset from this dataframe where each row represents the number of clusters (50) and the columns be the two variables + the cluster variable. The problem is these two variables (that we call the study level covariates) do not have a unique value for clusters.
How would I put them in one cell for each cluster then?
X1<-c(1,1,1,2,2,2,2,2,3,3,4,4,4,4,4,4) #Clusters
X2<c(1,2,3,1,1,1,1,1,1,2,3,3,1,1,2,2,2) #Covariate1
X3<-c(1991,2001,2002,1998,2014,2015,1990,
2002,2004,2006,2006,2006,2005,2003,2003,2000) #Covariate2
data<-data.frame(X1,X2,X3)
My desire output should be something like this:
|Clusters|Covariet1|Covariate2|
|--------|---------|----------|
|1 | ? |? |
|2 | ? |? |
|3 | ? |? |
|4 | ? |? |
Meanening that instead of a data frame with 16 rows, a dataframe with 4 rows
Here is how to aggreagate the data using the average of the covariate per cluster:
df <- data.frame(X1 = c(1,1,1,2,2,2,2,2,3,3,4,4,4,4,4,4),
X2 = c(1,2,3,1,1,1,1,1,1,2,3,3,1,1,2,2),
X3 = c(1991,2001,2002,1998,2014,2015,1990,2002,2004,2006,2006,2006,2005,2003,2003,2000)
)
library(tidyverse)
df %>% group_by(X1) %>% summarise(mean_cov1 = mean(X2))
# A tibble: 4 x 2
X1 mean_cov1
* <dbl> <dbl>
1 1 2
2 2 1
3 3 1.5
4 4 2
For the case you are working on, you have to decide what the most relevant aggreagation is. You can probably also create multiple at once.
I have a datframe like the following:
group | amount_food | amount_finance | amount_clothes
A | 30 | 40 | 50
B | 34 | 43 | 53
C | 50 | 86 | 90
I would like to colour the contents of the cells depending on the value (a gradient of sorts where e.g. red would indicate higher and blue would indicate lower values etc). Similar to conditional formatting in excel. Ideally would like this done on a column by column basis, s i know which group has the highest amount_food etc.
How can i achieve this in R?
df <- read.csv("shopspend.csv")
new to R so any pointers helpful.
I am using sparklyr for a project. I have a Spark Dataframe with lists in some of the columns and I'd like to separate them into multiple rows, i.e. have one value in each row, exactly like separate_rows does in dplyr.
So basically my dataframe is like this
| x | y
1| [a,b] | [c,d]
And I'd like to have something like this in the end :
| x | y
1| a | c
2| b | d
Like suggested in this post, explode is a good start, but it can do the job for only one column at once ; and if I use it twice, I will end up with 4 rows here instead of the 2 I want. In this very simple example, I could manage my way to keep only the rows that I want, but things can get a bit messier if there are more than two elements in the lists...
Something I thought about would be to do :
Merge the columns x and y into a single column which would contain [[a,c] , [b,d]]
Then use explode to have [a,c] and then [b,d]
Then explode but in columns (rather that in rows).
Only I don't know how to do 1) and 3).
Thank you for the help !
Here is a reproducible example obtained with collect and dput :
structure(list(ref_amount = list(list(967.66, 1592.56), list(
967.66, 1592.56)), ref_theta = list(list(5.26977034898459,
5.16119062369122), list(5.26977034898459, 5.16119062369122))), .Names = c("ref_amount",
"ref_theta"), row.names = c(NA, -2L), class = c("tbl_df", "tbl",
"data.frame"))
I have a large panel data set in the form:
ID | Time| X-VALUE
---| ----|-----
1 | 1 |x
1 | 2 |x
1 | 3 |x
2 | 1 |x
2 | 2 |x
2 | 3 |x
3 | 1 |x
3 | 2 |x
3 | 3 |x
. | . |.
. | . |.
More specifically, I have dataset of a large set of individual stock returns over a period of 30 years. I would like to calculate the "stock-specific" first (lag 1) autocorrelation in returns for all stocks individually.
I suspect that by applying the code: acf(pdata$return, lag.max = 1, plot = FALSE) I'll only get som kind of "average" autocorrelation value, is that correct?
Thank you
You can split the data frame and do the acf on each subset. There are tons of ways to do this in R. For example
by(pdata$return, pdata$ID, function(i) { acf(i, lag.max = 1, plot = FALSE) })
You may need to change variable and data frame names to match your own data.
This is not exactly what was requested, but a real autocorrelation function for panel data in R is collapse::psacf, it works by first standardizing data in each group, and then computing the autocovariance on the group-standardized panel-series using proper panel-lagging. Implementation is in C++ and very fast.
I have a tabular data like:
+---+----+----+
| | a | b |
+---+----+----+
| P | 1 | 2 |
| Q | 10 | 20 |
+---+----+----+
and I want to represent this using a Dict.
With the column and row names:
x = ["a", "b"]
y = ["P", "Q"]
and data
data = [ 1 2 ;
10 20 ]
how may I create a dictionary object d, so that d["a", "P"] = 1 and so on? Is there a way like
d = Dict(zip(x,y,data))
?
Your code works with a minor change to use Iterators.product:
d = Dict(zip(Iterators.product(x, y), data.'))
To do this you need to add a line using Iterators to your project, and might need to Pkg.add("Iterators"). Because Julia matrices are column-major (elements are stored in order within columns, and columns are stored in order within the matrix), we needed to transpose the data matrix using the transpose operator .'.
This is a literal answer to your question. I don't recommend doing that. If you have tabular data, it's probably better to use a DataFrame. These are not two dimensional (rows have no names) but that can be fixed by adding an additional column, and using select.