How to count how many conditions an observation meets using R? - r

If I have a date set with lots of binary variables, all with values o/1. I want to create a new column, and add by one if the observation is 1 of one binary variable, add by two if it has 1 of two binary variables...
Such as:
x1 x2 x3 x4 x5
1 1 1 0 1
0 0 1 0 0
0 0 0 0 0
I want to have
x1 x2 x3 x4 x5 count
1 1 1 0 1 4
0 0 1 0 0 1
0 0 0 0 0 0

If your dataset contains only the binary variables you are interested in, you can use
df$count <- rowSums(df)
Otherwise, please provide a more detailed description of your data.

Another option is Reduce with +
df$count <- Reduce(`+`, df)

Related

Create a new variable based on any 2 conditions being true

I have a dataframe in R with 4 variables and would like to create a new variable based on any 2 conditions being true on those variables.
I have attempted to create it via if/else statements however would require a permutation of every variable condition being true. I would also need to scale to where I can create a new variable based on any 3 conditions being true. I am not sure if there is a more efficient method than using if/else statements?
My example:
I have a dataframe X with following column variables
x1 = c(1,0,1,0)
X2 = c(0,0,0,0)
X3 = c(1,1,0,0)
X4 = c(0,0,1,0)
I would like to create a new variable X5 if any 2 of the variables are true (eg ==1)
The new variable based on the above dataframe would produce X5 (1,0,1,0)
This can easily be done by using the apply function:
x1 = c(1,0,1,0)
x2 = c(0,0,0,0)
x3 = c(1,1,0,0)
x4 = c(0,0,1,0)
df <- data.frame(x1,x2,x3,x4)
df$x5 <- apply(df,1,function(row) ifelse(sum(row != 0) == 2, 1, 0))
x1 x2 x3 x4 X5
1 1 0 1 0 1
2 0 0 1 0 0
3 1 0 0 1 1
4 0 0 0 0 0
apply with option 1 means: Do this function on every row. To scale this up to 3...N true values, just change the number in the ifelse statement.
You can try this:
#Data
df <- data.frame(x1,X2,X3,X4)
#Code
df$X5 <- ifelse(rowSums(df,na.rm=T)==2,1,0)
x1 X2 X3 X4 X5
1 1 0 1 0 1
2 0 0 1 0 0
3 1 0 0 1 1
4 0 0 0 0 0
You can use:
df$X5 <- 1*(apply(df == 1, 1, sum) == 2)
or
df$X5 <- 1*(mapply(sum, df) == 2)
Output
> df
X1 X2 X3 X4 X5
1 0 1 0 1
0 0 1 0 0
1 0 0 1 1
0 0 0 0 0
Data
df <- data.frame(X1,X2,X3,X4)

How generate binary variables without duplicated ID rows

I have this data frame :
>df
ID X1 X2 X3
IX 0 0 1
IX 1 1 0
IY 0 0 1
IZ 1 0 0
IZ 0 1 0
I need to create a no duplicated data frame that have unique ID and
as result it should take in consideration all binary elements
In other word , the result should be :
ID X1 X2 X3
IX 1 1 1
IY 0 0 1
IZ 1 1 0
I tried to use the duplicated function, but it just delete ID rows without having consideration to binary values and it doesn't give the needed result.
What should I do please?
aggregate(df[2:4], df[1], sum)

Convert long table of linked observations to wide adjacency matrix [duplicate]

This question already has an answer here:
How to convert two factors to adjacency matrix in R?
(1 answer)
Closed 4 years ago.
I am facing a challenge that I cannot manage to solve. I have a list of observations x_i (dimension is large, something around 30k) and a list of observations y_j (also large). x_i and y_i are id of the same units (say firms).
I have a dataframe of two columns that links x_i and y_j: if they appear on the same line, it means that they are connected. What I would like is to convert this network into a large matrix M of size (unique(union(x, y))) and which takes the value 1 if the two firms are connected.
Here is an example in small dimensions:
x1 x2
x3 x6
x4 x5
x1 x5
What I would like is a matrix:
0 1 0 0 1 0
0 0 0 0 0 0
0 0 0 0 0 1
0 0 0 1 0 0
0 0 0 0 0 0
0 0 0 0 0 0
Right now, the only solution I could think of is a double loop combined with a search in the initial dataframe:
list_firm = union(as.vector(df[1]), as.vector(df[2]))
list_firm <- sort(list_firm[[1]])
list_firm <- unique(list_firm)
M <- Matrix(nrow = length(list_firm), ncol = length(list_firm))
for (i in list_firm) {
for (j in list_firm) {
M[i, j] = !is.null(which(df$col1 == i & df$col2 == j))
}
}
Where df is the two columns data frame. This is obviously much too long to run.
Any suggestion? This would be very welcome
We convert the columns to factor with levels specified as the unique elements of both columns and get the frequency with table
lvls <- sort(unique(unlist(df)))
df[] <- lapply(df, factor, levels = lvls)
table(df)
# col2
#col1 x1 x2 x3 x4 x5 x6
# x1 0 1 0 0 1 0
# x2 0 0 0 0 0 0
# x3 0 0 0 0 0 1
# x4 0 0 0 0 1 0
# x5 0 0 0 0 0 0
# x6 0 0 0 0 0 0
data
df <- structure(list(col1 = c("x1", "x3", "x4", "x1"), col2 = c("x2",
"x6", "x5", "x5")), class = "data.frame", row.names = c(NA, -4L
))
The answer provided by #akrun in the comments is a good one. However, this is a good scenario to take advantage of a different data structure than data frames. Basically, what you're looking for is an adjacency matrix, which is a data structure in social network analysis. To achieve this, we can use the igraph package in R.
library(igraph)
library(dplyr)
df = data_frame(source=c('x1', 'x3', 'x4', 'x1'), target=c('x2', 'x6', 'x5', 'x5'))
g = graph_from_data_frame(df, directed=FALSE)
output = as.matrix(get.adjacency(g))
x1 x3 x4 x2 x6 x5
x1 0 0 0 1 0 1
x3 0 0 0 0 1 0
x4 0 0 0 0 0 1
x2 1 0 0 0 0 0
x6 0 1 0 0 0 0
x5 1 0 1 0 0 0
The output columns aren't in the exact order as your example, but this is a trivial problem to solve if needed.

PortfolioAnalytics Error row names not Dates

I am getting the following error from the portfolio analytics package.
Error in checkData(R) :
The data cannot be converted into a time series. If you are trying to pass in names from a data object with one column, you should use the form 'data[rows, columns, drop = FALSE]'. Rownames should have standard date formats, such as '1985-03-15'.
The data set I am using is simulated data
> df
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
[1,] 0 1 0 1 0 0 0 1 1 0
[2,] 0 1 0 0 1 1 1 1 1 1
[3,] 1 0 0 0 0 0 0 1 1 0
[4,] 1 0 1 1 1 0 0 1 0 1
[5,] 0 0 1 0 1 0 1 1 1 0
[6,] 0 1 0 1 0 1 1 0 1 1
[7,] 1 0 0 0 0 1 1 1 1 0
[8,] 0 0 1 1 0 0 1 1 0 1
[9,] 1 0 0 0 0 1 1 1 1 0
[10,] 0 1 1 0 0 1 0 1 0 0
I set the portfolio constraints to be
returns = as.matrix(df)
> funds = colnames(df)
> init.portfolio <- portfolio.spec(assets = funds)
> init.portfolio <- add.constraint(portfolio = init.portfolio, type = "full_investment")
> init.portfolio <- add.constraint(portfolio = init.portfolio, type = "long_only")
> minSD.portfolio <- add.objective(portfolio=init.portfolio,
+ type="risk",
+ name="StdDev")
> minSD.opt <- optimize.portfolio(R = df, portfolio = minSD.portfolio,
+ optimize_method = "ROI", trace = TRUE)
Error in checkData(R) :
The data cannot be converted into a time series. If you are trying to pass in names from a data object with one column, you should use the form 'data[rows, columns, drop = FALSE]'. Rownames should have standard date formats, such as '1985-03-15'.
How can I fix this error. DF is a simulation of single period returns. So they are all eithe 100% or 0%, and for the same period. I can add a date variable if I need to as the row names, but I do not know how. I tried
> rownames(df) = as.Date(c("Jan", rep(nrow(df))))
Error in charToDate(x) :
character string is not in a standard unambiguous format
Can someone help me with this error? Thanks
You'll need to add date data to df. Assuming that the data is monthly returns beginning at the start of this year, you can either add rownames using
rownames(df) <- as.character(seq(as.Date("2015-01-01"), length.out=nrow(df), by = "month"))
or convert df to an xts time series by
library(xts)
df <- xts(df, order.by = seq(as.Date("2015-01-01"),
length.out=nrow(df), by = "month"), df)
xts is a commonly used format for financial time series and works well with PortfolioAnalytics so you might consider that.
Once you've done that and run optimize.portfolio, you'll won't get an solution. It appears that df is not positive definite so you'll have to adjust the values in df.
Also I don't quite understand your comment that single period returns require that returns be 0 or 1. That's not true in general.

R subsetting data according to non zeros in row

I have a dataset like this:
1 2 3 4 5
1 1 0 0 0 0
2 1 0 0 2 0
3 0 0 0 1 5
4 2 0 0 0 0
I want to subset the rows with more than one column beeing non-zero (means rows 2,3)
I know that it has to be something like dataset[... dataset ...] but I did not find out how to access rows as such without using a for-loop
You just need rowSums really. Assuming your dataset is called "mydf", try:
> mydf[rowSums(mydf != 0) > 1, ]
X1 X2 X3 X4 X5
2 1 0 0 2 0
3 0 0 0 1 5
Here, rowSums(mydf != 0) will return a vector of how many values in each row is greater not zero. Then, adding our condition > 1 would create a logical vector which can be used to subset the rows we want.

Resources