The number of two specific elements between two columns in R - r

I have the following matrix:
x=c(0,0,0,1,1,1,2,2,2,0,1,2,0,1,2,0,1,2)
M=matrix(x,9,2)
The matrix M is:
> M
0 0
0 1
0 2
1 0
1 1
1 2
2 0
2 1
2 2
How do I find that the number of (0,0), (0,1), (0,2), ... (that is the first row, the second, the third and so on) in the whole rows are equal to 1?

If we need to get the frequency, use the table,
tbl <- table(paste(M[,1], M[,2], sep="_"))
This can be converted to a 3 column data.frame by splitting the names of 'tbl' into two columns and cbinding the value of 'tbl'
cbind(read.table(text=names(tbl), sep="_", header = FALSE), value = as.vector(tbl))

If you want to check if every row appears a single time you can use
duplicated(data.frame(M))
If any of the resulting values is TRUE then you know some rows appear more than one time (and you know where they are).

Related

How to apply a function in different ranges of a vectror in R?

I have the following matrix:
x=matrix(c(1,2,2,1,10,10,20,21,30,31,40,
1,3,2,3,10,11,20,20,32,31,40,
0,1,0,1,0,1,0,1,1,0,0),11,3)
I would like to find for each unique value of the first column in x, the maximum value (across all records having that value of the first column in x) of the third column in x.
I have created the following code:
v1 <- sequence(rle(x[,1])$lengths)
A=split(seq_along(v1), cumsum(v1==1))
A_diff=rep(0,length(split(seq_along(v1), cumsum(v1==1))))
for( i in 1:length(split(seq_along(v1), cumsum(v1==1))) )
{
A_diff[i]=max(x[split(seq_along(v1), cumsum(v1==1))[[i]],3])
}
However, the provided code works only when same elements are consecutive in the first column (because I use rle) and I use a for loop.
So, how can I do it to work generally without the for loop as well, that is using a function?
If I understand correctly
> tapply(x[,3],x[,1],max)
1 2 10 20 21 30 31 40
1 1 1 0 1 1 0 0
For grouping more than 1 variable I would do aggregate, note that matrices are cumbersome for this purpose, I would suggest you transform it to a data frame, nonetheless
> aggregate(x[,3],list(x[,1],x[,2]),max)

Procedural way to generate signal combinations and their output in r

I have been continuing to learn r to transition away from excel and I am wondering what the best way to approach the following problem is, or at least what tools are available to me:
I have a large data set (100K+ rows) and several columns that I could generate a signal off of and each value in the vectors can range between 0 and 3.
sig1 sig2 sig3 sig4
1 1 1 1
1 1 1 1
1 0 1 1
1 0 1 1
0 0 1 1
0 1 2 2
0 1 2 2
0 1 1 2
0 1 1 2
I want to generate composite signals using the state of each cell in the four columns then see what each of the composite signals tell me about the returns in a time series. For this question the scope is only generating the combinations.
So for example, one composite signal would be when all four cells in the vectors = 0. I could generate a new column that reads TRUE when that case is true and false in each other case, then go on to figure out how that effects the returns from the rest of the data frame.
The thing is I want to check all combinations of the four columns, so 0000, 0001, 0002, 0003 and on and on, which is quite a few. With the extent of my knowledge of r, I only know how to do that by using mutate() for each combination and explicitly entering the condition to check. I assume there is a better way to do this, but I haven't found it yet.
Thanks for the help!
I think that you could paste the columns together to get unique combinations, then just turn this to dummy variables:
library(dplyr)
library(dummies)
# Create sample data
data <- data.frame(sig1 = c(1,1,1,1,0,0,0),
sig2 = c(1,1,0,0,0,1,1),
sig3 = c(2,2,0,1,1,2,1))
# Paste together
data <- data %>% mutate(sig_tot = paste0(sig1,sig2,sig3))
# Generate dummmies
data <- cbind(data, dummy(data$sig_tot, sep = "_"))
# Turn to logical if needed
data <- data %>% mutate_at(vars(contains("data_")), as.logical)
data

How to determine the uniqueness of each column values in its own dynamic range?

Assuming my dataframe has one column, I wish to add another column to indicate if my ith element is unique within the first i elements. The results I want is:
c1 c2
1 1
2 1
3 1
2 0
1 0
For example, 1 is unique in {1}, 2 is unique in {1,2}, 3 is unique in {1,2,3}, 2 is not unique in {1,2,3,2}, 1 is not unique in {1,2,3,2,1}.
Here is my code, but is runs extremely slow given I have nearly 1 million rows.
for(i in 1:nrow(df)){
k <- sum(df$C1[1:i]==df$C1[i]))
if(k>1){df[i,"C2"]=0}
else{df[i,"C2"]=1}
}
Is there a quicker way of achieving this?
The following works:
x$c2 = as.numeric(! duplicated(x$c1))
Or, if you prefer more explicit code (I do, but it’s slower in this case):
x$c2 = ifelse(duplicated(x$c1), 0, 1)

Mutate Cumsum with Previous Row Value

I am trying to run a cumsum on a data frame on two separate columns. They are essentially tabulation of events for two different variables. Only one variable can have an event recorded per row in the data frame. The way I attacked the problem was to create a new variable, holding the value ‘1’, and create two new columns to sum the variables totals. This works fine, and I can get the correct total amount of occurrences, but the problem I am having is that in my current ifelse statement, if the event recorded is for variable “A”, then variable “B” is assigned 0. But, for every row, I want to have the previous variable’s value assigned to the current row, so that I don’t end up with gaps where it goes from 1 to 2, to 0, to 3.
I don't want to run summarize on this either, I would prefer to keep each recorded instance and run new columns through mutate.
CURRENT DF:
Event Value Variable Total.A Total.B
1 1 A 1 0
2 1 A 2 0
3 1 B 0 1
4 1 A 3 0
DESIRED RESULT:
Event Value Variable Total.A Total.B
1 1 A 1 0
2 1 A 2 0
3 1 B 2 1
4 1 A 3 1
Thanks!
You can use the property of booleans that you can sum them as ones and zeroes. Therefore, you can use the cumsum-function:
DF$Total.A <- cumsum(DF$variable=="A")
Or as a more general approach, provided by #Frank you can do:
uv = unique(as.character(DF$Variable))
DF[, paste0("Total.",uv)] <- lapply(uv, function(x) cumsum(DF$V == x))
If you have many levels to your factor, you can get this in one line by dummy coding and then cumsuming the matrix.
X <- model.matrix(~Variable+0, DF)
apply(X, 2, cumsum)

Input Values on Rows Set by a Quantile Threshold

I have been working on this project and I am stuck in the following:
I have 7 columns on which over 30% of the rows are NAs.
All my columns are numeric, by the way.
On these High Missing Values Columns I want to create 4 new columns base on the values of the these columns' quantiles.
1st column- input 1 in rows which contains data; 0 otherwise.
2nd column- input 1 in rows below the first quantile; 0 otherwise.
3rd column- input 1 in rows that are in the 2nd quantile range; 0 otherwise.
4th column- input 1 in rows that are above the 3rd quantile; 0 otherwise.
I got the first column. But the rest, based on the quantiles' threshold value has been a challenge.
Here is what I have so far...
My next 3 columns are base on just 3 quantiles: 33.33333%, 66.66667% and 100%
quantile(High_NAS_set1$EFX, prob=c(33/99,66/99,99/99),na.rm=TRUE)
#1st column: assign 1 for a row that contains data; 0 otherwise
New.EFX_<-High_NAS_set1$EFX #creating a new column
New.EFX[!is.na(New.EFX)]<-1
New.EFX[is.na(New.EFX)]<-0
#2nd Column:assign 1 in rows below the first quantile; 0 otherwise
New.EFX2_<-High_NAS_set1$EFX #creating a new column
quant<-quantile(New2.EFX_Emp,probs=33/99,na.rm=TRUE)
which(New2.EFX_Emp_Total<=quant)<-1 # assign 1 for rows which indexes are below quant
which(New2.EFX_Emp_Total!=quant)<-0
The last 2 lines are giving me an error:
Error in which(New2.EFX_Emp_Total <= quant) <- 1 :
could not find function "which<-"
Any help will be really appreciated.
Thanks,
Jean

Resources