data.matrix converts zero-values - r

I am trying to use R to analyse some data, though when I am trying to convert a data.frame (with zero-values) into a TimeSeries, it changes the values.
data <- input-dataset
Qua1 Qua2 Qua3 Qua4
A 0 1 1 3
B 0 1 0 0
C 0 2 0 0
D 1 1 3 0
I need to transpose these data and set the colnames, before further analysis
data = t(data)
colnames(data) <- data[1,]
data = data.frame(data[2:5,])
ts(data)
The data.frame keeps the inputvalues, but when I apply TS() (converts data.frame to numeric matrix via data.matrix) all zero's change to 1, the existing 1 to 2, and rest remains the same. As I would like to keep my zeros for later analysis, I would like to avoid this change-in-values, but how??

Related

Problem with data frame transformation using dplyr package

Problem
Let's consider two data frames :
One containing only 1's and 0's and second one with data :
set.seed(20)
df<-data.frame(sample(0:1,5,T),sample(0:1,5,T),sample(0:1,5,T))
#zero_one data frame
sample.0.1..5..T. sample.0.1..5..T..1 sample.0.1..5..T..2
1 0 1 0
2 1 0 0
3 1 1 1
4 0 0 0
5 1 0 1
df1<-data.frame(append(rnorm(4),10),append(runif(4),-5),append(rexp(4),20))
#with data
append.rnorm.4...10. append.runif.4....5. append.rexp.4...20.
1 0.08609139 0.2374272 0.3341095
2 -0.63778176 0.2297862 0.7537732
3 0.22642990 0.9447793 1.3011998
4 -0.05418293 0.8448115 1.2097271
5 10.00000000 -5.0000000 20.0000000
Now what I want to do is to change values in second data frame for which first data frame takes values 0 by mean calculated for values for which first data frame takes value one.
Example
In first column I want to replace 0.08609139 and -0.05418293 (values for which first column in first data frame takes values 0) by mean(-0.63778176, 0.22642990,10.00000000) (values for which first column in first data frame takes values 1).
I want to do it using mutate_all() function from dplyr package.
My work so far
df1<-df1 %>% mutate_all(
function(x) ifelse(df[x]==0, mean(x[df==1],na.rm=T,x)))
I know that the condition df[x] is meaningless, but I have no idea what should i put there. Could you please help me with that ?
You could follow #deschen's suggestion and multiply the two data frames together.
Here is another approach to consider using mapply. For each column, identify the positions (indices) in df where value is zero.
Then, substitute the corresponding df1 column of those positions with the mean of other values in the column. y[-idx] should be all values in the df1 column that exclude those positions.
Note that my set.seed is different - when I used yours of 20 I got different values, and a column with all zeroes. Please let me know if you are able to reproduce.
set.seed(12)
df<-data.frame(sample(0:1,5,T),sample(0:1,5,T),sample(0:1,5,T))
df1<-data.frame(append(rnorm(4),10),append(runif(4),-5),append(rexp(4),20))
my_fun <- function(x, y) {
idx <- which(x == 0)
y[idx] <- mean(y[-idx])
return(y)
}
mapply(my_fun, df, df1)

Procedural way to generate signal combinations and their output in r

I have been continuing to learn r to transition away from excel and I am wondering what the best way to approach the following problem is, or at least what tools are available to me:
I have a large data set (100K+ rows) and several columns that I could generate a signal off of and each value in the vectors can range between 0 and 3.
sig1 sig2 sig3 sig4
1 1 1 1
1 1 1 1
1 0 1 1
1 0 1 1
0 0 1 1
0 1 2 2
0 1 2 2
0 1 1 2
0 1 1 2
I want to generate composite signals using the state of each cell in the four columns then see what each of the composite signals tell me about the returns in a time series. For this question the scope is only generating the combinations.
So for example, one composite signal would be when all four cells in the vectors = 0. I could generate a new column that reads TRUE when that case is true and false in each other case, then go on to figure out how that effects the returns from the rest of the data frame.
The thing is I want to check all combinations of the four columns, so 0000, 0001, 0002, 0003 and on and on, which is quite a few. With the extent of my knowledge of r, I only know how to do that by using mutate() for each combination and explicitly entering the condition to check. I assume there is a better way to do this, but I haven't found it yet.
Thanks for the help!
I think that you could paste the columns together to get unique combinations, then just turn this to dummy variables:
library(dplyr)
library(dummies)
# Create sample data
data <- data.frame(sig1 = c(1,1,1,1,0,0,0),
sig2 = c(1,1,0,0,0,1,1),
sig3 = c(2,2,0,1,1,2,1))
# Paste together
data <- data %>% mutate(sig_tot = paste0(sig1,sig2,sig3))
# Generate dummmies
data <- cbind(data, dummy(data$sig_tot, sep = "_"))
# Turn to logical if needed
data <- data %>% mutate_at(vars(contains("data_")), as.logical)
data

Running a interaction matrix between many variables

I have a data set with 70 column variables, each is 0-1 dummy variable, and 3500 observations. I am looking to see how often observations with a 'success' in one variable are matched with another variable. In other words it obs 1 has a success dummy in variable one how often does it also have a success in variable 2 and so on for all the variables. I have found how to create a matrix table showing interactions when only two columns are involved however i cant find anything involving many columns. Ideally id like to present this in an interaction matrix with 70 variables across and 70 down. Here is an idea of the data set:
Dat A B C D
XX 1 1 1 1
XY 0 1 0 1
XZ 0 0 1 1
The output im hoping for would be:
Out A B C D
A 0 1 1 1
B 0 1 2
C 0 2
D 0
Showing the number of times that (A,B) is a pairing (B,C) is a pairing and so on.
I have tried using the table() command as well as as.matrix but it seems these require data organized as two columns and cannot understand the data when it refers to many column variables. I am fairly new to R so I apologize if my question isnt clear or is possibly quite simple.
Any help is appreciated. Thanks
Here's how to create a correlation matrix of indefinite size. First create a reproducible example of your dataset...
dat <- matrix(sample(0:1, size = 700, replace = TRUE), ncol = 70)
dat <- data.frame(dat)
Then calculate the correlation...
dat <- cor(dat)
And then plot the correlation visually...
library(corrplot)
corrplot(dat, method = "square")
You can also plot the correlation using numbers instead of colors...
corrplot(dat, method = "number")
Obviously you'll want to finesse these charts before using them in a publication. corrplot offers tons of options for chart appearance.
You can try:
res <- apply(combn(2:ncol(df), 2), 2, function(x, y) sum(rowSums(y[, x]) == 2), df)
m <- diag(x=0, ncol(df)-1)
m[upper.tri(m)] <- res
m[lower.tri(m)] <- NA
dimnames(m) <- list(colnames(df)[-1], colnames(df)[-1])
A B C D
A 0 1 1 1
B NA 0 1 2
C NA NA 0 2
D NA NA NA 0

table() function does not convert data frame correctly

I am trying to convert a data.frame to table without packages. Basically I take cookbook as reference for this and tried from data frame, both named or unnamed vectors. The data set is stackoverflow survey from kaggle.
moreThan1000 is a data.frame stores countries those have more than 1000 stackoverflow user and sorted by number column as shown below:
moreThan1000 <- subset(users, users$Number >1000)
moreThan1000 <- moreThan1000[order(moreThan1000$Number),]
when I try to convert it to a table like
tbl <- table(moreThan1000)
tbl <- table(moreThan1000$Country, moreThan1000$Number)
tbl <- table(moreThan1000$Country, moreThan1000$Number, dnn = c("Country","Number"))
after each attempt my conversion look like this:
Why moreThan1000 data.frame do not send just related countries but all countries to table? It seems to me conversion looks like a matrix.
I believe that this is because countries do not relate to each other. To each country corresponds a number, to another country will correspond an unrelated number. So the best way to reflect this is the original data.frame, not a table that will have just one 1 per row (unless two countries have the very same number of stackoverflow users). I haven't downloaded the dataset you're using, but look to what happens with a fake dataset, order by number just like your moreThan1000.
dat <- data.frame(A = letters[1:5], X = 21:25)
table(dat$A, dat$X)
21 22 23 24 25
a 1 0 0 0 0
b 0 1 0 0 0
c 0 0 1 0 0
d 0 0 0 1 0
e 0 0 0 0 1
Why would you expect anything different from your dataset?
The function "table" is used to tabulate your data.
So it will count how often every value occurs (in the "number"column!). In your case, every number only occurs once, so don't use this function here. It's working correctly, but it's not what you need.
Your data is already a tabulation, no need to count frequencies again.
You can check if there is an object conversion function, I guess you are looking for a function as.table rather than table.

R: How to turn a table of transaction-item pairs into a matrix of transactions x items?

I have a data frame in R like this:
user1,A
user1,B
user2,A
user2,C
user2,C
user3,A
user4,C
How can I transform it into a table like this?
user1,1,1,0
user2,1,0,2
user3,1,0,0
user4,0,0,1
Actually my data has a time interval as the second column, which is the time passed between the user's first and current purchase. I want to do a matrix plot in which each line in the matrix is a line in the image (And each element a pixel).
(I can (kinda) do it with the arules package, exporting the original dataframe and importing it as transactions, but I think there must be a direct way to do this, without needing such a hack.)
Thanks!
You can just use table()
> table(df1)
# V2
#V1 A B C
# user1 1 1 0
# user2 1 0 2
# user3 1 0 0
# user4 0 0 1
If you want to store this output as a new dataframe df2, this is one possibility:
df2 <- as.data.frame.matrix(table(df1))
data
df1 <- read.table(text="user1,A
user1,B
user2,A
user2,C
user2,C
user3,A
user4,C", header=F, sep=",")

Resources