I have to pre-process a big matrix. To make my example easier to understand I will use the following matrix:
Raw data
Where col = people and row = skills
In R my matrix is:
test <- matrix(c(18,12,15,0,13,0,14,0,12),ncol=3, nrow=3)
Aim
In my case I need to process row by row. So there is 3 steps. For each row I have to :
Put 0 if ij=ij (So all diagonals equals zero)
Put 0 if one of the ij=0
Otherwise I have to add ij+ij
I will show the 3 steps to be more clear.
Step 1 (row1)
The data are the row 1
The result is:
Step 2 (row2)
The data are the row 2
The result is:
Step 3 (row3)
The data are the row 3
The result is:
Create a maximum matrix
Then the maximum matching are :
So my final matrix should be:
Question
Can someone tell me how to succeed to achieve this in R?
And of course the same process should work if my matrix has more row and columns...
Thanks a lot :)
Here is my implementation in R. The code doesn't execute the steps exactly in the way you specified them. I focused on your final matrix and assumed that this is the main result you're interested in.
test <- matrix(c(18,12,15,0,13,0,14,0,12),ncol=3, nrow=3)
rownames(test) <- paste("Skill", 1:dim(test)[1], sep="")
colnames(test) <- paste("People", 1:dim(test)[2], sep="")
test
# Pairwise combinations
comb.mat <- combn(1:dim(test)[2], 2)
pairwise.mat <- data.frame(matrix(t(comb.mat), ncol=2))
pairwise.mat$max.score <- 0
names(pairwise.mat) <- c("Person1", "Person2", "Max.Score")
for ( i in 1:dim(comb.mat)[2] ) { # Loop over the rows
first.person <- comb.mat[1,i]
second.person <- comb.mat[2,i]
temp.mat <- test[, c(first.person, second.person)]
temp.mat[temp.mat == 0] <- NA
temp.rowSums <- rowSums(temp.mat, na.rm=FALSE)
temp.rowSums[is.na(temp.rowSums)] <- 0
max.sum <- max(temp.rowSums)
previous.val <- pairwise.mat$Max.Score[pairwise.mat$Person1 == first.person & pairwise.mat$Person2 == second.person]
pairwise.mat$Max.Score[pairwise.mat$Person1 == first.person & pairwise.mat$Person2 == second.person] <- max.sum*(max.sum > previous.val)
}
pairwise.mat
Person1 Person2 Max.Score
1 1 2 25
2 1 3 32
3 2 3 0
person.mat <- matrix(NA, nrow=dim(test)[2], ncol=dim(test)[2])
rownames(person.mat) <- colnames(person.mat) <- paste("People", 1:dim(test)[2], sep="")
diag(person.mat) <- 0
person.mat[cbind(pairwise.mat[,1], pairwise.mat[,2])] <- pairwise.mat$Max.Score
person.mat[lower.tri(person.mat, diag=F)] <- t(person.mat)[lower.tri(person.mat, diag=F)]
person.mat
People1 People2 People3
People1 0 25 32
People2 25 0 0
People3 32 0 0
Related
I found this online and used this with my data:
df <- data.frame(seasons = c("Season1","Season2","Season3","Season4"))
for(i in unique(df$seasons)) {
df[[paste0(i)]] <- ifelse(df$seasons==i,1,0)
}
The only challenge is where there is a 0 in the resultant cell, I want to fill in a meaningful value from a data frame that has data arranged like so:
S1
S2
Value
Season1
Season2
3
Season3
Season1
5
Season2
Season3
4
Note how a season in a pair could pop up at S1 or S2.
I'll need to fill for example,{row Season1; col Season 2} as well as {col Season 1 and row Season 2} in my matrix as 3.
Is there anyway for me to do this? I tried a few things but decided to give a shoutout to the community in case there is something simple out there I'm missing!
Thanks a bunch!
There are three steps and decided to rebuild the original matrix and call it S:
# Make square matrix of zeros
rc <- length(unique(df[[1]]) ) # going to assume that number of unique values is same in both cols
S <- diag(1, rc,rc)
# Label rows and cols
dimnames(S) <- list( sort(unique(df[[1]])), sort( unique(df[[2]])) )
# Assign value to matrix positions based on values of df[[3]]
S[ data.matrix( df[1:2]) ] <- # using 2 col matrix indexing
df[[3]]
# -------
> S
Season1 Season2 Season3
Season1 1 3 0
Season2 0 1 4
Season3 5 0 1
It's now a real matrix rather than a dataframe.
My dataframe contains a column with various touch points, numbers 1 till 18. I want to know which touch point results in touch point 10. Therefore I want to create a new column which shows the touch point which occurred before touch point 10 per customer journey (PurchaseID). If touch point 10 doesn't occur in a customer journey the value can be NULL or 0.
So for example:
dd <- read.table(text="
PurchaseId TouchPoint DesiredOutcome
1 8 6
1 6 6
1 10 6
2 12 0
2 8 0
3 17 4
3 3 4
3 4 4
3 10 4", header=TRUE)
The complete dataset contains 2.500.000 observations. Does anyone know how to solve my problem? Thanks in advance.
Firstly, it is better to give a complete reproducible sample code. I suggest you look at the data.table library which is nice for handling large datasets.
library(data.table)
mdata <- matrix(sample(x = c(1:20, 21), size = 15*10, replace = TRUE), ncol = 10)
mdata[mdata==21] <- NA
mdata <- data.frame(mdata)
names(mdata) <- paste0("cj", 1:10)
df_touch <- data.table(mdata)
# -- using for
res <- rep(0, nrow(df_touch))
for( i in 1:10){
cat(i, "\n")
res[i] <- i*df_touch[, (10 %in% get(paste0("cj", i)))]
cat(res[i], "\n")
}
# -- using lapply
dfun <- function(x, k = 10){ return( k %in% x ) }
df_touch[, lapply(.SD, dfun)]
I would like to select a sample from a dataset twice. Actually, I don't want to select it, but to create a new variable sampleNo that indicates to which sample (one or two) a case belongs to.
Lets suppose I have a dataset containing 40 cases:
data <- data.frame(var1=seq(1:40), var2=seq(40,1))
The first sample (n=10) I drew like this:
data$sampleNo <- 0
idx <- sample(seq(1,nrow(data)), size=10, replace=F)
data[idx,]$sampleNo <- 1
Now, (and here my problems start) I'd like to draw a second sample (n=10). But this sample should be drawn from the cases that don't belong to the first sample only. Additionally, "var1" should be an even number.
So sampleNo should be 0 for cases that were not drawn at all, 1 for cases that belong to the first sample and 2 for cases belonging to the second sample (= sampleNo equals 0 and var1 is even).
I was trying to solve it like this:
idx2<-data$var1%%2 & data$sampleNo==0
sample(data[idx2,], size=10, replace=F)
But how can I set sampleNo to 2?
We can use the setdiff function as follows:
sample(setdiff(1:nrow(data), idx), 3, replace = F)
setdiff(x, y) will select the elements of x that are not in y:
setdiff(x = 1:20, y = seq(2,20,2))
[1] 1 3 5 7 9 11 13 15 17 19
So to include in the above example:
data$sampleNo2 <- 0
idx2 <- sample(setdiff(1:nrow(data), idx), 3, replace = F)
data[idx2,]$sampleNo2 <- 1
Here is a complete solution to your problem more along the line of your original idea. The code can be shortened but for now I tried to make it as transparent as I could.
# Data
data <- data.frame(var1 = 1:40, var2 = 40:1)
# Add SampleNo column
data$sampleNo <- 0L
# Randomly select 10 rows as sample 1
pool_idx1 <- 1:nrow(data)
idx1 <- sample(pool_idx1, size = 10)
data[idx1, ]$sampleNo <- 1L
# Draw a second sample from cases where sampleNo != 1 & var1 is even
pool_idx2 <- pool_idx1[data$var1 %% 2 == 0 & data$sampleNo != 1]
idx2 <- sample(pool_idx2, size = 10)
data[idx2, ]$sampleNo <- 2L
I would like to use the vector:
time.int<-c(1,2,3,4,5) #vector to be use as a "guide"
and the database:
time<-c(1,1,1,1,5,5,5)
value<-c("s","s","s","t","d","d","d")
dat1<- as.data.frame(cbind(time,value))
to create the following vector, which I can then add to the first vector "time.int" into a second database.
freq<-c(4,0,0,0,3) #wished result
This vector is the sum of the events that belong to each time interval, there are four 1 in "time" so the first value gets a four and so on.
Potentially I would like to generalize it so that I can decide the interval, for example saying sum in a new vector the events in "times" each 3 numbers of time.int.
EDIT for generalization
time.int<-c(1,2,3,4,5,6)
time<-c(1,1,1,2,5,5,5,6)
value<-c("s","s","s","t", "t","d","d","d")
dat1<- data.frame(time,value)
let's say I want it every 2 seconds (every 2 time.int)
freq<-c(4,0,4) #wished result
or every 3
freq<-c(4,4) #wished result
I know how to do that in excel, with a pivot table.
sorry if a duplicate I could not find a fitting question on this website, I do not even know how to ask this and where to start.
The following will produce vector freq.
freq <- sapply(time.int, function(x) sum(x == time))
freq
[1] 4 0 0 0 3
BTW, don't use the construct as.data.frame(cbind(.)). Use instead
dat1 <- data.frame(time,value))
In order to generalize the code above to segments of time.int of any length, I believe the following function will do it. Note that since you've changed the data the output for n == 1 is not the same as above.
fun <- function(x, y, n){
inx <- lapply(seq_len(length(x) %/% n), function(m) seq_len(n) + n*(m - 1))
sapply(inx, function(i) sum(y %in% x[i]))
}
freq1 <- fun(time.int, time, 1)
freq1
[1] 3 1 0 0 3 1
freq2 <- fun(time.int, time, 2)
freq2
[1] 4 0 4
freq3 <- fun(time.int, time, 3)
freq3
[1] 4 4
We can use the table function to count the event number and use merge to create a data frame summarizing the information. event_dat is the final output.
# Create example data
time.int <- c(1,2,3,4,5)
time <- c(1,1,1,1,5,5,5)
# Count the event using table and convert to a data frame
event <- as.data.frame(table(time))
# Convert the time.int to a data frame
time_dat <- data.frame(time = time.int)
# Merge the data
event_dat <- merge(time_dat, event, by = "time", all = TRUE)
# Replace NA with 0
event_dat[is.na(event_dat)] <- 0
# See the result
event_dat
time Freq
1 1 4
2 2 0
3 3 0
4 4 0
5 5 3
A questionnaire was passed to teachers to check their curriculum preferences. They had to choose 20 items from about 50 options.
The resulting data is a long list of choices of the following type:
Teacher ID, Question ID
i want to format it to a list with one row for each teacher and a colomn per each question with the possible values: 0 (not chosen), 1 (chosen).
In pseudo code (of a programming language)
it would probably be something like this:
iterate list {
data [teacher_id] [question_id] = 0
}
Here is a sample data and the intended result:
a <- data.frame(
Case_ID = c(1,1,2,2,4,4),
Q_ID = c(3,5,5,8,2,6)
)
intended result is
res <- data.frame(
Case_ID = c(1,2,4),
Q_1 = c(0,0,0),
Q_2 = c(0,0,1),
Q_3 = c(1,0,0),
Q_4 = c(0,0,0),
Q_5 = c(1,1,0),
Q_6 = c(0,0,1),
Q_7 = c(0,0,0),
Q_8 = c(0,1,0)
)
Any help would be greatly appreciated.
Tnx
Hed
Returning a matrix and using matrix indexing to do the work:
m <- matrix(0, nrow=3, ncol=8)
rownames(m) <- c(1,2,4)
colnames(m) <- 1:8
idx <-apply(a, 2, as.character)
m[idx] <- 1
m
## 1 2 3 4 5 6 7 8
## 1 0 0 1 0 1 0 0 0
## 2 0 0 0 0 1 0 0 1
## 4 0 1 0 0 0 1 0 0
Note that you can think of a as a list of indecies, which themselves reference which cells in a "master array" are TRUE.
Then if you have a master matrix, say res of all 0's, you can then tell R: "all of the elements that are referenced in a should be 1"
This is done below
First we create the "master matrix"
# identify the unique teacher ID's
teacherIDs <- unique(a$Case_ID)
# count how many teachers there are
numbTeachers <- length(teacherIDs)
# create the column names for the questions
colNames <- c(paste0("Q_", 1:50))
# dim names for matrix. Using T_id for the row names
dnames <- list(paste0("T_", teacherIDs),
colNames)
# create the matrix
res2 <- matrix(0, ncol=50, nrow=numbTeachers, dimnames=dnames)
Next we convert a to a set of indices.
*Note that the first two lines below are only needed if there are Teacher ID's that are not present. ie in your example, T_3 is not present*
# create index out of a
indx <- a
indx$Case_ID <- as.numeric(as.factor(indx$Case_ID))
indx <- as.matrix(indx)
# populate those in a with 1
res2[indx] <- 1
res2