creating data frame for confusionMatrix(0 input, however getting unusual dataframe - r

I’m try to understand results of a prediction object via caret’s confusionMatrix() function, which requires table input according to http://artax.karlin.mff.cuni.cz/r-help/library/caret/html/confusionMatrix.html, my table() creates results that I understand , but its not friendly to the confusionMatrix() function.
Here is the relevant code snippet:
#MODEL CREATION
#convert categorical A to E values , into numeric 1 to 5 in order to be regression friendly
training_data_subset_numeric <- training_data_subset;
testing_data_subset_numeric <- testing_data_subset;
training_data_subset_numeric$classe <- as.numeric(training_data_subset$classe)
testing_data_subset_numeric$classe <- as.numeric(testing_data_subset$classe)
#model
exercise.model <- glm(formula = classe ~ ., data = training_data_subset_numeric)
#MODEL EVALUATION
exercise.prediction <- predict(exercise.model,newdata = testing_data_subset_numeric)
eval_table <- table(exercise.prediction,testing_data_subset$classe)
tail(eval_table)
exercise.prediction A B C D E
4.35504232913594 1 0 0 0 0
4.47219097065568 1 0 0 0 0
4.50838854075835 1 0 0 0 0
4.6173551930011 0 1 0 0 0
4.69261223447305 0 1 0 0 0
4.73297946213265 0 1 0 0 0
Basically I need to convert the above output , to a data frame with 1 col corresponding to prediction value that follows this rule:
If column A is 1 , than predicted value is 1
If column B is 1 , than predicted value is 2
If column C is 1 , than predicted value is 3
If column D is 1 , than predicted value is 4
If column E is 1 , than predicted value is 5
I therefore, wrote this function to get the job done:
getPredictResults<- function(x)
{
# create 1 column & n row data frame
num <- data.frame(matrix(0, ncol = 1, nrow = nrow(x)));
for (r in 1:nrow(x) ) {
for (c in 1:ncol(x) ) {
#if column A has value 1 than num[1,r] <- 1
if (x[r,'A']== 1)
{
num[1,r] <- 1;
}
#if column B has value 1 than num[1,r] <- 2
else if (x[r,'B']== 1)
{
num[1,r] <- 2;
}
#if column C has value 1 than num[1,r] <- 3
else if (x[r,'C']== 1)
{
num[1,r] <- 3;
}
#if column D has value 1 than num[1,r] <- 4
else if (x[r,'D']== 1)
{
num[1,r] <- 4;
}
#if column E has value 1 than num[1,r] <- 5
else if (x[r,'E']== 1)
{
num[1,r] <- 5;
}
else
{
}
}#end inner for
}#end outer for
return (num);
}#end function
exercise.prediction_df <- getPredictResults(eval_table)
However when typing :
head(exercise.prediction_df)
Im getting an unusual output , here is the bottom snippet:
2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
V4840 V4841 V4842 V4843 V4844 V4845 V4846 V4847 V4848 V4849 V4850 V4851 V4852 V4853 V4854 V4855 V4856 V4857
1 5 1 4 5 2 2 5 5 1 2 5 4 5 5 1 5 5 4
2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
V4858 V4859 V4860 V4861 V4862 V4863 V4864 V4865 V4866 V4867 V4868 V4869 V4870 V4871 V4872 V4873 V4874 V4875
1 4 2 1 2 5 1 4 5 2 1 4 5 2 4 2 4 4 2
2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
V4876 V4877 V4878 V4879 V4880 V4881 V4882 V4883 V4884 V4885 V4886 V4887 V4888 V4889 V4890 V4891 V4892 V4893
1 5 1 1 4 1 2 2 1 1 5 1 4 1 1 1 1 1 1
2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
V4894 V4895 V4896 V4897 V4898 V4899 V4900 V4901 V4902 V4903 V4904
1 1 1 1 1 1 1 1 1 2 2 2
2 NA NA NA NA NA NA NA NA NA NA NA
[ reached getOption("max.print") -- omitted 4 rows ]
Further investigation shows:
> ncol(exercise.prediction_df)
[1] 4904
> nrow(exercise.prediction_df)
[1] 4904
Which ncol() should only return 1 & nrow() obviously can be any integer value.
How can I fix this function, in order to create the right dataframe as an input to confusionMatrix() function?
Thanks.

classe <- cut(runif(100), seq(0, 1, length.out = 5))
levels(a) <- c("A", "B", "C", "D", "E")
exercise.prediction <- rnorm(100)
eval_table <- table(exercise.prediction, classe)
eval_matrix <- as.matrix(tab)
transform <- apply(eval_matrix, 1, function(x) sum(x * c(1:5)))
head(as.data.frame(transform))

Related

Making the rows of a data frame to NAs using R

I have a data frame as follows,
aid=c(1:10)
x1_var=rnorm(10,0,1)
x2_var=rnorm(10,0,1)
x3_var=rbinom(10,1,0.5)
data=data.frame(aid,x1_var,x2_var,x3_var)
head(data)
aid x1_var x2_var x3_var
1 1 -0.99759448 -0.2882535 1
2 2 -0.12755695 -1.3706875 0
3 3 1.04709366 0.8977596 1
4 4 0.48883458 -0.1965846 1
5 5 -0.40264114 0.2925659 1
6 6 -0.08409966 -1.3489460 1
I want to make the all the rows in this data frame completely to NA if x3_var==1(without making aid column to NA)
I tried the following code.
> data[which(data$x3_var==1),]=NA
> data
aid x1_var x2_var x3_var
1 NA NA NA NA
2 2 -0.12755695 -1.3706875 0
3 NA NA NA NA
4 NA NA NA NA
5 NA NA NA NA
6 NA NA NA NA
7 NA NA NA NA
8 8 -1.78160459 -1.8677633 0
9 9 -1.65895704 -0.8086148 0
10 10 -0.06281384 1.8888726 0
But this code have made the values of aid column also to NA. Can anybody help me to fix this?
Also are there any methods that do the same thing?
Thank you
Your code would work if you remove aid column from it.
data[which(data$x3_var==1),-1]=NA
You can also do this without which :
data[data$x3_var==1, -1]=NA
In the above two cases I am assuming that you know the position of aid column i.e 1. If in reality you don't know the position of the column you can use match to get it's position.
data[data$x3_var==1, -match('aid', names(data))] = NA
A dplyr solution. Assuming the columns to be altered begin with "x" as in the example data.
library(dplyr)
set.seed(1001)
df1 <- data.frame(aid = 1:10,
x1_var = rnorm(10,0,1),
x2_var = rnorm(10,0,1),
x3_var = rbinom(10,1,0.5))
df1 %>%
mutate(across(starts_with("x"), ~ifelse(x3_var == 1, NA, .x)))
aid x1_var x2_var x3_var
1 1 2.1886481 0.3026445 0
2 2 -0.1775473 1.6343924 0
3 3 NA NA NA
4 4 -2.5065362 0.4671611 0
5 5 NA NA NA
6 6 -0.1435595 0.1102652 0
7 7 NA NA NA
8 8 -0.6229437 -1.0302508 0
9 9 NA NA NA
10 10 NA NA NA

need help creating a function to delete one wave of data if it's a duplicate of the previous wave for a retrospective measure

I'm working with a longitudinal dataset that has a retrospective measure of trauma that provides a yes/no endorsement of a question and the age of onset if the answer was "yes". If a question was endorsed at the first wave of data and then again at the second wave with the same age it needs to be converted to a "no" and a the age to NA. My data looks like this:
df <- as.data.frame(cbind(Aw1 = c(0,0,1,0,0),
Aagew1 = c(NA,NA,23,NA,NA),
Aw2 = c(1,0,1,0,0),
Aagew2 = c(29,NA,23,NA,NA),
Bw1 = c(1,0,0,0,1),
Bagew1 = c(20,NA,NA,NA,23),
Bw2 = c(1,0,1,0,1),
Bagew2 = c(20,NA,28,NA,23)))
print(df)
Aw1 Aagew1 Aw2 Aagew2 Bw1 Bagew1 Bw2 Bagew2
1 0 NA 1 29 1 20 1 20
2 0 NA 0 NA 0 NA 0 NA
3 1 23 1 23 0 NA 1 28
4 0 NA 0 NA 0 NA 0 NA
5 0 NA 0 NA 1 23 1 23
Using the following data.table syntax I'm able to recode what I want conditionally so that the "A" question at wave two, that is the same incident recorded at wave 1, is no longer present
dt <- as.data.table(df)
dt[Aagew1 == Aagew2, ':=' (Aw2 = 0, Aagew2 = NA)]
print(dt)
Aw1 Aagew1 Aw2 Aagew2 Bw1 Bagew1 Bw2 Bagew2
1: 0 NA 1 29 1 20 1 20
2: 0 NA 0 NA 0 NA 0 NA
3: 1 23 0 NA 0 NA 1 28
4: 0 NA 0 NA 0 NA 0 NA
5: 0 NA 0 NA 1 23 1 23
I'd like to automate this syntax and put it into a function so that for every question that repeats itself at wave 2 is deleted. Here's a function I made that doesn't work:
rm.duplicate <- function(x){
y <- as.data.table(x)
for(i in LETTERS[1:2]){
y[paste0(i,"age","w1") == paste0(i,"age","w2"), ':=' (paste0(i,"w2") = 0, paste0(i,"age","w2") = NA)]
}
return(as.data.frame(y))
}
The desired outcome is (so that the duplicates at wave 2 are deleted for all the unique questions):
Aw1 Aagew1 Aw2 Aagew2 Bw1 Bagew1 Bw2 Bagew2
1: 0 NA 1 29 1 20 0 NA
2: 0 NA 0 NA 0 NA 0 NA
3: 1 23 0 NA 0 NA 1 28
4: 0 NA 0 NA 0 NA 0 NA
5: 0 NA 0 NA 1 23 0 NA
Thank you for helping me out!
The OP's function can be modified slightly to add get to return the value
rm.duplicate <- function(x){
y <- as.data.table(x)
for(i in LETTERS[1:2]){
y[get(paste0(i,"age","w1")) == get(paste0(i,"age","w2")),
paste0(i,c("", "age"), "w2") := .(0, NA)]
}
return(as.data.frame(y))
}
rm.duplicate(df)
# Aw1 Aagew1 Aw2 Aagew2 Bw1 Bagew1 Bw2 Bagew2
#1 0 NA 1 29 1 20 0 NA
#2 0 NA 0 NA 0 NA 0 NA
#3 1 23 0 NA 0 NA 1 28
#4 0 NA 0 NA 0 NA 0 NA
#5 0 NA 0 NA 1 23 0 NA
Or another option is set
dt <- as.data.table(df)
for(i in LETTERS[1:2]){
w1 <- paste0(i,"age","w1")
w2 <- paste0(i,"age","w2")
i1 <- which(dt[[w1]] == dt[[w2]])
nm1 <- paste0(i,c("", "age"), "w2")
set(dt, i = i1, j = nm1[1], value = 0)
set(dt, i = i1, j = nm1[2], value = NA)
}
dt

Split numbers from letters in a mixed string of characters and put it into columns using Regex in R

I have a string of basketball player stats like in the example below:
stats <- c("40pt 2rb 1as 2st 2to 4trey 11-20fg 14-14ft",
"7pt 5rb 1as 2st 1bl 3to 3-5fg 1-4ft",
"0pt 1rb 1as 0-2fg")
Ideally I would like to transform this string into tabular format:
This is the key for each column:
pt=points
rb=rebounds
as=assists
st=steals
bl=blocks
to=turnovers
trey=3 pointers made
fg=field goals made-attempted
ft=free throws made-attempted
We split the string at the boundary between letter and digit to create the list ('lst'), loop through the list, change it to a data.frame with column names from the alternate split values, rbind the elements with rbindlist, split the elements having - to multiple columns with cSplit and convert the NA values to 0
library(data.table)
library(splitstackshape)
lst <- strsplit(stats, "(?<=[0-9])(?=[a-z])|\\s+", perl = TRUE)
lst1 <- lapply(lst, function(x)
as.data.frame.list(setNames(x[c(TRUE, FALSE)], x[c(FALSE, TRUE)])))
res <- cSplit(rbindlist(lst1, fill = TRUE), c('fg', 'ft'), '-')
for(nm in seq_along(res)){
set(res, i = NULL, j = nm, value = as.numeric(as.character(res[[nm]])))
set(res, i = which(is.na(res[[nm]])), j = nm, value = 0)
}
res
# pt rb as st to trey bl fg_1 fg_2 ft_1 ft_2
#1: 40 2 1 2 2 4 0 11 20 14 14
#2: 7 5 1 2 3 0 1 3 5 1 4
#3: 0 1 1 0 0 0 0 0 2 0 0
use dcast from reshape 2 package:
m=gsub("(\\d+)-(\\d+)(\\w+)","\\1\\3_m \\2\\3_a",stats)
n=gsub("(\\d+)(\\S*)","\\1 \\2",gsub("\\s","\n",m))
o=cbind(read.table(text=n),group=rep(1:length(n),lengths(strsplit(n,"\n"))))
dcast(o,group~V2,value.var="V1")
group as bl fg_a fg_m ft_a ft_m pt rb st to trey
1 1 1 NA 20 11 14 14 40 2 2 2 4
2 2 1 1 5 3 4 1 7 5 2 3 NA
3 3 1 NA 2 0 NA NA 0 1 NA NA NA
Using base R
> m=gsub("(\\d+)-(\\d+)(\\w+)","\\1\\3_m \\2\\3_a",stats)
> n=gsub("(\\d+)(\\S*)","\\1 \\2",gsub("\\s","\n",m))
> o=lapply(n,function(x)rev(read.table(text=x)))
> p=Reduce(function(x,y)merge(x,y,by="V2",all=T),o)
> read.table(text=do.call(paste,data.frame(t(p))),h=T)
as fg_a fg_m ft_a ft_m pt rb st to trey bl
1 1 20 11 14 14 40 2 2 2 4 NA
2 1 5 3 4 1 7 5 2 3 NA 1
3 1 2 0 NA NA 0 1 NA NA NA NA

Generate a Preference Matrix in R?

I'm using r to analyze an undirected network of individuals with ethnicities as attributes. I want to create a tie accounts table, or "preference matrix," a square matrix where values of ethnicity are arrayed on both dimensions, and each cell tells you how many ties correspond to that type of relationship. (so from this you can calculate the probability of one group throwing ties to another group - but I just want to use it as an argument in igraph's preference.game function). here's what I tried:
# I create a variable for ethnicity by assigning the names of my vertices to their corresponding ethnicities
eth <- atts$Ethnicity[match(V(mahmudNet)$name,atts$Actor)]
# I create an adjacency matrix from my network data
mat <- as.matrix(get.adjacency(mahmudNet))
# I create the dimensions for my preference matrix from the Ethnicity values
eth.value <- unique(sort(eth))
# I create an empty matrix using these dimensions
eth.mat <- array(NA,dim=c(length(eth.value),length(eth.value)))
# I create a function that will populate the empty cells of the matrix
for (i in eth.value){
for (j in eth.value){
eth.mat[i,j] <- sum(mat[eth==i,eth==j])
}
}
My problem is at the end, I think. I need to figure out an expression that tells R how to populate the cells. the expression I put doesn't seem to work, but I want it so that potentially I could go
a <- sum(mat[eth=="White", eth=="Black"])
And then "a" would return the sum of all the cells in the adjacency matrix that correspond to a White-Black relationship.
Here's a sample of my data:
# data frame with Ethnicity attributes:
Actor Ethnicity
1 Sultan Mahmud of Siak 2
2 Daeng Kemboja 1
3 Raja Kecik of Trengganu 1
4 Raja Alam 2
5 Tun Dalam 2
6 Raja Haji 1
7 The Suliwatang 1
8 Punggawa Miskin 1
9 Tengku Selangor 1
10 Tengku Raja Said 1
11 Datuk Bendahara 2
12 VOC 3
13 King of Selangor 1
14 Dutch at Batavia 3
15 Punggawa Tua 2
16 Raja Tua Encik Andak 1
17 Raja Indera Bungsu 2
18 Sultan of Jambi 2
19 David Boelen 3
20 Datuk Temenggong 2
21 Punggawa Opu Nasti 1
# adjacency matrix with relations
Daeng Kemboja Punggawa Opu Nasti Raja Haji Daeng Cellak
Daeng Kemboja 0 1 1 1
Punggawa Opu Nasti 1 0 1 0
Raja Haji 1 1 0 0
Daeng Cellak 1 0 0 0
Daeng Kecik 1 0 0 0
Daeng Kecik
Daeng Kemboja 1
Punggawa Opu Nasti 0
Raja Haji 0
Daeng Cellak 0
Daeng Kecik 0
This is a simple job for table, once you have your data in the right shape.
First a sample dataset:
# fake ethnicity data by actor
actor_eth <- data.frame(actor = letters[1:10],
eth = sample(1:3, 10, replace=T))
# fake adjacency matrix
adj_mat <- matrix(rbinom(100, 1, .5), ncol=10)
dimnames(adj_mat) <- list(letters[1:10], letters[1:10])
# blank out lower triangle & diagonal,
# so random data is not asymetric & no self-ties
adj_mat[lower.tri(adj_mat)] <- NA
diag(adj_mat) <- NA
Here's our fake adjacency matrix:
a b c d e f g h i j
a NA 1 1 1 0 0 1 1 0 1
b NA NA 0 1 0 1 0 0 1 0
c NA NA NA 1 1 0 0 1 0 0
d NA NA NA NA 1 0 0 1 1 0
e NA NA NA NA NA 0 0 1 0 1
f NA NA NA NA NA NA 1 1 0 1
g NA NA NA NA NA NA NA 1 1 0
h NA NA NA NA NA NA NA NA 0 0
i NA NA NA NA NA NA NA NA NA 1
j NA NA NA NA NA NA NA NA NA NA
Here's our fake eth table:
actor eth
1 a 3
2 b 3
3 c 3
4 d 2
5 e 1
6 f 3
7 g 3
8 h 3
9 i 1
10 j 2
So what you want to do is 1) put this in long format, so you have a bunch of rows with a source actor and a target actor, each representing a tie. Then 2) replace the actor name with ethnicity, so you have ties with source/target ethnicity. Then 3) you can just use table to make a cross tab.
# use `melt` to put this in long form, omitting rows showing "non connections"
library(reshape2)
actor_ties <- subset(melt(adj_mat), value==1)
# now replace the actor names with their ethnicities to get create a data.frame
# of ties by ethnicty
eth_ties <-
data.frame(source_eth = with(actor_eth, eth[match(actor_ties$Var1, actor)]),
target_eth = with(actor_eth, eth[match(actor_ties$Var2, actor)]))
# now here's your cross tab
table(eth_ties)
Result:
target_eth
source_eth 1 2 3
1 0 2 1
2 2 0 1
3 3 5 9

Combine similar rows across two data frames

Still getting the gist of R. I have two data frames where the rows are named with different coordinates (e.g. x_1013y_41403; see below). The coordinates form sets of five, each set makes a cross if plotted onto a grid. The center coordinate is in one data frame, and the four peripheral coordinates are in the other.
Center A B C D E F
x_723y_6363.txt 554 NA 604 NA 645 NA
x_749y_41403.txt 14 NA 6 NA 13 NA
Peripheral A B C D E F
x_1013y_41403.txt NA 1 NA 0 NA 0
x_459y_6363.txt NA 2 NA 1 NA 4
x_485y_41403.txt NA 0 NA 0 NA 0
x_723y_6100.txt NA 1 NA 0 NA 3
x_723y_6627.txt NA 1 NA 0 NA 1
x_749y_41139.txt NA 1 NA 0 NA 0
x_749y_41667.txt NA 2 NA 0 NA 0
x_987y_6363.txt NA 1 NA 0 NA 0
To form a set, the peripheral coordinates would have the same x or y location as the center coordinate. For example, the center coordinate x_723y_6363 would be associated with x_723y_6100 and x_723y_6627 (same x location), as well as x_459y_6363 and x_987y_6363 (same y location).
I would like to combine the coordinates into their respective sets, and name the set with the center coordinate. For the case above, I would end up with two rows, where each row is the summation of a set.
A B C D E F
x_723y_6363.txt 554 5 604 1 645 8
x_749y_41403.txt 14 4 6 0 13 0
I am not sure at all how this can be done. I have thought about creating regular expressions to pick out the x and y coordinates individually and then doing a comparison across the two data frames. Any help would be greatly appreciated!
I hope someone else comes up with a better answer as this is ugly. I would first split the .txt names into x and y values then loop over each of the variables that is NA in center and sum all values that are share an x or y value with that center. Edit: Changed the sapply to make it slightly nicer.
center <- read.table(textConnection("
A B C D E F
x_723y_6363.txt 554 NA 604 NA 645 NA
x_749y_41403.txt 14 NA 6 NA 13 NA"),
header = TRUE)
peripheral <- read.table(textConnection("
A B C D E F
x_1013y_41403.txt NA 1 NA 0 NA 0
x_459y_6363.txt NA 2 NA 1 NA 4
x_485y_41403.txt NA 0 NA 0 NA 0
x_723y_6100.txt NA 1 NA 0 NA 3
x_723y_6627.txt NA 1 NA 0 NA 1
x_749y_41139.txt NA 1 NA 0 NA 0
x_749y_41667.txt NA 2 NA 0 NA 0
x_987y_6363.txt NA 1 NA 0 NA 0"),
header = TRUE)
xpat <- "^([^y]+).*"
ypat <- ".*(y_[0-9]+)\\.txt"
center$x <- gsub(xpat, "\\1", rownames(center))
center$y <- gsub(ypat, "\\1", rownames(center))
peripheral$x <- gsub(xpat, "\\1", rownames(peripheral))
peripheral$y <- gsub(ypat, "\\1", rownames(peripheral))
vars <- c("B", "D", "F")
center[vars] <- sapply(peripheral[vars], function(col)
apply(center, 1, function(row) sum(col[peripheral$x %in% row["x"] | peripheral$y %in% row["y"]]) )
)
R> center
A B C D E F x y
x_723y_6363.txt 554 5 604 1 645 8 x_723 y_6363
x_749y_41403.txt 14 4 6 0 13 0 x_749 y_41403
Another option:
# function to split coordinates x and y:
f <- function(DF) structure(
t(sapply(strsplit(row.names(DF), "[_y.]"), `[`, c(2,4))),
dimnames=list(NULL, c("x", "y")))
# get x and y for peripheral data:
P <- cbind(Peripheral, f(Peripheral))
# get x and y for centers, and mark ids:
C <- cbind(Center, f(Center), id=1:nrow(Center))
# matching:
Q <- merge(merge(P, C[,c("x","id")], all=TRUE), C[,c("y","id")], by="y", all=TRUE)
# prepare for union:
R <- within(Q, {id <- ifelse(is.na(id.y), id.x, id.y); id.x <- NULL; id.y <- NULL})
# join everything and aggregate:
S <- rbind(R, C)
aggregate(S[,3:8], by=list(id=S$id), FUN=sum, na.rm=TRUE)
Result:
id A B C D E F
1 1 554 5 604 1 645 8
2 2 14 4 6 0 13 0

Resources