I have a vector of actual values (factor variables) a a a b b c c b c b c c ...
I have a vector of predict values a b b a ...
I want to create a confusion matrix t = table (actual, predict)
If I print t, it will looks like:
a b c
a 4 3 1
b 1 5 2
c 3 1 8
However, I want to print it as
b c a
b 8 2 3
c 2 3 4
a 1 2 3
(i.e, I want to change the order of row and columns, but keep it as a confusion matrix)
How could I do that in R?
We could change/convert the columns to factor with levels specified.
actual <- factor(actual, levels=c('b', 'c', 'a'))
predict <- factor( predict, levels = c('b', 'c', 'a'))
table(actual, predict)
# predict
#actual b c a
# b 0 1 4
# c 3 2 2
# a 2 3 3
Or we can use row/column indexing
table(actual, predict)[c('b','c','a'), c('b', 'c', 'a')]
data
set.seed(24)
actual <- sample(letters[1:3], 20, replace=TRUE)
predict <- sample(letters[1:3], 20, replace=TRUE)
Related
I have a large data frame that looks like this:
df
X1 X2
1 A B
2 A C
And another that looks like this
df2
Type Group
1 Train A
2 Boat B
3 Car A
4 Hangar C
I want to insert df2 into df1 and copy the entire row every time I insert so I end up with this
X1 X2 X3
1 A B Train
2 A B Car
3 A B Boat
4 A C Train
5 A C Car
6 A C Hangar
What is the best way to do this in R? Cant figure this out.
I am not sure if I understand your aim correctly, but below is my base R attempt
do.call(
rbind,
c(
make.row.names = FALSE,
lapply(
1:nrow(df2),
function(k) {
cbind(
df[which(df == df2$Group[k], arr.ind = TRUE)[, "row"], ],
X3 = df2$Type[k]
)
}
)
)
)
which gives
X1 X2 X3
1 A B Train
2 A C Train
3 A B Boat
4 A B Car
5 A C Car
6 A C Hangar
I ran across a behaviour of merge() in R that I can't understand. It seems that it either merges or rbinds data frames depending on whether a column has one or more unique values in it.
a1 <- data.frame (A = c (1, 1))
a2 <- data.frame (A = c (1, 2))
# > merge (a1, a1)
# A
# 1 1
# 2 1
# 3 1
# 4 1
# > merge (a2, a2)
# A
# 1 1
# 2 2
The latter is the result that I would expect, and want, in both cases. I also tried having more than one column, as well as characters instead of numerals, and the results are the same: multiple values result in merging, one unique value results in rbinding.
In the first case each row matches two rows so there are 2x2=4 rows in the output and in the second case each row matches one row so there are 2 rows in the output.
To match on row number use this:
merge(a1, a1, by = 0)
## Row.names A.x A.y
## 1 1 1 1
## 2 2 1 1
or match on row number and only return the left instance:
library(sqldf)
sqldf("select x.* from a1 x left join a1 y on x.rowid = y.rowid")
## A
## 1 1
## 2 1
or match on row number and return both instances:
sqldf("select x.A A1, y.A A2 from a1 x left join a1 y on x.rowid = y.rowid")
## A1 A2
## 1 1 1
## 2 1 1
The behaviour is detailed in the documentation but, basically, merge() will, by default, want to give you a data.frame with columns taken from both original dfs. It is going to merge rows of the two by unique values of all common columns.
df1 <- data.frame(a = 1:3, b = letters[1:3])
df2 <- data.frame(a = 1:5, c = LETTERS[1:5])
df1
a b
1 1 a
2 2 b
3 3 c
df2
a c
1 1 A
2 2 B
3 3 C
4 4 D
5 5 E
merge(df1, df2)
a b c
1 1 a A
2 2 b B
3 3 c C
What's happening in your first example is that merge() wants to combine the rows of your two data frames by the A column but because both rows in both dfs are the same, it can't figure out which row to merge with which so it creates all possible combinations.
In your second example, you don't have this problem and so merging is unambiguous. The 1 rows will get merged together as will the 2 rows.
The scenarios are more apparent when you have multiple columns in your dfs:
Case 1:
> df1 <- data.frame(a = c(1, 1), b = letters[1:2])
> df2 <- data.frame(a = c(1, 1), c = LETTERS[1:2])
> df1
a b
1 1 a
2 1 b
> df2
a c
1 1 A
2 1 B
> merge(df1, df2)
a b c
1 1 a A
2 1 a B
3 1 b A
4 1 b B
Case 2:
> df1 <- data.frame(a = c(1, 2), b = letters[1:2])
> df2 <- data.frame(a = c(1, 2), c = LETTERS[1:2])
> df1
a b
1 1 a
2 2 b
> df2
a c
1 1 A
2 2 B
> merge(df1, df2)
a b c
1 1 a A
2 2 b B
I have a function that works when applied to the dataframes in the global environment, but I'm trying to get it to apply to a list. This question refers back to my previous question here. The function extracts information from the dataframe names in the global environment and makes a new column based on that info, but I would like it to apply to a list of dataframes rather than the dataframes in the global environment. Here's some mock data and the function:
pend4P_17k <- data.frame(x = c(1, 2, 3, 4, 5),
var1 = c('a', 'b', 'c', 'd', 'e'),
var2 = c(1, 1, 0, 0, 1))
pend5P_17k <- data.frame(x = c(1, 2, 3, 4, 5),
var1 = c('a', 'b', 'c', 'd', 'e'),
var2 = c(1, 1, 0, 0, 1))
pend10P_17k <- data.frame(x = c(1, 2, 3, 4, 5),
var1 = c('a', 'b', 'c', 'd', 'e'),
var2 = c(1, 1, 0, 0, 1))
list_pend <- list(pend4P_17k=pend4P_17k, pend5P_17k=pend5P_17k, pend10P_17k=pend10P_17k)
add_name_cols <- function(df){
my_global <- ls(envir = globalenv())
for(i in my_global)
if(class(get(i)) == "data.frame" & grepl("pend", i))
{
df <- get(i)
df$Pendant_ID <- gsub("^pend(.{2,3})_.*$", "\\1", i)
assign(i, df, envir = globalenv())
}
return(df)
}
list_pend <- lapply(list_pend, add_name_cols)
It applies the function to the list, but every dataframe has the same Pendant_ID column, when it should match the ID given in the dataframe name (i.e. the pend4P_17k dataframe should have a Pendant_ID column that is "4P")
Using R version 3.5.1, Mac OS X 10.13.6
You can modify your function so that it runs on a list as opposed to an environment:
list_pend <- list(pend4P_17k=pend4P_17k, pend5P_17k=pend5P_17k, pend10P_17k=pend10P_17k)
add_name_cols <- function(l){
for(i in seq_along(l)){
l[[i]]$Pendant_ID <- gsub("^pend(.{2,3})_.*$", "\\1", names(l)[i])
}
return(l)
}
list_pend <- add_name_cols(list_pend)
Output
> add_name_cols(list_pend)
$pend4P_17k
x var1 var2 Pendant_ID
1 1 a 1 4P
2 2 b 1 4P
3 3 c 0 4P
4 4 d 0 4P
5 5 e 1 4P
$pend5P_17k
x var1 var2 Pendant_ID
1 1 a 1 5P
2 2 b 1 5P
3 3 c 0 5P
4 4 d 0 5P
5 5 e 1 5P
$pend10P_17k
x var1 var2 Pendant_ID
1 1 a 1 10P
2 2 b 1 10P
3 3 c 0 10P
4 4 d 0 10P
5 5 e 1 10P
A few things:
In an if statement, use &&, not &. (Rationale: & suggests "0 or more" whereas if requires length of exactly 1; & doesn't short-circuit logic, might be nice to have.)
Don't use == when looking at an object's class, many objects return a vector of length 2 or more with class. It's often better to use inherits (or one of the is.* functions, such as is.data.frame).
lapply doesn't pass the name of an object, just its value. We'll use Map instead.
add_name_cols <- function(df, nm) {
if (inherits(df, "data.frame") && grepl("pend", nm)) {
df$Pendant_ID <- gsub("^pend(.{2,3})_.*$", "\\1", nm)
}
df
}
Map(add_name_cols, list_pend, names(list_pend))
# $pend4P_17k
# x var1 var2 Pendant_ID
# 1 1 a 1 4P
# 2 2 b 1 4P
# 3 3 c 0 4P
# 4 4 d 0 4P
# 5 5 e 1 4P
# $pend5P_17k
# x var1 var2 Pendant_ID
# 1 1 a 1 5P
# 2 2 b 1 5P
# 3 3 c 0 5P
# 4 4 d 0 5P
# 5 5 e 1 5P
# $pend10P_17k
# x var1 var2 Pendant_ID
# 1 1 a 1 10P
# 2 2 b 1 10P
# 3 3 c 0 10P
# 4 4 d 0 10P
# 5 5 e 1 10P
If you have purrr installed (part of the tidyverse), you can also use
purrr::imap(list_pend, add_name_cols)
I have a matrix. The entries of the matrix are counts for the combination of the dimension levels. For example:
(m0 <- matrix(1:4, nrow=2, dimnames=list(c("A","B"),c("A","B"))))
A B
A 1 3
B 2 4
I can change it to a long format:
library("reshape")
(m1 <- melt(m0))
X1 X2 value
1 A A 1
2 B A 2
3 A B 3
4 B B 4
But I would like to have multipe entries according to value:
m2 <- m1
for (i in 1:nrow(m1)) {
j <- m1[i,"value"]
k <- 2
while ( k <= j) {
m2 <- rbind(m2,m1[i,])
k = k+1
}
}
> m2 <- subset(m2,select = - value)
> m2[order(m2$X1),]
X1 X2
1 A A
3 A B
31 A B
32 A B
2 B A
4 B B
21 B A
41 B B
42 B B
43 B B
Is there a parameter in melt which considers to multiply the entries according to value? Or any other library which can perform this issue?
We could do this with base R. We convert the dimnames of 'm0' to a 'data.frame' with two columns using expand.grid, then replicate the rows of the dataset with the values in 'm0', order the rows and change the row names to NULL (if necessary).
d1 <- expand.grid(dimnames(m0))
d2 <- d1[rep(1:nrow(d1), c(m0)),]
res <- d2[order(d2$Var1),]
row.names(res) <- NULL
res
# Var1 Var2
#1 A A
#2 A B
#3 A B
#4 A B
#5 B A
#6 B A
#7 B B
#8 B B
#9 B B
#10 B B
Or with melt, we convert the 'm0' to 'long' format and then replicate the rows as before.
library(reshape2)
dM <- melt(m0)
dM[rep(1:nrow(dM), dM$value),1:2]
As #Frank mentioned, we can also use table with as.data.frame to create 'dM'
dM <- as.data.frame(as.table(m0))
I have 8 columns of variables which I must keep column 1 to 3. For column 4 to 8 I need to keep those with only 3 levels and drop which does not qualify that condition.
I tried the following command
data3 <- data2[,sapply(data2,function(col)length(unique(col)))==3]
It managed to retain the variables with 3 levels, but deleted my first 3 columns.
You could do a two step process:
data4 <- data2[1:3]
#Your answer for the second part here:
data3 <- data2[,sapply(data2,function(col)length(unique(col)))==3]
merge(data3,data4)
Depending on what you would like your expected output to be, could try with the option all =TRUE inside the merge().
I would suggest another approach:
x = 1:3
cbind(data2[x], Filter(function(i) length(unique(i))==3, data2[-x]))
# 1 2 3 5
#1 a 1 3 b
#2 b 2 4 b
#3 c 3 5 b
#4 d 4 6 a
#5 e 5 7 c
#6 f 6 8 c
#7 g 7 9 c
#8 h 8 10 a
#9 i 9 11 c
#10 j 10 12 b
Data:
data2 = setNames(
data.frame(letters[1:10],
1:10,
3:12,
sample(letters[1:10],10, replace=T),
sample(letters[1:3],10, replace=T)),
1:5)
Assuming that the columns 4:8 are factor class, we can also use nlevels to filter the columns. We create 'toKeep' as the numeric index of columns to keep, and 'toFilter' as numeric index of columns to filter. We subset the dataset into two: 1) using the 'toKeep' as the index (data2[toKeep]), 2) using the 'toFilter', we further subset the dataset by looping with sapply to find the number of levels (nlevels), create logical index (==3) to filter the columns and cbind with the first subset.
toKeep <- 1:3
toFilter <- setdiff(seq_len(ncol(data2)), n)
cbind(data2[toKeep], data2[toFilter][sapply(data2[toFilter], nlevels)==3])
# V1 V2 V3 V4 V6
#1 B B D C B
#2 B D D A B
#3 D E B A B
#4 C B E C A
#5 D D A D E
#6 E B A A B
data
set.seed(24)
data2 <- as.data.frame(matrix(sample(LETTERS[1:5], 8*6, replace=TRUE), ncol=8))