I am quite new to R programming, and am having some difficulty with ANOTHER step of my project. I am not even sure at this point if I am asking the question correctly. I have a dataframe of actual and predicted values:
actual predicted.1 predicted.2 predicted.3 predicted.4
a a a a a
a a a b b
b b a b b
b a b b c
c c c c c
c d c c d
d d d c d
d d d d a
The issue that I am having is that I need to create a vector of mismatches between the actual value and each of the four predicted values. This should result in a single vector: c(2,1,2,4)
I am trying to use a boolean mask to sum over the TRUE values...but something is not working right. I need to do this sum for each of the four predicted values to actual value comparisons.
discordant_sums(df[,seq(1,ncol(df),2)]!=,df[,seq(2,ncol(df),2)])
Any suggestions would be greatly appreciated.
You can use apply to compare values in 1st column with values in each of all other columns.
apply(df[-1], 2, function(x)sum(df[1]!=x))
# predicted.1 predicted.2 predicted.3 predicted.4
# 2 1 2 4
Data:
df <- read.table(text =
"actual predicted.1 predicted.2 predicted.3 predicted.4
a a a a a
a a a b b
b b a b b
b a b b c
c c c c c
c d c c d
d d d c d
d d d d a",
header = TRUE, stringsAsFactors = FALSE)
We can replicate the first column to make the lengths equal between the comparison objects and do the colSums
as.vector(colSums(df[,1][row(df[-1])] != df[-1]))
#[1] 2 1 2 4
data
df <- structure(list(actual = c("a", "a", "b", "b", "c", "c", "d",
"d"), predicted.1 = c("a", "a", "b", "a", "c", "d", "d", "d"),
predicted.2 = c("a", "a", "a", "b", "c", "c", "d", "d"),
predicted.3 = c("a", "b", "b", "b", "c", "c", "c", "d"),
predicted.4 = c("a", "b", "b", "c", "c", "d", "d", "a")),
.Names = c("actual",
"predicted.1", "predicted.2", "predicted.3", "predicted.4"),
class = "data.frame", row.names = c(NA,
-8L))
Related
From a single dataset I created two dataset filtering on the target variable. Now I'd like to compare all the features in the dataset using chi square. The problem is that one of the two dataset is much smaller than the other one so in some features I have some values that are not present in the second one and when I try to apply the chi square test I get this error: "all arguments must have the same length".
How can I add to the dataset with less value the missing value in order to be able to use chi square test?
Example:
I want to use chi square on a the same feature in the two dataset:
chisq.test(table(df1$var1, df2$var1))
but I get the error "all arguments must have the same length" because table(df1$var1) is:
a b c d
2 5 7 18
while table(df2$var1) is:
a b c
8 1 12
so what I would like to do is adding the value d in df2 and set it equal to 0 in order to be able to use the chi square test.
The table output of df2 can be modified if we convert to factor with levels specified
table(factor(df2$var1, levels = letters[1:4]))
a b c d
8 1 12 0
But, table with two inputs, should have the same length. For this, we may need to bind the datasets and then use table
library(dplyr)
table(bind_rows(df1, df2, .id = 'grp'))
var1
grp a b c d
1 2 5 7 18
2 8 1 12 0
Or in base R
table(data.frame(col1 = rep(1:2, c(nrow(df1), nrow(df2))),
col2 = c(df1$var1, df2$var1)))
col2
col1 a b c d
1 2 5 7 18
2 8 1 12 0
data
df1 <- structure(list(var1 = c("a", "a", "b", "b", "b", "b", "b", "c",
"c", "c", "c", "c", "c", "c", "d", "d", "d", "d", "d", "d", "d",
"d", "d", "d", "d", "d", "d", "d", "d", "d", "d", "d")), class = "data.frame",
row.names = c(NA,
-32L))
df2 <- structure(list(var1 = c("a", "a", "a", "a", "a", "a", "a",
"a",
"b", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c"
)), class = "data.frame", row.names = c(NA, -21L))
I have a dataframe like this one:
A <- c("a", "a", "a", "a")
B <- c("b", "b", "b", "b")
C <- c("c", "a", "c", "c")
D <- c("d", "b", "a", "d")
E <- c("a", "a", "b", "e")
F <- c("b", "b", "c", "f")
G <- c("c", "a", "a", "g")
df <- data.frame(A, B, C, D, E, F, G)
I need to merge all values from the columns A to G, remove duplicates, and store a resulting list in a new column. So, the final result should look like this:
Try this one
> df$new <- apply(df,1,unique)
> df
A B C D E F G new
1 a b c d a b c a, b, c, d
2 a b a b a b a a, b
3 a b c a b c a a, b, c
4 a b c d e f g a, b, c, d, e, f, g
A possible solution:
library(tidyverse)
A <- c("a", "a", "a", "a")
B <- c("b", "b", "b", "b")
C <- c("c", "a", "c", "c")
D <- c("d", "b", "a", "d")
E <- c("a", "a", "b", "e")
F <- c("b", "b", "c", "f")
G <- c("c", "a", "a", "g")
df <- data.frame(A, B, C, D, E, F, G)
df %>%
rowwise %>%
mutate(new = c_across(everything()) %>% unique %>% str_c(collapse = ",")) %>%
ungroup
#> # A tibble: 4 × 8
#> A B C D E F G new
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 a b c d a b c a,b,c,d
#> 2 a b a b a b a a,b
#> 3 a b c a b c a a,b,c
#> 4 a b c d e f g a,b,c,d,e,f,g
this is sort of a silly way of doing it, but does this address your issue?
list(unique(t(df)[,1]),
unique(t(df)[,2]),
unique(t(df)[,3]),
unique(t(df)[,4]))
So, if I have two lists, one being a "master list" without repeats, and the other being a subset with possible repeats, I would like to be able to check how many of each element are in the secondary subset list.
So if I have these lists:
a <- (a, b, c, d, e, f, g)
b <- (a, d, c, d, a, f, f, g, c, c)
I'd like to determine how many times each element from list a appear in list b and the frequency of each. My ideal output would be an r table that looks like:
c <- a b c d e f g
2 0 3 1 0 2 1
I've been trying to think through it with %in% and table()
You can use table and match - but first make the vectors factors so levels not present are included in the output:
a <- factor(c("a", "b", "c", "d", "e", "f", "g"))
b <- factor(c("a", "d", "c", "d", "a", "f", "f", "g", "c", "c"))
table(a[match(b, a)])
a b c d e f g
2 0 3 2 0 2 1
If for some reason you want a tidyverse solution. This method preserves the original data type in the lists.
library(tidyverse)
a <- c("a", "b", "c", "d", "e", "f", "g")
b <- c("a", "d", "c", "d", "a", "f", "f", "g", "c", "c")
tibble(letters = a, count = unlist(map(a, function(x) sum(b %in% x))))
# A tibble: 7 x 2
letters count
<chr> <int>
1 a 2
2 b 0
3 c 3
4 d 2
5 e 0
6 f 2
7 g 1
This question already has an answer here:
Order a data frame according to a given order [duplicate]
(1 answer)
Closed 4 years ago.
I would like to sort a data frame on one of its columns, based on a vector which contains all possible elements of the column, but without duplicates. For example a table like this:
A a
B b
C b
D b
E a
F a
G c
H b
And a vector like this: c("b", "c", "a")
So that sorting the table on column 2 based on this vector would produce this table:
B b
C b
D b
H b
G c
A a
E a
F a
We can use match with order
df1[order(match(df1$v2, vec1)),]
# v1 v2
#2 B b
#3 C b
#4 D b
#8 H b
#7 G c
#1 A a
#5 E a
#6 F a
data
vec1 <- c("b", "c", "a")
df1 < structure(list(v1 = c("A", "B", "C", "D", "E", "F", "G", "H"),
v2 = c("a", "b", "b", "b", "a", "a", "c", "b")), .Names = c("v1",
"v2"), class = "data.frame", row.names = c(NA, -8L))
I have a simple problem (seemingly) but have not yet able to find an appropriately quick/time & resource efficient solution. This is a problem in R-Software.
My data is of format:
INPUT
col1 col2
A q
C w
B e
A r
A t
A y
C q
B w
C e
C r
B t
C y
DESIRED OUTPUT
unit1 unit2 same_col2_freq
A B 1
A C 3
B A 1
B C 2
C A 3
C B 2
That is in input A has occurred in col1 with q, r, t, y occurring in col2. Now, q, r, t, y occurs for B with t so the A-B combination has count 1.
B has occurred in col1 with e, w, t occurring in col2. Now, e, w, t occurs for C with w, t so the B-C combination has count 2.
.... and so on for all combinations in col1.
I have done it using a for loop but it is slow. I am picking unique elements from col1 and then, all the data is iterated for each element of col1. Then I am combining the results using rbind. This is slow and resource costly.
I am looking for an efficient method. Maybe a library, function etc. exists that I am unaware of. I tried using co-occurrence matrix but the number of elements in col1 is of order of ~10,000 and it does not solve my purpose.
Any help is greatly appreciated.
Thanks!
Use merge to join the dataframe with itself and then use aggregate to count within groups. demo:
d = data.frame(col1=c("A", "C", "B", "A", "A", "A", "C", "B", "C", "C", "B", "C"), col2=c("q", "w", "e", "r", "t", "y", "q", "w", "e", "r", "t", "y"))
dm = merge(d, d, by="col2")
dm = dm[dm[,'col1.x']!=dm[,'col1.y'],]
aggregate(col2 ~ col1.x + col1.y, data=dm, length)
# col1.x col1.y col2
# 1 B A 1
# 2 C A 3
# 3 A B 1
# 4 C B 2
# 5 A C 3
# 6 B C 2
Here is a similar approach (as showed by #cogitovita), but using data.table. Convert the "data.frame" to "data.table" using setDT, then Cross Join (CJ) the unique elements of "col1", grouped by "col2". Subset the rows of the output columns that are not equal (V1!=V2), get the count (.N), grouped by the new columns (.(V1, V2)) and finally order the columns (order(V1,V2))
library(data.table)
setDT(df)[,CJ(unique(col1), unique(col1)), col2][V1!=V2,
.N, .(V1,V2)][order(V1,V2)]
# V1 V2 N
#1: A B 1
#2: A C 3
#3: B A 1
#4: B C 2
#5: C A 3
#6: C B 2
data
df <- structure(list(col1 = c("A", "C", "B", "A", "A", "A", "C", "B",
"C", "C", "B", "C"), col2 = c("q", "w", "e", "r", "t", "y", "q",
"w", "e", "r", "t", "y")), .Names = c("col1", "col2"), class =
"data.frame", row.names = c(NA, -12L))