I have a large dataframe and want to create a new column name Class based on matching data present in perticular column:
Is it possible to solve this using loop or other way
The example dataframe is as follows:
dat <- data.frame(
Function = c("A", "B", "C", "D", "E", "F", "G", "H", "I")
)
and the output look like this
dat <- data.frame(
Function = c("A", "C", "F", "D", "E", "I", "G", "H", "B"),
Class= c("Class1","Class1","Class1","Class2","Class2","Class2","Class3","Class3","Class3"))
Create an assignment dictionary A, using read.table for instance, and merge the data frames.
A <- read.table(text='
A Class1
B Class3
C Class1
D Class2
E Class2
F Class1
G Class3
H Class3
I Class2
')
merge(dat, A)
# Function Class
# 1 A Class1
# 2 B Class3
# 3 C Class1
# 4 D Class2
# 5 E Class2
# 6 F Class1
# 7 G Class3
# 8 H Class3
# 9 I Class2
The other way round you could write a list with classes and the assigned functions,
lst <- list(Class1=c("A", "C", "F"), Class2=c("D", "E", "I"), Class3=c("B", "G", "H"))
and Vectorize a small function so it loops over the functions as well as list elements.
class_assign <- Vectorize(\(x, lst) names(lst)[sapply(lst, \(a) any(x %in% a))], vectorize.args='x')
dat$Class <- class_assign(dat$Function, lst)
# Function Class
# 1 A Class1
# 2 C Class1
# 3 F Class1
# 4 D Class2
# 5 E Class2
# 6 I Class2
# 7 G Class3
# 8 H Class3
# 9 B Class3
Data:
dat <- structure(list(Function = c("A", "C", "F", "D", "E", "I", "G",
"H", "B")), class = "data.frame", row.names = c(NA, -9L))
Related
I have a dataset that looks like this:
data <- data.frame(Name1 = c("A", "B", "D", "E", "H"),
Name2 = c("B", "C", "E", "G", "I"))
I would like to add an ID column to help me trace groups of names, i.e. who references who? So with the example data, the groups would be:
Name1 Name2 GroupID
A B 1
B C 1
D E 2
E G 2
H I 3
Please note that my original data is not ordered as this example is. Thanks in advance for any help!
You can use the igraph package to make a network from your data set and determine clusters:
data <- data.frame(Name1 = c("A", "B", "D", "E", "H"),
Name2 = c("B", "C", "E", "G", "I"))
library(igraph)
graph <- graph_from_data_frame(data, directed = FALSE)
clusters <- components(graph)
#data$GroupId <- sapply(data$Name1, function(x) clusters$membership[which(names(clusters$membership) == x)])
# Simpler version
data$GroupId <- clusters$membership[data$Name1]
That gives:
> data
Name1 Name2 GroupId
1 A B 1
2 B C 1
3 D E 2
4 E G 2
5 H I 3
So, if I have two lists, one being a "master list" without repeats, and the other being a subset with possible repeats, I would like to be able to check how many of each element are in the secondary subset list.
So if I have these lists:
a <- (a, b, c, d, e, f, g)
b <- (a, d, c, d, a, f, f, g, c, c)
I'd like to determine how many times each element from list a appear in list b and the frequency of each. My ideal output would be an r table that looks like:
c <- a b c d e f g
2 0 3 1 0 2 1
I've been trying to think through it with %in% and table()
You can use table and match - but first make the vectors factors so levels not present are included in the output:
a <- factor(c("a", "b", "c", "d", "e", "f", "g"))
b <- factor(c("a", "d", "c", "d", "a", "f", "f", "g", "c", "c"))
table(a[match(b, a)])
a b c d e f g
2 0 3 2 0 2 1
If for some reason you want a tidyverse solution. This method preserves the original data type in the lists.
library(tidyverse)
a <- c("a", "b", "c", "d", "e", "f", "g")
b <- c("a", "d", "c", "d", "a", "f", "f", "g", "c", "c")
tibble(letters = a, count = unlist(map(a, function(x) sum(b %in% x))))
# A tibble: 7 x 2
letters count
<chr> <int>
1 a 2
2 b 0
3 c 3
4 d 2
5 e 0
6 f 2
7 g 1
The following code:
df <- data.frame(
"letter" = c("a", "b", "c", "d", "e", "f"),
"score" = seq(1,6)
)
Results in the following dataframe:
letter score
1 a 1
2 b 2
3 c 3
4 d 4
5 e 5
6 f 6
I want to get the scores for a sequence of letters, for example the scores of c("f", "a", "d", "e"). It should result in c(6, 1, 4, 5).
What's more, I want to get the scores for c("c", "o", "f", "f", "e", "e"). Now the o is not in the letter column so it should return NA, resulting in c(3, NA, 6, 6, 5, 5).
What is the best way to achieve this? Can I use dplyr for this?
We can use match to create an index and extract the corresponding 'score' If there is no match, then by default it gives NA
df$score[match(v1, df$letter)]
#[1] 3 NA 6 6 5 5
df$score[match(v2, df$letter)]
#[1] 6 1 4 5
data
v1 <- c("c", "o", "f", "f", "e", "e")
v2 <- c("f", "a", "d", "e")
If you want to use dplyr I would use a join:
df <- data.frame(
"letter" = c("a", "b", "c", "d", "e", "f"),
"score" = seq(1:6)
)
library(dplyr)
df2 <- data.frame(letter = c("c", "o", "f", "f", "e", "e"))
left_join(df2, df, by = "letter")
letter score
1 c 3
2 o NA
3 f 6
4 f 6
5 e 5
6 e 5
This question already has an answer here:
Order a data frame according to a given order [duplicate]
(1 answer)
Closed 4 years ago.
I would like to sort a data frame on one of its columns, based on a vector which contains all possible elements of the column, but without duplicates. For example a table like this:
A a
B b
C b
D b
E a
F a
G c
H b
And a vector like this: c("b", "c", "a")
So that sorting the table on column 2 based on this vector would produce this table:
B b
C b
D b
H b
G c
A a
E a
F a
We can use match with order
df1[order(match(df1$v2, vec1)),]
# v1 v2
#2 B b
#3 C b
#4 D b
#8 H b
#7 G c
#1 A a
#5 E a
#6 F a
data
vec1 <- c("b", "c", "a")
df1 < structure(list(v1 = c("A", "B", "C", "D", "E", "F", "G", "H"),
v2 = c("a", "b", "b", "b", "a", "a", "c", "b")), .Names = c("v1",
"v2"), class = "data.frame", row.names = c(NA, -8L))
I am quite new to R programming, and am having some difficulty with ANOTHER step of my project. I am not even sure at this point if I am asking the question correctly. I have a dataframe of actual and predicted values:
actual predicted.1 predicted.2 predicted.3 predicted.4
a a a a a
a a a b b
b b a b b
b a b b c
c c c c c
c d c c d
d d d c d
d d d d a
The issue that I am having is that I need to create a vector of mismatches between the actual value and each of the four predicted values. This should result in a single vector: c(2,1,2,4)
I am trying to use a boolean mask to sum over the TRUE values...but something is not working right. I need to do this sum for each of the four predicted values to actual value comparisons.
discordant_sums(df[,seq(1,ncol(df),2)]!=,df[,seq(2,ncol(df),2)])
Any suggestions would be greatly appreciated.
You can use apply to compare values in 1st column with values in each of all other columns.
apply(df[-1], 2, function(x)sum(df[1]!=x))
# predicted.1 predicted.2 predicted.3 predicted.4
# 2 1 2 4
Data:
df <- read.table(text =
"actual predicted.1 predicted.2 predicted.3 predicted.4
a a a a a
a a a b b
b b a b b
b a b b c
c c c c c
c d c c d
d d d c d
d d d d a",
header = TRUE, stringsAsFactors = FALSE)
We can replicate the first column to make the lengths equal between the comparison objects and do the colSums
as.vector(colSums(df[,1][row(df[-1])] != df[-1]))
#[1] 2 1 2 4
data
df <- structure(list(actual = c("a", "a", "b", "b", "c", "c", "d",
"d"), predicted.1 = c("a", "a", "b", "a", "c", "d", "d", "d"),
predicted.2 = c("a", "a", "a", "b", "c", "c", "d", "d"),
predicted.3 = c("a", "b", "b", "b", "c", "c", "c", "d"),
predicted.4 = c("a", "b", "b", "c", "c", "d", "d", "a")),
.Names = c("actual",
"predicted.1", "predicted.2", "predicted.3", "predicted.4"),
class = "data.frame", row.names = c(NA,
-8L))