Finding the Frequency of Values Across Character Strings - r

I have three different character vectors of different lengths. Some have overlapping values, others have unique values. These values appear a different number of times in each vector. For example,
A <- c("A", "A", "B")
B <- c("A", "B", "C", "D")
C <- c("B", "A", "C", "E", "F")
I want to know
How many unique values there are, in total.
What those values are
The frequency of each value across all lists, and I want to be able to filter it (ex: values that appear less then or equal to two times across all lists)
Edit to clarify the above point: I want to know how many times a value comes up across all lists. For example, I want to know that the value A comes up 4 times and the value F only once.
How do I go about doing this? I can't find a stringr command to do this and I am new to working with strings.

#Unique items
> unique(A)
[1] "A" "B"
#count of unique items
> length(unique(A))
[1] 2
#frequency of each unique value
df_A <- data.frame(A =A) #data frame prepared
> dplyr::mutate(dplyr::group_by(df_A, A), freq = n())
# A tibble: 3 x 2
# Groups: A [2]
A freq
<chr> <int>
1 A 2
2 A 2
3 B 1
#filter
df_A <- dplyr::mutate(dplyr::group_by(df_A, A), freq = n())
df_A$A[df_A$freq < 2]
> df_A$A[df_A$freq < 2]
[1] "B"
EDIT
#unique items across all lists
> unique(c(A, B, C))
[1] "A" "B" "C" "D" "E" "F"
#Freq across all lists
tabulate(as.factor(c(A,B,C)))
[1] 4 3 2 1 1 1
#OR
> table(c(A, B, C))
A B C D E F
4 3 2 1 1 1

You can use following steps:
To find unique elements:
uq <- unique(A)
To total of unique elements:
library(car)
A1 <- recode(A, "'A' = 1; 'B' = 2")
# This will give frequencies for all the elements
names(which(table(A1) == max(table(A1))))
tab <- sort(table(a)) # to sort the result in ascending order of frequency
How many unique values there are, in total.
table(unique(A1))

Related

Assigning complex values to character elements of data frame in R

There are three columns in my data frame which are characters, "A","B", and "C" (this order can vary for different data frames). I want to assign values to them, A= 1+0i, B=2+3i and C=3+2i. I use as.complex(factor(col1)) and the same thing for column two and three, but it makes all three column equal to 1+0i!!
col1 <- c("A","A", "A")
col2 <- c("B", "B","B")
col3 <- c("C","C","C")
df <- data.frame(col1,col2,col3)
print(df)
A= 1+0i
B=2+3i
C=3+2i
df2<- transform(df, col1=as.complex(as.factor(col1)),col2=as.complex(as.factor(col2)),col3=as.complex(as.factor(col3)))
sapply(df2,class)
View(df2)
So this is a weird thing you're doing. You have a column of strings, letters like "A" and "B". Then you have objects with the same names, A = 1 + 0i, etc. Normally we don't treat object names as "data", but you're sort of mixing the two here. The solution I'd propose is to make everything data: combine your A, B, and C values into a vector, and give the vector names accordingly. Then we can replace the values in the data frame with the corresponding values from our named vector:
vec = c(A, B, C)
names(vec) = c("A", "B", "C")
df[] = lapply(df, \(x) vec[x])
df
# col1 col2 col3
# 1 1+0i 2+3i 3+2i
# 2 1+0i 2+3i 3+2i
# 3 1+0i 2+3i 3+2i

Row name of the smallest element

I have the following dataframe:
d <- data.frame(a=c(1,2,3,4), b=c(20,19,18,17))
row.names(d) <- c("A", "B", "C", "D")
I want another data.frame, with the same columns and 2 rows, which contain the row names of the 2 smallest elements in that column.
In the example the expected result would be:
# Expected results
exp <- data.frame(a=c("A", "B"), b=c("C","D"))
We loop over the columns with lapply, order the values, use that index to subset the n corresponding row.names of 'd', and wrap with data.frame
n <- 2
data.frame(lapply(d, function(x) sort(head(row.names(d)[order(x)], n))))
-output
# a b
#1 A C
#2 B D
With R 4.1.0, we can also use the |> operator for chaining the functions (applied in the order for easier understanding) along with \(x) - for lambda function in base R
# // ordered the column values
# // get corresponding row names
lapply(d, \(x) row.names(d)[order(x)] |>
head(n) |> # // get the first n values
sort()) |> # // sort them
data.frame() # // convert the list to data.frame
# a b
#1 A C
#2 B D
Or using dplyr
library(dplyr)
d %>%
summarise(across(everything(),
~ sort(head(row.names(d)[order(.)], n))))
# a b
#1 A C
#2 B D
Using sapply in base R -
rn <- rownames(d)
sapply(d, function(x) rn[order(x) %in% 1:2])
# a b
#[1,] "A" "C"
#[2,] "B" "D"

Split a column of character vectors and return a list

I have the following dataframe:
df <- data.frame(Sl.No = c(1:6),
Variable = c('a', 'a,b', 'a,b,c', 'b', 'c', 'b,c'))
Sl.No Variable
1 a
2 a,b
3 a,b,c
4 b
5 c
6 b,c
I want to separate the unique values in the variable column as list
myList <- ("a", "b", "c")
I have tried the following code:
separator <- function(x) strsplit(x, ",")[[1]][[1]]
unique(sapply(df$Variable, separator))
This however gives me the following output:
"a"
I request some help. I have searched but seem unable to find an answer to this.
We can split the Variable column at "," and get all the values and select only the unique ones.
unique(unlist(strsplit(df$Variable, ",")))
#[1] "a" "b" "c"
If the Variable column is factor convert it into character before using strsplit.

Using row-wise column indices in a vector to extract values from data frame [duplicate]

This question already has an answer here:
Get the vector of values from different columns of a matrix
(1 answer)
Closed 5 years ago.
Using vector of column positional indexes such as:
> i <- c(3,1,2)
How can I use the index to extract the 3rd value from the first row of a data frame, the 1st value from the second row, the 2nd value from the third row, etc.
For example, using the above index and:
> dframe <- data.frame(x=c("a","b","c"), y=c("d","e","f"), z=c("g","h","i"))
> dframe
x y z
1 a d g
2 b e h
3 c f i
I would like to return:
> [1] "g", "b", "f"
Just use matrix indexing, like this:
dframe[cbind(seq_along(i), i)]
# [1] "g" "b" "f"
The cbind(seq_along(i), i) part creates a two column matrix of the relevant row and column that you want to extract.
How about this:
Df <- data.frame(
x=c("a","b","c"),
y=c("d","e","f"),
z=c("g","h","i"))
##
i <- c(3,1,2)
##
index2D <- function(v = i, DF = Df){
sapply(1:length(v), function(X){
DF[X,v[X]]
})
}
##
> index2D()
[1] "g" "b" "f"

Keeping value as a factor when extracting most common factor in R

I can get the most frequent level or name of a factor in a table using table() and levels() or name() as explained here, but how can I get a factor itself?
> a <- ordered (c("a", "b", "c", "b", "c", "b", "a", "c", "c"))
> tt <- table(a)
> m = names(which.max(tt)) # what do I put here?
> is.factor(m)
[1] FALSE # I want this to be TRUE and for m to be identical a[3]
This is just an example, of course. What I'm really trying to do is a lot of manipulation and aggregation of factors and I just want to keep the factors consistent across all the variables. I don't want them to change levels or order or drop levels because there is no data.
It's not clear exactly what you do want. If you want a factor vector of length 4 with the same levels as a:
m = a[ a %in% names(which.max(tt)) ]
For a length one vector, do the same as above and just take the first one:
m = a[ a %in% names(which.max(tt)) ][1]
m
#--------
[1] c
Levels: a < b < c
> m == a[3]
[1] TRUE
If you want a vector of the same length, then:
m <- a
is.na(m) <- ! m %in% names(which.max(tt))

Resources