Unique characters from a column of concatenated strings

Unique characters from a column of concatenated strings - r

I have a data.frame with a string column 'city' which consists of concatenated letters separated by ;
dt = data.frame(id = letters[1:6],
city = c("A;B","B;D","A;D;G","A;C","F;G","C;D"))
dt
# id city
# 1 a A;B
# 2 b B;D
# 3 c A;D;G
# 4 d A;C
# 5 e F;G
# 6 f C;D`
I hope to get the unique individual letters from the 'city' column:
city = c("A","B","C","D","F","G")`
How to do this?

A cleaner solution would be:
dt= data.frame(id=letters[1:6],city = c("A;B","B;D","A;D;G","A;C","F;G","C;D"))
city=strsplit(as.character(dt$city), ";")
city=sort(unique(unlist(city)))
[1] "A" "B" "C" "D" "F" "G"

The data:
dt= data.frame(id=letters[1:6],city = c("A;B","B;D","A;D;G","A;C","F;G","C;D"))
> dt
id city
1 a A;B
2 b B;D
3 c A;D;G
4 d A;C
5 e F;G
6 f C;D
Split the column city, using as.character to convert to strings:
city <- unlist(strsplit(as.character(dt$city), ";", fixed = T))
> city
[1] "A" "B" "B" "D" "A" "D" "G" "A" "C" "F" "G" "C" "D"
Now use unique and order to get the output:
city <- unique(city)
> city
[1] "A" "B" "D" "G" "C" "F"
city <- city[order(city)]
> city
[1] "A" "B" "C" "D" "F" "G"
> dput(city)
c("A", "B", "C", "D", "F", "G")
Edit: Updated with OPs new data.
Edit2: Updated to omit the sapply, as apparently strsplit is vectorized. Thanks #Cris!

Related

merging factor variables of equal length but different levels ignoring NA

I have survey data from various sources. Most are factor variables with different levels. When merging, this means that there are variables of the same length, each of them contains a number of rows with information, while the other rows are NA. So when merging each row in the complete df should have information in it, while disregarding the NA's and maintaining the same length.
I have tried the forcats package as it contains functions to manipulate differing factor levels, but I have not found a solutions that satisfies removing the NA's while merging the different factor with their corresponding levels.
v1 <- as.factor(c("a","b","c","x","x",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA))
v2<- as.factor(c(NA,NA,NA,NA,NA,"c","c","c","b","a",NA,NA,NA,NA,NA))
v3<- as.factor(c(NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,"f","c","c","b","a"))
df<- data.frame(v1,v2,v3)
A merged variable should look like a factor that contains:
("a","b","c","x","x","c","c","c","b","a","f","c","c","b","a")

library(magrittr)
lapply(df, function(x){
x[!is.na(x)] %>%
t %>%
as.character
}) %>%
unlist %>%
as.factor %>%
`names<-`(NULL)
[1] a b c x x c c c b a f c c b a
Levels: a b c f x

library(tidyverse)
map(df, ~na.omit(.x)) %>% unlist %>% unname
[1] a b c x x c c c b a f c c b a
Levels: a b c x f

In base R, we can use unlist and then Filter to omit NA values.
Filter(function(x) !is.na(x) , unlist(df, use.names = FALSE))
#[1] a b c x x c c c b a f c c b a
#Levels: a b c x f

We can use coalesce
library(dplyr)
df %>%
transmute(v = coalesce(!!! .)) %>%
pull(v)
#[1] "a" "b" "c" "x" "x" "c" "c" "c" "b" "a" "f" "c" "c" "b" "a"
Or more compactly
library(purrr)
reduce(df, coalesce)
#[1] "a" "b" "c" "x" "x" "c" "c" "c" "b" "a" "f" "c" "c" "b" "a"
Or in base R
do.call(pmin, c(lapply(df, as.character), na.rm = TRUE))
#[1] "a" "b" "c" "x" "x" "c" "c" "c" "b" "a" "f" "c" "c" "b" "a"

Append vector not giving names

In R studio, I am looking to create a vector for country names. They are enclosed in my data set in column 1. Countryvec gives factor names
"Australia Australia ..."
x just gives the names of Russia, country 36, country ends up being
1,1,...,2,2,...,4,4.. etc.
They are also not in order, 3 ends up between 42 and 43. How do I make the numbers the factors?
gdppc=read.xlsx("H:/dissertation/ALL/YAS.xlsx",sheetIndex = 1,startRow = 1)
countryvec=gdppc[,1]
country=c()
for (j in 1:43){
x=rep(countryvec[j],25)
country=append(country,x)
}

You need to retrieve the levels attribute
set.seed(7)
v <- factor(letters[rbinom(20, 10, .5)])
> c(v)
[1] 6 4 2 2 3 5 3 6 2 4 2 3 5 2 4 2 4 1 6 3
> levels(v)[v]
[1] "h" "e" "c" "c" "d" "f" "d" "h" "c" "e" "c" "d" "f" "c" "e" "c" "e" "a" "h" "d"
You'll probably need to modify the code to inside the loop:
x <- rep(levels(countryvec)[countryvec][j], 25)
Or convert the vector prior to the loop:
countryvec <- levels(countryvec)[countryvec]

how to deal with missing value in if else statement?

I have a dataframe, mydata, constructed as follows:
col1<-c(8.20e+07, 1.75e+08, NA, 4.80e+07,
3.40e+07, NA, 5.60e+07, 3.00e+06 )
col2<-c(1960,1960,1965,1986,1960
,1969,1960,1993)
col3<-c ( NA,2.190,NA,NA, 5.000, NA,
1.700,4.220)
mydata<-data.frame(col1,col2,col3)
mydata
# col1 col2 col3
# 1 8.20e+07 1960 NA
# 2 1.75e+08 1960 2.19
# 3 NA 1965 NA
# 4 4.80e+07 1986 NA
# 5 3.40e+07 1960 5.00
# 6 NA 1969 NA
# 7 5.60e+07 1960 1.70
# 8 3.00e+06 1993 4.22
I want to create a col4 that has the values "a", "b" and "c",
if col1 is smaller than 4.00e+07, then col4=="a"; if col1 is not less than 4.00e+07, then col4=="b", else col4=="c"
Here is my code:
col4 <-ifelse(col1<4.00e+07, "a",
ifelse(col1 >=4.00e+07, "b",
ifelse(is.na(col1 =4.00e+07), "b", "c" )))
but this evaluates to:
# [1] "b" "b" NA "b" "a" NA "b" "a"
It doesn't change the NA value in col1 as "c".
The outcome should be:
# [1] "b" "b" "c" "b" "a" "c" "b" "a"
What is the problem in my code? Any suggestion would be appreciated!

You have to check is.na first, because NA < 4.00e+07 results in NA. If the first argument of ifelse() is NA, the result will be NA as well:
ifelse(c(NA, TRUE, FALSE), "T", "F")
## [1] NA "T" "F"
As you can see, for the first vector element the result is indeed NA. Even if the other arguments of ifelse() have special code that would take care of this situation, it won't help because that code is never taken into account.
For your example, checking for NA first gives you the desired result:
col4 <- ifelse(is.na(col1), "c",
ifelse(col1 < 4.00e+07, "a","b"))
col4
## [1] "b" "b" "c" "b" "a" "c" "b" "a"

This can be also done with cut
v1 <- with(mydata, as.character(cut(col1,
breaks=c(-Inf, 4.00e+07, Inf), labels=c("a", "b"))))
v1[is.na(v1)] <- "c"
v1
#[1] "b" "b" "c" "b" "a" "c" "b" "a"

Matching values from two vectors in R

I have two vectors:
A <- c(1,3,5,6,4,3,2,3,3,3,3,3,4,6,7,7,5,4,4,3) # 7 unique values
B <- c("a","b","c","d","e","f","g") # 7 different values
I would like to match the values of B to A such that the smallest value in A gets the first value from B and continued on to the largest.
The above example would be:
A: 1 3 5 6 4 3 2 3 3 3 3 3 4 6 7 7 5 4 4 3
assigned: a c e f d c b c c c c c d f g g e d d c

Try this:
A <- c(1,3,5,6,4,3,2,3,3,3,3,3,4,6,7,7,5,4,4,3)
B <- letters[1:7]
B[match(A, sort(unique(A)))]
# [1] "a" "c" "e" "f" "d" "c" "b" "c" "c" "c" "c" "c" "d" "f" "g"
# [16] "g" "e" "d" "d" "c"

Another option that handles the general case that #JoshO'Brien addresses would be
B[as.numeric(factor(A))]
# [1] "a" "c" "e" "f" "d" "c" "b" "c" "c" "c" "c" "c" "d"
# [14] "f" "g" "g" "e" "d" "d" "c"
A2<-ifelse(A > 4, A + 1, A)
# [1] 1 3 6 7 4 3 2 3 3 3 3 3 4 7 8 8 6 4 4 3
B[as.numeric(factor(A2))]
# [1] "a" "c" "e" "f" "d" "c" "b" "c" "c" "c" "c" "c" "d"
# [14] "f" "g" "g" "e" "d" "d" "c"
However, following benchmark shows that this method is slower than #JoshOBrien's.
library(microbenchmark)
B <- make.unique(rep(letters, length.out=1000))
A <- sample(seq_along(B), replace=TRUE)
unique_sort_match <- function() B[match(A, sort(unique(A)))]
factor_as.numeric <- function() B[as.numeric(factor(A))]
bm<-microbenchmark(unique_sort_match(), factor_as.numeric(), times=1000L)
plot(bm)

To elaborate on the comments in #Josh's answer:
If A does in fact represent a permutation of the elements of B (ie, where a 1 in A represents the first element of B, a 4 in A represents the 4th element in B, etc), then as #Matthew Plourde points out, you would want to simply use A as your index to B:
B[A]
If A does not represent a permutation of B, then you should use the method suggested by #Josh

How to extract unique levels from 2 columns in a data frame in r

I have the data.frame
df<-data.frame("Site.1" = c("A", "B", "C"),
"Site.2" = c("D", "B", "B"),
"Tsim" = c(2, 4, 7),
"Jaccard" = c(5, 7, 1))
# Site.1 Site.2 Tsim Jaccard
# 1 A D 2 5
# 2 B B 4 7
# 3 C B 7 1
I can get the unique levels for each column using
top.x<-unique(df[1:2,c("Site.1")])
top.x
# [1] A B
# Levels: A B C
top.y<-unique(df[1:2,c("Site.2")])
top.y
# [1] D B
# Levels: B D
How do I get the unique levels for both columns and turn them into a vector i.e:
v <- c("A", "B", "D")
v
# [1] "A" "B" "D"

top.xy <- unique(unlist(df[1:2,]))
top.xy
[1] A B D
Levels: A B C D

Try union:
union(top.x, top.y)
# [1] "A" "B" "D"
union(unique(df[1:2, c("Site.1")]),
unique(df[1:2, c("Site.2")]))
# [1] "A" "B" "D"

You can get the unique levels for the firs two collumns:
de<- apply(df[,1:2],2,unique)
de
# $Site.1
# [1] "A" "B" "C"
# $Site.2
# [1] "D" "B"
Then you can take the symmetric difference of the two sets:
union(setdiff(de$Site.1,de$Site.2), setdiff(de$Site.2,de$Site.1))
# [1] "A" "C" "D"
If you're intrested in just two first two rows (as in your example):
de<- apply(df[1:2,1:2],2,unique)
de
# Site.1 Site.2
# [1,] "A" "D"
# [2,] "B" "B"
union(de[,1],de[,2])
# [1] "A" "B" "D"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Unique characters from a column of concatenated strings - r

A cleaner solution would be: dt= data.frame(id=letters[1:6],city = c("A;B","B;D","A;D;G","A;C","F;G","C;D")) city=strsplit(as.character(dt$city), ";") city=sort(unique(unlist(city))) [1] "A" "B" "C" "D" "F" "G"

Related

merging factor variables of equal length but different levels ignoring NA

Append vector not giving names

how to deal with missing value in if else statement?

Matching values from two vectors in R

How to extract unique levels from 2 columns in a data frame in r

Categories

Resources