Append vector not giving names - r

In R studio, I am looking to create a vector for country names. They are enclosed in my data set in column 1. Countryvec gives factor names
"Australia Australia ..."
x just gives the names of Russia, country 36, country ends up being
1,1,...,2,2,...,4,4.. etc.
They are also not in order, 3 ends up between 42 and 43. How do I make the numbers the factors?
gdppc=read.xlsx("H:/dissertation/ALL/YAS.xlsx",sheetIndex = 1,startRow = 1)
countryvec=gdppc[,1]
country=c()
for (j in 1:43){
x=rep(countryvec[j],25)
country=append(country,x)
}

You need to retrieve the levels attribute
set.seed(7)
v <- factor(letters[rbinom(20, 10, .5)])
> c(v)
[1] 6 4 2 2 3 5 3 6 2 4 2 3 5 2 4 2 4 1 6 3
> levels(v)[v]
[1] "h" "e" "c" "c" "d" "f" "d" "h" "c" "e" "c" "d" "f" "c" "e" "c" "e" "a" "h" "d"
You'll probably need to modify the code to inside the loop:
x <- rep(levels(countryvec)[countryvec][j], 25)
Or convert the vector prior to the loop:
countryvec <- levels(countryvec)[countryvec]

Related

In R, assign numeric variable according to segments of another variable? [duplicate]

This question already has answers here:
Create group names for consecutive values
(4 answers)
Closed 2 years ago.
Some simple sample data:
test <- c(rep('B', 10), rep('A', 7), rep('C', 10), rep('A', 3))
#1] "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "A" "A" "A" "A" "A" "A" "A" "C" "C" "C" "C" "C" "C" "C" "C" "C" "C" "A" "A" "A"
I would like to assign a numeric variable to this where the first block of 'B' gets a 1, the first block of 'A' gets a 2, the first block of 'C' gets a 3, and the next block of 'A' gets a 4. I tried:
test <- factor(test, levels = unique(test))
as.integer(test)
but that assigns that second block of 'A' a 2. How can I get it to assign unique, sequential numbers for each block? the actual data is drug combinations and I need the assigned numeric variable to start with 1
I guess you need rle
> with(rle(test), rep(seq_along(lengths), lengths))
[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4

Is there some way to split a list into its elements in R?

So I am doing an analysis of tweets from different accounts using get_timeline from rtweet. It returns a df with 90 variables, which is great. However, one of them, the variable hashtags, gives me either NA (no hashtags used in the tweet, one hashtag or a list of all the hashtags. So, I want to create different variables for each of the hashtags in order to save the tweets into a CSV to use powerBI and do some graphs.
Thefore, my question is can you split all the elements of the list into different variables containing a single word each?
As I understand your problem you do not need to split the list in order to get all single or unique list entries, but use a combination of unlist and unique instead.
Let's assume you have a list of hashtags (just letters in the example) with different lengths, l_hashtags .
Some hashtags are repetitions.
unlisting the list will give you vector with all hashtags, including all repetitions.
applying unique to this unlisted l_hastag gives you the unique members of the original list.
l_hashtags <- list(c(LETTERS[1:2]), rep(NA,5), LETTERS[5:15], c('A', 'N', 'N', 'J', 'K'))
l_hashtags
#> [[1]]
#> [1] "A" "B"
#>
#> [[2]]
#> [1] NA NA NA NA NA
#>
#> [[3]]
#> [1] "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O"
#>
#> [[4]]
#> [1] "A" "N" "N" "J" "K"
table(unlist(l_hashtags))
#>
#> A B E F G H I J K L M N O
#> 2 1 1 1 1 1 1 2 2 1 1 3 1
l_hashtags_unlisted <- unlist(l_hashtags)
unique(l_hashtags_unlisted)
#> [1] "A" "B" NA "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O"
You can of course put all this into one single line:
unique(unlist(l_hashtags))
# [1] "A" "B" NA "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O"

How to plot frequency of all elements in one list appearing in another list

I have a long list of sequences as follows
AAAAAACGTTATGATCGATC
AAAATTCGCGCTTAGAGATC
AAGCTACGCATGCATCGACT
AAAAAACGTTATGATCGATC
AAAAAACGTTATGATCGATC
AAAATTCGCGCTTAGAGATC etc.
I also have a shorter list and I would like to see how many times each element in the short list appears in the long list and plot it as a histogram. I suppose its like a Vlookup function. How can I do this in R?
Try:
longlist = c("AAAAAACGTTATGATCGATC", "AAAATTCGCGCTTAGAGATC", "AAGCTACGCATGCATCGACT",
"AAAAAACGTTATGATCGATC", "AAAAAACGTTATGATCGATC", "AAGCTACGCATGCATCGACT",
"AAGCTACGCATGCATCGACT", "AAAAAACGTTATGATCGATC", "AAAAAACGTTATGATCGATC"
)
shortlist = c("AAAAAACGTTATGATCGATC", "AAGCTACGCATGCATCGACT")
longlist
[1] "AAAAAACGTTATGATCGATC" "AAAATTCGCGCTTAGAGATC" "AAGCTACGCATGCATCGACT" "AAAAAACGTTATGATCGATC" "AAAAAACGTTATGATCGATC"
[6] "AAGCTACGCATGCATCGACT" "AAGCTACGCATGCATCGACT" "AAAAAACGTTATGATCGATC" "AAAAAACGTTATGATCGATC"
shortlist
[1] "AAAAAACGTTATGATCGATC" "AAGCTACGCATGCATCGACT"
outdf = data.frame(var=character(), freq=numeric(), stringsAsFactors=F)
for(i in 1:length(shortlist)) {outdf[i,]=c(shortlist[i], sum(longlist==shortlist[i]))}
outdf
var freq
1 AAAAAACGTTATGATCGATC 5
2 AAGCTACGCATGCATCGACT 3
outdf$freq = as.numeric(outdf$freq)
barplot(outdf$freq, names.arg=outdf$var)
Can easily use following to see frequency and barplot of full longlist:
table(longlist)
longlist
AAAAAACGTTATGATCGATC AAAATTCGCGCTTAGAGATC AAGCTACGCATGCATCGACT
5 1 3
barplot(table(longlist))
match and table should work for your character vectors. Here's an example just random letters:
set.seed(1492)
dat <- sample(c(letters, LETTERS), 100, replace=TRUE)
dat
## [1] "o" "l" "j" "f" "c" "a" "S" "A" "u" "N" "H" "H" "k" "B" "B" "P" "g"
## [18] "r" "I" "V" "H" "t" "g" "F" "e" "W" "E" "D" "r" "Y" "h" "Z" "R" "l"
## [35] "Z" "K" "v" "f" "b" "q" "M" "P" "i" "u" "w" "m" "S" "g" "f" "g" "G"
## [52] "h" "q" "T" "J" "M" "K" "m" "X" "Q" "f" "x" "t" "B" "k" "z" "I" "Y"
## [69] "z" "g" "z" "u" "O" "k" "G" "L" "n" "B" "A" "A" "J" "p" "U" "F" "E"
## [86] "X" "R" "J" "G" "L" "H" "o" "z" "r" "d" "r" "V" "H" "S" "I"
matches <- match(dat, LETTERS)
match_counts <- table(matches[!is.na(matches)])
match_counts
##
## 1 2 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## 3 4 1 2 2 3 5 3 3 2 2 2 1 1 2 1 2 3 1 1 2 1 2 2 2
names(match_counts) <- LETTERS[as.numeric(names(match_counts))]
match_counts
## A B D E F G H I J K L M N O P Q R S T U V W X Y Z
## 3 4 1 2 2 3 5 3 3 2 2 2 1 1 2 1 2 3 1 1 2 1 2 2 2
barplot(sort(match_counts), col="#649388")
Assuming that the sequences are strings.
lines <- readLines(n=6)
AAAAAACGTTATGATCGATC
AAAATTCGCGCTTAGAGATC
AAGCTACGCATGCATCGACT
AAAAAACGTTATGATCGATC
AAAAAACGTTATGATCGATC
AAAATTCGCGCTTAGAGATC
shortlist <- readLines(n=1)
AGTD
Here, I am assuming that each element as individual characters as it is not clear.
pat1 <- gsub("(?<=[A-Za-z])(?=[A-Za-z])", "|", shortlist, perl=TRUE)
pat1
#[1] "A|G|T|D"
library(stringr)
lvls <- unique(str_extract_all(shortlist, "[A-Za-z]")[[1]])
t1 <- table(factor(unlist(regmatches(lines,gregexpr(pat1, lines))), levels=lvls))
t1
#
# A G T D
#47 21 29 0
barplot(t1, col="#649388")
Update
If your shortlist is like below and you wanted to get the frequencies for each string instead of characters in the string.
shortlist1 <- readLines(n=4)
AAGCTACGCATGCATCGACT
AAAAAACGTTATGATCGATC
AAAAAACGTTATCT
AAAAAACG
pat2 <- paste0("^",paste(shortlist1, collapse="|"), "$")
lvls1 <- unique(shortlist1)
t2 <- table(factor(unlist(regmatches(lines,gregexpr(pat2, lines))), levels=lvls1))
t2
#AAGCTACGCATGCATCGACT AAAAAACGTTATGATCGATC AAAAAACGTTATCT
# 1 3 0
# AAAAAACG
# 0
barplot(t2, col="#649388")

Matching values from two vectors in R

I have two vectors:
A <- c(1,3,5,6,4,3,2,3,3,3,3,3,4,6,7,7,5,4,4,3) # 7 unique values
B <- c("a","b","c","d","e","f","g") # 7 different values
I would like to match the values of B to A such that the smallest value in A gets the first value from B and continued on to the largest.
The above example would be:
A: 1 3 5 6 4 3 2 3 3 3 3 3 4 6 7 7 5 4 4 3
assigned: a c e f d c b c c c c c d f g g e d d c
Try this:
A <- c(1,3,5,6,4,3,2,3,3,3,3,3,4,6,7,7,5,4,4,3)
B <- letters[1:7]
B[match(A, sort(unique(A)))]
# [1] "a" "c" "e" "f" "d" "c" "b" "c" "c" "c" "c" "c" "d" "f" "g"
# [16] "g" "e" "d" "d" "c"
Another option that handles the general case that #JoshO'Brien addresses would be
B[as.numeric(factor(A))]
# [1] "a" "c" "e" "f" "d" "c" "b" "c" "c" "c" "c" "c" "d"
# [14] "f" "g" "g" "e" "d" "d" "c"
A2<-ifelse(A > 4, A + 1, A)
# [1] 1 3 6 7 4 3 2 3 3 3 3 3 4 7 8 8 6 4 4 3
B[as.numeric(factor(A2))]
# [1] "a" "c" "e" "f" "d" "c" "b" "c" "c" "c" "c" "c" "d"
# [14] "f" "g" "g" "e" "d" "d" "c"
However, following benchmark shows that this method is slower than #JoshOBrien's.
library(microbenchmark)
B <- make.unique(rep(letters, length.out=1000))
A <- sample(seq_along(B), replace=TRUE)
unique_sort_match <- function() B[match(A, sort(unique(A)))]
factor_as.numeric <- function() B[as.numeric(factor(A))]
bm<-microbenchmark(unique_sort_match(), factor_as.numeric(), times=1000L)
plot(bm)
To elaborate on the comments in #Josh's answer:
If A does in fact represent a permutation of the elements of B (ie, where a 1 in A represents the first element of B, a 4 in A represents the 4th element in B, etc), then as #Matthew Plourde points out, you would want to simply use A as your index to B:
B[A]
If A does not represent a permutation of B, then you should use the method suggested by #Josh

match() to list of vectors - of possibly different lengths

The match(x, y) function is perfect to search the elements of the vector x within the elements of vector y. But what is an efficient and easy way to do the similar job when y is a list of vectors - of possibly different lengths?
I mean the result should be a vector of the same length as x, and the i-th element should be the first member of y that contains the i-th element of x, or NA.
To find the element of y in which each element of x (first) occurs, try this:
## First, a reproducible example
set.seed(44)
x <- letters[1:25]
y <- replicate(4, list(sample(letters, 8)))
y
# [[1]]
# [1] "t" "h" "m" "n" "a" "d" "i" "b"
#
# [[2]]
# [1] "c" "l" "z" "a" "s" "d" "i" "u"
#
# [[3]]
# [1] "b" "k" "e" "g" "o" "i" "h" "j"
#
# [[4]]
# [1] "g" "i" "f" "r" "h" "w" "l" "o"
## Find the element of y first containing the letters a-j
breaks <- c(0, cumsum(sapply(y, length))) + 1
findInterval(match(x, unlist(y)), breaks)
# [1] 1 1 2 1 3 4 3 1 1 3 3 2 1 1 3 NA NA 4 2 1 2 NA 4 NA NA

Resources