Sort Rows in R by certain characteristic - r

I've read a few questions that have been answered and are similar, but for the life of me, I can't figure out how to apply it to my question. I have a data frame like this (some entries are blank, the number of columns are determined by the pronunciation guide of a word)
5 SUPERNATURALISM S UW2 P ER0 N AE1 CH ER0 AH0 L IH2 Z AH0 M
6 ESTABLISH IH0 S T AE1 B L IH0 SH
7 FABRIC F AE1 B R IH0 K
1504 AMTRAK AE1 M T R AE0 K
I'd basically like to be able to sort the table according to the number of syllables, and that's easily visible by the number of times you see a number (0,1 or 2) in each row (the 0,1,or 2 indicates the stress placed on the vowels. The consonants like S, P, and so on, don't come attached with a number)
For example, Row 7 and 1504 both have 2 syllables because there are 2 numbers in each row (although in different columns), so I'd like them to be together.
In the end, I'd just like to sort it so out of the 4 rows I showed, the order should be 7,1504,6,5
Does anyone know the code I should write?

You can extract digit from each line and order them according to the number of digits(length of the list)
library(stringr)
## read lines
ll <- readLines(textConnection('5 SUPERNATURALISM S UW2 P ER0 N AE1 CH ER0 AH0 L IH2 Z AH0 M
6 ESTABLISH IH0 S T AE1 B L IH0 SH
7 FABRIC F AE1 B R IH0 K
1504 AMTRAK AE1 M T R AE0 K '))
## extract digit , remove the first one and order
ord <- order(sapply(str_extract_all(ll,'\\d+'),function(x)length(x[-1])))
## get the result
ll[ord]
[1] "7 FABRIC F AE1 B R IH0 K "
[2] "1504 AMTRAK AE1 M T R AE0 K "
[3] "6 ESTABLISH IH0 S T AE1 B L IH0 SH "
[4] "5 SUPERNATURALISM S UW2 P ER0 N AE1 CH ER0 AH0 L IH2 Z AH0 M"

Here's a solution using base and positive look-behind regex (using #agstudy's data):
len = unlist(lapply(ll, function(x)
length(gregexpr("(?<=[A-Za-z])[0-2]", x, perl=TRUE)[[1L]])))
ll[order(len)]
The regex (?<=[A-Za-z])[0-2] basically means if there's a number between 0-2 and it's previous character is any of the alphabets, then there's a match. perl=TRUE is required for matching with look-ahead and look-behind regexps as it requires PCRE.
We loop through ll using this regex. gregexpr returns a list with it's first index returning the position of matches. So, we extract it using length(.) and then use order to rearrange ll.
HTH

You would have to process the data first, probably regex is the way to go. Here is a solution (I will use the package stringr):
## your data
df <- read.table(text=
"5 SUPERNATURALISM S UW2 P ER0 N AE1 CH ER0 AH0 L IH2 Z AH0 M
6 ESTABLISH IH0 S T AE1 B L IH0 SH
7 FABRIC F AE1 B R IH0 K
1504 AMTRAK AE1 M T R AE0 K ",
fill=TRUE,
stringsAsFactors=FALSE)
Manipulating to extract numbers and count:
library(stringr)
text <- apply(df[,3:16], 1,paste, collapse="")
numbers <- str_extract_all(text, "\\d+")
df$syllables <- sapply(numbers, length)
Order:
df <- df[order(df$syllables),]
df
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 syllables
3 7 FABRIC F AE1 B R IH0 K 2
4 1504 AMTRAK AE1 M T R AE0 K 2
2 6 ESTABLISH IH0 S T AE1 B L IH0 SH 3
1 5 SUPERNATURALISM S UW2 P ER0 N AE1 CH ER0 AH0 L IH2 Z AH0 M 7

Related

Assign a vector element to another element in the same vector that is not itself (Secret Santa algorithm)

I have an office Secret Santa. Each person needs to be assigned someone to buy a gift for. For simplicity, let's refer to these people by letters.
First, I create the vector of people called peeps
# People
peeps <- letters[1:20]
Next, I set the seed for the RNG.
# Set RNG seed
set.seed(43)
Then, I sample the people and assign a giver to a receiver.
# Create data frame of givers and receiver
data.frame(giver = peeps, receiver = sample(peeps, length(peeps)))
# giver receiver
# 1 a l
# 2 b h
# 3 c d
# 4 d b
# 5 e e <- Problem!!!
# 6 f j
# 7 g s
# 8 h n
# 9 i g
# 10 j t
# 11 k q
# 12 l f
# 13 m k
# 14 n i
# 15 o a
# 16 p m
# 17 q p
# 18 r c
# 19 s r
# 20 t o
Notice row #5. e has been assigned themselves, which doesn't work. So, this time I loop through the people and I prevent self assignment.
# Set RNG seed
set.seed(7)
# Result data frame
res <- data.frame(giver = peeps, receiver = character(length(peeps)), stringsAsFactors = FALSE)
# Loop through people
for(i in 1:nrow(res)){
# Get sample that is not self
res[i, "receiver"] <- sample(peeps[peeps != res$giver[i]], 1)
# Remove from vector
peeps <- peeps[peeps != res[i, "receiver"]]
}
This particular case throws the following error.
# Error in sample.int(length(x), size, replace, prob) :
# invalid first argument
If we look at the resulting data frame you'll see it worked for all except the final person.
res
# giver receiver
# 1 a k
# 2 b h
# 3 c b
# 4 d s
# 5 e n
# 6 f i
# 7 g m
# 8 h l
# 9 i d
# 10 j q
# 11 k p
# 12 l j
# 13 m r
# 14 n e
# 15 o c
# 16 p o
# 17 q g
# 18 r f
# 19 s a
# 20 t
t is yet to be assigned a receiver, but the only receiver left is... t!
# Check which letters are left
peeps
#[1] "t" <- Same as final letter!!!
Question: what is an efficient way of assigning a member of peeps to another member of peeps that is not themself?
You seem to be talking about derangements (fixed-point free permutations). By a classic probability result, the probability that a randomly chosen permutation is a derangement is essentially 1/e, independent of the size of the set being sampled from. Just use a naive hit-and-miss approach. Generate random permutations using sample() until you get one that works. On average, roughly 3 trials should yield a success.
derangement <- function(v){
while(TRUE){
p <- sample(v)
if(all(p != v)) return(p)
}
}
peeps <- letters[1:20]
set.seed(43)
print(derangement(peeps))
Output:
[1] "j" "r" "b" "l" "f" "i" "t" "g" "c" "n" "d" "s" "p" "o" "q"
[16] "k" "a" "e" "h" "m"
You could just shuffle them and then pair them off with the next one...
peeps <- letters[1:20]
giver <- sample(peeps) #random order
receiver <- giver[c(2:length(giver), 1)]
df <- data.frame(giver, receiver)
df <- df[order(df$giver), ] #restore original order
df
giver receiver
19 a o
16 b t
8 c f
11 d i
10 e d
9 f e
7 g c
15 h b
12 i k
1 j m
13 k s
4 l r
2 m p
18 n a
20 o j
3 p l
6 q g
5 r q
14 s h
17 t n

Multiple String Matching in R

Consider A,B,C,D .... as words.
I have two DFs.
df1:
ColA
A B
B C
C D
E F
G H
A M
M
df2:
ColB
A B C D X Y Z
C D M N F K L
S H A F R M T U
Operation:
I want to search all element of df1 in df2 then append all the matching values in a new column OR may be create multiple rows.
Output 1:
ColB COlB
A B C D X Y Z A,A B,B C,C D
C D M N F K L C D,M
S H A F R M T U A,A M
Output2:
ColB Output
A B C D X Y Z A
A B C D X Y Z A B
A B C D X Y Z B C
A B C D X Y Z C D
C D M N F K L C D
C D M N F K L M
S H A F R M T U A
S H A F R M T U A M
I think this will do it, although it differs a bit from your expected answer, which I think is wrong.
First set up the input data frames:
# set up the data
df1 <- data.frame(ColA = c("A B",
"B C",
"C D",
"E F",
"G H",
"A M",
"M"),
stringsAsFactors = FALSE)
df2 <- data.frame(ColB = c("A B C D X Y Z",
"C D M N F K L",
"S H A F R M T"),
stringsAsFactors = FALSE)
Next we will form all the pairwise combinations of the things to search with the things to be searched:
# create a vector of patterns and items to search
intermediate <- as.vector(outer(df2$ColB, df1$ColA, paste, sep = "|"))
# split it into a list
intermediate <- strsplit(intermediate, "|", fixed = TRUE)
Then we can create a function to match the elements for each row of this full combination dataset The core is the foundMatch which returns a logical indicating whether all elements in ColA were present in ColB. In your examples, order does not matter, so here we split the elements and look for all of the first to be in the second.
# set up the output data.frame
Output2 <- data.frame(do.call(rbind, intermediate))
names(Output2) <- c("ColB", "Output")
# here is the core, which does the element matching
foundMatch <- apply(Output2, 1, function(x) {
tokens <- strsplit(x, " ", fixed = TRUE)
all(tokens[[2]] %in% tokens[[1]])
})
# filter out the ones with the match
Output2 <- Output2[foundMatch, ]
Output2
## ColB Output
## 1 A B C D X Y Z A B
## 2 C D M N F K L A B
## 3 S H A F R M T A B
## 10 A B C D X Y Z E F
## 14 C D M N F K L G H
## 20 C D M N F K L M
## 21 S H A F R M T M
Not exactly what you have above but I think it's correct.
It is not obvious for me how your data.frames df1 and df2 are built. But you can try to vectorise your data and match both sets.
d1 <- sort(as.character(unlist(df1)))
d2 <- sort(as.character(unlist(df2)))
# get the intersection/difference without duplicates
intersect(d1,d2)
setdiff(d1,d2)
# get all values matching with the first or with the second dataset, respectively
d1[ d1 %in% d2 ]
d2[ d2 %in% d1 ]

Whole dataset shows up, although a subset has been selected and newly defined

I a dataframe which I have subsetted using normal indexing. Code below.
dframe <- dframe[1:10, c(-3,-7:-10)]
But when I write dframe$Symbol I get the output.
BABA ORCL LFC TSM ACT ABBV MA ABEV KMI UPS
3285 Levels: A AA AA^B AAC AAN AAP AAT AAV AB ABB ABBV ABC ABEV ABG ABM ABR ABR^A ABR^B ABR^C ABRN ABT ABX ACC ACCO ACE ACG ACH ACI ACM ACN ACP ACRE ACT ACT^A ACW ADC ADM ADPT ADS ADT ADX AEB AEC AED AEE AEG AEH AEK AEL AEM AEO AEP AER AES AES^C AET AF AF^C ... ZX
I'm wondering what is happening here. Does the dframe dataframe only contain 10 rows or still all rows, but only outputs 10 rows?
Thanks
That's just the way factors work. When you subset a factor, it preserves all levels, even those that are no longer represented in the subset. For example:
f1 <- factor(letters);
f1;
## [1] a b c d e f g h i j k l m n o p q r s t u v w x y z
## Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
f2 <- f1[1:10];
f2;
## [1] a b c d e f g h i j
## Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
To answer your question, it's actually slightly tricky to append all missing levels to a factor. You have to combine the existing factor data with all missing indexes (here I'm referring to the integer indexes that the factor class internally uses to map the actual factor data to its levels vector, which is stored as an attribute on the factor object), and then rebuild a factor (using the original levels) from that combined data. Below I demonstrate this, now randomizing the subset taken from f1 to demonstrate that order does not matter:
set.seed(1); f3 <- sample(f1,10);
f3;
## [1] g j n u e s w m l b
## Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
factor(c(f3,setdiff(1:nlevels(f3),as.integer(f3))),labels=levels(f3));
## [1] g j n u e s w m l b a c d f h i k o p q r t v x y z
## Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z

combining rows in sequence file

I have a data frame, in which each individual has two rows and I want to combine these two row in on row.
Code lines:
dat <- read.table("cbin.csv",sep="\t", row.names=1)
dat
V2 V3 V4 V5
1_1 A B C D
1_2 a b c d
2_1 E F G H
2_2 e f g h
3_1 J K L M
3_2 j k l m
d <- apply( dat[ , colnames(dat) ] , 2 , paste , collapse = " " )
d
V2 V3 V4 V5
"A a E e J j" "B b F f K k" "C c G g L l" "D d H h M m"
But I want to combine each two rows like this
1 A a B b C c D d
2 E e F f G g H h
3 I i J j K k L l
How can I do this?
This will get you more or less the data.frame you want. I just pull out the even rows and cbind them next to the odd rows.
dat2 <- cbind(dat[seq(1, nrow(dat), by = 2), ],
dat[seq(2, nrow(dat), by = 2), ])
I'll leave reordering the columns (or pasting them together, if you want to combine them into individual strings) as an exercise for the reader.
Here are a couple of options:
Option 1: Use stack to get a long data.frame, then use paste within aggregate to get the output you want.
Here's how you make your "long" data.frame.
Long <- cbind(rn = rownames(dat), stack(dat))
head(Long)
# rn values ind
# 1 1_1 A V2
# 2 1_2 a V2
# 3 2_1 E V2
# 4 2_2 e V2
# 5 3_1 J V2
# 6 3_2 j V2
If the values in "dat" are factors, you might need to do:
Long <- cbind(rn = rownames(dat), stack(lapply(dat, as.character)))
Once your data are in a long form, use aggregate along with substr (among other choices) to get the values you need to paste together.
aggregate(values ~ substr(rn, 1, 1), Long, paste, collapse = " ")
# substr(rn, 1, 1) values
# 1 1 A a B b C c D d
# 2 2 E e F f G g H h
# 3 3 J j K k L l M m
An alternative is a similar approach to what #Gregor is suggesting. This is basically an alternative approach to getting every alternate row and binding it, but goes the extra step to reorder and paste the values together.
do.call(paste,
cbind(dat[c(TRUE, FALSE), ],
dat[c(FALSE, TRUE), ])[order(rep(names(dat), 2))])
# [1] "A a B b C c D d" "E e F f G g H h" "J j K k L l M m"

Remove variables with factor level 1

I am using the program gs in the bnlearn package for my data frame EMGbin. The dataframe EMGbin contains all factors, ranging from A to Z. EMGbin has 600000 columns and 130 rows. Here is a sample of EMGbin:
V101 V102 V103 V104 V105 V106
1 L M D S O O
2 L M C P A O
3 J M C O O O
4 L N D R A O
5 K M D O A O
6 K M C P O O
7 K N D Q O O
8 L N D R O O
9 L M D O O O
10 K M D S A O
When I run the program gs(EMGbin), I get the error:
Error in check.data(x) : all factors must have at least two levels.
When I run sapply(EMGbin, nlevels), I see the levels of factors each of the 600,000 variables has, and I see some of them are listed as 1 level. Would removing the variables with 1 factor level help? So far, the only way I know how to do this is x[, sapply(x, fun) != 1], but I don't know what to substitute in for fun.
Use this:
x[, sapply(x, nlevels) > 1]
You can check the number of levels in a factor with the nlevels function.

Resources