Creating a new variable column based on data from another column - r

I'm pretty new to R, and programming in general, and I'm wondering the best way to loop through a column so I can add a column to the data frame further describing the observations I looped through.
I currently have a list of amino acids and their positions on a protein that looks like this:
Residue Position
H 1
R 2
K 3
D 4
E 5
H 6
R 7
K 8
D 9
E 10
I'd like something that looks like this (where H, R, and K are basic amino acids, and D and E are acidic amino acids):
Residue Position Properties
H 1 Basic
R 2 Basic
K 3 Basic
D 4 Acidic
E 5 Acidic
H 6 Basic
R 7 Basic
K 8 Basic
D 9 Acidic
E 10 Acidic
I'm really not sure where to start, and I'm having difficulty finding a good resource for this kind of situation in R.
I started by trying to subset the data, but then I realized that wouldn't do the trick:
Basic
h.dat <- subset(all, all$Residue == "H")
r.dat <- subset(all, all$Residue == "R")
k.dat <- subset(all, all$Residue == "K")
Acidic
d.dat <- subset(all, all$Residue == "D")
e.dat <- subset(all, all$Residue == "E")
Thanks!
Note:
H = Histidine (Basic amino acid)
R = Arginine (Basic)
K = Lysine (Basic)
E = Glutamic Acid (Acidic)
D = Aspartic Acid (Acidic)

You can use ifelse. If df is the name of your original data,
df$Property <- ifelse(df$Residue %in% c("H", "R", "K"), "Basic", "Acidic")
df
# Residue Position Property
# 1 H 1 Basic
# 2 R 2 Basic
# 3 K 3 Basic
# 4 D 4 Acidic
# 5 E 5 Acidic
# 6 H 6 Basic
# 7 R 7 Basic
# 8 K 8 Basic
# 9 D 9 Acidic
# 10 E 10 Acidic

Try:
> df1
Residue Position
1 H 1
2 R 2
3 K 3
4 D 4
5 E 5
6 H 6
7 R 7
8 K 8
9 D 9
10 E 10
Create a reference table:
> df2
Residue Property
1 H Basic
2 R Basic
3 K Basic
4 D Acidic
5 E Acidic
Then merge:
> merge(df1, df2)
Residue Position Property
1 D 9 Acidic
2 D 4 Acidic
3 E 5 Acidic
4 E 10 Acidic
5 H 1 Basic
6 H 6 Basic
7 K 8 Basic
8 K 3 Basic
9 R 7 Basic
10 R 2 Basic

I think you might want to allow for non-polar amino acids as well:
c(rep("Basic",3),rep("Acidic",2),"Non-Polar")[ # those are the choices
match(dat$Residue, c("H","R","K","E","D"), nomatch=6) ] #select indices
So I added an 11th residue named "Z" and tested:
> dat$Property <- c(rep("Basic",3),rep("Acidic",2),"Non-Polar")[
match(dat$Residue, c("H","R","K","E","D"), nomatch=6) ]
> dat
Residue Position Property
1 H 1 Basic
2 R 2 Basic
3 K 3 Basic
4 D 4 Acidic
5 E 5 Acidic
6 H 6 Basic
7 R 7 Basic
8 K 8 Basic
9 D 9 Acidic
10 E 10 Acidic
11 Z 11 Non-Polar

Related

quanteda::dfm_lookup(): capture found term

I would like to perform the amazing quanteda's dfm_lookup() on a dictionary but also retrieve the matches.
Consider the following example:
dict_ex <- dictionary(list(christmas = c("Christmas", "Santa", "holiday"),
opposition = c("Opposition", "reject", "notincorpus"),
taxglob = "tax*",
taxregex = "tax.+$",
country = c("United_States", "Sweden")))
dfmat_ex <- dfm(tokens(c("My Christmas was ruined by your opposition tax plan.",
"Does the United_States or Sweden have more progressive taxation?")),
remove = stopwords("english"))
dfmat_ex
dfm_lookup(dfmat_ex, dict_ex)
This gives me:
Document-feature matrix of: 2 documents, 5 features (50.00% sparse) and 0 docvars.
features
docs christmas opposition taxglob taxregex country
text1 1 1 1 0 0
text2 0 0 1 0 2
However, since every dictionary tool also has multiple entries, I would like to know which token produced the match. (My real dictionary is rather long, so the example might seem trivial but for the real use case, it is not.)
I would like to achieve a result like this:
Document-feature matrix of: 2 documents, 5 features (50.00% sparse) and 0 docvars.
features
docs christmas christmas.match opposition opposition.match taxglob taxglob.match taxregex taxreg.match country country.match
text1 1 Christmas 1 Opposition 1 tax 0 NA 0 NA
text2 0 NA 0 NA 1 taxation 0 NA 2 United_States, Sweden
Can someone help me with this? Many thanks in advance! :)
That's not really possible for two reasons.
First, a matrix(-like) object (dfm or otherwise) cannot mix element modes, here a mixture of counts and character values. This would be possible with a data.frame but then you lose the advantages of sparsity, and here, you would have a n x 2*V (where V = number of features) data.frame dimensions.
Second, "christmas.match" could have more than one feature/token matching it, so the character value would require a list, straining the object class even further.
A better way would be to use kwic() to match the tokens to the patterns formed by the dictionary. You can do this for the keys by supplying the dictionary as pattern(), or unlisting the dictionary to get matches for each value.
library("quanteda")
## Package version: 3.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
dict <- dictionary(list(one = c("a*", "b"), two = c("e", "f")))
toks <- tokens(c(d1 = "a b c d e f g and another"))
# where the dictionary keys are the patterns matched
kwic(toks, dict) %>%
as.data.frame()
## docname from to pre keyword post pattern
## 1 d1 1 1 a b c d e f one
## 2 d1 2 2 a b c d e f g one
## 3 d1 5 5 a b c d e f g and another two
## 4 d1 6 6 a b c d e f g and another two
## 5 d1 8 8 c d e f g and another one
## 6 d1 9 9 d e f g and another one
# where the dictionary values are the patterns matched
kwic(toks, unlist(dict)) %>%
as.data.frame()
## docname from to pre keyword post pattern
## 1 d1 1 1 a b c d e f a*
## 2 d1 2 2 a b c d e f g b
## 3 d1 5 5 a b c d e f g and another e
## 4 d1 6 6 a b c d e f g and another f
## 5 d1 8 8 c d e f g and another a*
## 6 d1 9 9 d e f g and another a*

R - split list every x items

I have data to analyse that is presented in the form of a list (just one row and MANY columns).
A B C D E F G H I
1 2 3 4 5 6 7 8 9
Is there a way to tell R to split this list every x items and get something as seen below (the columns C D E F G H I are virtually the same as A B)?
A B
1 2
3 4
5 6
7 8
9
If the number of columns is a multiple of 'x', then we unlist the dataset, and use matrix to create the expected output.
as.data.frame(matrix(unlist(df1), ncol=2, dimnames=list(NULL, c("A", "B")) , byrow=TRUE))
If the number of columns is not a multiple of 'x', then
x <- 2
gr <- as.numeric(gl(ncol(df1), x, ncol(df1)))
lst <- split(unlist(df1), gr)
do.call(rbind, lapply(lst, `length<-`, max(lengths(lst))))
# A B
# 1 1 2
# 2 3 4
# 3 5 6
# 4 7 8
# 5 9 NA

Assigning rows of data.frame to another data.frame in R based on frequency of element's occurance

I have a data.frame df
> df
V1 V2
1 a b
2 a e
3 a f
4 b c
5 b e
6 b f
7 c d
8 c g
9 c h
10 d g
11 d h
12 e f
13 f g
14 g h
I found the frequency of each element's occurrence considering column V1 and sorted the Freq column in ascending order
>dfFreq <- as.data.frame(table(df$V1))
Var1 Freq
1 a 3
2 b 3
3 c 3
4 d 2
5 e 1
6 f 1
7 g 1
>dfFreqSorted <- dfFreq[order(dfFreq$Freq),]
Var1 Freq
5 e 1
6 f 1
7 g 1
4 d 2
1 a 3
2 b 3
3 c 3
Now what i want to do is to create a new data.frame based on original data.frame such that each "Var1" item in "dfFreqSorted" is used according to it's Freq but once every time going from the top of "dfFreqSorted" to the bottom which would give the result below:
So consider the first Var1 item which is "e" so the first matching row of "e" in V1 column of df is (e,f) which would be the first item in the new data.frame.
I figured that this can be done using:
>subset(df, V1==dfFreqSorted$Var[1])[1,]
V1 V2
12 e f
So if i used a for loop and looped through all the elements in the Var1 column of dfFreqSorted and used the subset command above and rbind the returned result into another data.frame I would have something like below
V1 V2
12 e f
13 f g
14 g h
10 d g
1 a b
4 b c
7 c d
Now this result shows each Var1 item once. I need the remaining rows as shown below such that after finishing first iteration of all the rows of Var1 once, the loop should go again to the beginning and check the frequency of all Var1 whose frequency is more than 1 now and find the next row from df for that element so the remaining rows that should be produced in the same data.frame as shown below:
11 d h
2 a e
5 b e
8 c g
3 a f
6 b f
9 c h
As you can see above that all elements are considered in Var1 whose frequency is 1 are used first then those whose frequency is greater than 1 (i.e 2) and are used once then in the next iteration those are used whose freq is greater than 2 (i.e 3) are used. Used such that corresponding unused row of that element is fetched from df.
So in short all the elements of df are arranged in anew data.frame such that elements are used in ascending order of their frequencies but used once first and then twice or thrice in every iteration based on what their frequency is.
I am not asking for the whole code just few guidelines of how i can achieve the objective. Thanks in advance.
Hello #akrun i am a beginner so the solution might be really a beginner level approach but it solved my problem perfectly fine.
> a<-read.table("isnodes.txt")
> a
V1 V2
1 a b
2 a e
3 a f
4 b c
5 b e
6 b f
7 c d
8 c g
9 c h
10 d g
11 d h
12 e f
13 f g
14 g h
> aF<-as.data.frame(table(a$V1))
> aF
Var1 Freq
1 a 3
2 b 3
3 c 3
4 d 2
5 e 1
6 f 1
7 g 1
> aFsorted <- aF[order(aF$Freq),]
> aFsorted
Var1 Freq
5 e 1
6 f 1
7 g 1
4 d 2
1 a 3
2 b 3
3 c 3
> sortedEdgeList <- a[-c(1:nrow(a)),]
> sortedEdgeList
[1] V1 V2
<0 rows> (or 0-length row.names)
> aFsorted <- cbind(aFsorted, Used=0)
> aFsorted
Var1 Freq Used
5 e 1 0
6 f 1 0
7 g 1 0
4 d 2 0
1 a 3 0
2 b 3 0
3 c 3 0
> maxFreq <- max(aFsorted$Freq)
> maxFreq
[1] 3
> for(i in 1:maxFreq){
+ rows<-nrow(aFsorted)
+ for(j in 1:rows){
+ Var1Value<-aFsorted$Var[j]
+ Var1Edge<-a[match(aFsorted$Var1[j],a$V1),]
+ sortedEdgeList<-rbind(sortedEdgeList,Var1Edge)
+ a<-a[!(a$V1==Var1Edge$V1 & a$V2==Var1Edge$V2),]
+ aFsorted$Used[j]=aFsorted$Used[j]+1
+ }
+ if(aFsorted$Used==aFsorted$Freq){
+ aFsorted<-aFsorted[!(aFsorted$Used==aFsorted$Freq),]
+ }
+ }
Warning messages:
1: In if (aFsorted$Used == aFsorted$Freq) { :
the condition has length > 1 and only the first element will be used
2: In if (aFsorted$Used == aFsorted$Freq) { :
the condition has length > 1 and only the first element will be used
3: In if (aFsorted$Used == aFsorted$Freq) { :
the condition has length > 1 and only the first element will be used
> sortedEdgeList
V1 V2
12 e f
13 f g
14 g h
10 d g
5 a b
4 b c
7 c d
11 d h
2 a e
51 b e
8 c g
3 a f
6 b f
9 c h
I'm not sure this is what you want, but it might be close. It helps conceptually to keep the frequencies in the original data frame.
library("plyr")
set.seed(3)
df <- data.frame(V1 = sample(letters[1:10], 20, replace = TRUE),
V2 = sample(letters[1:10], 20, replace = TRUE),
stringsAsFactors = FALSE)
df$freqV1 <- NA_integer_
for (i in 1:nrow(df)) {
df$freqV1[i] <- length(grep(pattern = df$V1[i], x = df$V1))
}
df2 <- arrange(df, freqV1, V2) # you may want just arrange(df, freqV1)
which gives:
V1 V2 freqV1
1 h c 1
2 d a 2
3 d b 2
4 c c 2
5 c j 2
6 b c 3
7 g c 3
8 b f 3
9 g h 3
10 g h 3
11 b i 3
12 i a 4
13 i c 4
14 i d 4
15 i f 4
16 f b 5
17 f d 5
18 f d 5
19 f e 5
20 f f 5

Vertex Labels in igraph with R

I have some issue while adding vertex labels in a weighted igraph working with R.
The data frame of the graph is:
df <- read.table(text=
"From, To, Weight
A,B,1
B,C,2
B,F,3
C,D,5
B,F,4
C,D,6
D,E,7
E,B,8
E,B,9
E,C,10
E,F,11", sep=',',header=TRUE)
# From To Weight
# 1 A B 1
# 2 B C 2
# 3 B F 3
# 4 C D 5
# 5 B F 4
# 6 C D 6
# 7 D E 7
# 8 E B 8
# 9 E B 9
# 10 E C 10
# 11 E F 11
and I use :
g<-graph.data.frame(df,directed = TRUE)
plot(g)
to plot the following graph :
One can see that vertex labels (for example) from E to B are superimposed.
(The same problem appears for vertex C-D and vertex B-F)
I'd like to know how to separate these labels so as to have each
different weight on each vertex ?
try the qgraph package. qgraph builds on igraph and does a lot of stuff for you in the background.
install.packages('qgraph')
require(qgraph)
qgraph(df,edge.labels=T)
Hope this helps.

R - extract submatrix with column name

R - is there any way to extract multiple column from a matrix with column names?
for example in matrix below:
A B C D E
A 1 3 5 7 9
B 2 4 6 8 10
extract submatrix with column C, D and E, like:
C D E
A 5 7 9
B 6 8 10
thanks.
As long as the matrix has column names (returned by colnames(m)) you can use them to index the columns you'd like to extract.
m[, c("C", "D", "E")]
# C D E
# A 5 7 9
# B 6 8 10

Resources