Recoding values by row based on values of other columns

Recoding values by row based on values of other columns - r

I have a data set containing the genetic information of two parents and 300+ offspring. I'm trying to change the row values of the offspring based on the values of the parents in that row such that:
P1 P2 o1 o2 o3
1 A T A T AT
2 C A CA A C
3 G C G G C
4 T C C TC CT
becomes:
P1 P2 o1 o2 o3
1 A T a b h
2 C A b b a
3 G C a a b
4 T C b b h
where 'a' in the offspring indicates that it's like P1, 'b' for P2, and 'h' for having both. I've split the parent columns from the offspring for ease (Parents and Test, respectively), but my loop doesn't work or changes the entire row to NA. I've just been trying to tackle recoding to 'a' and 'b' for now with the following code:
for (i in 1:nrow(Test)) {
if (Parents[i, 1] == "A") {
Test[Test[i, ] == "A"] <- "a"
} else
if (Parents[i, 2] =="A") {
Test[Test[i, ] == "A"] <-"b"
}
}
I'd appreciate any help, I'm desperately trying to avoid doing this by hand.

I wonder if your expected output is inconsistent with your rules; assuming that it is, try this:
dat
# P1 P2 o1 o2 o3
# 1 A T A T AT
# 2 C A CA A C
# 3 G C G G C
# 4 T C C TC CT
vecgrepl <- Vectorize(grepl)
dat[,3:5] <- lapply(dat[,3:5], function(Z)
sapply(paste0(+(vecgrepl(dat$P1, Z)), +(vecgrepl(dat$P2, Z))),
switch, "01"="b", "10"="a", "11"="h", "-")
)
dat
# P1 P2 o1 o2 o3
# 1 A T a b h
# 2 C A h b a
# 3 G C a a b
# 4 T C b h h
Breakdown:
grepl accepts a pattern of length 1 only, so we need to Vectorize it. There are other ways to do this with equivalent results.
vecgrepl(dat$P1, Z) should return (for each column Z) whether P1's letter is found in its value.
+(.) is a shortcut for converting FALSE/TRUE to 0/1, used for switch below; admittedly we could use FALSETRUE, TRUEFALSE,TRUETRUE and such, I thought this might appear cleaner.
switch is an easy way to emulate multiple if/then conditionals, looking at the three combinations. The trailing "-" is a default value, if none of "01", "10", or "11" is seen (effectively "00" here). (This could also be emulated with dplyr::case_when or data.table::fcase with little adjustment.)
Because switch is also length-1 only, I vectorize it using sapply(..., switch, ...) where the second ... are arguments sent to switch. Equivalent to sapply(paste0(...), function(x) switch(z, "01"="b", ..)).
dat[,3:5] <- lapply(dat[,3:5], ..), only do this for three of the volumns; could also have done dat[,-(1:2)] to do all columns except the first two.
Data
dat <- structure(list(P1 = c("A", "C", "G", "T"), P2 = c("T", "A", "C", "C"), o1 = c("A", "CA", "G", "C"), o2 = c("T", "A", "G", "TC"), o3 = c("AT", "C", "C", "CT")), class = "data.frame", row.names = c("1", "2", "3", "4"))

Related

How to Create Column with Binary Variable that Indicates if Another Column has Certain Factors?

I'm looking to create an binary variable column that shows simply indicates whether or not an existing column is equal to "R" or "P". If it is equal, i would like the new column to read "1", and if there is a blank observation I would like it to read "0".
I would like this:
Person Play Key
A 1 R
B 2 P
C 3
D 4 R
E 5
To become this:
Person Play Key Indicator
A 1 R 1
B 2 P 1
C 3 0
D 4 R 1
E 5 0
I have tried:
df$Indicator <- (df$Key == 'R' | 'P')
But that doesn't work. I get the error Error in df$Indicator <- (df$Key == 'R' | 'P')" : operations are possible only for numeric, logical or complex types
Besides I'm not sure that would provide the binary indicator I'm looking for.

Try any of these approaches. You were almost close as you were using a code like df$Indicator <- (df$Key == 'R' | 'P') but the proper form would be df$Indicator <- df$Key == 'R' | df$Key =='P'. That will produce TRUE/FALSE values, so you can use as.numeric() to make them 0/1. Here the code:
#Code 1
df$Indicator <- as.numeric(df$Key %in% c('R','P'))
#Code 2
df$Indicator <- as.numeric(df$Key == 'R' | df$Key== 'P')
Output:
Person Play Key Indicator
1 A 1 R 1
2 B 2 P 1
3 C 3 0
4 D 4 R 1
5 E 5 0
Some data used:
#Data
df <- structure(list(Person = c("A", "B", "C", "D", "E"), Play = 1:5,
Key = c("R", "P", "", "R", "")), row.names = c(NA, -5L), class = "data.frame")
Another option would be (All credits to #ChuckP):
#Code3
df$indicator <- ifelse(df$Key == 'R' | df$Key== 'P', 1, 0)
Which will produce same output.

expl <- data.frame(Person = LETTERS[1:5], Play = 1:5, Key = c("R", "R"," ", "P", " "))
expl$Indicator <- expl$Key == 'R' | expl$Key =='P'
print(expl)
expl$Indicator2 <- as.numeric(expl$Key == 'R' | expl$Key =='P')
print(expl)

looping not working in R

I am trying to get a function to solve a small problem. I have to two list, and each list comprise n samples. Each sample has a variable amount of identifiers of bacteria (in the example letters, in my problem bacterial identifiers like OTU1-OTUn, in both cases are “character”). One list comprise samples from diet, and the another list samples from gut contents. I want to know for each sample of list gut, how many bacteria from diet are in the gut and how many bacteria in the gut do not come from diet. This was easily done when working with phyloseq object were diet and gut are both phyloseq objects with n samples each.
Bacteria_from_diet<-length(intersect(taxa_names(gut),taxa_names(diet))
Bacteria_not_diet<-length(taxa_names(diet)- Bacteria_from_diet
However, this “summarizes” the result over the n samples of gut and diet, I mean, like if I collapse data by sample, and I need some measure of variation.
I have tried the following code in R:
diet<-list(DL1=c("A","B","C"),DL2=c("A","C","D"),DL3=c("B","D","E"),DL4=c("B","D","E"))
gut<-list(DL5=c("A","F","G"),DL6=c("B","F","H"),DL7=c("D","H","J"),DL8=c("A","G","F"))
gut_vs_diet <- function(a,b) ## a is diet and b is gut
{
xx<-10
gut = numeric(xx)
diet = numeric(xx)
all<-unlist(lapply(b,length)) ### get the number of elements of each element of list b
for(i in seq_along(b)){ #### loop over b (gut) to get:
diet<-length(intersect(b[[i]],a[[i]])) ### the number of elements of diet are present in gut
gut = all-diet ## the number of elements of gut that not come from diet
}
gutvsdiet = data.frame(all,gut,diet)
return(gutvsdiet)
}
When running the funtion I obtain this result, which is not correct
gut_vs_diet(diet,gut)
all gut diet
DL5 3 3 0
DL6 3 3 0
DL7 3 3 0
DL8 3 3 0
In some cases, I was able to get some value in diet column, but the function randomly choose the diet sample.
I do not know where could be the mistake.Anyway, I would like to do this iteratively, I mean, get the values for each sample of gut compared with all samples of diet. Alternatively, I can run a replicate(10,gut_vs_diet(sample(diet),sample(gut)) to get random comparisons and avoid somekind of bias.
Thank you very much for your help
Manuel

Here is my version of your code:
diet <- list(DL1=c("A","B","C"), DL2=c("A","C","D"), DL3=c("B","D","E"), DL4=c("B","D","E"))
gut <- list(DL5=c("A","F","G"), DL6=c("B","F","H"), DL7=c("D","H","J"), DL8=c("A","G","F"))
gut_vs_diet <- function(a, b) ## a is diet and b is gut
{
all <- lengths(b) ### get the number of elements of each element of list b
diet <- mapply(function(ai, bi) length(intersect(ai, bi)), a, b)
# diet <- lengths(mapply(intersect, a, b)) ## a variant
data.frame(all, gut=all-diet, diet)
}
gut_vs_diet(diet,gut)
# > gut_vs_diet(diet,gut)
# all gut diet
# DL5 3 2 1
# DL6 3 3 0
# DL7 3 2 1
# DL8 3 3 0

As #jogo suggested in a comment, you can use mapply instead of your for-loop:
FOO <- function(x, y){
all <- lengths(y)
diet <- mapply(function(a, b){
length(intersect(b, a))
}, x, y)
gut <- all - diet
return(data.frame(all, gut, diet))
}
> FOO(diet, gut)
all gut diet
DL5 3 2 1
DL6 3 3 0
DL7 3 2 1
DL8 3 3 0

Just for completion, with a for loop it would look like this. Note that you need to subtract all[[i]] - diet and construct the dataframe inside the loop, otherwise you will just fill it with the last result of the loop, which is, data.frame(all = c(3,3,3,3), gut = 3, diet = 0)
diet <- list(DL1 = c("A", "B", "C"), DL2 = c("A", "C", "D"), DL3 = c("B", "D", "E"), DL4 = c("B", "D", "E"))
gut <- list(DL5 = c("A", "F", "G"), DL6 = c("B", "F", "H"), DL7 = c("D", "H", "J"), DL8 = c("A", "G", "F"))
gut_vs_diet <- function(a, b)
{
all <- lengths(b)
gutvsdiet <- NULL
for (i in seq_along(b)) {
diet <- length(intersect(b[[i]], a[[i]]))
gut <- all[[i]] - diet
resultForThisListElement <- c(all[[i]], gut, diet)
gutvsdiet <- rbind(gutvsdiet, resultForThisListElement)
}
colnames(gutvsdiet) <- c("all", "gut", "diet")
return(gutvsdiet)
}
gut_vs_diet(diet, gut)

How to create table with both forward and backward information for edges with igraph

I have created a table with igraph listing the data as follows :
where a,b,c,d,e are the edges.
a and b are mutual edges,
with the weight values of 1 for a->b, 2 for b->a (There is no self-loop).
By the way I used the following code to create the above table:
library(igraph)
library(dplyr)
g <- data.frame(from = c("a", "b", "c", "d", "e"),
to = c("b", "a", "a", "b", "a"), weight = c(1:5)) %>%
igraph::graph_from_data_frame()
Now I hope to create another table listing both the forward and backward information between the edges, as well as the weight values like:
Does anyone know how to do this with igraph?

First you could get a list of the pairs of node that share and edge regardless of direction
simplified <- as.undirected(g, mode="collapse")
pairs <- ends(simplified, E(simplified))
Then we can write a helper function to return a given edge weight between two node and if it doesn't exist, return NA instead
get_edge_weight<- Vectorize(function(a, b) {
e <- E(g)[a %->% b]
if(length(e)==1) {
e$weight
} else {
NA
}
})
Then you can build your desired data.frame with
data.frame(from=pairs[,1], to=pairs[,2],
fwd=get_edge_weight(pairs[,1], pairs[,2]),
back=get_edge_weight(pairs[,2], pairs[,1])
)
# from to fwd back
# b a b 1 2
# c a c NA 3
# d b d NA 4
# e a e NA 5

How to use an if then statement in R to create a new variable based on another

I have a data frame with Column1, which can take the value of any letter of the alphabet. I want to create a second column that spells out the number corresponding to that letter. I am trying to do this with an if then statement... But keep getting an error. Sorry this is a simple question but I have tried the R for dummies website http://www.dummies.com/how-to/content/how-to-use-if-statements-in-r.html with no luck!
x$Column2 <- NULL
if (x$Column1 == "A") then[x$Column2 <- "One"]

The best way to do this is create a reference table:
>Reference = data.frame(Number = c("One", "Two", "Three", "Four"), Letter = c("A", "B", "C", "D"))
> Reference
Number Letter
1 One A
2 Two B
3 Three C
4 Four D
> Data = data.frame(Letter = c("B", "B", "C", "A", "D"))
> Data
Letter
1 B
2 B
3 C
4 A
5 D
Then you can find the indices:
> Indices = sapply(Data$Letter, function(x) which(x == Reference$Letter))
> Indices
[1] 2 2 3 1 4
And use them to create the column
> Data$Number = Reference[Indices,]$Number
> Data
Letter Number
1 B Two
2 B Two
3 C Three
4 A One
5 D Four

To my understanding, it is like creating a dummy variable, what you want to do here. Try
> x$dummy <- as.numeric(Column1 != "A")
and you should get 0 for all A's and 1 for other values.
Look at Generate a dummy-variable for further information.

trouble understanding count.multiple and simplify in igraph

I am working with network data and have come across an odd (or at least I didn't expect it) behavior with count.multiple in the igraph package in R.
library(igraph)
library(plyr)
df <- data.frame( sender = c( "a", "a", "a", "b", "b", "c","c","d" ),
receiver = c( "b", "b", "b", "c", "a", "d", "d", "a" ) )
What I want is to count up all of the edges and use the multiples as a weight.
when I do ddply(df, .(sender, receiver), "nrow") my results are:
sender receiver nrow
1 a b 3
2 b a 1
3 b c 1
4 c d 2
5 d a 1
Which is what I would expect.
However, I cannot reproduce this using igraph's count.multiple, which is what I expected to do this within igraph
df.graph <- graph.edgelist(as.matrix(df))
E(df.graph)$weight <- count.multiple(df.graph)
E(df.graph)$weight produces:
3 3 3 1 1 2 2 1
I then used the simplify command:
df.graph <- simplify(df.graph)
which produces
9 1 1 4 1
I get what is going on here, simplify is just adding the weights, but I don't understand why/when this would be used as opposed to what ddply is doing..?
Any thoughts?
Thanks!

The default behaviour of simplify is to add the weights of multiple edges.
To avoid double counting, you can set the initial weights to 1
g <- graph.edgelist(as.matrix(df))
E(g)$weight <- 1
g <- simplify( g )
E(g)$weight
or change the way they are aggregated.
g <- graph.edgelist(as.matrix(df))
E(g)$weight <- count.multiple(g)
g <- simplify( g, edge.attr.comb = list(weight=max, name="concat", "ignore") )
E(g)$weight

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Recoding values by row based on values of other columns - r

Related

How to Create Column with Binary Variable that Indicates if Another Column has Certain Factors?

looping not working in R

How to create table with both forward and backward information for edges with igraph

How to use an if then statement in R to create a new variable based on another

trouble understanding count.multiple and simplify in igraph

Categories

Resources