r - How to create vector with for loops and ifelse - r

I'm having a problem with nested for loops and ifelse statements. This is my dataframe abund:
Species Total C1 C2 C3 C4
1 Blue 223 73 30 70 50
2 Black 221 17 50 56 98
3 Yellow 227 29 99 74 25
4 Green 236 41 97 68 30
5 Red 224 82 55 21 66
6 Orange 284 69 48 73 94
7 Black 154 9 63 20 62
8 Red 171 70 58 13 30
9 Blue 177 57 27 8 85
10 Orange 197 88 61 18 30
11 Orange 112 60 8 31 13
I would like to add together some of abund’s columns but only if they match the correct species I’ve specified in the vector colors.
colors <- c("Black", "Red", "Blue")
So, if the Species in abund matches the species in color then add columns C2 through C4 together in a new vector minus. If the species in abund does not match the species in color then add a 0 to the new vector minus.
I'm having trouble with my code and hope it's just a small matter of defining a range, but I'm not sure. This is my code so far:
# Use for loop to create vector of sums for select species or 0 for species not selected
for( i in abund$Species)
{
for( j in colors)
{
minus <- ifelse(i == j, sum(abund[abund$Species == i,
"C2"]:abund[abund$Species == i, "C4"]), 0)
}
}
Which returns this: There were 12 warnings (use warnings() to see them)
and this "vector": minus [1] 0
This is my target:
minus
[1] 150 204 0 0 142 0 145 101 120 0 0
Thank you for your time and help with this.

This is probably better done without any loops.
# Create the vector
minus <- rep(0, nrow(abund))
# Identify the "colors" cases
inColors <- abund[["Species"]] %in% colors
# Set the values
minus[inColors] <- rowSums(abund[inColors, c("C2","C3","C4")])
Also, for what it is worth there are quite a few problems with your original code. First, your first for loop isn't doing what you think. In each round, the value of i is being set to the next value in abund$Species, so first it is Blue then Black then Yellow, etc. As a result, then you index using abund[abund$Species == i, ], you may return multiple rows (ex. Blue will give you 1 and 9, since both those rows Species == "Blue").
Second when you make the statement abund[abund$Species == i, "C2"]:abund[abund$Species == i, "C4"] you are not indexing the columns C2 C3 and C4 you are making a sequence starting at the value in C2 and ending at the value in C4. For example, when i == "Yellow" it returns 99:25 or 99, 98, 97, ... , 26, 25. The reason you were getting those warnings was a combination of this problem and the last one. For example, when i == "Blue", you were trying to make a sequence starting at both 30 and 27 and ending at both 50 and 85. The warning was saying that it was just using the first number in your start and finish and giving you 30:50.
Finally, you were constantly over writing your value of minus rather than adding to it. You need to first create minus as above and index into it for the assignment like this minus[i] <- newValue.

Note that ifelse is vectorized so you usually don't need any for loops when using it.
I like Barker's answer best, but if you wanted to do this with ifelse this is the way:
abund$minus = with(abund, ifelse(
Species %in% colors, # if the species matches
C2 + C3 + C4, # add the columns
0 # otherwise 0
))
Even though this is just one line and Barker's is 3, on large data it will be slightly more efficient to avoid ifelse.
However, ifelse statements can be nested and are often easier to work with when conditions get complicated - so there are definitely good times to use them. On small to medium sized data the speed difference will be negligible so just use whichever you think of first.

# Create a column called minus with the length of the number of existing rows.
# The default value is zero.
abund$minus <- integer(nrow(abund))
# Perform sum of C2 to C4 only in those rows where Species is in the colors vector
abund$minus[abund$Species %in% colors] <- rowSums(abund[abund$Species %in% colors,5:7])

Related

How can I extract a part of a vector to another vector (including positions)

I have a vector with different values (positive and negative), so, I want to select only the 10 lowest odd number values, and the 10 lowest pair values. Help me, please!
This is a way to do it using base R.
vector with odd and even numbers
x <- sample(-100:100, 30)
The modulus operator in R help to get the job done. You can use it this way
c(
# Extract the lowest even numbers
head(sort(x[x %% 2 == 0]), 5),
# Extract the lowest odds numbers
head(sort(x[x %% 2 == 1]), 5)
)
Given vector vas your input vector, you can obtain the desired output (including positions) via the following code
names(v) <- seq_along(v)
# lowest 10 odd numbers
low_odd <- sort(v[v%%2==1])[1:10]
# positions of those odd numbers in v
low_odd_pos <- as.numeric(names(low_odd))
# lowest 10 even numbers
low_even <- sort(v[v%%2==0])[1:10]
# positions of those even numbers in v
low_even_pos <- as.numeric(names(low_even))
Example
set.seed(1)
v <- sample(-50:50)
then
> low_odd
43 101 39 95 85 72 7 73 45 29
-49 -47 -45 -43 -41 -39 -37 -35 -33 -31
> low_odd_pos
[1] 43 101 39 95 85 72 7 73 45 29

What can do to find and remove semi-duplicate rows in a matrix?

Assume I have this matrix
set.seed(123)
x <- matrix(rnorm(410),205,2)
x[8,] <- c(0.13152348, -0.05235148) #similar to x[5,]
x[16,] <- c(1.21846582, 1.695452178) #similar to x[11,]
The values are very similar to the rows specified above, and in the context of the whole data, they are semi-duplicates. What could I do to find and remove them? My original data is an array that contains many such matrices, but the position of the semi duplicates is the same across all matrices.
I know of agrep but the function operates on vectors as far as I understand.
You will need to set a threshold, but you can just compute the distance between each row using dist and find the points that are sufficiently close together. Of course, Each point is near itself, so you need to ignore the diagonal of the distance matrix.
DM = as.matrix(dist(x))
diag(DM) = 1 ## ignore diagonal
which(DM < 0.025, arr.ind=TRUE)
row col
8 8 5
5 5 8
16 16 11
11 11 16
48 48 20
20 20 48
168 168 71
91 91 73
73 73 91
71 71 168
This finds the "close" points that you created and a few others that got generated at random.

Summing values after every third position in data frame in R

I am new to R. I have a data frame like following
>df=data.frame(Id=c("Entry_1","Entry_1","Entry_1","Entry_2","Entry_2","Entry_2","Entry_3","Entry_4","Entry_4","Entry_4","Entry_4"),Start=c(20,20,20,37,37,37,68,10,10,10,10),End=c(50,50,50,78,78,78,200,94,94,94,94),Pos=c(14,34,21,50,18,70,101,35,2,56,67),Hits=c(12,34,17,89,45,87,1,5,6,3,26))
Id Start End Pos Hits
Entry_1 20 50 14 12
Entry_1 20 50 34 34
Entry_1 20 50 21 17
Entry_2 37 78 50 89
Entry_2 37 78 18 45
Entry_2 37 78 70 87
Entry_3 68 200 101 1
Entry_4 10 94 35 5
Entry_4 10 94 2 6
Entry_4 10 94 56 3
Entry_4 10 94 67 26
For each entry I would like to iterate the data.frame in 3 different modes. For an example, for Entry_1 mode_1 =seq(20,50,3)and mode_2=seq(21,50,3) and mode_3=seq(22,50,3). I would like sum all the Values in Column "Hits" whose corresponding values in Column "Pos" that falls in mode_1 or_mode_2 or mode_3 and generate a data.frame like follow:
Id Mode_1 Mode_2 Mode_3
Entry_1 0 17 34
Entry_2 87 89 0
Entry_3 1 0 0
Entry_4 26 8 0
I tried the following code:
mode_1=0
mode_2=0
mode_3=0
mode_1_sum=0
mode_2_sum=0
mode_3_sum=0
for(i in dim(df)[1])
{
if(df$Pos[i] %in% seq(df$Start[i],df$End[i],3))
{
mode_1_sum=mode_1_sum+df$Hits[i]
print(mode_1_sum)
}
mode_1=mode_1_sum+counts
print(mode_1)
ifelse(df$Pos[i] %in% seq(df$Start[i]+1,df$End[i],3))
{
mode_2_sum=mode_2_sum+df$Hits[i]
print(mode_2_sum)
}
mode_2_sum=mode_2_sum+counts
print(mode_2)
ifelse(df$Pos[i] %in% seq(df$Start[i]+2,df$End[i],3))
{
mode_3_sum=mode_3_sum+df$Hits[i]
print(mode_3_sum)
}
mode_3_sum=mode_3_sum+counts
print(mode_3_sum)
}
But the above code only prints 26. Can any one guide me how to generate my desired output, please. I can provide much more details if needed. Thanks in advance.
It's not an elegant solution, but it works.
m <- 3 # Number of modes you want
foo <- ((df$Pos - df$Start)%%m + 1) * (df$Start < df$Pos) * (df$End > df$Pos)
tab <- matrix(0,nrow(df),m)
for(i in 1:m) tab[foo==i,i] <- df$Hits[foo==i]
aggregate(tab,list(df$Id),FUN=sum)
# Group.1 V1 V2 V3
# 1 Entry_1 0 17 34
# 2 Entry_2 87 89 0
# 3 Entry_3 1 0 0
# 4 Entry_4 26 8 0
-- EXPLANATION --
First, we find the indices of df$Pos That are both bigger than df$Start and smaller than df$End. These should return 1 if TRUE and 0 if FALSE. Next, we take the difference between df$Pos and df$Start, we take mod 3 (which will give a vector of 0s, 1s and 2s), and then we add 1 to get the right mode. We multiply these two things together, so that the values that fall within the interval retain the right mode, and the values that fall outside the interval become 0.
Next, we create an empty matrix that will contain the values. Then, we use a for-loop to fill in the matrix. Finally, we aggregate the matrix.
I tried looking for a quicker solution, but the main problem I cannot work around is the varying intervals for each row.

R : how to Detect Pattern in Matrix By Row

I have a big matrix with 4 columns, containing normalized values (by column, mean ~ 0 and standard deviation = 1)
I would like to see if there is a pattern in the matrix, and if yes I would like to cluster rows by pattern, by pattern I mean values in a given row example
for row N
if value in column 1 < column 2 < column 3 < column 4 then it is let's say a pattern 1
Basically there is 4^4 = 256 possible patterns (in theory)
Is there a way in R to do this ?
Thanks in advance
Rad
Yes. (Although the number of distinct permutations is only 24 = 4*3*2. After one value is chosen, there are only three possible second values, and after the second is specified there are only two more orderings left.) The order function applied to each row should give the desired 1,2,3, 4 permutations:
mtx <- matrix(rnorm(10000), ncol=4)
res <- apply(mtx, 1, function(x) paste( order(x), collapse=".") )
> table(res)[1:10]
> table(res)
res
1.2.3.4 1.2.4.3 1.3.2.4 1.3.4.2 1.4.2.3 1.4.3.2
98 112 95 120 114 118
2.1.3.4 2.1.4.3 2.3.1.4 2.3.4.1 2.4.1.3 2.4.3.1
101 114 105 102 104 122
3.1.2.4 3.1.4.2 3.2.1.4 3.2.4.1 3.4.1.2 3.4.2.1
105 82 107 90 97 86
4.1.2.3 4.1.3.2 4.2.1.3 4.2.3.1 4.3.1.2 4.3.2.1
99 93 100 108 118 110

How can I get column data to be added based on a group designation using R?

The data set that I'm working with is similar to the one below (although the example is of a much smaller scale, the data I'm working with is 10's of thousands of rows) and I haven't been able to figure out how to get R to add up column data based on the group number. Essentially I want to be able to get the number of green(s), blue(s), and red(s) added up for all of group 81 and 66 separately and then be able to use that information to calculate percentages.
txt <- "Group Green Blue Red Total
81 15 10 21 46
81 10 10 10 30
81 4 8 0 12
81 42 2 2 46
66 11 9 1 21
66 5 14 5 24
66 7 5 2 14
66 1 16 3 20
66 22 4 2 28"
dat <- read.table(textConnection(txt), sep = " ", header = TRUE)
I've spent a good deal of time trying to figure out how to use some of the functions on my own hoping I would stumble across a proper way to do it, but since I'm such a new basic user I feel like I have hit a wall that I cannot progress past without help.
One way is via aggregate. Assuming your data is in an object x:
aggregate(. ~ Group, data=x, FUN=sum)
# Group Green Blue Red Total
# 1 66 46 48 13 107
# 2 81 71 30 33 134
Both of the answers above are perfect examples of how to address this type of problem. Two other options exist within reshape and plyr
library(reshape)
cast(melt(dat, "Group"), Group ~ ..., sum)
library(plyr)
ddply(dat, "Group", function(x) colSums(x[, -1]))
I would suggest that #Joshua's answer is neater, but two functions you should learn are apply and tapply. If a is your data set, then:
## apply calculates the sum of each row
> total = apply(a[,2:4], 1, sum)
## tapply calculates the sum based on each group
> tapply(total, a$Group, sum)
66 81
107 134

Resources