This is my df :
a <- data.frame(x1 = 1:3, x2 = 0, GF = c("Pelagic", "Demersal", "Cephalopod"), Pelagic = 6, Demersal = 7, Cephalopod = 8)
I have a list like this :
GF_list <- c("Pelagic", "Demersal", "Cephalopod")
I want to attribute to the x2 column the value corresponding to the GF of the line. So I do this
for (i in 1 : nrow(a)) {
for (j in 1 : length(GF_list)) {
if (a$GF[i] == GF_list[j]) {
a$x2[i] <- a[i,(ncol(a) + (- length(GF_list) + j))]
}}
}
But it takes a very long time ... (I have a large data frame)
Does it exist a faster way to applicate this attribution ? I think about a way which eliminates the first loop : "for (i in 1 : nrow(a))"
Thank you
So you want to select a different column from each row? You can do complicated extractions from a data.frame with a numeric matrix. Here's how it might work
a$x2 <- a[cbind(1:nrow(a), match(GF_list, names(a)))]
The matrix has a column for row numbers and column numbers. We use match() to find the right column for each row.
One way is using apply row-wise and select the value of column from the GF column of that row.
a$x2 <- apply(a, 1, function(x) x[x[["GF"]]])
a
# x1 x2 GF Pelagic Demersal Cephalopod
#1 1 6 Pelagic 6 7 8
#2 2 7 Demersal 6 7 8
#3 3 8 Cephalopod 6 7 8
Here is a solution which gives a numeric result:
i <- 1:nrow(a)
j <- which(a$GF %in% GF_list)
as.matrix(a[,(ncol(a)-length(GF_list)+1):ncol(a)])[cbind(i,j)]
Related
does anyone know how to have a row in R that is calculated from another row automatically? i.e.
lets say in excel, i want to make a row C, which is made up of (B2/B1)
e.g. C1 = B2/B1
C2 = B3/B2
...
Cn = Cn+1/Cn
but in excel, we only need to do one calculation then drag it down. how do we do it in R?
In R you work with columns as vectors so the operations are vectorized. The calculations as described could be implemented by the following commands, given a data.frame df (i.e. a table) and the respective column names as mentioned:
df["C1"] <- df["B2"]/df["B1"]
df["C2"] <- df["B3"]/df["B2"]
In R you usually would name the columns according to the content they hold. With that, you refer to the columns by their name, although you can also address the first column as df[, 1], the first row as df[1, ] and so on.
EDIT 1:
There are multiple ways - and certainly some more elegant ways to get it done - but for understanding I kept it in simple base R:
Example dataset for demonstration:
df <- data.frame("B1" = c(1, 2, 3),
"B2" = c(2, 4, 6),
"B3" = c(4, 8, 12))
Column calculation:
for (i in 1:ncol(df)-1) {
col_name <- paste0("C", i)
df[col_name] <- df[, i+1]/df[, i]
}
Output:
B1 B2 B3 C1 C2
1 1 2 4 2 2
2 2 4 8 2 2
3 3 6 12 2 2
So you iterate through the available columns B1/B2/B3. Dynamically create a column name in every iteration, based on the number of the current iteration, and then calculate the respective column contents.
EDIT 2:
Rowwise, as you actually meant it apparently, works similarly:
a <- c(10,15,20, 1)
df <- data.frame(a)
for (i in 1:nrow(df)) {
df$b[i] <- df$a[i+1]/df$a[i]
}
Output:
a b
1 10 1.500000
2 15 1.333333
3 20 0.050000
4 1 NA
You can do this just using vectors, without a for loop.
a <- c(10,15,20, 1)
df <- data.frame(a)
df$b <- c(df$a[-1], 0) / df$a
print(df)
a b
1 10 1.500000
2 15 1.333333
3 20 0.050000
4 1 0.000000
Explanation:
In the example data, df$a is the vector 10 15 20 1.
df$a[-1] is the same vector with its first element removed, 15 20 1.
And using c() to add a new element to the end so that the vector has the same lenght as before:
c(df$a[-1],0) which is 15 20 1 0
What we want for column b is this vector divided by the original df$a.
So:
df$b <- c(df$a[-1], 0) / df$a
I have a matrix of integers
m <- rbind(c(1,2),
c(3,6),
c(5,1),
c(2,1),
c(6,3))
and I am looking for a function that takes this matrix as input and outputs a vector flag with length(flag) == ncol(m) that assigns the rows that contain the same set of integers the same unique (let's say integer) value.
For the above example, the desired output would be:
flag <- c(1, 2, 3, 1, 2)
So rows 1 and 4 inm get the same flag 1, because they both contain the same set of integers, in this case {1, 2}. Similarly, rows 2 and 5 get the same flag.
The solution should work for any number of columns.
The only thing I could come up with is the following approach ...
FlagSymmetric <- function(x) {
vec_sim <- rep(NA, nrow(x)) # object containing flags
ind_ord <- ncol(x)
counter <- 1
for(i in 1:nrow(x)) {
if(is.na(vec_sim[i])) { # if that row is not flagged yet, proceed ...
vec_sim[i] <- counter # ... and give the next free flag
for(j in (i+1):nrow(x)) {
if( (i+1) > nrow(x) ) next # in case of tiny matrices
ind <- x[j, ] %in% x[i, ]
if(sum(ind)==ind_ord) vec_sim[j] <- counter # if the same, assign flag
}
counter <- counter + 1
}
}
return(vec_sim)
}
... which does what I want:
> FlagSymmetric(m)
[1] 1 2 3 1 2
If n = nrow(m) this needs 1/2 n^2 operations. Of course, I could make it much quicker by writing this in C++, but this only alleviates my problem to some extent, because I am working with matrices with a potentially huge number of rows.
I guess there must be a smarter way of doing this.
EDIT:
Additional, more general example (sorting row and pasting to character string not possible):
m2 <- rbind(c(1,112),
c(11,12),
c(12,11),
c(112,1),
c(6,3))
flag2 <- c(1, 2, 2, 1, 3) # desired output
FlagSymmetric(m2) # works
[1] 1 2 2 1 3
Assuming you only have numeric data in your matrix.
First converting the matrix to dataframe,
m <- data.frame(m)
We can sort every row and paste them together. Convert them to factor and then to numeric to get unique numbers for every combination
m$flag <- as.numeric(factor(apply(m, 1, function(x) paste0(sort(x), collapse = ""))))
m
# X1 X2 flag
#1 1 2 1
#2 3 6 3
#3 5 1 2
#4 2 1 1
#5 6 3 3
EDIT
The above solution does not work for every combination as explained in the new example. To differentiate between each number, as #d.b commented we can use any non-empty collapse argument. For updated example,
as.numeric(factor(apply(m2, 1, function(x) paste0(sort(x), collapse = "-"))))
#[1] 1 2 2 1 3
I'm new to R, so apologies in advance for bad form in my code.
I'm trying to figure out the best way to go through a dataframe, row by row, and modify a value based on logic that references other columns within that row or an entirely different dataframe. The issue is that the logic I'm using necessitates creating and subsetting a dataframe for each row to retrieve a minimum value. My real data set is 47000 rows and 15 columns, so creating 47,000 subsets is taking a long time.
Here are sample datasets to help describe what I'm talking about.
df1 <- data.frame('A' = c(rep("Beer", 2), rep("Chip", 2)), 'B' = c(NA, 3,
NA,9), 'C' = 5:8, 'D' = NA)
df2 <- data.frame('Q' = c(rep("Beer", 2), rep("Chip", 2)), 'R' = 6:9, 'S' =
c(12, 15, 4, 18), 'T' = c(23, 45, 75, 34))
df1:
A B C D
Beer NA 5 NA
Beer 3 6 NA
Chip NA 7 NA
Chip 9 8 NA
df2:
Q R S T
Beer 6 12 23
Beer 7 15 45
Chip 8 4 75
Chip 9 18 34
This loop does what I want, namely checking whether a value is NA in column B or not, if it isn't then use that value in for column D, if it is NA then retrieve the minimum value from a filtered subset of df2. In the real use case I have other filtering conditions.
require(dplyr)
for (i in 1:nrow(df1)) {
if (!(is.na(df1$B[i]))) {
df1$D[i] <- df1$B[i]}
else {x <- filter(df2, df1$A[i] == df2$Q)
x <- min(x$S)
df1$D[i] <- x
}
}
Everyone says to avoid loops in R, so I created this function using apply which also works (although is a little more difficult to follow):
FUNC <- function(x) {
apply(x, 1, function(y) {
if (!(is.na(y[2]))) {
y[4] <- y[2]}
else {z <- filter(df2, y[1] == df2$Q)
z <- min(z$S)
y[4] <- z}
}
)
}
df1$D <- as.numeric(FUNC(df1))
Output:
A B C D
Beer NA 5 12
Beer 3 6 3
Chip NA 7 4
Chip 9 8 9
Aside question: is there a way to reference items in vector y by name instead of by index position?
So is there a better way to do this? Right now both methods take about 5-8 minutes to run through 47,000+ rows which seems long to me.
df1$D <- df2 %>%
rename(A=Q) %>%
group_by(A) %>%
summarise(D=min(S)) %>%
right_join(df1, by="A") %>%
mutate(D=ifelse(is.na(B), D.x, B)) %>%
`[[`("D")
I'm new to R and can't seem to get to grips with how to call a previous value of "self", in this case previous "b" b[-1].
b <- ( ( 1 / 14 ) * MyData$High + (( 13 / 14 )*b[-1]))
Obviously I need a NA somewhere in there for the first calculation, but I just couldn't figure this out on my own.
Adding example of what the sought after result should be (A=MyData$High):
A b
1 5 NA
2 10 0.7142...
3 15 3.0393...
4 20 4.6079...
1) for loop Normally one would just use a simple loop for this:
MyData <- data.frame(A = c(5, 10, 15, 20))
MyData$b <- 0
n <- nrow(MyData)
if (n > 1) for(i in 2:n) MyData$b[i] <- ( MyData$A[i] + 13 * MyData$b[i-1] )/ 14
MyData$b[1] <- NA
giving:
> MyData
A b
1 5 NA
2 10 0.7142857
3 15 1.7346939
4 20 3.0393586
2) Reduce It would also be possible to use Reduce. One first defines a function f that carries out the body of the loop and then we have Reduce invoke it repeatedly like this:
f <- function(b, A) (A + 13 * b) / 14
MyData$b <- Reduce(f, MyData$A[-1], 0, acc = TRUE)
MyData$b[1] <- NA
giving the same result.
This gives the appearance of being vectorized but in fact if you look at the source of Reduce it does a for loop itself.
3) filter Noting that the form of the problem is a recursive filter with coefficient 13/14 operating on A/14 (but with A[1] replaced with 0) we can write the following. Since filter returns a time series we use c(...) to convert it back to an ordinary vector. This approach actually is vectorized as the filter operation is performed in C.
MyData$b <- c(filter(replace(MyData$A, 1, 0)/14, 13/14, method = "recursive"))
MyData$b[1] <- NA
again giving the same result.
Note: All solutions assume that MyData has at least 1 row.
There are a couple of ways you could do this.
The first method is a simple loop
df <- data.frame(A = seq(5, 25, 5))
df$b <- 0
for(i in 2:nrow(df)){
df$b[i] <- (1/14)*df$A[i]+(13/14)*df$b[i-1]
}
df
A b
1 5 0.0000000
2 10 0.7142857
3 15 1.7346939
4 20 3.0393586
5 25 4.6079758
This doesn't give the exact values given in the expected answer, but it's close enough that I've assumed you made a transcription mistake. Note that we have to assume that we can take the NA in df$b[1] as being zero or we get NA all the way down.
If you have heaps of data or need to do this a bunch of time the speed could be improved by implementing the code in C++ and calling it from R.
The second method uses the R function sapply
The form you present the problem in
is recursive, which makes it impossible to vectorise, however we can do some maths and find that it is equivalent to
We can then write a function which calculates b_i and use sapply to calculate each element
calc_b <- function(n,A){
(1/14)*sum((13/14)^(n-1:n)*A[1:n])
}
df2 <- data.frame(A = seq(10,25,5))
df2$b <- sapply(seq_along(df2$A), calc_b, df2$A)
df2
A b
1 10 0.7142857
2 15 1.7346939
3 20 3.0393586
4 25 4.6079758
Note: We need to drop the first row (where A = 5) in order for the calculation to perform correctly.
In R with a matrix:
one two three four
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 11 18
[4,] 4 9 11 19
[5,] 5 10 15 20
I want to extract the submatrix whose rows have column three = 11. That is:
one two three four
[1,] 1 6 11 16
[3,] 3 8 11 18
[4,] 4 9 11 19
I want to do this without looping. I am new to R so this is probably very obvious but the
documentation is often somewhat terse.
This is easier to do if you convert your matrix to a data frame using as.data.frame(). In that case the previous answers (using subset or m$three) will work, otherwise they will not.
To perform the operation on a matrix, you can define a column by name:
m[m[, "three"] == 11,]
Or by number:
m[m[,3] == 11,]
Note that if only one row matches, the result is an integer vector, not a matrix.
I will choose a simple approach using the dplyr package.
If the dataframe is data.
library(dplyr)
result <- filter(data, three == 11)
m <- matrix(1:20, ncol = 4)
colnames(m) <- letters[1:4]
The following command will select the first row of the matrix above.
subset(m, m[,4] == 16)
And this will select the last three.
subset(m, m[,4] > 17)
The result will be a matrix in both cases.
If you want to use column names to select columns then you would be best off converting it to a dataframe with
mf <- data.frame(m)
Then you can select with
mf[ mf$a == 16, ]
Or, you could use the subset command.
Subset is a very slow function , and I personally find it useless.
I assume you have a data.frame, array, matrix called Mat with A, B, C as column names; then all you need to do is:
In the case of one condition on one column, lets say column A
Mat[which(Mat[,'A'] == 10), ]
In the case of multiple conditions on different column, you can create a dummy variable. Suppose the conditions are A = 10, B = 5, and C > 2, then we have:
aux = which(Mat[,'A'] == 10)
aux = aux[which(Mat[aux,'B'] == 5)]
aux = aux[which(Mat[aux,'C'] > 2)]
Mat[aux, ]
By testing the speed advantage with system.time, the which method is 10x faster than the subset method.
If your matrix is called m, just use :
R> m[m$three == 11, ]
If the dataset is called data, then all the rows meeting a condition where value of column 'pm2.5' > 300 can be received by -
data[data['pm2.5'] >300,]