Apply() function to a data frame - r

I have a data frame object which has 24 columns and each one has a different length. I would like to multiply every column by a vector of 24 values. I am thinking about using the apply function since I do not have any matrix. my guess is like:
trans_temp:
Ta.f Ta.f Ta.f Ta.f
1995-10-13 04:00:00 13.6 13.6 13.6 13.6
1995-10-13 05:00:00 13.6 13.6 13.6 13.6
1995-10-13 06:00:00 13.6 13.6 13.6 13.6
1995-10-13 07:00:00 13.5 13.5 13.5 13.5
1995-10-13 08:00:00 13.5 13.5 13.5 13.5
and my vector is
x <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24)
So I want the first column multiplied by 1, the second by 2, the third by 3 and so on. I can not multiply directlly because it is a data.frame object.
apply(trans_temp,x,MARGIN=2,fun)
Any help?

You can create a matrix directly and just multiply the data with it:
as.matrix(trans_temp) * col(trans_temp)
Benchmarking with eddi's
m <- as.data.frame(matrix(runif(1e7), ncol=1000))
x <- seq_len(1000)
system.time(tt1 <- as.matrix(m) * col(m)) # 0.335 seconds
system.time(tt2 <- t(x*t(m))) # 0.505 seconds
identical(tt1, tt2) # TRUE

You are on the right track, but I don't understand how your columns have different lengths, unless you mean some contain, e.g. NA in them. Use MARGIN = 1 to apply across rows.
x <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24)
t( apply(trans_temp , MARGIN = 1 , function(y) x * y ) )
You could even shorten the call like so:
t( apply(trans_temp , 1 , `*` , x ) )

Here's another approach without using apply, that relies on R recycling behavior:
t(x*t(trans_temp))
This will probably be much faster than the other two approaches.
^^^ Not anymore after Arun's edits :) What this has going for it now is that you can have an arbitrary x (and if you want an arbitrary operation in addition to arbitrary x, then you'd go with Simon's answer).

Related

How to fit a function for different groups in a data set using R

Please, how can I fit a function for different groups in a data set (Soil) using R. the first column is the group i.e. Plot and the second column is the observed variable i.e. Depth
Plot Depth
1 12.5
1 14.5
1 15.8
1 16.1
1 18.9
1 21.2
1 23.4
1 25.7
2 13.1
2 15.0
2 15.8
2 16.3
2 17.4
2 18.6
2 22.6
2 24.1
2 25.6
3 11.5
3 12.2
3 13.9
3 14.7
3 18.9
3 20.5
3 21.6
3 22.6
3 24.1
3 25.8
4 10.2
4 21.5
4 15.1
4 12.3
4 10.0
4 13.5
4 16.5
4 19.2
4 17.6
4 14.1
4 19.7
I used the 'for' statement but only saw output for Plot 1.
This was how I applied the 'for' statement:
After importing my data in R, I saved it as: SNq,
for (i in 1:SNq$Plot[i]) {
dp <- SNq$Depth[SNq$Plot==SNq$Plot[i]]
fit1 = fitdist(dp, "gamma") ## this is the function I'm fitting. The function is not the issue. My challenge is the 'for' statement.
fit1
}
I think this should work. Just make one change in your code:
Why would it work ?
Because: unique function will return unique values (1,2,3) which are nothing but the groups in Plot column. With unique value, we can subset the data using SNq$Depth[SNq$Plot==i] and get depth value for that group.
for (i in unique(SNq$Plot)) { # <- here
dp <- SNq$Depth[SNq$Plot==i]
fit1 = fitdist(dp, "gamma") ## this is the function I'm fitting. The function is not the issue. My challenge is the 'for' statement.
plot(fit1)
}
A tidyverse suggestion:
library("tidyverse")
library("fitdistrplus")
fits <- SNq %>%
group_by(Plot) %>%
nest() %>%
mutate(fits = map(data, ~ fitdist(data = .$Depth, distr = "gamma")),
summaries = map(fit, summary))
You could continue with print(fits$fits) and print(fits$summaries) to access the different fits and their summary. Alternatively you can use a syntax like fits$fits[[1]] and fits$summaries[[1]] to access them.
Try:
for (i in 1:nrow(SNq)) {
dp <- SNq$Depth[SNq$Plot==SNq$Plot[i]]
fit1 = fitdist(dp, "gamma")
fit1
}

Estimating missing values in time-series data frame based on a "rate of change"

I am trying to use a loop in R to estimate values that will replace the NAs in my data frame based on a rate of change ("rate") that multiplies my last value (ok, this is confusing, but please refer to the example below). This is something similar to my data:
l1 <- c(NA,NA,NA,27,31,0.5)
l2 <- c(NA,8,12,28,39,0.5)
l3 <- c(NA,NA,NA,NA,39,0.3)
l4 <- c(NA,NA,11,15,31,0.2)
l5 <- c(NA,NA,NA,NA,51,0.9)
data <- as.data.frame(rbind(l1,l2,l3,l4,l5))
colnames(data) <- c("dbh1","dbh2","dbh3","dbh4","dbh5","rate")
So I created a loop to identify my first no-NA value in each line, then use that value to estimate its previous values based on the "rate". So for instance, in row 1, the first NA value would be replace by "27-(0.5*3)", then the second one would be "27-(0.5*2)" and the third one by "27-(0.5*1)". This is the loop I came up with. I know the first part (the outside loop) works but the the inside one doesn't:
for (i in 1: nrow(data)) {
dbh.cols <- data3[i,c("dbh1","dbh2","dbh3","dbh4","dbh5")]
sample.year <- which(dbh.cols != "NA")
data$first.dbh[i] <- min(dbh.cols, na.rm = T)
data$first.index[i] <- min(sample.year)
for (j on 1: (min(sample.year)-1)) {
ifelse(is.na(data[i,j]), min(dbh.cols, na.rm = T) - (min(sample.year)-j)*rate[i,j], data[i,j])
}
}
I am not good at programming so probably my internal loop strategy with "ifelse" is too weird (and wrong) but I just couldn't think of anything else that would work here... Any suggestions?
1) This uses no explicit loops, just an apply. It assumes that the NAs are all leading as in the example given.
fillIn <- function(x) {
rate <- tail(x, 1)
n <- sum(is.na(x)) # no of NAs
c(x[n+1] - rate * seq(n, 1), na.omit(x))
}
replace(data, TRUE, t(apply(data, 1, fillIn)))
giving:
dbh1 dbh2 dbh3 dbh4 dbh5 rate
l1 25.5 26.0 26.5 27.0 31 0.5
l2 7.5 8.0 12.0 28.0 39 0.5
l3 37.8 38.1 38.4 38.7 39 0.3
l4 10.6 10.8 11.0 15.0 31 0.2
l5 47.4 48.3 49.2 50.1 51 0.9
2) Here is a second approach that uses na.approx from the zoo package. It does not require apply. Here data1 has the same content as data except that the first column is filled in. The other NAs remain. The last line uses na.approx to fill in the remaining NAs linearly.
library(zoo)
NAs <- rowSums(is.na(data))
data1 <- cbind( data[cbind(1:nrow(data), NAs + 1)] - data$rate * NAs, data[-1] )
replace(data, TRUE, t(na.approx(t(data1))))
giving:
dbh1 dbh2 dbh3 dbh4 dbh5 rate
l1 25.5 26.0 26.5 27.0 31 0.5
l2 7.5 8.0 12.0 28.0 39 0.5
l3 37.8 38.1 38.4 38.7 39 0.3
l4 10.6 10.8 11.0 15.0 31 0.2
l5 47.4 48.3 49.2 50.1 51 0.9
2a) A variation on (2) uses na.locf in the middle line to bring forward the first non-NA in each row. The first and last lines are the same.
library(zoo)
NAs <- rowSums(is.na(data))
data1 <- cbind(na.locf(t(data), fromLast = TRUE)[1, ] - data$rate * NAs, data[-1])
replace(data, TRUE, t(na.approx(t(data1))))
You do not need to use multiple for loops for this. Here is some simplified code to do what you want just for the for loop. Working explicitly with your data we need to get the first non-NA value from each row.
for_estimate <- apply(data, 1, function(x) x[min(which(is.na(x) == FALSE))])
Secondly, we need to determine what integer to multiply the rate by for each row depending on how many NA values there are.
# total number of NA values per row
n_na <- apply(data,1, function(x) sum(is.na(x)) )
# make it a matrix with a 0's appended on
n_na <- matrix(c(n_na, rep(0, nrow(data) * (ncol(data)-1))),
nrow = nrow(data), ncol = ncol(data)-1)
# fill in the rest of the matrix
for(i in 2:ncol(n_na)){
n_na[,i] <- n_na[,i-1] -1
}
Once we have that we can use this code to back fill the NA values in that way you are interested in.
for(i in (ncol(data)-1):1){
if(sum(is.na(data[,i]))>0){
to_fill <- which(is.na(data[,i])==TRUE)
data[to_fill,i] <- for_estimate[to_fill] - (data$rate[to_fill]*(n_na[to_fill,i])
}
}
output
dbh1 dbh2 dbh3 dbh4 dbh5 rate
l1 25.5 26.0 26.5 27.0 31 0.5
l2 7.5 8.0 12.0 28.0 39 0.5
l3 37.8 38.1 38.4 38.7 39 0.3
l4 10.6 10.8 11.0 15.0 31 0.2
l5 47.4 48.3 49.2 50.1 51 0.9

Remove column values with NA in R

I have a data frame, called gen, which is represented below
A B C D E
1 NA 4.35 35.3 3.36 4.87
2 45.2 .463 34.3 NA 34.4
3 NA 34.5 35.6 .457 46.3
I would like to remove the columns where there are NA's. (I know na.omit does it for rows, but I can't seem to find one for columns). The final result would read:
B C E
1 4.35 35.3 4.87
2 .463 34.3 34.4
3 34.5 35.6 46.3
Thanks!
gen <- gen[sapply(gen, function(x) all(!is.na(x)))]
dfrm[ , sapply(dfrm, function(x){ !any(is.na(x)) } )
You might want to use instead this variant:
dfrm[ , sapply(dfrm, function(x){ all(is.finite(x)) } )
If you have Inf or -Inf values in a vector they are not removed or identified with selection based on is.na.
Just use this:
gen[colSums(is.na(gen)) == 0]

Compute mean across intervals of rows

I have a long matrix, of which I need to compute the mean for a certain interval of rows. At the moment I am doing this manually like this:
values <- cbind(1:50,1)
meanqual10 <- mean(values[1:10,1])
meanqual10
[1] 5.5
meanqual15 <- mean(values[11:20,1])
meanqual15
[1] 15.5
meanqual20 <- mean(values[21:30,1])
meanqual20
[1] 25.5
meanqual25 <- mean(values[31:40,1])
meanqual25
[1] 35.5
meanqual30 <- mean(values[41:50,1])
meanqual30
[1] 45.5
There must be a nicer way of doing this. Can anybody help please?
SeƱor O's answer is nice if you have regular intervals. Another approach, if you want to select arbitrary rows could be something like:
l <- list(1:10,11:20,21:30,31:40,41:50) # vectors of any length or ordering
sapply(l, function(x) mean(values[x,1]))
Which gives:
[1] 5.5 15.5 25.5 35.5 45.5
And of course by can do this for arbitrary rows, too. This is just a slightly different approach.
by(values[,1], ceiling(1:50 / 10), mean)
ceiling(1:50/10) creates a vector of length 50 with a new integer every 10 numbers.
by will then take the mean for each group of the same integer.
Result:
ceiling(1:50/10): 1
[1] 5.5
------------------------------------------------------------------------------------------------------
ceiling(1:50/10): 2
[1] 15.5
------------------------------------------------------------------------------------------------------
ceiling(1:50/10): 3
[1] 25.5
------------------------------------------------------------------------------------------------------
ceiling(1:50/10): 4
[1] 35.5
------------------------------------------------------------------------------------------------------
ceiling(1:50/10): 5
[1] 45.5

Apply formula for between species comparison

I have a data frame laid out in the follwing manner:
Species Trait.p Trait.y Trait.z
a 20.1 7.2 14.1
b 20.4 8.3 15.2
b 19.2 6.8 13.9
I would like to apply, for each species combination, (Xa) - (Xb) where is X is the trait value and the letter is the species and Xa > Xb. I.e has to be such that the larger value of each respective species combination has to come first, calculated for every trait
Would this be a multi-step process?
An example output could be
Combination Trait.p Trait.y Trait.z
a/b 0.3 1.1 1.1
I assumed you choose the largest value but David brings up a good point. I doubt this is the best approach but I think it gives you what you're after. Note I added a c as I'm sure your problem is a bit more complex that just a and b:
dat <- read.table(text="Species Trait.p Trait.y Trait.z
a 20.1 7.2 14.1
b 20.4 8.3 15.2
b 19.2 6.8 13.9
c 14.2 3.8 11.9", header=T)
li <- lapply(split(dat, dat$Species), function(x) apply(x[, -1], 2, max))
com <- expand.grid(names(li), names(li))
inds <- com[com[, 1] != com[, 2], ]
inds <- t(apply(inds, 1, sort))
inds <- inds[!duplicated(inds), ]
ans <- lapply(1:nrow(inds), function(i) {
abs(li[[inds[i, 1]]]-li[[inds[i, 2]]])
})
cbind(Combination = paste(inds[, 1], inds[, 2], sep="/"),
as.data.frame(do.call(rbind, ans)))
This gives us:
Combination Trait.p Trait.y Trait.z
1 a/b 0.3 1.1 1.1
2 a/c 5.9 3.4 2.2
3 b/c 6.2 4.5 3.3
Sorry for the lack of annotation but I'm heading to class.

Resources