How to extract diagonal elements from dataframe and store in a variable? - r

I have a simple 9 element dataframe.
A B C
1 8 21 1
2 40 25 32
3 10 15 49
I want to extract the diagonal elements and store it in a variable. Is there an easier way to do this other than taking one number out at a time to store to a variable?

In this case as they are all numeric you can use:
df <- data.frame(a=c(4,8,10), b = c(25,24,15), c = c(1,32,49))
df
df
a b c
1 4 25 1
2 8 24 32
3 10 15 49
Where this takes the diagonal.
diag(as.matrix(df))
[1] 4 24 49

You can use the diag function which extracts the diagonal of a matrix:
Data <- data.frame(a = c(1,2,3), b= c(11,12,13), c = c(111,112,113))
Data2 <- as.matrix(Data)
Result <- diag(Data2)
Result #Returns 1 12 113

Related

Will head() and tail() functions in R change the order of output?

I know head() and tail() function will return the first or last parts of a dataset, but I wanna know if the two functions are gonna order the output, or just return without ordering them? Thanks many in advance!
As you can see below, they do keep the original order:
df <- data.frame(number = 1:26, letter = letters[1:26])
> head(df)
number letter
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
> tail(df)
number letter
21 21 u
22 22 v
23 23 w
24 24 x
25 25 y
26 26 z

R - replace all values smaller than a specific value in a column with the nearest bigger value

I have a data frame like this one:
df <- data.frame(c(1,2,3,4,5,6,7), c(0,23,55,0,1,40,21))
names(df) <- c("a", "b")
a b
1 0
2 23
3 55
4 0
5 1
6 40
7 21
Now I want to replace all values smaller than 22 in column b with the nearest bigger value. Of course it is possible to use loops, but since I have quite big datasets this is way too slow.
The solution should look somewhat like this:
a b
1 23
2 23
3 55
4 55
5 40
6 40
7 40
Here is a tidyverse possibility (but note #phiver's comment on replacement ambiguities)
library(tidyverse);
df %>%
mutate(b = ifelse(b < 22, NA, b)) %>%
fill(b) %>%
fill(b, .direction = "up");
# a b
#1 1 23
#2 2 23
#3 3 55
#4 4 55
#5 5 55
#6 6 40
#7 7 40
Explanation: Replace values b < 22 with NA and then use fill to fill NAs with previous/following non-NA entries.
Sample data
df <- data.frame(a = c(1,2,3,4,5,6,7), b = c(0,23,55,0,1,40,21))
You can use zoo::rollapply :
library(zoo)
df$b <- rollapply(df$b,3,function(x)
if (x[2] < 22) min(x[x>22]) else x[2],
partial =T)
# df
# a b
# 1 1 23
# 2 2 23
# 3 3 55
# 4 4 55
# 5 5 40
# 6 6 40
# 7 7 40
In base R you could do this for the same output:
transform(df, b = sapply(seq_along(b),function(i)
if (b[i] < 22) {
bi <- c(b,Inf)[seq(i-1,i+1)]
min(bi[bi>=22])
} else b[i]))

lag/lead entire dataframe in R

I am having a very hard time leading or lagging an entire dataframe. What I am able to do is shifting individual columns with the following attempts but not the whole thing:
require('DataCombine')
df_l <- slide(df, Var = var1, slideBy = -1)
using colnames(x_ret_mon) as Var does not work, I am told the variable names are not found in the dataframe.
This attempt shifts the columns right but not down:
df_l<- dplyr::lag(df)
This only creates new variables for the lagged variables but then I do not know how to effectively delete the old non lagged values:
df_l<-shift(df, n=1L, fill=NA, type=c("lead"), give.names=FALSE)
Use dplyr::mutate_all to apply lags or leads to all columns.
df = data.frame(a = 1:10, b = 21:30)
dplyr::mutate_all(df, lag)
a b
1 NA NA
2 1 21
3 2 22
4 3 23
5 4 24
6 5 25
7 6 26
8 7 27
9 8 28
10 9 29
I don't see the point in lagging all columns in a data.frame. Wouldn't that just correspond to rbinding an NA row to your original data.frame (minus its last row)?
df = data.frame(a = 1:10, b = 21:30)
rbind(NA, df[-nrow(df), ]);
# a b
#1 NA NA
#2 1 21
#3 2 22
#4 3 23
#5 4 24
#6 5 25
#7 6 26
#8 7 27
#9 8 28
#10 9 29
And similarly for leading all columns.
A couple more options
data.frame(lapply(df, lag))
require(purrr)
map_df(df, lag)
If your data is a data.table you can do
require(data.table)
as.data.table(shift(df))
Or, if you're overwriting df
df[] <- lapply(df, lag) # Thanks Moody
require(magrittr)
df %<>% map_df(lag)

randomly select rows based on limited random numbers

Seems simple but I can't figure it out.
I have a bunch of animal location data (217 individuals) as a single dataframe. I'm trying to randomly select X locations per individual for further analysis with the caveat that X is within the range of 6-156.
So I'm trying to set up a loop that first randomly selects a value within the range of 6-156 then use that value (say 56) to randomly extract 56 locations from the first individual animal and so on.
for(i in unique(ANIMALS$ID)){
sub<-sample(6:156,1)
sub2<-i([sample(nrow(i),sub),])
}
This approach didn't seem to work so I tried tweaking it...
for(i in unique(ANIMALS$ID)){
sub<-sample(6:156,1)
rand<-i[sample(1:nrow(i),sub,replace=FALSE),]
}
This did not work either.. Any suggestions or previous postings would be helpful!
Head of the datafile...ANIMALS is the name of the df, ID indicates unique individuals
> FID X Y MONTH DAY YEAR HOUR MINUTE SECOND ELKYR SOURCE ID animalid
1 0 510313 4813290 9 5 2008 22 30 0 342008 FG 1 1
2 1 510382 4813296 9 6 2008 1 30 0 342008 FG 1 1
3 2 510385 4813311 9 6 2008 2 0 0 342008 FG 1 1
4 3 510385 4813394 9 6 2008 3 30 0 342008 FG 1 1
5 4 510386 4813292 9 6 2008 2 30 0 342008 FG 1 1
6 5 510386 4813431 9 6 2008 4 1 0 342008 FG 1 1
Here's one way using mapply. This function takes two lists (or something that can be coerced into a list) and applies function FUN to corresponding elements.
# simulate some data
xy <- data.frame(animal = rep(1:10, each = 10), loc = runif(100))
# calculate number of samples for individual animal
num.samples.per.animal <- sample(3:6, length(unique(xy$animal)), replace = TRUE)
num.samples.per.animal
[1] 6 3 4 4 6 3 3 6 3 5
# subset random x number of rows from each animal
result <- do.call("rbind",
mapply(num.samples.per.animal, split(xy, f = xy$animal), FUN = function(x, y) {
y[sample(1:nrow(y), x),]
}, SIMPLIFY = FALSE)
)
result
animal loc
7 1 0.99483999
1 1 0.50951321
10 1 0.36505294
6 1 0.34058842
8 1 0.26489107
9 1 0.47418823
13 2 0.27213396
12 2 0.28087775
15 2 0.22130069
23 3 0.33646632
21 3 0.02395097
28 3 0.53079981
29 3 0.85287600
35 4 0.84534073
33 4 0.87370167
31 4 0.85646813
34 4 0.11642335
46 5 0.59624723
48 5 0.15379729
45 5 0.57046122
42 5 0.88799675
44 5 0.62171858
49 5 0.75014593
60 6 0.86915983
54 6 0.03152932
56 6 0.66128549
64 7 0.85420774
70 7 0.89262455
68 7 0.40829671
78 8 0.19073661
72 8 0.20648832
80 8 0.71778913
73 8 0.77883677
75 8 0.37647108
74 8 0.65339300
82 9 0.39957202
85 9 0.31188471
88 9 0.10900795
100 10 0.55282999
95 10 0.10145296
96 10 0.09713218
93 10 0.64900866
94 10 0.76099256
EDIT
Here is another (more straightforward) approach that also handles cases when number of rows is less than the number of samples that should be allocated.
set.seed(357)
result <- do.call("rbind",
by(xy, INDICES = xy$animal, FUN = function(x) {
avail.obs <- nrow(x)
num.rows <- sample(3:15, 1)
while (num.rows > avail.obs) {
message("Sample to be larger than available data points, repeating sampling.")
num.rows <- sample(3:15, 1)
}
x[sample(1:avail.obs, num.rows), ]
}))
result
I like Stackoverflow because I learn so much. #RomanLustrik provided a simple solution; mine is straight-froward as well:
# simulate some data
xy <- data.frame(animal = rep(1:10, each = 10), loc = runif(100))
newVec <- NULL #Create a blank dataFrame
for(i in unique(xy$animal)){
#Sample a number between 1 and 10 (or 6 and 156, if you need)
samp <- sample(1:10, 1)
#Determine which rows of dataFrame xy correspond with unique(xy$animal)[i]
rows <- which(xy$animal == unique(xy$animal)[i])
#From xy, sample samp times from the rows associated with unique(xy$animal)[i]
newVec1 <- xy[sample(rows, samp, replace = TRUE), ]
#append everything to the same new dataFrame
newVec <- rbind(newVec, newVec1)
}

automating a normal transformation function in R over multiple columns

I have a data frame m with:
>m
id w y z
1 2 5 8
2 18 5 98
3 1 25 5
4 52 25 8
5 5 5 4
6 3 3 5
Below is a general function for normally transforming a variable that I need to apply to columns w,y,z.
y<-qnorm((rank(x,na.last="keep")-0.5)/sum(!is.na(x))
For example, if I wanted to run this function on "column w" to get the output column appended to dataframe "m" then:
m$w_n<-qnorm((rank(m$w,na.last="keep")-0.5)/sum(!is.na(m$w))
Can someone help me automate this to run on multiple columns in data frame m?
Ideally, I would want an output data frame with the following columns:
id w y z w_n y_n z_n
Note this is a sample data frame, the one I have is much larger and I have more letter columns to run this function on other than w, y,z.
Thanks!
Probably a way to do it in a single step, but what about:
df <- data.frame(id = 1:6, w = sample(50, 6), z = sample(50, 6) )
df
id w z
1 1 39 40
2 2 20 26
3 3 43 11
4 4 4 37
5 5 36 24
6 6 27 14
transCols <- function(x) qnorm((rank(x,na.last="keep")-0.5)/sum(!is.na(x)))
tmpdf <- lapply(df[, -1], transCols)
names(tmpdf) <- paste0(names(tmpdf), "_n")
df_final <- cbind(df, tmpdf)
df_final
df_final
id w z w_n z_n
1 1 39 40 -0.2104284 -1.3829941
2 2 20 26 1.3829941 1.3829941
3 3 43 11 0.2104284 0.6744898
4 4 4 37 -1.3829941 0.2104284
5 5 36 24 0.6744898 -0.6744898
6 6 27 14 -0.6744898 -0.2104284

Resources