Related
I have a data frame that looks like this:
ID val1 val2 val3
A07 -0.01 -0.03 0.01
A08 0.05 -0.07 0.02
B01 0.02 0.03 -0.01
For each row, I'd like to identify the largest absolute value in columns val1, val2, and val3. I'd then like to print the signed value (e.g. the originally formatted integer) of the largest absolute value to a new column. The result would look like this:
ID val1 val2 val3 val.new
A07 -0.01 -0.03 0.01 -0.03
A08 0.05 -0.07 0.02 -0.07
B01 0.04 0.02 -0.01 0.04
I am currently using apply to identify the maximum absolute value in each row across the desired columns and print to a new column, like this:
df[,"val.new"] = apply(abs(df[,2:4]), 1, max)
But this of course returns the max absolute value, without the sign:
ID val1 val2 val3 val.new
A07 -0.01 -0.03 0.01 0.03
A08 0.05 -0.07 0.02 0.07
B01 0.04 0.02 -0.01 0.04
I can't figure out how to return the signed value that was used to identify the max. How do I fix that?
Thanks!
You can do:
df$val.new <- apply(df[-1], 1, function(x) x[which.max(abs(x))])
df
#> ID val1 val2 val3 val.new
#> 1 A07 -0.01 -0.03 0.01 -0.03
#> 2 A08 0.05 -0.07 0.02 -0.07
#> 3 B01 0.02 0.03 -0.01 0.03
Data used
df <- structure(list(ID = structure(1:3, .Label = c("A07", "A08", "B01"
), class = "factor"), val1 = c(-0.01, 0.05, 0.02), val2 = c(-0.03,
-0.07, 0.03), val3 = c(0.01, 0.02, -0.01)), row.names = c(NA,
-3L), class = "data.frame")
df
#> ID val1 val2 val3
#> 1 A07 -0.01 -0.03 0.01
#> 2 A08 0.05 -0.07 0.02
#> 3 B01 0.02 0.03 -0.01
We can use vectorized row/column index in base R
df$val.new <- df[-1][cbind(seq_len(nrow(df)), max.col(abs(df[-1]), 'first'))]
-output
df
# ID val1 val2 val3 val.new
#1 A07 -0.01 -0.03 0.01 -0.03
#2 A08 0.05 -0.07 0.02 -0.07
#3 B01 0.02 0.03 -0.01 0.03
data
df <- structure(list(ID = structure(1:3, .Label = c("A07", "A08", "B01"
), class = "factor"), val1 = c(-0.01, 0.05, 0.02), val2 = c(-0.03,
-0.07, 0.03), val3 = c(0.01, 0.02, -0.01)), row.names = c(NA,
-3L), class = "data.frame")
My data set is about forest fires and NDVI values (a value ranging from 0 to 1, indicating how green is the surface). It has an initial column which says when the forest fire of row one took place, and subsequent columns indicating the NDVI value on different dates, before and after the fire happened. NDVI values before the fire are substantially higher compared with values after the fire. Something like:
data1989 <- data.frame("date_fire" = c("1987-01-01", "1987-07-03", "1988-01-01"),
"1986-01-01" = c(0.5, 0.589, 0.66),
"1986-06-03" = c(0.56, 0.447, 0.75),
"1986-10-19" = c(0.8, NA, 0.83),
"1987-01-19" = c(0.75, 0.65,0.75),
"1987-06-19" = c(0.1, 0.55,0.811),
"1987-10-19" = c(0.15, 0.12, 0.780),
"1988-01-19" = c(0.2, 0.22,0.32),
"1988-06-19" = c(0.18, 0.21,0.23),
"1988-10-19" = c(0.21, 0.24, 0.250),
stringsAsFactors = FALSE)
> data1989
date_fire X1986.01.01 X1986.06.03 X1986.10.19 X1987.01.19 X1987.06.19 X1987.10.19 X1988.01.19 X1988.06.19 X1988.10.19
1 1987-01-01 0.500 0.560 0.80 0.75 0.100 0.15 0.20 0.18 0.21
2 1987-07-03 0.589 0.447 NA 0.65 0.550 0.12 0.22 0.21 0.24
3 1988-01-01 0.660 0.750 0.83 0.75 0.811 0.78 0.32 0.23 0.25
I would like to compute the average of NDVI values, in a new column, PRIOR to the forest fire. In case one, it would be the average of columns 2, 3, 4 and 5.
What I need to get is:
date_fire X1986.01.01 X1986.06.03 X1986.10.19 X1987.01.19 X1987.06.19 X1987.10.19 X1988.01.19 X1988.06.19 X1988.10.19 meanPreFire
1 1987-01-01 0.500 0.560 0.80 0.75 0.100 0.15 0.20 0.18 0.21 0.653
2 1987-07-03 0.589 0.447 NA 0.65 0.550 0.12 0.22 0.21 0.24 0.559
3 1988-01-01 0.660 0.750 0.83 0.75 0.811 0.78 0.32 0.23 0.25 0.764
Thanks!
EDIT: SOLUTION
How to adapt the code with more than one column to exclude:
data1989 <- data.frame("date_fire" = c("1987-02-01", "1987-07-03", "1988-01-01"),
"type" = c("oak", "pine", "oak"),
"meanRainfall" = c(600, 300, 450),
"1986.01.01" = c(0.5, 0.589, 0.66),
"1986.06.03" = c(0.56, 0.447, 0.75),
"1986.10.19" = c(0.8, NA, 0.83),
"1987.01.19" = c(0.75, 0.65,0.75),
"1987.06.19" = c(0.1, 0.55,0.811),
"1987.10.19" = c(0.15, 0.12, 0.780),
"1988.01.19" = c(0.2, 0.22,0.32),
"1988.06.19" = c(0.18, 0.21,0.23),
"1988.10.19" = c(0.21, 0.24, 0.250),
check.names = FALSE,
stringsAsFactors = FALSE)
Using:
j1 <- findInterval(as.Date(data1989$date_fire), as.Date(names(data1989)[-(1:3)],format="%Y.%m.%d"))
m1 <- cbind(rep(seq_len(nrow(data1989)), j1), sequence(j1))
data1989$meanPreFire <- tapply(data1989[-(1:3)][m1], m1[,1], FUN = mean, na.rm = TRUE)
> data1989
date_fire type meanRainfall 1986.01.01 1986.06.03 1986.10.19 1987.01.19 1987.06.19 1987.10.19 1988.01.19 1988.06.19 1988.10.19 meanPreFire
1 1987-02-01 oak 600 0.500 0.560 0.80 0.75 0.100 0.15 0.20 0.18 0.21 0.6525
2 1987-07-03 pine 300 0.589 0.447 NA 0.65 0.550 0.12 0.22 0.21 0.24 0.5590
3 1988-01-01 oak 450 0.660 0.750 0.83 0.75 0.811 0.78 0.32 0.23 0.25 0.7635
Reshape data to the long form and filter dates prior to the forest fire.
library(tidyverse)
data1989 %>%
pivot_longer(-date_fire, names_to = "date") %>%
mutate(date_fire = as.Date(date_fire),
date = as.Date(date, "X%Y.%m.%d")) %>%
filter(date < date_fire) %>%
group_by(date_fire) %>%
summarise(meanPreFire = mean(value, na.rm = T))
# # A tibble: 3 x 2
# date_fire meanPreFire
# <date> <dbl>
# 1 1987-01-01 0.62
# 2 1987-07-03 0.559
# 3 1988-01-01 0.764
The solution would be much more concise if we would keep the data in long(er) form... but this reproduces the desired output:
library(dplyr)
library(tidyr)
data1989 %>%
pivot_longer(-date_fire, names_to = "date_NDVI", values_to = "value", names_prefix = "^X") %>%
mutate(date_fire = as.Date(date_fire, "%Y-%m-%d"),
date_NDVI = as.Date(date_NDVI, "%Y.%m.%d")) %>%
group_by(date_fire) %>%
mutate(period = ifelse(date_NDVI < date_fire, "before_fire", "after_fire")) %>%
group_by(date_fire, period) %>%
mutate(average_NDVI = mean(value, na.rm = TRUE)) %>%
pivot_wider(names_from = date_NDVI, names_prefix = "X", values_from = value) %>%
pivot_wider(names_from = period, values_from = average_NDVI) %>%
group_by(date_fire) %>%
summarise_all(funs(sum(., na.rm=T)))
Returns:
# A tibble: 3 x 12
date_fire `X1986-01-01` `X1986-06-03` `X1986-10-19` `X1987-01-19` `X1987-06-19` `X1987-10-19` `X1988-01-19` `X1988-06-19` `X1988-10-19` before_fire after_fire
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1987-01-01 0.5 0.56 0.8 0.75 0.1 0.15 0.2 0.18 0.21 0.62 0.265
2 1987-07-03 0.589 0.447 0 0.65 0.55 0.12 0.22 0.21 0.24 0.559 0.198
3 1988-01-01 0.66 0.75 0.83 0.75 0.811 0.78 0.32 0.23 0.25 0.764 0.267
Edit:
If we stop the expression right after calculating the averages we can use the data in this structure to easily calculate the variance or account for variable number of observations. I think it's ok to keep the date_fireas its own column, but I'd suggest leaving the other dates as a column (because they correspond to observations). Especially if we want to do more analysis with the data using ggplot2 and other tidyverse functions.
We can use base R, by creating a row/column index. The column index can be got from findInterval with the column names and the 'date_fire'
j1 <- findInterval(as.Date(data1989$date_fire), as.Date(names(data1989)[-1]))
l1 <- lapply(j1+1, `:`, ncol(data1989)-1)
m1 <- cbind(rep(seq_len(nrow(data1989)), j1), sequence(j1))
m2 <- cbind(rep(seq_len(nrow(data1989)), lengths(l1)), unlist(l1))
data1989$meanPreFire <- tapply(data1989[-1][m1], m1[,1], FUN = mean, na.rm = TRUE)
data1989$meanPostFire <- tapply(data1989[-1][m2], m2[,1], FUN = mean, na.rm = TRUE)
data1989
# date_fire 1986-01-01 1986-06-03 1986-10-19 1987-01-19 1987-06-19 1987-10-19 1988-01-19 1988-06-19 1988-10-19
#1 1987-01-01 0.500 0.560 0.80 0.75 0.100 0.15 0.20 0.18 0.21
#2 1987-07-03 0.589 0.447 NA 0.65 0.550 0.12 0.22 0.21 0.24
#3 1988-01-01 0.660 0.750 0.83 0.75 0.811 0.78 0.32 0.23 0.25
# meanPreFire meanPostFire
#1 0.6200 0.2650000
#2 0.5590 0.1975000
#3 0.7635 0.2666667
Or using melt/dcast from data.table
library(data.table)
dcast(melt(setDT(data1989), id.var = 'date_fire')[,
.(value = mean(value, na.rm = TRUE)),
.(date_fire, grp = c('postFire', 'preFire')[1 + (as.IDate(variable) < as.IDate(date_fire))]) ], date_fire ~ grp)[data1989, on = .(date_fire)]
# date_fire postFire preFire 1986-01-01 1986-06-03 1986-10-19 1987-01-19 1987-06-19 1987-10-19 1988-01-19 1988-06-19
#1: 1987-01-01 0.2650000 0.6200 0.500 0.560 0.80 0.75 0.100 0.15 0.20 0.18
#2: 1987-07-03 0.1975000 0.5590 0.589 0.447 NA 0.65 0.550 0.12 0.22 0.21
#3: 1988-01-01 0.2666667 0.7635 0.660 0.750 0.83 0.75 0.811 0.78 0.32 0.23
# 1988-10-19
#1: 0.21
#2: 0.24
#3: 0.25
data
data1989 <- data.frame("date_fire" = c("1987-01-01", "1987-07-03", "1988-01-01"),
"1986-01-01" = c(0.5, 0.589, 0.66),
"1986-06-03" = c(0.56, 0.447, 0.75),
"1986-10-19" = c(0.8, NA, 0.83),
"1987-01-19" = c(0.75, 0.65,0.75),
"1987-06-19" = c(0.1, 0.55,0.811),
"1987-10-19" = c(0.15, 0.12, 0.780),
"1988-01-19" = c(0.2, 0.22,0.32),
"1988-06-19" = c(0.18, 0.21,0.23),
"1988-10-19" = c(0.21, 0.24, 0.250), check.names = FALSE,
stringsAsFactors = FALSE)
I have data set with 3 features as below:
V1 V2 V3
0.268 0.917 0.191
0.975 0.467 0.447
0.345 0.898 0.984
0.901 0.043 0.456
0.243 0.453 0.964
0.001 0.464 0.953
0.998 0.976 0.978
0.954 0.932 0.923
How to plot this data in 3D graphic based on the following conditions giving different colour for each condition.
(v1>=0.90 && v3>=0.90 && v3>=0.90) || (v1>=0.90 && v3< 0.50 && v3< 0.50) || (v1 < 0.50 && v3>=0.90 && v3< 0.50)|| (v1< 0.50 && v3< 0.50 && v3>=0.90)
I assumed the second statement in each condition is referring to V2, which makes more sense. To color the points according to which condition is met first you need to create a column with that value:
df = data.frame(
"V1" = c(0.268,0.975,0.345,0.901,0.243,0.001,0.998,0.954),
"V2" = c(0.917,0.467,0.898,0.043,0.453,0.464,0.976,0.932),
"V3" = c(0.191,0.447,0.984,0.456,0.964,0.953,0.978,0.923)
)
df = df %>%
mutate(
group = case_when(
V1 >= 0.9 & V2 >= 0.9 & V3 >=0.9 ~ "1",
V1 >= 0.9 & V2 < 0.5 & V3 < 0.5 ~ "2",
V1 < 0.5 & V2 >= 0.9 & V3 <0.5 ~ "3",
V1 <0.5 & V2 <0.5 & V3 >=0.9 ~ "4",
T ~ "5"
))
Then we can use the plotlyor scatterplot3d packages to build the graph:
scatterplot3d(x=df$V1,y=df$V2,z=df$V3,color=df$group)
plot_ly(x=df$V1,y=df$V2,z=df$V3,color = df$group)
You can start by creating a logical vector using the vectorized &;|
# Create the logical vector
ind <- (mat$v1>=0.90 & mat$v3>=0.90 & mat$v3>=0.90) | (mat$v1>=0.90 & mat$v3< 0.50 & mat$v3< 0.50) |
(mat$v1 < 0.50 & mat$v3>=0.90 & mat$v3< 0.50) | (mat$v1< 0.50 & mat$v3< 0.50 & mat$v3>=0.90)
And now one can plot it e.g. using the plotly
# plot
plotly::plot_ly(x = mat$v1[ind], y = mat$v2[ind], z = mat$v3[ind])
With the data
mat = structure(list(v1 = c(0.268, 0.975, 0.345, 0.901, 0.243, 0.001,
0.998, 0.954), v2 = c(0.917, 0.467, 0.898, 0.043, 0.453, 0.464,
0.976, 0.932), v3 = c(0.191, 0.447, 0.984, 0.456, 0.964, 0.953,
0.978, 0.923)), class = "data.frame", row.names = c(NA, -8L))
When using aggregate with compound function, the resulting data.frame has matrices inside columns.
ta=aggregate(cbind(precision,result,prPo)~rstx+qx+laplace,t0
,function(x) c(x=mean(x),m=min(x),M=max(x)))
ta=head(ta)
dput(ta)
structure(list(rstx = c(3, 3, 2, 3, 2, 3), qx = c(0.2, 0.25,
0.3, 0.3, 0.33, 0.33), laplace = c(0, 0, 0, 0, 0, 0), precision = structure(c(0.174583333333333,
0.186833333333333, 0.3035, 0.19175, 0.30675, 0.193666666666667,
0.106, 0.117, 0.213, 0.101, 0.22, 0.109, 0.212, 0.235, 0.339,
0.232, 0.344, 0.232), .Dim = c(6L, 3L), .Dimnames = list(NULL,
c("x", "m", "M"))), result = structure(c(-142.333333333333,
-108.316666666667, -69.1, -85.7, -59.1666666666667, -68.5666666666667,
-268.8, -198.2, -164, -151.6, -138.2, -144.8, -30.8, -12.2, -14.2,
-3.8, -12.6, -3.4), .Dim = c(6L, 3L), .Dimnames = list(NULL,
c("x", "m", "M"))), prPo = structure(c(3.68416666666667,
3.045, 2.235, 2.53916666666667, 2.0775, 2.23666666666667, 1.6,
1, 1.02, 0.54, 0.87, 0.31, 5.04, 4.02, 2.77, 3.53, 2.63, 3.25
), .Dim = c(6L, 3L), .Dimnames = list(NULL, c("x", "m", "M")))), .Names = c("rstx",
"qx", "laplace", "precision", "result", "prPo"), row.names = c(NA,
6L), class = "data.frame")
Is there a function that transform data.frame matrix-colum into columns?
Manually, for each matrix-column, column bind plus column delete works:
colnames(ta)
[1] "rstx" "qx" "laplace" "precision" "result" "prPo"
ta[,"precision"] # ta[,4]
x m M
[1,] 0.1745833 0.106 0.212
[2,] 0.1868333 0.117 0.235
[3,] 0.3035000 0.213 0.339
[4,] 0.1917500 0.101 0.232
[5,] 0.3067500 0.220 0.344
[6,] 0.1936667 0.109 0.232
#column bind + column delete
ta=cbind(ta,precision=ta[,4])
ta=ta[,-4]
colnames(ta)
[1] "rstx" "qx" "laplace" "result" "prPo" "precision.x" "precision.m"
[8] "precision.M"
ta
rstx qx laplace result.x result.m result.M prPo.x prPo.m prPo.M precision.x precision.m
1 3 0.20 0 -142.33333 -268.80000 -30.80000 3.684167 1.600000 5.040000 0.1745833 0.106
2 3 0.25 0 -108.31667 -198.20000 -12.20000 3.045000 1.000000 4.020000 0.1868333 0.117
3 2 0.30 0 -69.10000 -164.00000 -14.20000 2.235000 1.020000 2.770000 0.3035000 0.213
4 3 0.30 0 -85.70000 -151.60000 -3.80000 2.539167 0.540000 3.530000 0.1917500 0.101
5 2 0.33 0 -59.16667 -138.20000 -12.60000 2.077500 0.870000 2.630000 0.3067500 0.220
6 3 0.33 0 -68.56667 -144.80000 -3.40000 2.236667 0.310000 3.250000 0.1936667 0.109
precision.M
1 0.212
2 0.235
3 0.339
4 0.232
5 0.344
6 0.232
matrix doesn't support matrix-column. So as.matrix() transform data.frame into matrix, breaking up matrix-column.
Here is my idea:
library(tidyverse)
ta2 <- ta %>%
as.matrix() %>%
as.data.frame()
Somewhere in Stackoverflow I found a very simple solution:
cbind(ta[-ncol(ta)],ta[[ncol(ta)]])
rstx qx laplace precision.x precision.m precision.M result.x result.m result.M x m
1 3 0.20 0 0.1745833 0.1060000 0.2120000 -142.33333 -268.80000 -30.80000 3.684167 1.60
2 3 0.25 0 0.1868333 0.1170000 0.2350000 -108.31667 -198.20000 -12.20000 3.045000 1.00
3 2 0.30 0 0.3035000 0.2130000 0.3390000 -69.10000 -164.00000 -14.20000 2.235000 1.02
4 3 0.30 0 0.1917500 0.1010000 0.2320000 -85.70000 -151.60000 -3.80000 2.539167 0.54
5 2 0.33 0 0.3067500 0.2200000 0.3440000 -59.16667 -138.20000 -12.60000 2.077500 0.87
6 3 0.33 0 0.1936667 0.1090000 0.2320000 -68.56667 -144.80000 -3.40000 2.236667 0.31
M
1 5.04
2 4.02
3 2.77
4 3.53
5 2.63
6 3.25
Just that!
So my test data looks like this:
structure(list(day = c(1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 4L
), Left = c(0.25, 0.33, 0, 0, 0.25, 0.33, 0.5, 0.33, 0.5, 0),
Left1 = c(NA, NA, 0, 0.5, 0.25, 0.33, 0.1, 0.33, 0.5, 0),
Middle = c(0, 0, 0.3, 0, 0.25, 0, 0.3, 0.33, 0, 0), Right = c(0.25,
0.33, 0.3, 0.5, 0.25, 0.33, 0.1, 0, 0, 0.25), Right1 = c(0.5,
0.33, 0.3, 0, 0, 0, 0, 0, 0, 0.75), Side = structure(c(2L,
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L), .Label = c("L", "R"), class = "factor")), .Names = c("day",
"Left", "Left1", "Middle", "Right", "Right1", "Side"), class = "data.frame", row.names = c(NA,
-10L))
or this:
day Left Left1 Middle Right Right1 Side
1 0.25 NA 0.00 0.25 0.50 R
1 0.33 NA 0.00 0.33 0.33 R
2 0.00 0.00 0.30 0.30 0.30 R
2 0.00 0.50 0.00 0.50 0.00 R
2 0.25 0.25 0.25 0.25 0.00 L
3 0.33 0.33 0.00 0.33 0.00 L
I would like to write a loop to find the standard error and average value for each day on the chosen side..
Ok.. So far I have this code:
td<-read.csv('test data.csv')
IDs<-unique(td$day)
se<-function(x) sqrt(var(x)/length(x))
for (i in 1:length (IDs)) {
day.i<-which(td$day==IDs[i])
td.i<-td[day.i,]
if(td$Side=='L'){
side<-cbind(td.i$Left + td.i$Left1)
}else{
side<-cbind(td.i$Right + td.i$Right1)
}
mean(side)
se(side)
print(mean)
print(se)
}
But I am getting error messages like this
Error: unexpected '}' in "}"
Obviously, I am also not getting the print out of means for each day.. Does anyone know why?
also working on things here: http://www.talkstats.com/showthread.php/27187-Writing-a-mean-loop..-(literally)
Convert your data into a list and work with that instead:
First, split up your data into a list according to Side, subsetting the relevant columns along the way.
td = split(td, td$Side)
NAMES = names(td)
td = lapply(1:length(td),
function(x) td[[x]][c(1, grep(NAMES[x],
names(td[[x]])))])
names(td) = NAMES
td
# $L
# day Left Left1
# 5 2 0.25 0.25
# 6 3 0.33 0.33
# 7 3 0.50 0.10
# 8 4 0.33 0.33
# 9 4 0.50 0.50
#
# $R
# day Right Right1
# 1 1 0.25 0.50
# 2 1 0.33 0.33
# 3 2 0.30 0.30
# 4 2 0.50 0.00
# 10 4 0.25 0.75
Then, use lapply and aggregate to apply whatever functions you want to your data.
lapply(1:length(td),
function(x) aggregate(list(td[[x]][-1]),
list(day = td[[x]]$day), mean))
# [[1]]
# day Left Left1
# 1 2 0.250 0.250
# 2 3 0.415 0.215
# 3 4 0.415 0.415
#
# [[2]]
# day Right Right1
# 1 1 0.29 0.415
# 2 2 0.40 0.150
# 3 4 0.25 0.750
Still not entirely sure if I understand (that is if you want mean and SE for both Left and Left 1 or some sort of combination like sum). This is how I interpreted your question:
FUN <- function(dat, side = "L") {
DF <- split(dat, dat$Side)[[side]]
ind <- if(side=="L") 2:3 else 5:6
stderr <- function(x) sqrt(var(x)/length(x))
meanNse <- function(x) c(mean=mean(x), se=stderr(x))
OUT <- aggregate(DF[, ind], list(DF[, 1]), meanNse)
names(OUT)[1] <- "day"
return(OUT)
}
#test it
FUN(td)
FUN(td, "R")
Which yields:
> FUN(td)
day Left.mean Left.se Left1.mean Left1.se
1 2 0.250 NA 0.250 NA
2 3 0.415 0.085 0.215 0.115
3 4 0.415 0.085 0.415 0.085
> FUN(td, "R")
day Right.mean Right.se Right1.mean Right1.se
1 1 0.29 0.04 0.415 0.085
2 2 0.40 0.10 0.150 0.150
3 4 0.25 NA 0.750 NA