I'm working with some daily rainfall data that spans several years. I want to sum the rainfall on consecutive rainy day to get a rainfall total for that rainfall event. It would also be nice to get a start and stop date and rainfall intensity per event. I'm thinking I could hack something together with aggregate however what I'm thinking of doing in my head seems very bulky. Is there a quick and elegant solution possibly to be found with dplyr,tdyror data.table.
Data
structure(list(Time = structure(c(1353398400, 1353484800, 1353571200,
1353657600, 1353744000, 1353830400, 1353916800, 1354003200, 1354089600,
1354176000, 1354262400, 1354348800, 1354435200, 1354521600, 1354608000,
1354694400, 1354780800, 1354867200, 1354953600, 1355040000, 1355126400,
1355212800, 1355299200, 1355385600, 1355472000, 1355558400, 1355644800,
1355731200, 1355817600, 1355904000, 1355990400, 1356076800, 1356163200,
1356249600, 1356336000, 1356422400, 1356508800, 1356595200, 1356681600,
1356768000, 1356854400, 1356940800, 1357027200, 1357113600, 1357200000,
1357286400, 1357372800, 1357459200, 1357545600, 1357632000, 1357718400
), class = c("POSIXct", "POSIXt"), tzone = ""), inc = c(NA, NA,
NA, NA, NA, NA, NA, 0.11, NA, 0.62, 0.0899999999999999, 0.39,
NA, NA, 0.03, NA, NA, NA, NA, NA, NA, 0.34, NA, NA, NA, NA, 0.0600000000000001,
0.02, NA, NA, NA, 0.29, 0.35, 0.02, 0.27, 0.17, 0.0600000000000001,
NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.47, NA, NA, NA, 0.0300000000000002
)), .Names = c("Time", "inc"), row.names = 50:100, class = "data.frame")
Desired output
Begin End Days Total Intensity
11/27/2012 11/27/2012 1 0.11 0.11
11/29/2012 12/1/2012 3 1.1 0.366666667
12/4/2012 12/4/2012 1 0.03 0.03
12/11/2012 12/11/2012 1 0.34 0.34
12/16/2012 12/17/2012 2 0.08 0.04
12/21/2012 12/26/2012 6 0.29 0.048333333
1/5/2013 1/5/2013 1 0.47 0.47
1/9/2013 1/9/2013 1 0.03 0.03
data.table::rleid is a convenient function for dealing with consecutive values, assuming your data frame is named df and it has been sorted by Time variable before hand:
library(data.table)
setDT(df)
na.omit(df[,.(Begin = as.Date(first(Time)),
End = as.Date(last(Time)),
Days = as.Date(last(Time)) - as.Date(first(Time)) + 1,
Total = sum(inc), Intensity = mean(inc)),
by = .(id = rleid(is.na(inc)))])
# id Begin End Days Total Intensity
#1: 2 2012-11-27 2012-11-27 1 days 0.11 0.1100000
#2: 4 2012-11-29 2012-12-01 3 days 1.10 0.3666667
#3: 6 2012-12-04 2012-12-04 1 days 0.03 0.0300000
#4: 8 2012-12-11 2012-12-11 1 days 0.34 0.3400000
#5: 10 2012-12-16 2012-12-17 2 days 0.08 0.0400000
#6: 12 2012-12-21 2012-12-26 6 days 1.16 0.1933333 #I think you have some miscalculation here
#7: 14 2013-01-05 2013-01-05 1 days 0.47 0.4700000
#8: 16 2013-01-09 2013-01-09 1 days 0.03 0.0300000
Here is an approach that uses dplyr.
First, some preliminary cleanup: a date variable is needed, not a POSIXct:
library(dplyr)
df2 <- df %>%
mutate(date = as.Date(Time)) %>%
select(-Time)
This computes a data frame with an explicit variable for rain_event:
df3 <- df2 %>%
filter(!is.na(inc)) %>%
mutate(
day_lag = as.numeric(difftime(date, lag(date), units = "days")),
# special case: first rain event
day_lag = ifelse(is.na(day_lag), 1, day_lag),
rain_event = 1 + cumsum(day_lag > 1)
)
> df3
inc date day_lag rain_event
1 0.11 2012-11-27 1 1
2 0.62 2012-11-29 2 2
3 0.09 2012-11-30 1 2
4 0.39 2012-12-01 1 2
5 0.03 2012-12-04 3 3
6 0.34 2012-12-11 7 4
7 0.06 2012-12-16 5 5
8 0.02 2012-12-17 1 5
9 0.29 2012-12-21 4 6
10 0.35 2012-12-22 1 6
11 0.02 2012-12-23 1 6
12 0.27 2012-12-24 1 6
13 0.17 2012-12-25 1 6
14 0.06 2012-12-26 1 6
15 0.47 2013-01-05 10 7
16 0.03 2013-01-09 4 8
Now, summarise by each rain event, computing the metrics you care about:
df3 %>%
group_by(rain_event) %>%
summarise(
begin = min(date),
end = max(date),
days = n(),
total = sum(inc),
intensity = mean(inc)
)
# A tibble: 8 × 6
rain_event begin end days total intensity
<dbl> <date> <date> <int> <dbl> <dbl>
1 1 2012-11-27 2012-11-27 1 0.11 0.1100000
2 2 2012-11-29 2012-12-01 3 1.10 0.3666667
3 3 2012-12-04 2012-12-04 1 0.03 0.0300000
4 4 2012-12-11 2012-12-11 1 0.34 0.3400000
5 5 2012-12-16 2012-12-17 2 0.08 0.0400000
6 6 2012-12-21 2012-12-26 6 1.16 0.1933333
7 7 2013-01-05 2013-01-05 1 0.47 0.4700000
8 8 2013-01-09 2013-01-09 1 0.03 0.0300000
You can append a new column that group rows when they represent a continuous rainy period, then get the statistics you want using dplyr. assuming that your dataframe is called df:
library(dplyr)
rain_period = rep(NA,nrow(df)) #initialize vector
group=1 #initialize group number
for(i in 1:nrow(df)){
if(is.na(df$inc[i])) group = group + 1
else rain_period[i] = group
}
df$group = rain_period
result = dplyr::group_by(df,group)
result = dplyr::summarise(result,
Begin = min(Time),
End = max(Time),
Days = n(),
Total = sum(inc),
Intensity = mean(inc))
Only base packages, and basically using aggregate function. I know it is not the nicest option around. The only problem is with the format of dates (the columns of data frame must be specified one-by-one for the desired date format, otherwise it will be converted to integer):
data1 <- structure(list(Time = structure(c(1353398400, 1353484800, 1353571200,
1353657600, 1353744000, 1353830400, 1353916800, 1354003200, 1354089600,
1354176000, 1354262400, 1354348800, 1354435200, 1354521600, 1354608000,
1354694400, 1354780800, 1354867200, 1354953600, 1355040000, 1355126400,
1355212800, 1355299200, 1355385600, 1355472000, 1355558400, 1355644800,
1355731200, 1355817600, 1355904000, 1355990400, 1356076800, 1356163200,
1356249600, 1356336000, 1356422400, 1356508800, 1356595200, 1356681600,
1356768000, 1356854400, 1356940800, 1357027200, 1357113600, 1357200000,
1357286400, 1357372800, 1357459200, 1357545600, 1357632000, 1357718400
), class = c("POSIXct", "POSIXt"), tzone = ""), inc = c(NA, NA,
NA, NA, NA, NA, NA, 0.11, NA, 0.62, 0.0899999999999999, 0.39,
NA, NA, 0.03, NA, NA, NA, NA, NA, NA, 0.34, NA, NA, NA, NA, 0.0600000000000001,
0.02, NA, NA, NA, 0.29, 0.35, 0.02, 0.27, 0.17, 0.0600000000000001,
NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.47, NA, NA, NA, 0.0300000000000002
)), .Names = c("Time", "inc"), row.names = 50:100, class = "data.frame")
rainruns <- function(datas = data1) {
incs <- c(NA, datas$inc) # last column
event <- cumsum(is.na(incs[-length(incs)]) & !is.na(incs[-1])) # counter for rain events
datas <- cbind(datas, event) # add events column
datas2 <- datas[!is.na(datas$inc),] # delete na's
summarydata1 <- aggregate(datas2$inc, by = list(datas2$event), # summarize rain data by event
FUN = function(x) c(length(x), sum(x), mean(x)))[[2]]
summarydata2 <- aggregate(as.Date(datas2$Time), by = list(datas2$event), # summarize dates by event
FUN = function(x) c(min(x), max(x)))[[2]]
summarydata <- data.frame(format(as.Date(summarydata2[,1], # combine both, correcting date formats
origin = "1970-01-01"), "%m/%d/%Y"),
format(as.Date(summarydata2[,2],
origin = "1970-01-01"), "%m/%d/%Y"), summarydata1)
names(summarydata) <- c("Begin", "End", "Days", "Total", "Intensity") # update column names
return(summarydata)
}
Related
I want to find a way to extract the n rows that contain the Top results (min and max) in a dataframe. The problem is that this dataframe contains strings and NA and also groups. Also if the top results are in the same row, I still need exactly n rows, so being in the same row counts just as 1 result.
V01_Code V01_Corr V01_Lag V02_Code V02_Corr V02_Lag V03_Code V03_Corr V03_Lag V04_Code V04_Corr V04_Lag Group
1 AMI 0.63 L7 <NA> NA <NA> <NA> NA <NA> <NA> NA <NA> B
2 CII -0.61 L7 CMI -0.53 L7 <NA> NA <NA> <NA> NA <NA> A
3 AFI 0.51 L7 <NA> NA <NA> <NA> NA <NA> <NA> NA <NA> A
4 AII 0.52 L7 BII 0.62 L4 BMI 0.60 L7 III 0.58 L4 B
5 BII 0.52 L7 IIA 0.74 L6 III 0.51 L7 IMA 0.75 L6 A
6 AII 0.58 L6/L7 BII 0.69 L4 BMI 0.70 L7 IIA 0.57 L4 A
7 IIA 0.58 L6 IMA 0.59 L6 IMI 0.52 L6 <NA> NA <NA> B
8 IMU 0.52 L6 <NA> NA <NA> <NA> NA <NA> <NA> NA <NA> A
I tried several versions like this:
aggregate(. ~ Group, df, function(x) max(head(sort(x),2),na.rm=T))
But it doesnt seem to work! As output I want a dataframe of the rows (for example 2 rows here) that contain the highest and lowest values. So in this case 0.75 in row 5 is the highest value, 2nd highest is in the same row which doesnt count then. The 2nd highest in any other row would be 0.7 in row 6. So for my top 2 result of max values I want:
V01_Code V01_Corr V01_Lag V02_Code V02_Corr V02_Lag V03_Code V03_Corr V03_Lag V04_Code V04_Corr V04_Lag Group
5 BII 0.52 L7 IIA 0.74 L6 III 0.51 L7 IMA 0.75 L6 A
6 AII 0.58 L6/L7 BII 0.69 L4 BMI 0.70 L7 IIA 0.57 L4 A
1 AMI 0.63 L7 <NA> NA <NA> <NA> NA <NA> <NA> NA <NA> B
4 AII 0.52 L7 BII 0.62 L4 BMI 0.60 L7 III 0.58 L4 B
n in this case would be 2, so the 2 rows that contain the maximum values for each group.
Here is my dataframe
structure(list(V01_Code = c("AMI", "CII", "AFI", "AII", "BII",
"AII", "IIA", "IMU"), V01_Corr = c(0.63, -0.61, 0.51, 0.52, 0.52,
0.58, 0.58, 0.52), V01_Lag = c("L7", "L7", "L7", "L7", "L7",
"L6/L7", "L6", "L6"), V02_Code = c(NA, "CMI", NA, "BII", "IIA",
"BII", "IMA", NA), V02_Corr = c(NA, -0.53, NA, 0.62, 0.74, 0.69,
0.59, NA), V02_Lag = c(NA, "L7", NA, "L4", "L6", "L4", "L6",
NA), V03_Code = c(NA, NA, NA, "BMI", "III", "BMI", "IMI", NA),
V03_Corr = c(NA, NA, NA, 0.6, 0.51, 0.7, 0.52, NA), V03_Lag = c(NA,
NA, NA, "L7", "L7", "L7", "L6", NA), V04_Code = c(NA, NA,
NA, "III", "IMA", "IIA", NA, NA), V04_Corr = c(NA, NA, NA,
0.58, 0.75, 0.57, NA, NA), V04_Lag = c(NA, NA, NA, "L4",
"L6", "L4", NA, NA), Group = c("B", "A", "A", "B", "A", "A",
"B", "A")), row.names = c("1", "2", "3", "4", "5", "6",
"7", "8"), class = "data.frame")
Here is an option with reshaping i.e. create a row sequence (row_number) column, reshape from wide to long with pivot_longer, arrange the rows by 'Group' and the 'value' column in descending order, then filter the first 'n' unique 'rn' - row_number column, ungroup and reshape back to 'wide' format with pivot_wider
library(dplyr)
library(tidyr)
df1 %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = ends_with("Corr"), names_to = 'Corr') %>%
arrange(Group, desc(value)) %>%
group_by(Group) %>%
filter(rn %in% head(unique(rn), 2)) %>%
ungroup %>%
select(-rn) %>%
pivot_wider(names_from = Corr, values_from = value)
-output
# A tibble: 4 x 13
V01_Code V01_Lag V02_Code V02_Lag V03_Code V03_Lag V04_Code V04_Lag Group V04_Corr V02_Corr V03_Corr V01_Corr
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 BII L7 IIA L6 III L7 IMA L6 A 0.75 0.74 0.51 0.52
2 AII L6/L7 BII L4 BMI L7 IIA L4 A 0.57 0.69 0.7 0.58
3 AMI L7 <NA> <NA> <NA> <NA> <NA> <NA> B NA NA NA 0.63
4 AII L7 BII L4 BMI L7 III L4 B 0.58 0.62 0.6 0.52
I have a dataset with hundreds of thousands of measurements taken from several subjects. However, the measurements are only partially available, i.e., there may be large stretches with NA. I need to establish up front, for which timespan positive data are available for each subject.
Data:
df
timestamp C B A starttime_ms
1 00:00:00.033 NA NA NA 33
2 00:00:00.064 NA NA NA 64
3 00:00:00.066 NA 0.346 NA 66
4 00:00:00.080 47.876 0.346 22.231 80
5 00:00:00.097 47.876 0.346 22.231 97
6 00:00:00.099 47.876 0.346 NA 99
7 00:00:00.114 47.876 0.346 NA 114
8 00:00:00.130 47.876 0.346 NA 130
9 00:00:00.133 NA 0.346 NA 133
10 00:00:00.147 NA 0.346 NA 147
My (humble) solution so far is (i) to pick out the range of timestamp values that are not NA and to select the first and last such timestamp for each subject individually. Here's the code for subject C:
NotNA_C <- df$timestamp[which(!is.na(df$C))]
range_C <- paste(NotNA_C[1], NotNA_C[length(NotNA_C)], sep = " - ")
range_C
[1] "00:00:00.080" "00:00:00.130"
That doesn't look elegant and, what's more, it needs to be repeated for all other subjects. Is there a more efficient way to establish the range of time for which non-NA values are available for all subjects in one go?
EDIT
I've found a base R solution:
sapply(df[,2:4], function(x)
paste(df$timestamp[which(!is.na(x))][1],
df$timestamp[which(!is.na(x))][length(df$timestamp[which(!is.na(x))])], sep = " - "))
C B A
"00:00:00.080 - 00:00:00.130" "00:00:00.066 - 00:00:00.147" "00:00:00.080 - 00:00:00.097"
but would be interested in other solutions as well!
Reproducible data:
df <- structure(list(timestamp = c("00:00:00.033", "00:00:00.064",
"00:00:00.066", "00:00:00.080", "00:00:00.097", "00:00:00.099",
"00:00:00.114", "00:00:00.130", "00:00:00.133", "00:00:00.147"
), C = c(NA, NA, NA, 47.876, 47.876, 47.876, 47.876, 47.876,
NA, NA), B = c(NA, NA, 0.346, 0.346, 0.346, 0.346,
0.346, 0.346, 0.346, 0.346), A = c(NA, NA, NA, 22.231, 22.231, NA, NA, NA, NA,
NA), starttime_ms = c(33, 64, 66, 80, 97, 99, 114, 130, 133,
147)), row.names = c(NA, 10L), class = "data.frame")
dplyr solution
library(tidyverse)
df <- structure(list(timestamp = c("00:00:00.033", "00:00:00.064",
"00:00:00.066", "00:00:00.080", "00:00:00.097", "00:00:00.099",
"00:00:00.114", "00:00:00.130", "00:00:00.133", "00:00:00.147"
), C = c(NA, NA, NA, 47.876, 47.876, 47.876, 47.876, 47.876,
NA, NA), B = c(NA, NA, 0.346, 0.346, 0.346, 0.346,
0.346, 0.346, 0.346, 0.346), A = c(NA, NA, NA, 22.231, 22.231, NA, NA, NA, NA,
NA), starttime_ms = c(33, 64, 66, 80, 97, 99, 114, 130, 133,
147)), row.names = c(NA, 10L), class = "data.frame")
df %>%
pivot_longer(-c(timestamp, starttime_ms)) %>%
group_by(name) %>%
drop_na() %>%
summarise(min = timestamp %>% min(),
max = timestamp %>% max())
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 3
#> name min max
#> <chr> <chr> <chr>
#> 1 A 00:00:00.080 00:00:00.097
#> 2 B 00:00:00.066 00:00:00.147
#> 3 C 00:00:00.080 00:00:00.130
Created on 2021-02-15 by the reprex package (v0.3.0)
You could look at the cumsum of differences where there's no NA, coerce them to logical and subset first and last element.
lapply(data.frame(apply(rbind(0, diff(!sapply(df[c("C", "B", "A")], is.na))), 2, cumsum)),
function(x) c(df$timestamp[as.logical(x)][1], rev(df$timestamp[as.logical(x)])[1]))
# $C
# [1] "00:00:00.080" "00:00:00.130"
#
# $B
# [1] "00:00:00.066" "00:00:00.147"
#
# $A
# [1] "00:00:00.080" "00:00:00.097"
I am working on a dataset for a welfare wage subsidy program, where wages per worker are structured as follows:
df <- structure(list(wage_1990 = c(13451.67, 45000, 10301.67, NA, NA,
8726.67, 11952.5, NA, NA, 7140, NA, NA, 10301.67, 7303.33, NA,
NA, 9881.67, 5483.33, 12868.33, 9321.67), wage_1991 = c(13451.67,
45000, 10301.67, NA, NA, 8750, 11952.5, NA, NA, 7140, NA, NA,
10301.67, 7303.33, NA, NA, 9881.67, 5483.33, 12868.33, 9321.67
), wage_1992 = c(13451.67, 49500, 10301.67, NA, NA, 8750, 11952.5,
NA, NA, 7140, NA, NA, 10301.67, 7303.33, NA, NA, 9881.67, NA,
12868.33, 9321.67), wage_1993 = c(NA, NA, 10301.67, NA, NA, 8750,
11958.33, NA, NA, 7140, NA, NA, 10301.67, 7303.33, NA, NA, 9881.67,
NA, NA, 9321.67), wage_1994 = c(NA, NA, 10301.67, NA, NA, 8948.33,
11958.33, NA, NA, 7140, NA, NA, 10301.67, 7303.33, NA, NA, 9881.67,
NA, NA, 9321.67), wage_1995 = c(NA, NA, 10301.67, NA, NA, 8948.33,
11958.33, NA, NA, 7140, NA, NA, 10301.67, 7303.33, NA, NA, 9881.67,
NA, NA, 9321.67), wage_1996 = c(NA, NA, 10301.67, NA, NA, 8948.33,
11958.33, NA, NA, 7291.67, NA, NA, 10301.67, 7303.33, NA, NA,
9881.67, NA, NA, 9321.67)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -20L))
I have tried one proposed solution, which is running this code after the one above:
average_growth_rate <- apply(df, 1, function(x) {
x1 <- x[!is.na(x)]
mean(x1[-1]/x1[-length(x1)]-1)})
out <- data.frame(rowid = seq_len(nrow(df)), average_growth_rate)
out[!is.na(out$average_growth_rate),]
But I keep getting this error:
Error in dim(X) <- c(n, length(X)/n) : dims [product 60000] do not match the length of object [65051]
I want to do the following: 1-Create a variable showing the annual growth rate of wage for each worker or lack of thereof.
The practical issue that I am facing is that each observation is in one row and while the first worker joined the program in 1990, others might have joined in say 1993 or 1992. Therefore, is there a way to apply the growth rate for each worker depending on the specific years they worked, rather than applying a general growth formula for all observations?
My expected output for each row would be having a new column
average wage growth rate
1- 15%
2- 9%
3- 12%
After running the following code to see descriptive statistics of my variable of interest:
skim(df$average_growth_rate)
I get the following result:
"Variable contains Inf or -Inf value(s) that were converted to NA.── Data Summary ────────────────────────
Values
Name gosi_beneficiary_growth$a...
Number of rows 3671
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None
── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
1 data 1348 0.633 Inf Inf -1 -0.450 0 0.0568
"
I am not sure why my mean and standard deviation values are Inf.
Here is one approach:
library(tidyverse)
growth <- df %>%
rowid_to_column() %>%
gather(key, value, -rowid) %>%
drop_na() %>%
arrange(rowid, key) %>%
group_by(rowid) %>%
mutate(yoy = value / lag(value)-1) %>%
summarise(average_growth_rate = mean(yoy, na.rm=T))
# A tibble: 12 x 2
rowid average_growth_rate
<int> <dbl>
1 1 0
2 2 0.05
3 3 0
4 6 0.00422
5 7 0.0000813
6 10 0.00354
7 13 0
8 14 0
9 17 0
10 18 0
11 19 0
12 20 0
And just to highlight that all these 0s are expected, here the dataframe:
> head(df)
# A tibble: 6 x 7
wage_1990 wage_1991 wage_1992 wage_1993 wage_1994 wage_1995 wage_1996
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 13452. 13452. 13452. NA NA NA NA
2 45000 45000 49500 NA NA NA NA
3 10302. 10302. 10302. 10302. 10302. 10302. 10302.
4 NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA
6 8727. 8750 8750 8750 8948. 8948. 8948.
where you see that e.g. for the first row, there was no growth nor any decline. The second row, there was a slight increase in between the second and the third year, but it was 0 for the first and second. For the third row, again absolutely no change. Etc...
Also, finally, to add these results to the initial dataframe, you would do e.g.
df %>%
rowid_to_column() %>%
left_join(growth)
And just to answer the performance question, here a benchmark (where I changed akrun's data.frame call to a tibble call to make sure there is no difference coming from this). All functions below correspond to creating the growth rates, not merging back to the original dataframe.
library(microbenchmark)
microbenchmark(cj(), akrun(), akrun2())
Unit: microseconds
expr min lq mean median uq max neval cld
cj() 5577.301 5820.501 6122.076 5988.551 6244.301 10646.9 100 c
akrun() 998.301 1097.252 1559.144 1160.450 1212.552 28704.5 100 a
akrun2() 2033.801 2157.101 2653.018 2258.052 2340.702 34143.0 100 b
base R is the clear winner in terms of performance.
We can use base R with apply. Loop over the rows with MARGIN = 1, remove the NA elements ('x1'), get the mean of the ratio of the current and previous element
average_growth_rate <- apply(df, 1, function(x) {
x1 <- x[!is.na(x)]
mean(x1[-1]/x1[-length(x1)]-1)})
out <- data.frame(rowid = seq_len(nrow(df)), average_growth_rate)
out[!is.na(out$average_growth_rate),]
# rowid average_growth_rate
#1 1 0.00000000000
#2 2 0.05000000000
#3 3 0.00000000000
#6 6 0.00422328325
#7 7 0.00008129401
#10 10 0.00354038282
#13 13 0.00000000000
#14 14 0.00000000000
#17 17 0.00000000000
#18 18 0.00000000000
#19 19 0.00000000000
#20 20 0.00000000000
Or using tapply/stack
na.omit(stack(tapply(as.matrix(df), row(df), FUN = function(x)
mean(head(na.omit(x), -1)/tail(na.omit(x), -1) -1))))[2:1]
Here is my data frame:
structure(list(Q = c(NA, 346.86, 166.95, 162.57, NA, NA, NA,
266.7), L = c(18.93, NA, 15.72, 39.51, NA, NA, NA, NA), C = c(NA,
23.8, NA, 8.47, 20.89, 18.72, 14.94, NA), X = c(40.56, NA, 26.05,
3.08, 23.77, 59.37, NA, NA), W = c(29.47, NA, NA, NA, 36.08,
NA, 27.34, 28.19), S = c(NA, 7.47, NA, NA, 18.64, NA, 25.34,
NA), Y = c(NA, 2.81, 0, NA, NA, 21.18, 10.83, 12.19), H = c(0,
NA, NA, NA, NA, 0, NA, 0)), class = "data.frame", row.names = c(NA,
-8L), .Names = c("Q", "L", "C", "X", "W", "S", "Y", "H"))
Each row has 4 variables that are NAs, now I want to do the same operations to every row:
Drop those 4 varibles that are NAs
Calculate diversity for the rest 4 variables (it's just some computations involved with the rest, here I use diversity() from vegan)
Append the output to a new data frame
But the problem is:
How to do drop NA variables using dplyr? I don't know whether select() can make it.
How to apply operations to every row of a data frame?
It seems that drop_na() will remove the entire row for my dataset, any suggestion?
With tidyverse it may be better to gather into 'long' format and then spread it back. Assuming that we have exactly 4 non-NA elements per row, create a row index with rownames_to_column (from tibble), gather (from tidyr) into 'long' format, remove the NA elements, grouped by row number ('rn'), change the 'key' values to common values and then spread it to wide' format
library(tibble)
library(tidyr)
library(dplyr)
res <- rownames_to_column(df1, 'rn') %>%
gather(key, val, -rn) %>%
filter(!is.na(val)) %>%
group_by(rn) %>%
mutate(key = LETTERS[1:4]) %>%
spread(key, val) %>%
ungroup %>%
select(-rn)
res
# A tibble: 8 x 4
# A B C D
#* <dbl> <dbl> <dbl> <dbl>
#1 18.9 40.6 29.5 0
#2 347 23.8 7.47 2.81
#3 167 15.7 26.0 0
#4 163 39.5 8.47 3.08
#5 20.9 23.8 36.1 18.6
#6 18.7 59.4 21.2 0
#7 14.9 27.3 25.3 10.8
#8 267 28.2 12.2 0
diversity(res)
# 1 2 3 4 5 6 7 8
#1.0533711 0.3718959 0.6331070 0.7090783 1.3517680 0.9516232 1.3215712 0.4697572
Regarding the diversity calculation, we can replace NA with 0 and apply on the whole dataset i.e.
library(vegan)
diversity(replace(df1, is.na(df1), 0))
#[1] 1.0533711 0.3718959 0.6331070 0.7090783
#[5] 1.3517680 0.9516232 1.3215712 0.4697572
as we get the same output as in the first solution
This question already has answers here:
Select last non-NA value in a row, by row
(3 answers)
Closed last month.
I have a data frame Depth which consist of LON and LAT with corresponding depths temperature data. For each coordinate (LON and LAT) I would like to pull out last record of each depth corresponding to the coordinates into a new data frame,
> Depth<-read.csv('depthdata.csv')
> head(Depth)
LAT LON X150 X175 X200 X225 X250 X275 X300 X325 X350 X375 X400 X425 X450
1 -78.375 -163.875 -1.167 -1.0 NA NA NA NA NA NA NA NA NA NA NA
2 -78.125 -168.875 -1.379 -1.3 -1.259 -1.6 -1.476 -1.374 -1.507 NA NA NA NA NA NA
3 -78.125 -167.625 -1.700 -1.7 -1.700 -1.7 NA NA NA NA NA NA NA NA NA
4 -78.125 -167.375 -2.100 -2.2 -2.400 -2.3 -2.200 NA NA NA NA NA NA NA NA
5 -78.125 -167.125 -1.600 -1.6 -1.600 -1.6 NA NA NA NA NA NA NA NA NA
6 -78.125 -166.875 NA NA NA NA NA NA NA NA NA NA NA NA NA
so that I will have this;
LAT LON
-78.375 -163.875 -1
-78.125 -168.875 -1.507
-78.125 -167.625 -1.7
-78.125 -167.375 -2.2
-78.125 -167.125 -1.6
-78.125 -166.875 NA
I tried the tail() function but I don't have the desirable result.
As I understand it, you want the last non-NA value in each row, for all columns except the first two.
We can use max.col() along with is.na() with our relevant columns to get us the column number for the last non-NA value. 2 is added (shown by + 2L) to compensate for the removal of the first two columns (shown by [-(1:2)]).
idx <- max.col(!is.na(Depth[-(1:2)]), ties.method = "last") + 2L
We can use idx in cbind() to create an index matrix for retrieving the values.
Depth[cbind(seq_len(nrow(Depth)), idx)]
# [1] -1.000 -1.507 -1.700 -2.200 -1.600 NA
Bind this together with the first two columns of the original data with cbind() and we're done.
cbind(Depth[1:2], LAST = Depth[cbind(seq_len(nrow(Depth)), idx)])
# LAT LON LAST
# 1 -78.375 -163.875 -1.000
# 2 -78.125 -168.875 -1.507
# 3 -78.125 -167.625 -1.700
# 4 -78.125 -167.375 -2.200
# 5 -78.125 -167.125 -1.600
# 6 -78.125 -166.875 NA
Data:
Depth <- structure(list(LAT = c(-78.375, -78.125, -78.125, -78.125, -78.125,
-78.125), LON = c(-163.875, -168.875, -167.625, -167.375, -167.125,
-166.875), X150 = c(-1.167, -1.379, -1.7, -2.1, -1.6, NA), X175 = c(-1,
-1.3, -1.7, -2.2, -1.6, NA), X200 = c(NA, -1.259, -1.7, -2.4,
-1.6, NA), X225 = c(NA, -1.6, -1.7, -2.3, -1.6, NA), X250 = c(NA,
-1.476, NA, -2.2, NA, NA), X275 = c(NA, -1.374, NA, NA, NA, NA
), X300 = c(NA, -1.507, NA, NA, NA, NA), X325 = c(NA, NA, NA,
NA, NA, NA), X350 = c(NA, NA, NA, NA, NA, NA), X375 = c(NA, NA,
NA, NA, NA, NA), X400 = c(NA, NA, NA, NA, NA, NA), X425 = c(NA,
NA, NA, NA, NA, NA), X450 = c(NA, NA, NA, NA, NA, NA)), .Names = c("LAT",
"LON", "X150", "X175", "X200", "X225", "X250", "X275", "X300",
"X325", "X350", "X375", "X400", "X425", "X450"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))