Computing Growth Rates - r

I am working on a dataset for a welfare wage subsidy program, where wages per worker are structured as follows:
df <- structure(list(wage_1990 = c(13451.67, 45000, 10301.67, NA, NA,
8726.67, 11952.5, NA, NA, 7140, NA, NA, 10301.67, 7303.33, NA,
NA, 9881.67, 5483.33, 12868.33, 9321.67), wage_1991 = c(13451.67,
45000, 10301.67, NA, NA, 8750, 11952.5, NA, NA, 7140, NA, NA,
10301.67, 7303.33, NA, NA, 9881.67, 5483.33, 12868.33, 9321.67
), wage_1992 = c(13451.67, 49500, 10301.67, NA, NA, 8750, 11952.5,
NA, NA, 7140, NA, NA, 10301.67, 7303.33, NA, NA, 9881.67, NA,
12868.33, 9321.67), wage_1993 = c(NA, NA, 10301.67, NA, NA, 8750,
11958.33, NA, NA, 7140, NA, NA, 10301.67, 7303.33, NA, NA, 9881.67,
NA, NA, 9321.67), wage_1994 = c(NA, NA, 10301.67, NA, NA, 8948.33,
11958.33, NA, NA, 7140, NA, NA, 10301.67, 7303.33, NA, NA, 9881.67,
NA, NA, 9321.67), wage_1995 = c(NA, NA, 10301.67, NA, NA, 8948.33,
11958.33, NA, NA, 7140, NA, NA, 10301.67, 7303.33, NA, NA, 9881.67,
NA, NA, 9321.67), wage_1996 = c(NA, NA, 10301.67, NA, NA, 8948.33,
11958.33, NA, NA, 7291.67, NA, NA, 10301.67, 7303.33, NA, NA,
9881.67, NA, NA, 9321.67)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -20L))
I have tried one proposed solution, which is running this code after the one above:
average_growth_rate <- apply(df, 1, function(x) {
x1 <- x[!is.na(x)]
mean(x1[-1]/x1[-length(x1)]-1)})
out <- data.frame(rowid = seq_len(nrow(df)), average_growth_rate)
out[!is.na(out$average_growth_rate),]
But I keep getting this error:
Error in dim(X) <- c(n, length(X)/n) : dims [product 60000] do not match the length of object [65051]
I want to do the following: 1-Create a variable showing the annual growth rate of wage for each worker or lack of thereof.
The practical issue that I am facing is that each observation is in one row and while the first worker joined the program in 1990, others might have joined in say 1993 or 1992. Therefore, is there a way to apply the growth rate for each worker depending on the specific years they worked, rather than applying a general growth formula for all observations?
My expected output for each row would be having a new column
average wage growth rate
1- 15%
2- 9%
3- 12%
After running the following code to see descriptive statistics of my variable of interest:
skim(df$average_growth_rate)
I get the following result:
"Variable contains Inf or -Inf value(s) that were converted to NA.── Data Summary ────────────────────────
Values
Name gosi_beneficiary_growth$a...
Number of rows 3671
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None
── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
1 data 1348 0.633 Inf Inf -1 -0.450 0 0.0568
"
I am not sure why my mean and standard deviation values are Inf.

Here is one approach:
library(tidyverse)
growth <- df %>%
rowid_to_column() %>%
gather(key, value, -rowid) %>%
drop_na() %>%
arrange(rowid, key) %>%
group_by(rowid) %>%
mutate(yoy = value / lag(value)-1) %>%
summarise(average_growth_rate = mean(yoy, na.rm=T))
# A tibble: 12 x 2
rowid average_growth_rate
<int> <dbl>
1 1 0
2 2 0.05
3 3 0
4 6 0.00422
5 7 0.0000813
6 10 0.00354
7 13 0
8 14 0
9 17 0
10 18 0
11 19 0
12 20 0
And just to highlight that all these 0s are expected, here the dataframe:
> head(df)
# A tibble: 6 x 7
wage_1990 wage_1991 wage_1992 wage_1993 wage_1994 wage_1995 wage_1996
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 13452. 13452. 13452. NA NA NA NA
2 45000 45000 49500 NA NA NA NA
3 10302. 10302. 10302. 10302. 10302. 10302. 10302.
4 NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA
6 8727. 8750 8750 8750 8948. 8948. 8948.
where you see that e.g. for the first row, there was no growth nor any decline. The second row, there was a slight increase in between the second and the third year, but it was 0 for the first and second. For the third row, again absolutely no change. Etc...
Also, finally, to add these results to the initial dataframe, you would do e.g.
df %>%
rowid_to_column() %>%
left_join(growth)
And just to answer the performance question, here a benchmark (where I changed akrun's data.frame call to a tibble call to make sure there is no difference coming from this). All functions below correspond to creating the growth rates, not merging back to the original dataframe.
library(microbenchmark)
microbenchmark(cj(), akrun(), akrun2())
Unit: microseconds
expr min lq mean median uq max neval cld
cj() 5577.301 5820.501 6122.076 5988.551 6244.301 10646.9 100 c
akrun() 998.301 1097.252 1559.144 1160.450 1212.552 28704.5 100 a
akrun2() 2033.801 2157.101 2653.018 2258.052 2340.702 34143.0 100 b
base R is the clear winner in terms of performance.

We can use base R with apply. Loop over the rows with MARGIN = 1, remove the NA elements ('x1'), get the mean of the ratio of the current and previous element
average_growth_rate <- apply(df, 1, function(x) {
x1 <- x[!is.na(x)]
mean(x1[-1]/x1[-length(x1)]-1)})
out <- data.frame(rowid = seq_len(nrow(df)), average_growth_rate)
out[!is.na(out$average_growth_rate),]
# rowid average_growth_rate
#1 1 0.00000000000
#2 2 0.05000000000
#3 3 0.00000000000
#6 6 0.00422328325
#7 7 0.00008129401
#10 10 0.00354038282
#13 13 0.00000000000
#14 14 0.00000000000
#17 17 0.00000000000
#18 18 0.00000000000
#19 19 0.00000000000
#20 20 0.00000000000
Or using tapply/stack
na.omit(stack(tapply(as.matrix(df), row(df), FUN = function(x)
mean(head(na.omit(x), -1)/tail(na.omit(x), -1) -1))))[2:1]

Related

Reading in xlsx files that are unstructured in R

I am having issues trying to find an efficient way to read in multiple unstructured .xlsx files into R. This requires a bit of explaining, so anyone who is trying to assist can understand exactly what I am trying to do.
I have been suggested this and have decided it would be easier to use dput to replicate my dataset. The structure can be replicated with the code below:
x <- structure(list(...1 = c("Company Name", "Contact",
"Name", "Phone #", "Scope of Work", NA,
"Trees", "36\" Box Southern Live Oak (1.5\" Caliper)", "36\" Box Thornless Chilean Mesquite (1.5\" Caliper)",
NA, "DG",
"Desert Gold", "Pink Coral"
), ...2 = c("To:", "Date:", "Job Name:", "Plan Date:", "Install All Trees, Shrubs, Irrigation and Landscape Material to Meet all Landscape Plans and Specs",
NA, NA, NA, NA, NA, NA, NA, NA), ...3 = c("Contractor",
"DATE ID", "Job ID", "DATE ID", NA, NA, NA, NA, NA, NA, NA,
NA, NA), ...4 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA), ...5 = c(NA, NA, NA, NA, NA, NA, "Quantity", "20", "38",
NA, "Quantity", "26", "32"), ...6 = c(NA, NA, NA, NA, NA, NA, NA,
10, 10, NA, NA, 10, 10), ...7 = c(NA, NA, NA, NA, NA, NA, NA,
200, 380, NA, NA, 260, 320)), row.names = c(NA, -13L), class = c("tbl_df",
"tbl", "data.frame"))
The tibble will look like this, if you use the code above:
...1 ...2 ...3 ...4 ...5 ...6 ...7
<chr> <chr> <chr> <lgl> <chr> <dbl> <dbl>
1 "Company Name" To: Cont~ NA NA NA NA
2 "Contact" Date: DATE~ NA NA NA NA
3 "Name" Job ~ Job ~ NA NA NA NA
4 "Phone #" Plan~ DATE~ NA NA NA NA
5 "Scope of Work" Inst~ NA NA NA NA NA
6 NA NA NA NA NA NA NA
7 "Trees" NA NA NA Quan~ NA NA
8 "36\" Box Southern Live Oak (1.~ NA NA NA 20 10 200
9 "36\" Box Thornless Chilean Mes~ NA NA NA 38 10 380
10 NA NA NA NA NA NA NA
11 "DG" NA NA NA Quan~ NA NA
12 "Desert Gold" NA NA NA 26 10 260
13 "Pink Coral" NA NA NA 32 10 320
TLDR: These files consist of landscape bid forms to contractors. If you notice, the subset x[1:5,1:3] are information about the job, such as the job name, the date, the contractors name, the landscape company's name, etc. Every single one of the .xlsx files have the exact same format regarding that subset. I would like to keep the job name, but, for the purpose of this question, I will not make it the focus.
Under the x[1:5,1:3] subset, starting on x[7,1], there is a header named Trees, which is bolded on the .xlsx files. The next header is DG, which is also bolded. The values are right under the headers, so the first value for Trees is "36\" Box Southern Live Oak (1.5\" Caliper)" and the first value for DG is "Desert Gold". These values are not bolded.
It is important to stress that there are about 10-15 different headers throughout hundreds of files and the amount of values for each header can range from 1 to 100+ rows. These headers and values are always in x[,1].
I am trying to figure out how to partition the sections (DG, TREES...,) and read them into R as their own dataframes. I think the most ideal way to do this is by reading the files into a list and then separating the sections into their own dataframes into a nested list.
Lastly, if you notice, in x[,5], there are headers named Quantity, which are also bolded, and then there are integers under each of the Quantity's that are not bolded. x[,6] is the price of each of those quantities and x[,7] is those 2 columns multiplied together. I am trying to preserve these numbers as well.
In the end, I am trying to have multiple tables or dataframes in R that look like so:
df1
Trees Quantity Price Totals
1 36"... 20 10 200
2 36"... 38 10 380
df2
DG Quantity Price Totals
1 Desert Gold 26 10 260
2 Pink Coral 32 10 320
I am trying to create some way to efficiently do that over hundreds of .xlsx datasets.
So far, I have created a list that has each of the excel files in it. There are 248 files in a folder that I have on my local PC. I read in each of the files into a list like so:
excel_list <- vector(mode = "list", length = 248)
for(i in 1:length(list.files("."))){
excel_list[[i]] <- read_excel(list.files(".")[i], col_names = F)
}
To achieve your desired result you first have to identify the rows containing the section headers which according to your example data could be achieved by finding rows containing "Quantity" in the fifth column. After doing so we some additional data wrangling steps to first convert your data into a tidy format. Finally, we could split the data by section to achieve your desired result:
library(janitor)
library(dplyr, warn = FALSE)
library(tidyr)
tidy_data <- function(x) {
x %>%
remove_empty() %>%
mutate(is_header_row = grepl("^Quan", `...5`),
section = ifelse(is_header_row, `...1`, NA_character_)) %>%
fill(section) %>%
filter(!is.na(section), !is_header_row) %>%
select(-is_header_row) %>%
remove_empty() %>%
rename(Item = 1, Quantity = 2, Price = 3, Totals = 4)
}
xx <- tidy_data(x)
xx
#> # A tibble: 4 × 5
#> Item Quantity Price Totals section
#> <chr> <chr> <dbl> <dbl> <chr>
#> 1 "36\" Box Southern Live Oak (1.5\" Caliper)" 20 10 200 Trees
#> 2 "36\" Box Thornless Chilean Mesquite (1.5\" Ca… 38 10 380 Trees
#> 3 "Desert Gold" 26 10 260 DG
#> 4 "Pink Coral" 32 10 320 DG
xx %>%
split(., .$section) %>%
purrr::imap(function(x, y) { x %>% select(-section) %>% rename("{y}" := 1) })
#> $DG
#> # A tibble: 2 × 4
#> DG Quantity Price Totals
#> <chr> <chr> <dbl> <dbl>
#> 1 Desert Gold 26 10 260
#> 2 Pink Coral 32 10 320
#>
#> $Trees
#> # A tibble: 2 × 4
#> Trees Quantity Price Totals
#> <chr> <chr> <dbl> <dbl>
#> 1 "36\" Box Southern Live Oak (1.5\" Caliper)" 20 10 200
#> 2 "36\" Box Thornless Chilean Mesquite (1.5\" Caliper)" 38 10 380

Determine range of time where measurements are not NA

I have a dataset with hundreds of thousands of measurements taken from several subjects. However, the measurements are only partially available, i.e., there may be large stretches with NA. I need to establish up front, for which timespan positive data are available for each subject.
Data:
df
timestamp C B A starttime_ms
1 00:00:00.033 NA NA NA 33
2 00:00:00.064 NA NA NA 64
3 00:00:00.066 NA 0.346 NA 66
4 00:00:00.080 47.876 0.346 22.231 80
5 00:00:00.097 47.876 0.346 22.231 97
6 00:00:00.099 47.876 0.346 NA 99
7 00:00:00.114 47.876 0.346 NA 114
8 00:00:00.130 47.876 0.346 NA 130
9 00:00:00.133 NA 0.346 NA 133
10 00:00:00.147 NA 0.346 NA 147
My (humble) solution so far is (i) to pick out the range of timestamp values that are not NA and to select the first and last such timestamp for each subject individually. Here's the code for subject C:
NotNA_C <- df$timestamp[which(!is.na(df$C))]
range_C <- paste(NotNA_C[1], NotNA_C[length(NotNA_C)], sep = " - ")
range_C
[1] "00:00:00.080" "00:00:00.130"
That doesn't look elegant and, what's more, it needs to be repeated for all other subjects. Is there a more efficient way to establish the range of time for which non-NA values are available for all subjects in one go?
EDIT
I've found a base R solution:
sapply(df[,2:4], function(x)
paste(df$timestamp[which(!is.na(x))][1],
df$timestamp[which(!is.na(x))][length(df$timestamp[which(!is.na(x))])], sep = " - "))
C B A
"00:00:00.080 - 00:00:00.130" "00:00:00.066 - 00:00:00.147" "00:00:00.080 - 00:00:00.097"
but would be interested in other solutions as well!
Reproducible data:
df <- structure(list(timestamp = c("00:00:00.033", "00:00:00.064",
"00:00:00.066", "00:00:00.080", "00:00:00.097", "00:00:00.099",
"00:00:00.114", "00:00:00.130", "00:00:00.133", "00:00:00.147"
), C = c(NA, NA, NA, 47.876, 47.876, 47.876, 47.876, 47.876,
NA, NA), B = c(NA, NA, 0.346, 0.346, 0.346, 0.346,
0.346, 0.346, 0.346, 0.346), A = c(NA, NA, NA, 22.231, 22.231, NA, NA, NA, NA,
NA), starttime_ms = c(33, 64, 66, 80, 97, 99, 114, 130, 133,
147)), row.names = c(NA, 10L), class = "data.frame")
dplyr solution
library(tidyverse)
df <- structure(list(timestamp = c("00:00:00.033", "00:00:00.064",
"00:00:00.066", "00:00:00.080", "00:00:00.097", "00:00:00.099",
"00:00:00.114", "00:00:00.130", "00:00:00.133", "00:00:00.147"
), C = c(NA, NA, NA, 47.876, 47.876, 47.876, 47.876, 47.876,
NA, NA), B = c(NA, NA, 0.346, 0.346, 0.346, 0.346,
0.346, 0.346, 0.346, 0.346), A = c(NA, NA, NA, 22.231, 22.231, NA, NA, NA, NA,
NA), starttime_ms = c(33, 64, 66, 80, 97, 99, 114, 130, 133,
147)), row.names = c(NA, 10L), class = "data.frame")
df %>%
pivot_longer(-c(timestamp, starttime_ms)) %>%
group_by(name) %>%
drop_na() %>%
summarise(min = timestamp %>% min(),
max = timestamp %>% max())
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 3
#> name min max
#> <chr> <chr> <chr>
#> 1 A 00:00:00.080 00:00:00.097
#> 2 B 00:00:00.066 00:00:00.147
#> 3 C 00:00:00.080 00:00:00.130
Created on 2021-02-15 by the reprex package (v0.3.0)
You could look at the cumsum of differences where there's no NA, coerce them to logical and subset first and last element.
lapply(data.frame(apply(rbind(0, diff(!sapply(df[c("C", "B", "A")], is.na))), 2, cumsum)),
function(x) c(df$timestamp[as.logical(x)][1], rev(df$timestamp[as.logical(x)])[1]))
# $C
# [1] "00:00:00.080" "00:00:00.130"
#
# $B
# [1] "00:00:00.066" "00:00:00.147"
#
# $A
# [1] "00:00:00.080" "00:00:00.097"

Fill in NA column values with the last value that was not NA (na.locf by column) [duplicate]

This question already has answers here:
Fill missing values rowwise (right / left)
(2 answers)
Closed 2 years ago.
I am cleaning my data of which the dput looks as follows.
DF <- structure(list(toberevised = c("[Money amounts are in thousands of dollars]",
NA, NA, NA, "Item", NA, NA, NA, NA, "Number of returns", "Number of joint returns",
"Number with paid preparer's signature", "Number of exemptions",
"Adjusted gross income (AGI) [3]", "Salaries and wages in AGI: [4] Number",
"Salaries and wages in AGI: Amount", "Taxable interest: Number",
"Taxable interest: Amount", "Ordinary dividends: Number", "Ordinary dividends: Amount"
), ...2 = c("UNITED STATES [2]", NA, NA, NA, "All returns", NA,
NA, "1", NA, "135257620", "52607676", "80455243", "273738434",
"7364640131", "114060887", "5161583318", "59553985", "161324824",
"31158675", "164247298"), ...3 = c(NA, NA, NA, NA, "Under", "$50,000 [1]",
NA, "2", NA, "92150166", "20743943", "53622647", "159649737",
"1797097083", "75422766", "1541276272", "28527550", "39043002",
"13174923", "23867893"), ...4 = c(NA, NA, "Size of adjusted gross income",
NA, "50000", "under", "75000", "3", NA, "18221115", "11329459",
"11025624", "44189517", "1119634632", "16299827", "896339313",
"10891905", "16353293", "5255958", "12810282"), ...5 = c(NA,
NA, NA, NA, "75000", "under", "100000", "4", NA, "10499106",
"8296546", "6260725", "28555195", "905336768", "9520214", "721137490",
"7636612", "12852148", "4095938", "11524298"), ...6 = c(NA, NA,
NA, NA, "100000", "under", "200000", "5", NA, "10797979", "9193700",
"6678965", "30919226", "1429575727", "9782173", "1083175205",
"9092673", "23160862", "5824522", "25842394"), ...7 = c(NA, NA,
NA, NA, "200000", "or more", NA, "6", NA, "3589254", "3044028",
"2867282", "10424759", "2112995921", "3035907", "919655038",
"3405245", "69915518", "2807334", "90202431")), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
In the first row and the third row, I would like to use something like na.locf from zoo but not on the rows but on the columns, so that the DF becomes.
DF[1,3:7] <- "UNITED STATES [2]"
DF[1,5:7] <- "Size of adjusted gross income"
apply na.locf rowwise :
DF[] <- t(apply(DF, 1, zoo::na.locf, na.rm = FALSE))
DF
# A tibble: 20 x 7
# toberevised ...2 ...3 ...4 ...5 ...6 ...7
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 [Money amounts are in th… UNITED ST… UNITED ST… UNITED STATES … UNITED STATES … UNITED STATES … UNITED STATES…
# 2 NA NA NA NA NA NA NA
# 3 NA NA NA Size of adjust… Size of adjust… Size of adjust… Size of adjus…
# 4 NA NA NA NA NA NA NA
# 5 Item All retur… Under 50000 75000 100000 200000
# 6 NA NA $50,000 [… under under under or more
# 7 NA NA NA 75000 100000 200000 200000
# 8 NA 1 2 3 4 5 6
# 9 NA NA NA NA NA NA NA
#10 Number of returns 135257620 92150166 18221115 10499106 10797979 3589254
#11 Number of joint returns 52607676 20743943 11329459 8296546 9193700 3044028
#12 Number with paid prepare… 80455243 53622647 11025624 6260725 6678965 2867282
#13 Number of exemptions 273738434 159649737 44189517 28555195 30919226 10424759
#14 Adjusted gross income (A… 7364640131 1797097083 1119634632 905336768 1429575727 2112995921
#15 Salaries and wages in AG… 114060887 75422766 16299827 9520214 9782173 3035907
#16 Salaries and wages in AG… 5161583318 1541276272 896339313 721137490 1083175205 919655038
#17 Taxable interest: Number 59553985 28527550 10891905 7636612 9092673 3405245
#18 Taxable interest: Amount 161324824 39043002 16353293 12852148 23160862 69915518
#19 Ordinary dividends: Num… 31158675 13174923 5255958 4095938 5824522 2807334
#20 Ordinary dividends: Amou… 164247298 23867893 12810282 11524298 25842394 90202431
As suggested by #G. Grothendieck na.locf0 is a better candidate here.
DF[] <- t(apply(DF, 1, zoo::na.locf0))

Filling NA in timeseries data with different interpolation techniques

Time = c("7/16/2017 18:46", "7/16/2017 21:52",
"7/16/2017 23:16", "7/17/2017 4:03", "7/17/2017 5:13", "7/17/2017 5:27",
"7/17/2017 18:57", "7/17/2017 19:25", "7/17/2017 23:58", "7/18/2017 2:59",
"7/18/2017 3:27", "7/18/2017 3:59")
Flux = c(NA, NA, 4.51263406,
NA, NA, 2.291454049, NA, 4.568703192, NA, NA, 3.392520428, NA
), int = c(403.5413091, 421.5796345, NA, 410.0796897, NA, NA,
363.5271212, NA, NA, 398.9564539, NA, NA)
corr = c(422.745436,
447.6726631, NA, 420.4392183, NA, NA, 408.7056493, NA, NA, 421.8799971,
NA, NA)
dat = c(NA, NA, NA, NA, 2.316481462, NA, NA, NA, 7.11779784,
NA, NA, 2.953349661)
df$Time <- as.POSIXct(strptime(df$Timestamp, format="%m/%d/%Y %H:%M"))
which will look like...
Time Flux int corr dat
7/16/2017 18:46 NA 403.5413091 422.745436 NA
7/16/2017 21:52 NA 421.5796345 447.6726631 NA
7/16/2017 23:16 4.51263406 NA NA NA
7/17/2017 4:03 NA 410.0796897 420.4392183 NA
7/17/2017 5:13 NA NA NA 2.316481462
7/17/2017 5:27 2.291454049 NA NA NA
7/17/2017 18:57 NA 363.5271212 408.7056493 NA
7/17/2017 19:25 4.568703192 NA NA NA
7/17/2017 23:58 NA NA NA 7.11779784
7/18/2017 2:59 NA 398.9564539 421.8799971 NA
7/18/2017 3:27 3.392520428 NA NA NA
7/18/2017 3:59 NA NA NA 2.953349661
I have four columns (1 time data, 3 continuous data). I have many NA values in each column. I want to interpolate and fill the NA for all columns. Since I dont know which interpolation method I need, I would like to many interpolation methods (linear, spline etc). I tried na.approx but it didnt work.
Any help?
If you want to try and compare several interpolation methods as stated, you can use the na.interpolation() function from the imputeTS package.
For linear interpolation:
library("imputeTS")
na.interpolation(df, option = "linear")
For spline interpolation:
library("imputeTS")
na.interpolation(df, option = "spline")
For stineman interpolation:
library("imputeTS")
na.interpolation(df, option = "stine")
So as you can see, you just have to adapt the options parameter.
df <- fill(df,direction = c (names(df)))
But i dont which technique it uses to fill the NA

Construct new column from last non-NA values for each row [duplicate]

This question already has answers here:
Select last non-NA value in a row, by row
(3 answers)
Closed last month.
I have a data frame Depth which consist of LON and LAT with corresponding depths temperature data. For each coordinate (LON and LAT) I would like to pull out last record of each depth corresponding to the coordinates into a new data frame,
> Depth<-read.csv('depthdata.csv')
> head(Depth)
LAT LON X150 X175 X200 X225 X250 X275 X300 X325 X350 X375 X400 X425 X450
1 -78.375 -163.875 -1.167 -1.0 NA NA NA NA NA NA NA NA NA NA NA
2 -78.125 -168.875 -1.379 -1.3 -1.259 -1.6 -1.476 -1.374 -1.507 NA NA NA NA NA NA
3 -78.125 -167.625 -1.700 -1.7 -1.700 -1.7 NA NA NA NA NA NA NA NA NA
4 -78.125 -167.375 -2.100 -2.2 -2.400 -2.3 -2.200 NA NA NA NA NA NA NA NA
5 -78.125 -167.125 -1.600 -1.6 -1.600 -1.6 NA NA NA NA NA NA NA NA NA
6 -78.125 -166.875 NA NA NA NA NA NA NA NA NA NA NA NA NA
so that I will have this;
LAT LON
-78.375 -163.875 -1
-78.125 -168.875 -1.507
-78.125 -167.625 -1.7
-78.125 -167.375 -2.2
-78.125 -167.125 -1.6
-78.125 -166.875 NA
I tried the tail() function but I don't have the desirable result.
As I understand it, you want the last non-NA value in each row, for all columns except the first two.
We can use max.col() along with is.na() with our relevant columns to get us the column number for the last non-NA value. 2 is added (shown by + 2L) to compensate for the removal of the first two columns (shown by [-(1:2)]).
idx <- max.col(!is.na(Depth[-(1:2)]), ties.method = "last") + 2L
We can use idx in cbind() to create an index matrix for retrieving the values.
Depth[cbind(seq_len(nrow(Depth)), idx)]
# [1] -1.000 -1.507 -1.700 -2.200 -1.600 NA
Bind this together with the first two columns of the original data with cbind() and we're done.
cbind(Depth[1:2], LAST = Depth[cbind(seq_len(nrow(Depth)), idx)])
# LAT LON LAST
# 1 -78.375 -163.875 -1.000
# 2 -78.125 -168.875 -1.507
# 3 -78.125 -167.625 -1.700
# 4 -78.125 -167.375 -2.200
# 5 -78.125 -167.125 -1.600
# 6 -78.125 -166.875 NA
Data:
Depth <- structure(list(LAT = c(-78.375, -78.125, -78.125, -78.125, -78.125,
-78.125), LON = c(-163.875, -168.875, -167.625, -167.375, -167.125,
-166.875), X150 = c(-1.167, -1.379, -1.7, -2.1, -1.6, NA), X175 = c(-1,
-1.3, -1.7, -2.2, -1.6, NA), X200 = c(NA, -1.259, -1.7, -2.4,
-1.6, NA), X225 = c(NA, -1.6, -1.7, -2.3, -1.6, NA), X250 = c(NA,
-1.476, NA, -2.2, NA, NA), X275 = c(NA, -1.374, NA, NA, NA, NA
), X300 = c(NA, -1.507, NA, NA, NA, NA), X325 = c(NA, NA, NA,
NA, NA, NA), X350 = c(NA, NA, NA, NA, NA, NA), X375 = c(NA, NA,
NA, NA, NA, NA), X400 = c(NA, NA, NA, NA, NA, NA), X425 = c(NA,
NA, NA, NA, NA, NA), X450 = c(NA, NA, NA, NA, NA, NA)), .Names = c("LAT",
"LON", "X150", "X175", "X200", "X225", "X250", "X275", "X300",
"X325", "X350", "X375", "X400", "X425", "X450"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))

Resources