Related
I have a dataframe (table with 100 rows/countries and 28 columns/months between 2020 and 2022). I used the package imputeTS and used the function na_kalman() to substitute my several NAs values by some estimated values. Everything goes fine till here. After, when I try to plot using gplot_na_imputations() or ggplot_na_distribution() an error is shown: "Input x_with_na is not numeric". I think the solution is to convert my dataframe into a time series 'ts'. Any suggestions?
This is what I have:
total_tests_imp <- na_kalman(total_tests_md)
ggplot_na_imputations(x_with_na = total_tests_md, x_with_imputations = total_tests_imp)
ggplot_na_distribution(total_tests_md)
(ps.) when I run: class(total_tests_md)
it appears:[1] "tbl_df" "tbl" "data.frame"
When I run `head(total_tests_md)´
# A tibble: 6 x 29
countries jan_20 fev_20 mar_20 abr_20 mai_20 jun_20 jul_20 ago_20 set_20 out_20 nov_20 dez_20 jan_21 fev_21 mar_21 abr_21
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2 Albania NA 0.009 0.54 2.83 5.08 8.19 12.9 20.3 29.1 42.0 61.7 86.2 119. 155. 187. 214.
3 Algeria NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
4 Andorra NA NA NA NA NA NA NA NA 691. 1033. 1405. 1613. 1819. 2003. 2175. 2335.
5 Angola NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
6 Argentina 0.013 0.015 0.162 1.55 4.44 9.91 19.7 34.3 52.3 74.3 92.3 112. 143. 172. 204. 257.
# ... with 12 more variables: mai_21 <dbl>, jun_21 <dbl>, jul_21 <dbl>, ago_21 <dbl>, set_21 <dbl>, out_21 <dbl>,
# nov_21 <dbl>, dez_21 <dbl>, jan_22 <dbl>, fev_22 <dbl>, mar_22 <dbl>, abr_22 <dbl>´´´
dput(head(total_tests_md))
structure(list(countries = c("Afghanistan", "Albania", "Algeria",
"Andorra", "Angola", "Argentina"), jan_20 = c(NA, NA, NA, NA,
NA, 0.013), fev_20 = c(NA, 0.009, NA, NA, NA, 0.015), mar_20 = c(NA,
0.54, NA, NA, NA, 0.162), abr_20 = c(NA, 2.831, NA, NA, NA, 1.546
), mai_20 = c(NA, 5.083, NA, NA, NA, 4.445), jun_20 = c(NA, 8.192,
NA, NA, NA, 9.913), jul_20 = c(NA, 12.852, NA, NA, NA, 19.719
), ago_20 = c(NA, 20.317, NA, NA, NA, 34.32), set_20 = c(NA,
29.089, NA, 691.095, NA, 52.255), out_20 = c(NA, 42.031, NA,
1033.495, NA, 74.307), nov_20 = c(NA, 61.658, NA, 1404.711, NA,
92.271), dez_20 = c(NA, 86.158, NA, 1613.414, NA, 112.404), jan_21 = c(NA,
119.428, NA, 1819.053, NA, 143.415), fev_21 = c(NA, 154.702,
NA, 2003.284, NA, 171.576), mar_21 = c(NA, 186.772, NA, 2174.988,
NA, 203.784), abr_21 = c(NA, 214.329, NA, 2335.148, NA, 257.398
), mai_21 = c(NA, 243.676, NA, 2480.234, NA, 317.92), jun_21 = c(NA,
271.086, NA, 2543.915, NA, 375.2), jul_21 = c(NA, 299.727, NA,
2621.83, NA, 433.25), ago_21 = c(NA, 352.728, NA, 2709.918, NA,
492.053), set_21 = c(NA, 404.621, NA, 2767.717, NA, 528.764),
out_21 = c(NA, 439.925, NA, 2850.247, NA, 556.29), nov_21 = c(NA,
467.614, NA, 3006.839, NA, 580.944), dez_21 = c(NA, 495.44,
NA, 3449.208, NA, 627.339), jan_22 = c(21.413, 543.967, NA,
3840.758, 40.321, 730.777), fev_22 = c(22.328, 552.997, NA,
3882.243, 41.965, 756.948), mar_22 = c(22.695, 556.666, 5.167,
NA, 43.944, 777.078), abr_22 = c(NA, 558.412, NA, NA, 44.198,
783.816)), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
When you use ggplot_na_imputations or ggplot_na_distribution, you should provide vector or ts object in one dimension as it is specified in the function description :
https://www.rdocumentation.org/packages/imputeTS/versions/3.2/topics/ggplot_na_imputations
So you must convert your data.frame with all countries into a vector by country. Moreover, to convert a vector to time series, see there :
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/ts.html
Your data
total_tests_md <- structure(list(countries = c("Afghanistan", "Albania", "Algeria", "Andorra", "Angola", "Argentina"),
jan_20 = c(NA, NA, NA, NA, NA, 0.013),
fev_20 = c(NA, 0.009, NA, NA, NA, 0.015),
mar_20 = c(NA, 0.54, NA, NA, NA, 0.162),
abr_20 = c(NA, 2.831, NA, 0.3, NA, 1.546)),
row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
Import your libraries
library(zoo)
library(imputeTS)
Convert your data.frame into a vector
# remove country name
Albania <- total_tests_md[2,-1]
Albania <- as.numeric(Albania)
# create month vector
month <- seq(as.Date("2020-01-01"), as.Date("2020-04-01"), by = "month")
When you use time series
# reasonning with ts
Albaniats <- zoo(Albania, month)
AlbaniatsInput <- Albaniats
AlbaniatsInput[1] <- 0.5
ggplot_na_imputations(x_with_na = Albaniats,
x_with_imputations = AlbaniatsInput,
x_axis_labels = index(Albaniats))
ggplot_na_distribution(Albaniats,
x_axis_labels = index(Albaniats))
When you use only vector
#reasoning with numeric vector
AlbaniaInput <- Albania
AlbaniaInput[1] <- 0.5
ggplot_na_imputations(x_with_na = Albania,
x_with_imputations = AlbaniaInput,
x_axis_labels = month)
ggplot_na_distribution(Albania,
x_axis_labels = month)
I am working on a dataset for a welfare wage subsidy program, where wages per worker are structured as follows:
df <- structure(list(wage_1990 = c(13451.67, 45000, 10301.67, NA, NA,
8726.67, 11952.5, NA, NA, 7140, NA, NA, 10301.67, 7303.33, NA,
NA, 9881.67, 5483.33, 12868.33, 9321.67), wage_1991 = c(13451.67,
45000, 10301.67, NA, NA, 8750, 11952.5, NA, NA, 7140, NA, NA,
10301.67, 7303.33, NA, NA, 9881.67, 5483.33, 12868.33, 9321.67
), wage_1992 = c(13451.67, 49500, 10301.67, NA, NA, 8750, 11952.5,
NA, NA, 7140, NA, NA, 10301.67, 7303.33, NA, NA, 9881.67, NA,
12868.33, 9321.67), wage_1993 = c(NA, NA, 10301.67, NA, NA, 8750,
11958.33, NA, NA, 7140, NA, NA, 10301.67, 7303.33, NA, NA, 9881.67,
NA, NA, 9321.67), wage_1994 = c(NA, NA, 10301.67, NA, NA, 8948.33,
11958.33, NA, NA, 7140, NA, NA, 10301.67, 7303.33, NA, NA, 9881.67,
NA, NA, 9321.67), wage_1995 = c(NA, NA, 10301.67, NA, NA, 8948.33,
11958.33, NA, NA, 7140, NA, NA, 10301.67, 7303.33, NA, NA, 9881.67,
NA, NA, 9321.67), wage_1996 = c(NA, NA, 10301.67, NA, NA, 8948.33,
11958.33, NA, NA, 7291.67, NA, NA, 10301.67, 7303.33, NA, NA,
9881.67, NA, NA, 9321.67)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -20L))
I have tried one proposed solution, which is running this code after the one above:
average_growth_rate <- apply(df, 1, function(x) {
x1 <- x[!is.na(x)]
mean(x1[-1]/x1[-length(x1)]-1)})
out <- data.frame(rowid = seq_len(nrow(df)), average_growth_rate)
out[!is.na(out$average_growth_rate),]
But I keep getting this error:
Error in dim(X) <- c(n, length(X)/n) : dims [product 60000] do not match the length of object [65051]
I want to do the following: 1-Create a variable showing the annual growth rate of wage for each worker or lack of thereof.
The practical issue that I am facing is that each observation is in one row and while the first worker joined the program in 1990, others might have joined in say 1993 or 1992. Therefore, is there a way to apply the growth rate for each worker depending on the specific years they worked, rather than applying a general growth formula for all observations?
My expected output for each row would be having a new column
average wage growth rate
1- 15%
2- 9%
3- 12%
After running the following code to see descriptive statistics of my variable of interest:
skim(df$average_growth_rate)
I get the following result:
"Variable contains Inf or -Inf value(s) that were converted to NA.── Data Summary ────────────────────────
Values
Name gosi_beneficiary_growth$a...
Number of rows 3671
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None
── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
1 data 1348 0.633 Inf Inf -1 -0.450 0 0.0568
"
I am not sure why my mean and standard deviation values are Inf.
Here is one approach:
library(tidyverse)
growth <- df %>%
rowid_to_column() %>%
gather(key, value, -rowid) %>%
drop_na() %>%
arrange(rowid, key) %>%
group_by(rowid) %>%
mutate(yoy = value / lag(value)-1) %>%
summarise(average_growth_rate = mean(yoy, na.rm=T))
# A tibble: 12 x 2
rowid average_growth_rate
<int> <dbl>
1 1 0
2 2 0.05
3 3 0
4 6 0.00422
5 7 0.0000813
6 10 0.00354
7 13 0
8 14 0
9 17 0
10 18 0
11 19 0
12 20 0
And just to highlight that all these 0s are expected, here the dataframe:
> head(df)
# A tibble: 6 x 7
wage_1990 wage_1991 wage_1992 wage_1993 wage_1994 wage_1995 wage_1996
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 13452. 13452. 13452. NA NA NA NA
2 45000 45000 49500 NA NA NA NA
3 10302. 10302. 10302. 10302. 10302. 10302. 10302.
4 NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA
6 8727. 8750 8750 8750 8948. 8948. 8948.
where you see that e.g. for the first row, there was no growth nor any decline. The second row, there was a slight increase in between the second and the third year, but it was 0 for the first and second. For the third row, again absolutely no change. Etc...
Also, finally, to add these results to the initial dataframe, you would do e.g.
df %>%
rowid_to_column() %>%
left_join(growth)
And just to answer the performance question, here a benchmark (where I changed akrun's data.frame call to a tibble call to make sure there is no difference coming from this). All functions below correspond to creating the growth rates, not merging back to the original dataframe.
library(microbenchmark)
microbenchmark(cj(), akrun(), akrun2())
Unit: microseconds
expr min lq mean median uq max neval cld
cj() 5577.301 5820.501 6122.076 5988.551 6244.301 10646.9 100 c
akrun() 998.301 1097.252 1559.144 1160.450 1212.552 28704.5 100 a
akrun2() 2033.801 2157.101 2653.018 2258.052 2340.702 34143.0 100 b
base R is the clear winner in terms of performance.
We can use base R with apply. Loop over the rows with MARGIN = 1, remove the NA elements ('x1'), get the mean of the ratio of the current and previous element
average_growth_rate <- apply(df, 1, function(x) {
x1 <- x[!is.na(x)]
mean(x1[-1]/x1[-length(x1)]-1)})
out <- data.frame(rowid = seq_len(nrow(df)), average_growth_rate)
out[!is.na(out$average_growth_rate),]
# rowid average_growth_rate
#1 1 0.00000000000
#2 2 0.05000000000
#3 3 0.00000000000
#6 6 0.00422328325
#7 7 0.00008129401
#10 10 0.00354038282
#13 13 0.00000000000
#14 14 0.00000000000
#17 17 0.00000000000
#18 18 0.00000000000
#19 19 0.00000000000
#20 20 0.00000000000
Or using tapply/stack
na.omit(stack(tapply(as.matrix(df), row(df), FUN = function(x)
mean(head(na.omit(x), -1)/tail(na.omit(x), -1) -1))))[2:1]
From the documentation I read that invisible() returns a (temporarily) invisible copy of an object. Now when I use invisible I always need to call the object twice before it is actually printed.
I use data.table and would like my function to return an invisible copy of the object given that a certain condition is met (i.e premature abortion of function).
I've noticed that this behaviour of "needing double/two calls" also applies if the invisibly returned object is used inside another function, making its use seemingly unusable. What causes this behaviour? Am I doing something wrong? How do I get the function to return invisibly, and printed on the first call?
Please see sample code below:
example <- function(DT) {
if (!(1 %in% DT$RSI.verticalBottom) | !(1 %in% DT$RSI.top)) {
# abort if there is no buy or sell signal
DT[, `:=`(pos = NA,
return = NA
)]
return(invisible(DT))
}
> example(sample.data)
> sample.data
> sample.data
conm tic datadate cshoq gind year month yearmon fdateq pdateq fyr fyearq fqtr
1: NS GROUP INC NSS.1 2000-01-31 NA 101010 2000 1 2000_1 NA <NA> NA NA NA
2: NS GROUP INC NSS.1 2000-02-29 NA 101010 2000 2 2000_2 NA <NA> NA NA NA
3: NS GROUP INC NSS.1 2000-03-31 21.533 101010 2000 3 2000_3 NA <NA> 9 2000 2
4: NS GROUP INC NSS.1 2000-04-30 NA 101010 2000 4 2000_4 NA <NA> NA NA NA
5: NS GROUP INC NSS.1 2000-05-31 NA 101010 2000 5 2000_5 NA <NA> NA NA NA
6: NS GROUP INC NSS.1 2000-06-30 22.008 101010 2000 6 2000_6 NA <NA> 9 2000 3
req epspiq epspxq ajexq saleq saley ivncfy gsubind dpq ibmiiq ibq iby oiadpq
1: NA NA NA NA NA NA NA NA NA NA NA NA NA
2: NA NA NA NA NA NA NA NA NA NA NA NA NA
3: -58.396 -0.38 -0.38 1 100.107 186.733 10.77 10101020 5.517 NA -8.231 -21.165 -5.617
4: NA NA NA NA NA NA NA NA NA NA NA NA NA
5: NA NA NA NA NA NA NA NA NA NA NA NA NA
6: -63.168 -0.19 -0.23 1 73.652 260.385 20.90 10101020 NA NA -5.048 -26.213 NA
oiadpy oibdpq oibdpy xiq xoprq cogsy dlcchy wcapchy QEBIT.adep YEBIT.adep QEBIT.bdep
1: NA NA NA NA NA NA NA NA NA NA NA
2: NA NA NA NA NA NA NA NA NA NA NA
3: -16.924 -0.1 -5.57 0 100.207 177.826 -0.394 NA -0.05610996 -0.09063208 -0.0009989311
4: NA NA NA NA NA NA NA NA NA NA NA
5: NA NA NA NA NA NA NA NA NA NA NA
6: NA NA NA 0 NA NA -0.394 NA NA NA NA
YEBIT.bdep QEBT YEBT f_id I.QSales IWA.QEBIT IWA.QEBT I.YSales IWA.YEBIT
1: NA NA NA NA NA NA NA NA NA
2: NA NA NA NA NA NA NA NA NA
3: -0.000535524 -0.08222202 -0.1133437 2000Q2 19344.53 0.08160277 0.03577741 196223.7 0.08329726
4: NA NA NA NA NA NA NA NA NA
5: NA NA NA NA NA NA NA NA NA
6: NA -0.06853853 -0.1006702 2000Q3 19798.64 0.10680607 0.06096211 196223.7 0.08329726
IWA.YEBT QSales.pc YSales.pc RSI_QEBIT RSI_QEBT RSI_IWA.QEBIT RSI_IWA.QEBT adj.factor
1: NA NA NA NA NA NA NA 1
2: NA NA NA NA NA NA NA 1
3: 0.03875869 0.005174952 0.0009516334 41.45963 32.93934 29.96487 18.23527 1
4: NA NA NA NA NA NA NA 1
5: NA NA NA NA NA NA NA 1
6: 0.03875869 0.003720053 0.0013269806 49.83110 34.64800 37.58678 24.75847 1
dvpsxm cshtrm curcdm close high low trfm trt1m close.unAdj mktcap close.div
1: NA 4557500 USD 8.8750 10.1250 6.7500 1.0409 16.3934 8.8750 NA 8.8750
2: NA 4506100 USD 11.6875 12.1250 8.0625 1.0409 31.6901 11.6875 NA 11.6875
3: NA 4146200 USD 16.3125 16.8125 11.3750 1.0409 39.5722 16.3125 351.2571 16.3125
4: NA 3215400 USD 15.8750 16.3750 12.8750 1.0409 -2.6820 15.8750 NA 15.8750
5: NA 2948800 USD 18.3125 19.3750 16.0625 1.0409 15.3543 18.3125 NA 18.3125
6: NA 4296100 USD 20.9375 21.0000 17.7500 1.0409 14.3345 20.9375 460.7925 20.9375
RSI_close RSI.verticalBottom RSI.top return pos
1: NA NA NA NA NA
2: NA NA NA NA NA
3: NA NA NA NA NA
4: NA NA NA NA NA
5: NA NA NA NA NA
6: NA NA NA NA NA
Sample data
> dput(sample.data)
structure(list(conm = c("NS GROUP INC", "NS GROUP INC", "NS GROUP INC",
"NS GROUP INC", "NS GROUP INC", "NS GROUP INC"), tic = c("NSS.1",
"NSS.1", "NSS.1", "NSS.1", "NSS.1", "NSS.1"), datadate = structure(c(10987,
11016, 11047, 11077, 11108, 11138), class = "Date"), cshoq = c(NA,
NA, 21.533, NA, NA, 22.008), gind = c(101010L, 101010L, 101010L,
101010L, 101010L, 101010L), year = c(2000, 2000, 2000, 2000,
2000, 2000), month = c(1, 2, 3, 4, 5, 6), yearmon = c("2000_1",
"2000_2", "2000_3", "2000_4", "2000_5", "2000_6"), fdateq = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), pdateq = structure(c(NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_), class = "Date"), fyr = c(NA, NA, 9L, NA,
NA, 9L), fyearq = c(NA, NA, 2000L, NA, NA, 2000L), fqtr = c(NA,
NA, 2L, NA, NA, 3L), req = c(NA, NA, -58.396, NA, NA, -63.168
), epspiq = c(NA, NA, -0.38, NA, NA, -0.19), epspxq = c(NA, NA,
-0.38, NA, NA, -0.23), ajexq = c(NA, NA, 1, NA, NA, 1), saleq = c(NA,
NA, 100.107, NA, NA, 73.652), saley = c(NA, NA, 186.733, NA,
NA, 260.385), ivncfy = c(NA, NA, 10.77, NA, NA, 20.9), gsubind = c(NA,
NA, 10101020L, NA, NA, 10101020L), dpq = c(NA, NA, 5.517, NA,
NA, NA), ibmiiq = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), ibq = c(NA, NA, -8.231, NA, NA, -5.048), iby = c(NA,
NA, -21.165, NA, NA, -26.213), oiadpq = c(NA, NA, -5.617, NA,
NA, NA), oiadpy = c(NA, NA, -16.924, NA, NA, NA), oibdpq = c(NA,
NA, -0.1, NA, NA, NA), oibdpy = c(NA, NA, -5.57, NA, NA, NA),
xiq = c(NA, NA, 0, NA, NA, 0), xoprq = c(NA, NA, 100.207,
NA, NA, NA), cogsy = c(NA, NA, 177.826, NA, NA, NA), dlcchy = c(NA,
NA, -0.394, NA, NA, -0.394), wcapchy = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_), QEBIT.adep = c(NA,
NA, -0.0561099623402959, NA, NA, NA), YEBIT.adep = c(NA,
NA, -0.0906320789576561, NA, NA, NA), QEBIT.bdep = c(NA,
NA, -0.000998931143676266, NA, NA, NA), YEBIT.bdep = c(NA,
NA, -0.000535523983441598, NA, NA, NA), QEBT = c(NA, NA,
-0.0822220224359935, NA, NA, -0.0685385325585184), YEBT = c(NA,
NA, -0.113343651095414, NA, NA, -0.100670161491637), f_id = c(NA,
NA, "2000Q2", NA, NA, "2000Q3"), I.QSales = c(NA, NA, 19344.526,
NA, NA, 19798.641), IWA.QEBIT = c(NA, NA, 0.0816027748625115,
NA, NA, 0.10680606815387), IWA.QEBT = c(NA, NA, 0.0357774080378087,
NA, NA, 0.0609621135107203), I.YSales = c(NA, NA, 196223.665,
NA, NA, 196223.665), IWA.YEBIT = c(NA, NA, 0.0832972567299668,
NA, NA, 0.0832972567299668), IWA.YEBT = c(NA, NA, 0.0387586889685299,
NA, NA, 0.0387586889685299), QSales.pc = c(NA, NA, 0.00517495233535316,
NA, NA, 0.00372005331072976), YSales.pc = c(NA, NA, 0.000951633433204909,
NA, NA, 0.00132698061673652), RSI_QEBIT = c(NA, NA, 41.4596290506163,
NA, NA, 49.8310957229999), RSI_QEBT = c(NA, NA, 32.939339100869,
NA, NA, 34.6480049470139), RSI_IWA.QEBIT = c(NA, NA, 29.9648696052066,
NA, NA, 37.5867809473848), RSI_IWA.QEBT = c(NA, NA, 18.2352737965041,
NA, NA, 24.7584711404174), adj.factor = c(1, 1, 1, 1, 1,
1), dvpsxm = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), cshtrm = c(4557500, 4506100, 4146200, 3215400,
2948800, 4296100), curcdm = c("USD", "USD", "USD", "USD",
"USD", "USD"), close = c(8.875, 11.6875, 16.3125, 15.875,
18.3125, 20.9375), high = c(10.125, 12.125, 16.8125, 16.375,
19.375, 21), low = c(6.75, 8.0625, 11.375, 12.875, 16.0625,
17.75), trfm = c(1.0409, 1.0409, 1.0409, 1.0409, 1.0409,
1.0409), trt1m = c(16.3934, 31.6901, 39.5722, -2.682, 15.3543,
14.3345), close.unAdj = c(8.875, 11.6875, 16.3125, 15.875,
18.3125, 20.9375), mktcap = c(NA, NA, 351.2570625, NA, NA,
460.7925), close.div = c(8.875, 11.6875, 16.3125, 15.875,
18.3125, 20.9375), RSI_close = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), RSI.verticalBottom = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), RSI.top = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), return = c(NA,
NA, NA, NA, NA, NA), pos = c(NA, NA, NA, NA, NA, NA)), .Names = c("conm",
"tic", "datadate", "cshoq", "gind", "year", "month", "yearmon",
"fdateq", "pdateq", "fyr", "fyearq", "fqtr", "req", "epspiq",
"epspxq", "ajexq", "saleq", "saley", "ivncfy", "gsubind", "dpq",
"ibmiiq", "ibq", "iby", "oiadpq", "oiadpy", "oibdpq", "oibdpy",
"xiq", "xoprq", "cogsy", "dlcchy", "wcapchy", "QEBIT.adep", "YEBIT.adep",
"QEBIT.bdep", "YEBIT.bdep", "QEBT", "YEBT", "f_id", "I.QSales",
"IWA.QEBIT", "IWA.QEBT", "I.YSales", "IWA.YEBIT", "IWA.YEBT",
"QSales.pc", "YSales.pc", "RSI_QEBIT", "RSI_QEBT", "RSI_IWA.QEBIT",
"RSI_IWA.QEBT", "adj.factor", "dvpsxm", "cshtrm", "curcdm", "close",
"high", "low", "trfm", "trt1m", "close.unAdj", "mktcap", "close.div",
"RSI_close", "RSI.verticalBottom", "RSI.top", "return", "pos"
), sorted = c("conm", "tic", "datadate", "cshoq", "gind", "year",
"month", "yearmon"), class = c("data.table", "data.frame"), row.names = c(NA,
-6L), .internal.selfref = <pointer: 0x102806978>)
I want to push up (metaphorically) the dataframe in ordner to get rid of the spaces (NA-Values)
My Data:
> dput(df1)
structure(list(ID = c("CN1-1", "CN1-1", "CN1-1", "CN1-10", "CN1-10",
"CN1-10", "CN1-11", "CN1-11", "CN1-11", "CN1-12", "CN1-12", "CN1-12",
"CN1-13", "CN1-13", "CN1-13"), v1 = c(0.37673, NA, NA, 1.019972,
NA, NA, 0.515152, NA, NA, 0.375139, NA, NA, 0.508125, NA, NA),
v2 = c(NA, 0.732, NA, NA, 0, NA, NA, 0.748, NA, NA, 0.466,
NA, NA, 0.57, NA), v2 = c(NA, NA, 0.357, NA, NA, 0.816, NA,
NA, 0.519, NA, NA, 0.206, NA, NA, 0.464)), .Names = c("ID",
"v1", "v2", "v2"), row.names = c(NA, 15L), class = "data.frame")
>
Looks like:
ID v1 v2 v2
1 CN1-1 0.376730 NA NA
2 CN1-1 NA 0.732 NA
3 CN1-1 NA NA 0.357
4 CN1-10 1.019972 NA NA
5 CN1-10 NA 0.000 NA
6 CN1-10 NA NA 0.816
7 CN1-11 0.515152 NA NA
8 CN1-11 NA 0.748 NA
9 CN1-11 NA NA 0.519
10 CN1-12 0.375139 NA NA
11 CN1-12 NA 0.466 NA
12 CN1-12 NA NA 0.206
13 CN1-13 0.508125 NA NA
14 CN1-13 NA 0.570 NA
15 CN1-13 NA NA 0.464
Please note: I'm not sure if the pattern is consistent over all rows. It could also be possible, that one or more variables are prominent 2+ times per ID Group.
Desired output:
ID v1 v2 v2
1 CN1-1 0.376730 0.732 0.357
2 CN1-10 1.019972 0.000 0.816
...
My idea was to melt then get rid of all NA values and then dcast. Any better approach?
EDIT:
duplicated could look like this.
16 CN1-x 0.508125 NA NA
17 CN1-x NA 0.570 NA
18 CN1-x NA NA 0.464
19 CN1-x NA NA 0.134
do.call(rbind,
lapply(split(df1, df1$ID), function(a)
data.frame(ID = a$ID[1], lapply(a[-1], sum, na.rm = TRUE))))
# ID v1 v2 v2.1
#CN1-1 CN1-1 0.376730 0.732 0.357
#CN1-10 CN1-10 1.019972 0.000 0.816
#CN1-11 CN1-11 0.515152 0.748 0.519
#CN1-12 CN1-12 0.375139 0.466 0.206
#CN1-13 CN1-13 0.508125 0.570 0.464
This question already has answers here:
Select last non-NA value in a row, by row
(3 answers)
Closed last month.
I have a data frame Depth which consist of LON and LAT with corresponding depths temperature data. For each coordinate (LON and LAT) I would like to pull out last record of each depth corresponding to the coordinates into a new data frame,
> Depth<-read.csv('depthdata.csv')
> head(Depth)
LAT LON X150 X175 X200 X225 X250 X275 X300 X325 X350 X375 X400 X425 X450
1 -78.375 -163.875 -1.167 -1.0 NA NA NA NA NA NA NA NA NA NA NA
2 -78.125 -168.875 -1.379 -1.3 -1.259 -1.6 -1.476 -1.374 -1.507 NA NA NA NA NA NA
3 -78.125 -167.625 -1.700 -1.7 -1.700 -1.7 NA NA NA NA NA NA NA NA NA
4 -78.125 -167.375 -2.100 -2.2 -2.400 -2.3 -2.200 NA NA NA NA NA NA NA NA
5 -78.125 -167.125 -1.600 -1.6 -1.600 -1.6 NA NA NA NA NA NA NA NA NA
6 -78.125 -166.875 NA NA NA NA NA NA NA NA NA NA NA NA NA
so that I will have this;
LAT LON
-78.375 -163.875 -1
-78.125 -168.875 -1.507
-78.125 -167.625 -1.7
-78.125 -167.375 -2.2
-78.125 -167.125 -1.6
-78.125 -166.875 NA
I tried the tail() function but I don't have the desirable result.
As I understand it, you want the last non-NA value in each row, for all columns except the first two.
We can use max.col() along with is.na() with our relevant columns to get us the column number for the last non-NA value. 2 is added (shown by + 2L) to compensate for the removal of the first two columns (shown by [-(1:2)]).
idx <- max.col(!is.na(Depth[-(1:2)]), ties.method = "last") + 2L
We can use idx in cbind() to create an index matrix for retrieving the values.
Depth[cbind(seq_len(nrow(Depth)), idx)]
# [1] -1.000 -1.507 -1.700 -2.200 -1.600 NA
Bind this together with the first two columns of the original data with cbind() and we're done.
cbind(Depth[1:2], LAST = Depth[cbind(seq_len(nrow(Depth)), idx)])
# LAT LON LAST
# 1 -78.375 -163.875 -1.000
# 2 -78.125 -168.875 -1.507
# 3 -78.125 -167.625 -1.700
# 4 -78.125 -167.375 -2.200
# 5 -78.125 -167.125 -1.600
# 6 -78.125 -166.875 NA
Data:
Depth <- structure(list(LAT = c(-78.375, -78.125, -78.125, -78.125, -78.125,
-78.125), LON = c(-163.875, -168.875, -167.625, -167.375, -167.125,
-166.875), X150 = c(-1.167, -1.379, -1.7, -2.1, -1.6, NA), X175 = c(-1,
-1.3, -1.7, -2.2, -1.6, NA), X200 = c(NA, -1.259, -1.7, -2.4,
-1.6, NA), X225 = c(NA, -1.6, -1.7, -2.3, -1.6, NA), X250 = c(NA,
-1.476, NA, -2.2, NA, NA), X275 = c(NA, -1.374, NA, NA, NA, NA
), X300 = c(NA, -1.507, NA, NA, NA, NA), X325 = c(NA, NA, NA,
NA, NA, NA), X350 = c(NA, NA, NA, NA, NA, NA), X375 = c(NA, NA,
NA, NA, NA, NA), X400 = c(NA, NA, NA, NA, NA, NA), X425 = c(NA,
NA, NA, NA, NA, NA), X450 = c(NA, NA, NA, NA, NA, NA)), .Names = c("LAT",
"LON", "X150", "X175", "X200", "X225", "X250", "X275", "X300",
"X325", "X350", "X375", "X400", "X425", "X450"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))