I have a table test whose NA values I would like to approximate based on linear interpolation between values that do exist.
For example, the second row plotted looks like this:
v1 <- unlist(test[2,])
plot(v1[!is.na(v1)], names(v1)[!is.na(v1)], type="l", add = TRUE)
How would one go about interpolating/approximating the NA values along the x-axis in this case? Any suggestions in base R or dplyr would be helpful
test
variable 26500 30000 30100 30700 31600 33700 33800 33900 34000 34600 34800 35100 35200 35300
1 -20 NA 0 NA NA 10 20 NA NA NA 30 NA NA NA NA
2 -10 NA 0 NA NA NA 10 NA NA NA 20 NA NA NA 30
3 0 0 NA NA NA NA NA 10 NA NA NA 20 NA NA NA
4 24 NA NA NA 0 NA NA NA NA 10 NA NA NA 20 NA
5 40 NA NA 0 NA NA NA NA 10 NA NA NA 20 NA NA
6 55 NA NA 0 NA NA NA NA 10 NA NA NA 20 NA NA
35400 35600 35800 35900 36200 36300 36400 36700 36900 37000 37200 37800 37900 38000 38200
1 40 NA NA NA 50 NA NA NA NA NA 60 NA NA NA 70
2 NA NA NA 40 NA NA NA 50 NA NA NA 60 NA NA NA
3 NA 30 NA NA 40 NA NA NA 50 NA NA NA 60 NA NA
4 NA NA 30 NA NA 40 NA NA NA 50 NA NA NA 60 NA
5 NA NA 30 NA NA 40 NA NA NA 50 NA NA NA NA 60
6 NA NA NA 30 NA NA 40 NA NA 50 NA NA NA NA 60
38800 39000 39100 39200 39700 39800 39900 40000 40200 40600 40700 40800 41700 41800
1 NA NA NA 80 NA NA NA NA 90 NA NA NA 100 NA
2 70 NA NA NA 80 NA NA NA NA 90 NA NA 100 NA
3 70 NA NA NA NA 80 NA NA NA NA 90 NA 100 NA
4 NA 70 NA NA NA NA NA 80 NA NA NA 90 100 NA
5 NA NA 70 NA NA NA NA 80 NA NA NA 90 NA 100
6 NA 70 NA NA NA NA 80 NA NA NA NA 90 100 NA
Here is the sample data:
dput(test)
structure(list(variable = c(-20, -10, 0, 24, 40, 55), `26500` = c(NA,
NA, 0L, NA, NA, NA), `30000` = c(0L, 0L, NA, NA, NA, NA), `30100` = c(NA,
NA, NA, NA, 0L, 0L), `30700` = c(NA, NA, NA, 0L, NA, NA), `31600` = c(10L,
NA, NA, NA, NA, NA), `33700` = c(20L, 10L, NA, NA, NA, NA), `33800` = c(NA,
NA, 10L, NA, NA, NA), `33900` = c(NA, NA, NA, NA, 10L, 10L),
`34000` = c(NA, NA, NA, 10L, NA, NA), `34600` = c(30L, 20L,
NA, NA, NA, NA), `34800` = c(NA, NA, 20L, NA, NA, NA), `35100` = c(NA,
NA, NA, NA, 20L, 20L), `35200` = c(NA, NA, NA, 20L, NA, NA
), `35300` = c(NA, 30L, NA, NA, NA, NA), `35400` = c(40L,
NA, NA, NA, NA, NA), `35600` = c(NA, NA, 30L, NA, NA, NA),
`35800` = c(NA, NA, NA, 30L, 30L, NA), `35900` = c(NA, 40L,
NA, NA, NA, 30L), `36200` = c(50L, NA, 40L, NA, NA, NA),
`36300` = c(NA, NA, NA, 40L, 40L, NA), `36400` = c(NA, NA,
NA, NA, NA, 40L), `36700` = c(NA, 50L, NA, NA, NA, NA), `36900` = c(NA,
NA, 50L, NA, NA, NA), `37000` = c(NA, NA, NA, 50L, 50L, 50L
), `37200` = c(60L, NA, NA, NA, NA, NA), `37800` = c(NA,
60L, NA, NA, NA, NA), `37900` = c(NA, NA, 60L, NA, NA, NA
), `38000` = c(NA, NA, NA, 60L, NA, NA), `38200` = c(70L,
NA, NA, NA, 60L, 60L), `38800` = c(NA, 70L, 70L, NA, NA,
NA), `39000` = c(NA, NA, NA, 70L, NA, 70L), `39100` = c(NA,
NA, NA, NA, 70L, NA), `39200` = c(80L, NA, NA, NA, NA, NA
), `39700` = c(NA, 80L, NA, NA, NA, NA), `39800` = c(NA,
NA, 80L, NA, NA, NA), `39900` = c(NA, NA, NA, NA, NA, 80L
), `40000` = c(NA, NA, NA, 80L, 80L, NA), `40200` = c(90L,
NA, NA, NA, NA, NA), `40600` = c(NA, 90L, NA, NA, NA, NA),
`40700` = c(NA, NA, 90L, NA, NA, NA), `40800` = c(NA, NA,
NA, 90L, 90L, 90L), `41700` = c(100L, 100L, 100L, 100L, NA,
100L), `41800` = c(NA, NA, NA, NA, 100L, NA)), row.names = c(NA,
-6L), class = "data.frame")
We could use na.interp from forecast
library(forecast)
test[-1] <- t(apply(test[-1], 1, na.interp))
Or with na.approx
test[-1] <- t(apply(test[-1], 1, na.approx, na.rm = FALSE))
then do the plotting
v1 <- unlist(test[2, -1])
plot(v1, names(v1), type = 'l')
If you want to switch easily between different interpolation methods (or time series imputation methods in general) you can also use the imputeTS package.
For the requested solution this would be:
library("imputeTS")
test[-1] <- t(apply(test[-1], 1, na_interpolation, option = "linear"))
Switching to Spline interpolation would look like this:
test[-1] <- t(apply(test[-1], 1, na_interpolation, option = "stine"))
Another option could be Stineman interpolation:
test[-1] <- t(apply(test[-1], 1, na_interpolation, option = "spline"))
Other imputation methods like na_ma (moving average imputation), na_kalman (Kalman smoothing on structural time series models) would be also possible, if you replace the na_interpolation with the specific function (see also GitHub package Readme for a imputation function overview).
Related
Problem statement:
I actually want to eliminate from further analysis columns that have identical values in all cells. In order to do this, I want to find the columns that have identical values.
I wrote the following code which seems to be working for the dataframe test but not for the real dataframe stpo
library("dplyr")
library("purrr")
test_unique <- function(x)
{
return(length(unique(x)))
}
test <-data.frame(c1 = c("a", "a"), c2 = c(NA, NA), c3 = c(1,2), c4=c(NA, 4))
# What I want to find out the columns that have the same value throughout
res <- map(test[,c(names(test))], test_unique)
res
# But when I try to apply the same thing to the dataset below, it does not work.
# Not sure what the reason is. Is there a better way to do this? Perhaps using data.table? What am I doing wrong?
res2 <- map(stpo[,c(names(stpo))], test_unique)
res2
I am not exactly sure how to put the result of dput. I am putting this below (this is the dataframe stpo)
structure(list(stlnr = c(1L, 2L, 3L, 3L, 3L, 3L, 4L), stlkn = c(1L,
1L, 1L, 2L, 3L, 4L, 5L), stpoz = c(2L, 2L, 2L, 4L, 6L, 8L, 10L
), aennr = c(NA, NA, NA, NA, NA, NA, NA), vgknt = c(0L, 0L, 0L,
0L, 0L, 0L, 0L), idnrk = c("test_1", "test_1", "test_2", "test_3",
"test_3", "test_1", "test_2"), pswrk = c(NA, NA, NA, NA, NA,
NA, NA), meins = c("EA", "EA", "EA", "EA", "EA", "EA", "EA"),
menge = c(1, 14, 4, 4, 2, 2, 1), fmeng = c(NA, NA, NA, NA,
NA, NA, NA), ausch = c(0, 0, 0, 0, 0, 0, 0), avoau = c(0,
0, 0, 0, 0, 0, 0), netau = c(NA, NA, NA, NA, NA, NA, NA),
erskz = c(NA, NA, NA, NA, NA, NA, NA), rekri = c(NA, NA,
NA, NA, NA, NA, NA), rekrs = c(NA, NA, NA, NA, NA, NA, NA
), nlfzt = c(0L, 0L, 0L, 0L, 0L, 0L, 0L), verti = c(NA, NA,
NA, NA, NA, NA, NA), alpos = c(NA, NA, NA, NA, NA, NA, NA
), ewahr = c(0L, 0L, 0L, 0L, 0L, 0L, 0L), ekgrp = c(NA, NA,
NA, NA, NA, NA, NA), lifzt = c(0L, 0L, 0L, 0L, 0L, 0L, 0L
), lifnr = c(NA, NA, NA, NA, NA, NA, NA), roms1 = c(0, 0,
0, 0, 0, 0, 0), roms2 = c(0, 0, 0, 0, 0, 0, 0), roms3 = c(0,
0, 0, 0, 0, 0, 0), romen = c(0, 0, 0, 0, 0, 0, 0), rform = c(NA,
NA, NA, NA, NA, NA, NA), upskz = c(NA, NA, NA, NA, NA, NA,
NA), valkz = c(NA, NA, NA, NA, NA, NA, NA), matkl = c(NA,
NA, NA, NA, NA, NA, NA), webaz = c(0L, 0L, 0L, 0L, 0L, 0L,
0L), clobk = c(NA, NA, NA, NA, NA, NA, NA), lgort = c(NA,
NA, NA, NA, NA, NA, 14L), kzkup = c(NA, NA, NA, NA, NA, NA,
NA), dvnam = c(NA, NA, NA, NA, NA, NA, NA), dspst = c(NA,
NA, NA, NA, NA, NA, NA), alpst = c(NA, NA, NA, NA, NA, NA,
NA), alprf = c(0L, 0L, 0L, 0L, 0L, 0L, 0L), alpgr = c(NA,
NA, NA, NA, NA, NA, NA), kstty = c(NA, NA, NA, NA, NA, NA,
NA), kstnr = c(NA, NA, NA, NA, NA, NA, NA), nlfzv = c(0L,
0L, 0L, 0L, 0L, 0L, 0L), nlfmv = c(NA, NA, NA, NA, NA, NA,
NA), idhis = c(0L, 0L, 0L, 0L, 0L, 0L, 0L), idvar = c(NA,
NA, NA, NA, NA, NA, NA), itsob = c(NA, NA, NA, NA, NA, NA,
NA), cufactor = c(0L, 0L, 0L, 0L, 0L, 0L, 0L), funcid = c(NA,
NA, NA, NA, NA, NA, NA)), row.names = c(NA, -7L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x0000022534c51ef0>)
The issue is that we are subsetting on a data.table, rather than a data.frame. Here, we need with = FALSE (as mentioned in ?data.table
j - When with=TRUE (default), j is evaluated within the frame of the data.table; i.e., it sees column names as if they are variables.
stpo[,c(names(stpo))]
[1] "stlnr" "stlkn" "stpoz" "aennr" "vgknt" "idnrk" "pswrk" "meins" "menge" "fmeng" "ausch" "avoau" "netau" "erskz"
[15] "rekri" "rekrs" "nlfzt" "verti" "alpos" "ewahr" "ekgrp" "lifzt" "lifnr" "roms1" "roms2" "roms3" "romen" "rform"
[29] "upskz" "valkz" "matkl" "webaz" "clobk" "lgort" "kzkup" "dvnam" "dspst" "alpst" "alprf" "alpgr" "kstty" "kstnr"
[43] "nlfzv" "nlfmv" "idhis" "idvar" "itsob" "cufactor" "funcid"
Now, check the output of
stpo[,c(names(stpo)), with = FALSE]
stlnr stlkn stpoz aennr vgknt idnrk pswrk meins menge fmeng ausch avoau netau erskz rekri rekrs nlfzt verti alpos ewahr ekgrp lifzt lifnr roms1 roms2
1: 1 1 2 NA 0 test_1 NA EA 1 NA 0 0 NA NA NA NA 0 NA NA 0 NA 0 NA 0 0
2: 2 1 2 NA 0 test_1 NA EA 14 NA 0 0 NA NA NA NA 0 NA NA 0 NA 0 NA 0 0
3: 3 1 2 NA 0 test_2 NA EA 4 NA 0 0 NA NA NA NA 0 NA NA 0 NA 0 NA 0 0
4: 3 2 4 NA 0 test_3 NA EA 4 NA 0 0 NA NA NA NA 0 NA NA 0 NA 0 NA 0 0
5: 3 3 6 NA 0 test_3 NA EA 2 NA 0 0 NA NA NA NA 0 NA NA 0 NA 0 NA 0 0
6: 3 4 8 NA 0 test_1 NA EA 2 NA 0 0 NA NA NA NA 0 NA NA 0 NA 0 NA 0 0
7: 4 5 10 NA 0 test_2 NA EA 1 NA 0 0 NA NA NA NA 0 NA NA 0 NA 0 NA 0 0
roms3 romen rform upskz valkz matkl webaz clobk lgort kzkup dvnam dspst alpst alprf alpgr kstty kstnr nlfzv nlfmv idhis idvar itsob cufactor funcid
1: 0 0 NA NA NA NA 0 NA NA NA NA NA NA 0 NA NA NA 0 NA 0 NA NA 0 NA
2: 0 0 NA NA NA NA 0 NA NA NA NA NA NA 0 NA NA NA 0 NA 0 NA NA 0 NA
3: 0 0 NA NA NA NA 0 NA NA NA NA NA NA 0 NA NA NA 0 NA 0 NA NA 0 NA
4: 0 0 NA NA NA NA 0 NA NA NA NA NA NA 0 NA NA NA 0 NA 0 NA NA 0 NA
5: 0 0 NA NA NA NA 0 NA NA NA NA NA NA 0 NA NA NA 0 NA 0 NA NA 0 NA
6: 0 0 NA NA NA NA 0 NA NA NA NA NA NA 0 NA NA NA 0 NA 0 NA NA 0 NA
7: 0 0 NA NA NA NA 0 NA 14 NA NA NA NA 0 NA NA NA 0 NA 0 NA
Also, there is no need to do any subsetting if the whole columns are used, i.e. simply do
purrr::map(stpo, test_unique)
-output
$stlnr
[1] 4
$stlkn
[1] 5
$stpoz
[1] 5
...
...
Regarding the use of
stpo[,1:length(names(stpo))]
It seems to be a bug or a hackish way of dealing things instead of the standard option
If we want to eliminate columns having a single value, use var (assuming all numeric columns)
Filter(var, stpo)
stlnr stlkn stpoz menge
1: 1 1 2 1
2: 2 1 2 14
3: 3 1 2 4
4: 3 2 4 4
5: 3 3 6 2
6: 3 4 8 2
7: 4 5 10 1
Or change the function to return a logical output (it will also check for other type columns)
f1 <- function(x) length(unique(x)) > 1
Filter(f1, stpo)
-output
stlnr stlkn stpoz idnrk menge lgort
1: 1 1 2 test_1 1 NA
2: 2 1 2 test_1 14 NA
3: 3 1 2 test_2 4 NA
4: 3 2 4 test_3 4 NA
5: 3 3 6 test_3 2 NA
6: 3 4 8 test_1 2 NA
7: 4 5 10 test_2 1 14
Or use the data.table way of subsetting the columns
stpo[, .SD, .SDcols = f1]
stlnr stlkn stpoz idnrk menge lgort
1: 1 1 2 test_1 1 NA
2: 2 1 2 test_1 14 NA
3: 3 1 2 test_2 4 NA
4: 3 2 4 test_3 4 NA
5: 3 3 6 test_3 2 NA
6: 3 4 8 test_1 2 NA
7: 4 5 10 test_2 1 14
Looks like I have taken a cue from what Arun wrote and modified the code like so:
res2 <- map(stpo[,1:length(names(stpo))], test_unique)
I would like to copy the last two columns from each month to the beginning of the next month. I did it as follows (below), but the data contains NA and when I change it to character, the program breaks down. How do I copy columns to keep their type?
My code:
library(readxl)
library(tibble)
df<- read_excel("C:/Users/Rezerwa/Documents/Database.xlsx")
df=add_column(df, Feb1 = as.character(do.call(paste0, df["January...4"])), .after = "January...5")
df=add_column(df, Feb2 = as.numeric(do.call(paste0, df["January...5"])), .after = "Feb1")
My data:
df
# A tibble: 10 x 13
Product January...2 January...3 January...4 January...5 February...6 February...7 February...8 February...9 March...10 March...11 March...12 March...13
<chr> <lgl> <lgl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl>
1 a NA NA 754.00 4 754.00 4 754.00 4 754.00 4 754.00 4
2 b NA NA 706.00 3 706.00 3 706.00 3 706.00 3 706.00 3
3 c NA NA 517.00 3 517.00 3 517.00 3 517.00 3 517.00 3
4 d NA NA 1466.00 9 1466.00 9 1466.00 9 1466.00 9 1466.00 9
5 e NA NA 543.00 8 543.00 8 543.00 8 543.00 8 543.00 8
6 f NA NA NA NA NA NA NA NA NA NA NA NA
7 g NA NA NA NA NA NA NA NA NA NA NA NA
8 h NA NA NA NA NA NA NA NA NA NA NA NA
9 i NA NA 1466.00 8 NA NA NA NA NA NA NA NA
10 j NA NA NA NA 543.00 3 NA NA NA NA NA NA
My error:
> df=add_column(df, Feb1 = as.character(do.call(paste0, df["January...4"])), .after = "January...5")
> df=add_column(df, Feb2 = as.numeric(do.call(paste0, df["January...5"])), .after = "Feb1")
Warning message:
In eval_tidy(xs[[i]], unique_output) : NAs introduced by coercion
Using base R we can split the columns based on the prefix of their names, select last two columns from each group and cbind to original df.
df1 <- cbind(df, do.call(cbind, lapply(split.default(df[-1],
sub("\\..*", "", names(df)[-1])), function(x) {n <- ncol(x);x[, c(n-1, n)]})))
To get data in order, we can do
cbind(df1[1], df1[-1][order(match(sub("\\..*", "", names(df1)[-1]), month.name))])
data
df <- structure(list(Product = structure(1:10, .Label = c("a", "b",
"c", "d", "e", "f", "g", "h", "i", "j"), class = "factor"), January...2 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA), January...3 = c(NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), January...4 = c(754, 706, 517,
1466, 543, NA, NA, NA, 1466, NA), January...5 = c(4L, 3L, 3L,
9L, 8L, NA, NA, NA, 8L, NA), February...6 = c(754, 706, 517,
1466, 543, NA, NA, NA, NA, 543), February...7 = c(4L, 3L, 3L,
9L, 8L, NA, NA, NA, NA, 3L), February...8 = c(754, 706, 517,
1466, 543, NA, NA, NA, NA, NA), February...9 = c(4L, 3L, 3L,
9L, 8L, NA, NA, NA, NA, NA), March...10 = c(754, 706, 517, 1466,
543, NA, NA, NA, NA, NA), March...11 = c(4L, 3L, 3L, 9L, 8L,
NA, NA, NA, NA, NA), March...12 = c(754, 706, 517, 1466, 543,
NA, NA, NA, NA, NA), March...13 = c(4L, 3L, 3L, 9L, 8L, NA, NA,
NA, NA, NA)), class = "data.frame", row.names = c("1", "2", "3",
"4", "5", "6", "7", "8", "9", "10"))
The way I have extracted my results somehow kept them as diagonal elements in a data frame. I would like to reduce the data down, keeping the row names and col names. I.e. merge the row names and col names.
1750:10-K:2006 1800:10-K:2006 1923:10-K:2006 2488:10-K:2006
1750:10-K:2005 0.9291217 NA NA NA
1800:10-K:2005 NA 0.9690067 NA NA
1923:10-K:2005 NA NA 0.8584429 NA
2488:10-K:2005 NA NA NA 0.956372
2969:10-K:2005 NA NA NA NA
3133:10-K:2005 NA NA NA NA
3197:10-K:2005 NA NA NA NA
3333:10-K:2005 NA NA NA NA
3370:10-K:2005 NA NA NA NA
3673:10-K:2005 NA NA NA NA
2969:10-K:2006 3133:10-K:2006 3197:10-K:2006 3333:10-K:2006
1750:10-K:2005 NA NA NA NA
1800:10-K:2005 NA NA NA NA
1923:10-K:2005 NA NA NA NA
2488:10-K:2005 NA NA NA NA
2969:10-K:2005 0.861327 NA NA NA
3133:10-K:2005 NA 0.9375159 NA NA
3197:10-K:2005 NA NA 0.9633629 NA
3333:10-K:2005 NA NA NA 0.9752259
3370:10-K:2005 NA NA NA NA
3673:10-K:2005 NA NA NA NA
3370:10-K:2006 3673:10-K:2006
1750:10-K:2005 NA NA
1800:10-K:2005 NA NA
1923:10-K:2005 NA NA
2488:10-K:2005 NA NA
2969:10-K:2005 NA NA
3133:10-K:2005 NA NA
3197:10-K:2005 NA NA
3333:10-K:2005 NA NA
3370:10-K:2005 0.941602 NA
3673:10-K:2005 NA 0.9745789
Expected output:
1750:10-K:2005_1750:10-K:2006 0.9291217
1800:10-K:2005_1800:10-K:2006 0.9690067
1923:10-K:2005_1923:10-K:2006 0.8584429
2488:10-K:2005_2488:10-K:2006 0.956372
Data:
structure(list(`1750:10-K:2006` = c(0.929121725727165, NA, NA,
NA, NA, NA, NA, NA, NA, NA), `1800:10-K:2006` = c(NA, 0.96900670959669,
NA, NA, NA, NA, NA, NA, NA, NA), `1923:10-K:2006` = c(NA, NA,
0.858442889654398, NA, NA, NA, NA, NA, NA, NA), `2488:10-K:2006` = c(NA,
NA, NA, 0.956371967288172, NA, NA, NA, NA, NA, NA), `2969:10-K:2006` = c(NA,
NA, NA, NA, 0.861326963904054, NA, NA, NA, NA, NA), `3133:10-K:2006` = c(NA,
NA, NA, NA, NA, 0.93751593784196, NA, NA, NA, NA), `3197:10-K:2006` = c(NA,
NA, NA, NA, NA, NA, 0.963362873672737, NA, NA, NA), `3333:10-K:2006` = c(NA,
NA, NA, NA, NA, NA, NA, 0.975225879729218, NA, NA), `3370:10-K:2006` = c(NA,
NA, NA, NA, NA, NA, NA, NA, 0.941602039119482, NA), `3673:10-K:2006` = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, 0.974578948898938)), row.names = c("1750:10-K:2005",
"1800:10-K:2005", "1923:10-K:2005", "2488:10-K:2005", "2969:10-K:2005",
"3133:10-K:2005", "3197:10-K:2005", "3333:10-K:2005", "3370:10-K:2005",
"3673:10-K:2005"), class = "data.frame")
You can try diag but you have to convert to matrix first, i.e.
data.frame(v1 = rownames(df), v2 = diag(as.matrix(df)))
# v1 v2
#1 1750:10-K:2005 0.9291217
#2 1800:10-K:2005 0.9690067
#3 1923:10-K:2005 0.8584429
#4 2488:10-K:2005 0.9563720
#5 2969:10-K:2005 0.8613270
#6 3133:10-K:2005 0.9375159
#7 3197:10-K:2005 0.9633629
#8 3333:10-K:2005 0.9752259
#9 3370:10-K:2005 0.9416020
#10 3673:10-K:2005 0.9745789
Here is a solution with dplyr:
library(dplyr)
df %>%
rownames_to_column() %>%
gather(KPI,Value,-rowname) %>%
mutate(KPI = paste0(rowname,KPI,sep="_")) %>%
drop_na() %>%
select(-rowname)
I have a matrix and my objective is to find the maximum of each column and then to divide that number by the sum of all values in the row which contains the max of that column. In other words
max(y) / sum of values in the row where y is the max
How would apply this formula to every column in R ?
> the_matrix
Source: local data frame [20 x 10]
type 100 100F 100I 100X 101 102 1028P 103 103D
(fctr) (int) (int) (int) (int) (int) (int) (int) (int) (int)
1 0 NA NA NA NA NA NA NA NA NA
2 0A 2 NA NA NA NA NA NA NA NA
3 0B NA NA NA NA NA NA NA NA NA
4 0C NA NA NA NA NA NA NA NA NA
5 0E NA NA NA NA NA NA NA NA NA
6 0G NA NA NA NA NA NA NA NA NA
7 0O NA NA NA NA NA NA NA NA NA
8 0Z NA NA NA NA NA NA NA NA NA
9 1 2 NA NA NA NA NA NA NA NA
10 1A 3968 NA 214 26 4 289 8 56030 7484
11 1B 172 NA 107 NA NA 2 NA 372 3829
12 1C 584 NA 19 NA NA 1 NA 72951 363
13 1D 27 NA NA NA NA NA NA 365 22
14 1E 27944 16 68 NA NA NA 1 62 12
15 1F 1 NA 1 NA NA 1 NA 368 27
16 1G 4 NA NA NA NA NA NA 7 NA
17 1H 65 NA 6 21 1 6 3 714 59
18 1M NA NA NA NA NA NA NA 1 NA
19 1N NA NA NA NA NA NA NA NA NA
20 1Q NA NA NA NA NA NA NA NA NA
> dput(the_matrix)
structure(list(type = structure(1:20, .Label = c("0", "0A", "0B",
"0C", "0E", "0G", "0O", "0Z", "1", "1A", "1B", "1C", "1D", "1E",
"1F", "1G", "1H", "1M", "1N", "1Q", "1S", "1X", "1Z", "2", "2A",
"2B", "2C", "2D", "2E", "2F", "2G", "2H", "2I", "2J", "2M", "2S",
"2T", "2X", "2Z", "3", "3B", "3C", "3E", "4B", "5H", "8Z", "0H",
"1I", "1R", "2N", "3H", "5D", "0D", "1K", "1P", "1T", "1U", "1V",
"1W", "1Y", "2U", "3A", "4A", "5C", "7H", "9", "0F", "0T", "1J",
"2L", "0W", "2Q", "3G"), class = "factor"), `100` = c(NA, 2L,
NA, NA, NA, NA, NA, NA, 2L, 3968L, 172L, 584L, 27L, 27944L, 1L,
4L, 65L, NA, NA, NA), `100F` = c(NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, 16L, NA, NA, NA, NA, NA, NA), `100I` = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, 214L, 107L, 19L, NA, 68L, 1L,
NA, 6L, NA, NA, NA), `100X` = c(NA, NA, NA, NA, NA, NA, NA, NA,
NA, 26L, NA, NA, NA, NA, NA, NA, 21L, NA, NA, NA), `101` = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, 4L, NA, NA, NA, NA, NA, NA, 1L,
NA, NA, NA), `102` = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 289L,
2L, 1L, NA, NA, 1L, NA, 6L, NA, NA, NA), `1028P` = c(NA, NA,
NA, NA, NA, NA, NA, NA, NA, 8L, NA, NA, NA, 1L, NA, NA, 3L, NA,
NA, NA), `103` = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 56030L,
372L, 72951L, 365L, 62L, 368L, 7L, 714L, 1L, NA, NA), `103D` = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, 7484L, 3829L, 363L, 22L, 12L,
27L, NA, 59L, NA, NA, NA)), .Names = c("type", "100", "100F",
"100I", "100X", "101", "102", "1028P", "103", "103D"), class = c("tbl_df",
"data.frame"), row.names = c(NA, -20L))
Going step-by-step:
# let's not call a data frame a matrix
real_matrix = as.matrix(the_matrix[, -1])
# max of each column
col_max = apply(real_matrix, 2, max, na.rm = T)
# which row contains the max
col_which_max = apply(real_matrix, 2, which.max)
# row totals
row_total = rowSums(real_matrix, na.rm = T)
# col max divided by row total for corresponding row
col_max / row_total[col_which_max]
Rounded to 3 decimals, this yields the following:
100 100F 100I 100X 101 102 1028P 103 103D
0.994 0.001 0.003 0.000 0.000 0.004 0.000 0.987 0.110
I have a data.frame called RawHM and want, for each row, to evaluate sets of columns defined by the entries in the list AllList, in order to see if there is enough non-NA observations (not less than 2) to keep the column set of entries for that row. If not, the column set entries should be substituted with NA's.
AllList:
> dput(AllList)
structure(list(EGI = c("OO", "PP", "QQ"), Ref = c("RR", "SS",
"TT")), .Names = c("EGI", "Ref"))
RawHM:
> dput(head(RawHM,10))
structure(list(OO = c(2.26128283268031, NA, NA, NA, 3.1189673217816,
2.68131772865193, 1.50542478607416, NA, NA, NA), PP = c(NA, 2.86537733048028,
2.02969026818987, NA, 2.54112005565494, 3.01623803266379, 1.73909499803785,
2.49712237003491, NA, 1.67635525591635), QQ = c(NA, NA, 1.91968060122123,
NA, NA, 2.63463138625395, NA, NA, NA, NA), RR = c(NA, NA, NA,
NA, NA, 1.01488582084669, 1.01944283768403, NA, 1.06329113924051,
NA), SS = c(0.950310559006211, 0.924124326404927, 1.07886334610473,
0.951793999929161, 0.847931452310888, 0.879173290937997, 0.882126364182319,
NA, NA, 0.713085668766746), TT = c(NA, NA, 1.09812749411644,
NA, 0.9994646420402, 1.21090641120118, 1.25090285854196, NA,
NA, NA)), .Names = c("OO", "PP", "QQ", "RR", "SS", "TT"), row.names = c(1L,
2L, 15L, 16L, 23L, 24L, 25L, 30L, 36L, 40L), class = "data.frame")
I have tried by making a function:
func<-function(x)unlist(lapply(AllList,function(y)if(length(na.omit(x[unlist(y)]))<2){rep(NA,length(unlist(y)))} else{x[unlist(y)]}))
And then:
output<-t(apply(RawHM,1,func))
Which works in priciple but doesnt preserve the colnames, which i want to be the same as in the RawHM dataframe. I would prefer to avoid renaming the columns afterwards..
> dput(head(output,10))
structure(c(NA, NA, NA, NA, 3.1189673217816, 2.68131772865193,
1.50542478607416, NA, NA, NA, NA, NA, 2.02969026818987, NA, 2.54112005565494,
3.01623803266379, 1.73909499803785, NA, NA, NA, NA, NA, 1.91968060122123,
NA, NA, 2.63463138625395, NA, NA, NA, NA, NA, NA, NA, NA, NA,
1.01488582084669, 1.01944283768403, NA, NA, NA, NA, NA, 1.07886334610473,
NA, 0.847931452310888, 0.879173290937997, 0.882126364182319,
NA, NA, NA, NA, NA, 1.09812749411644, NA, 0.9994646420402, 1.21090641120118,
1.25090285854196, NA, NA, NA), .Dim = c(10L, 6L), .Dimnames = list(
c("1", "2", "15", "16", "23", "24", "25", "30", "36", "40"
), NULL))
Any help would be very welcome :-)
Regards
Mads
func is a very strange function... funky even!
When you use apply your data gets converted to a matrix from a data.frame. Your function seems to operate differently if it is a data.frame rather than a matrix:
func(RawHM[1,])
EGI.OO EGI.PP EGI.QQ Ref.RR Ref.SS Ref.TT
2.2612828 NA NA NA 0.9503106 NA
func(as.matrix(RawHM)[1,])
EGI1 EGI2 EGI3 Ref1 Ref2 Ref3
NA NA NA NA NA NA
Note that you get different results, and different names!
In any case, the names issue arises from the fact that when you produce the NAs, there are no names, so the result give inconsistent output for apply. To fix this, here is a modification:
func2 <- function(x)unlist(lapply(AllList,function(y)if(length(na.omit(x[unlist(y)]))<2){sapply(y,function(z) NA)} else{x[unlist(y)]}))
t(apply(RawHM,1,func2))
EGI.OO EGI.PP EGI.QQ Ref.RR Ref.SS Ref.TT
1 NA NA NA NA NA NA
2 NA NA NA NA NA NA
15 NA 2.029690 1.919681 NA 1.0788633 1.0981275
16 NA NA NA NA NA NA
23 3.118967 2.541120 NA NA 0.8479315 0.9994646
24 2.681318 3.016238 2.634631 1.014886 0.8791733 1.2109064
25 1.505425 1.739095 NA 1.019443 0.8821264 1.2509029
30 NA NA NA NA NA NA
36 NA NA NA NA NA NA
40 NA NA NA NA NA NA