I want to calculate the mean in a row if at least three out of six observations in the row are != NA. If four or more NA´s are present, the mean should show NA.
Example which gives me the mean, ignoring the NA´s:
require(dplyr)
a <- 1:10
b <- a+10
c <- a+20
d <- a+30
e <- a+40
f <- a+50
df <- data.frame(a,b,c,d,e,f)
df[2,c(1,3,4,6)] <- NA
df[5,c(1,4,6)] <- NA
df[8,c(1,2,5,6)] <- NA
df <- df %>% mutate(mean = rowMeans(df[,1:6], na.rm=TRUE))
I thought about the use of
case_when
but i´m not sure how to use it correctly:
df <- df %>% mutate(mean = case_when( ~ rowMeans(df[,1:6], na.rm=TRUE), TRUE ~ NA))
You can try a base R solution saving the number of non NA values in a new variable and then use ifelse() for the mean:
#Data
a <- 1:10
b <- a+10
c <- a+20
d <- a+30
e <- a+40
f <- a+50
df <- data.frame(a,b,c,d,e,f)
df[2,c(1,3,4,6)] <- NA
df[5,c(1,4,6)] <- NA
df[8,c(1,2,5,6)] <- NA
#Code
#Count number of non NA
df$count <- rowSums( !is.na( df [,1:6]))
#Compute mean
df$Mean <- ifelse(df$count>=3,rowMeans(df [,1:6],na.rm=T),NA)
Output:
a b c d e f count Mean
1 1 11 21 31 41 51 6 26.00000
2 NA 12 NA NA 42 NA 2 NA
3 3 13 23 33 43 53 6 28.00000
4 4 14 24 34 44 54 6 29.00000
5 NA 15 25 NA 45 NA 3 28.33333
6 6 16 26 36 46 56 6 31.00000
7 7 17 27 37 47 57 6 32.00000
8 NA NA 28 38 NA NA 2 NA
9 9 19 29 39 49 59 6 34.00000
10 10 20 30 40 50 60 6 35.00000
You could do:
library(dplyr)
df %>%
rowwise %>%
mutate(
mean = case_when(
sum(is.na(c_across())) < 4 ~ mean(c_across(), na.rm = TRUE),
TRUE ~ NA_real_)
) %>% ungroup()
Output:
# A tibble: 10 x 7
a b c d e f mean
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 11 21 31 41 51 26
2 NA 12 NA NA 42 NA NA
3 3 13 23 33 43 53 28
4 4 14 24 34 44 54 29
5 NA 15 25 NA 45 NA 28.3
6 6 16 26 36 46 56 31
7 7 17 27 37 47 57 32
8 NA NA 28 38 NA NA NA
9 9 19 29 39 49 59 34
10 10 20 30 40 50 60 35
This is leveraging rowwise and c_across which basically means operating on row level, so you can use vectorized functions such as sum, mean etc. in their usual way (also with case_when).
c_across also has a cols argument where you can specify which columns you want to take into account. For example, if you'd like to take into account columns 1:6, you can specify this as:
df %>%
rowwise %>%
mutate(
mean = case_when(
sum(is.na(c_across(1:6))) < 4 ~ mean(c_across(), na.rm = TRUE),
TRUE ~ NA_real_)
) %>% ungroup()
Alternatively, if you'd e.g. like to take into account all columns except column number 2, you would do c_across(-2). You can also use column names, e.g. for the first example c_across(a:f) (all columns) or for the second c_across(-b) (all columns except b).
This is implemented internally in dplyr, but you could also do usual vector subsetting with taking the whole c_across() (which defaults to all columns, i.e. everything()) and do e.g. c_across()[1:6] or c_across()[-2].
We can create an index first and then do the assignment based on the index
i1 <- rowSums(!is.na(df)) >=3
df$Mean[i1] <- rowMeans(df[i1,], na.rm = TRUE)
df
# a b c d e f Mean
#1 1 11 21 31 41 51 26.00000
#2 NA 12 NA NA 42 NA NA
#3 3 13 23 33 43 53 28.00000
#4 4 14 24 34 44 54 29.00000
#5 NA 15 25 NA 45 NA 28.33333
#6 6 16 26 36 46 56 31.00000
#7 7 17 27 37 47 57 32.00000
#8 NA NA 28 38 NA NA NA
#9 9 19 29 39 49 59 34.00000
#10 10 20 30 40 50 60 35.00000
Related
I want to create a "row" containing the first non-NA value that appears in a data frame. So for example, given this test data frame:
test.df <- data.frame(a=c(11,12,13,14,15,16),b=c(NA,NA,23,24,25,26), c=c(31,32,33,34,35,36), d=c(NA,NA,NA,NA,45,46))
test.df
a b c d
1 11 NA 31 NA
2 12 NA 32 NA
3 13 23 33 NA
4 14 24 34 NA
5 15 25 35 45
6 16 26 36 46
I know that I can detect the first appearance of a non-NA like this:
first.appearance <- as.numeric(sapply(test.df, function(col) min(which(!is.na(col)))))
first.appearance
[1] 1 3 1 5
This tells me that the first element in column 1 is not NA, the third element in column 2 is not NA, the first element in column 3 is not NA, and the fifth element in column 4 is not NA. But when I put the pieces together, it yields this (which is logical, but not what I want):
> test.df[first.appearance,]
a b c d
1 11 NA 31 NA
3 13 23 33 NA
1.1 11 NA 31 NA
5 15 25 35 45
I would like the output to be the first non-NA in each column. What is a base or dplyr way to do this? I am not seeing it. Thanks in advance.
a b c d
1 11 23 31 45
We can use
library(dplyr)
test.df %>%
slice(first.appearance) %>%
summarise_all(~ first(.[!is.na(.)]))
# a b c d
#1 11 23 31 45
Or it can be
test.df %>%
summarise_all(~ min(na.omit(.)))
# a b c d
#1 11 23 31 45
Or with colMins
library(matrixStats)
colMins(as.matrix(test.df), na.rm = TRUE)
#[1] 11 23 31 45
You can use :
library(tidyverse)
df %>% fill(everything(), .direction = "up") %>% head(1)
a b c d
<dbl> <dbl> <dbl> <dbl>
1 11 23 31 45
I have a data frame for which I want to create columns for row means. Each row mean column should be computed for a group of columns in the data. which are related to each other. I can differentiate between the groups of columns using dplyr's starts_with(). Since I have several groups of columns to calculate row means for, I'd like to build a function to do it. For some reason, I fail to get it to work.
Data
df <- data.frame("europe_paris" = 1:10,
"europe_london" = 11:20,
"europe_rome" = 21:30,
"asia_bangkok" = 31:40,
"asia_tokyo" = 41:50,
"asia_kathmandu" = 51:60)
set.seed(123)
df <- as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA),
prob = c(0.70, 0.30),
size = length(cc),
replace = TRUE) ]))
df
europe_paris europe_london europe_rome asia_bangkok asia_tokyo asia_kathmandu
1 1 NA NA NA 41 51
2 NA 12 22 NA 42 52
3 3 13 23 33 43 NA
4 NA 14 NA NA 44 54
5 NA 15 25 35 45 55
6 6 NA NA 36 46 56
7 7 17 27 NA 47 57
8 NA 18 28 38 48 NA
9 9 19 29 39 49 NA
10 10 NA 30 40 NA 60
I want to create a new column for the row means of each continent, across cities. One column for Asia cities, and one for Europe. Each run of the function will be fed by the name of a continent, to guide which columns to pick.
My attempt to build the function
This attempt is based on this answer.
continent_mean <-
function(continent) {
df %>%
select(starts_with(as.character(continent))) %>%
mutate(., (!!as.name(continent)) == rowMeans(., na.rm = TRUE))
}
However, running this code results in a weird behavior, as it seemingly returns the same dataset, with just the selected columns according to starts_with(), but it doesn't generate a new column for row means.
continent_mean("asia")
asia_bangkok asia_tokyo asia_kathmandu
1 31 41 51
2 32 42 52
3 33 43 53
4 34 44 54
5 35 45 55
6 36 46 56
7 37 47 57
8 38 48 58
9 39 49 59
10 40 50 60
What am I missing here? I thought this could be due to the == rather than = in mutate(), but a single = throws an error, so it seems not to be the solution either.
Thanks!
We can use quo_name to assign column names
library(dplyr)
library(rlang)
continent_mean <- function(df, continent) {
df %>%
select(starts_with(continent)) %>%
mutate(!!quo_name(continent) := rowMeans(., na.rm = TRUE))
}
continent_mean(df, "asia")
# asia_bangkok asia_tokyo asia_kathmandu asia
#1 NA 41 51 46
#2 NA 42 52 47
#3 33 43 NA 38
#4 NA 44 54 49
#5 35 45 55 45
#6 36 46 56 46
#7 NA 47 57 52
#8 38 48 NA 43
#9 39 49 NA 44
#10 40 NA 60 50
Using base R, we can do similar thing by
continent_mean <- function(df, continent) {
df1 <- df[startsWith(names(df), "asia")]
df1[continent] <- rowMeans(df1, na.rm = TRUE)
df1
}
If we want rowMeans of all the continents together we can use split.default
sapply(split.default(df, sub("_.*", "", names(df))), rowMeans, na.rm = TRUE)
# asia europe
# [1,] 46 1
# [2,] 47 17
# [3,] 38 13
# [4,] 49 14
# [5,] 45 20
# [6,] 46 6
# [7,] 52 17
# [8,] 43 23
# [9,] 44 19
#[10,] 50 20
Hi i am having data frame ,how to replace NA values in "Val_1" with respect to nearest value of Val_2
for e.g Val_1 at ID -4 value is missing and corresponding value of Val_2 is "33.3" we need to replace with nearest value in Val_2 i.e 45 (previous nearest value is 45) also ID-8 with 33 (nearest value of 44.6 is 44.5)
ID Date Val_1 Val_2
1 01-02-2014 NA 22
2 02-02-2014 23 NA
3 03-02-2014 45 33
4 04-02-2014 NA 33.3
5 05-02-2014 45 46
6 06-02-2014 33 44.5
7 07-02-2014 56 48
8 08-02-2014 NA 44.6
9 09-02-2014 10 43
10 10-02-2014 14 56
11 11-02-2014 NA NA
12 12-02-2014 22 22
we can replace NA value by
library(zoo)
na.locf(na.locf(DF$Val_1), fromLast = TRUE)
but above code replace with previous value from the same column
o/p :
ID Date Val_1 Val_2
1 01-02-2014 NA 22
2 02-02-2014 23 NA
3 03-02-2014 45 33
4 04-02-2014 45 33.3
5 05-02-2014 45 46
6 06-02-2014 33 44.5
7 07-02-2014 56 48
8 08-02-2014 33 44.6
9 09-02-2014 10 43
10 10-02-2014 14 56
11 11-02-2014 NA NA
12 12-02-2014 22 22
Thanks
Sorry but I couldn't think of any simpler way:
# To use pipes
library(dplyr)
# Give a threshold. Nearest values must have a difference below this threshold
diff.threshold <- 0.5
# Create a vector with IDs that must have Val_1 updated
IDtoReplace <- DF %>% filter(is.na(Val_1), !is.na(Val_2)) %>%
select(ID) %>%
unlist()
for (id in IDtoReplace){
# Get Val_2 from current id
curVal2 <- DF %>% filter(ID==id) %>% select(Val_2) %>% unlist()
# Get value to be input
valuetoinput <- DF %>% filter(!is.na(Val_1),!is.na(Val_2),ID < id) %>% # Filter out all NA values and keep only previous ID
mutate(diff = abs(Val_2-curVal2)) %>% # Calculate all the differentes
filter(diff==min(diff),diff<=diff.threshold) %>% # Keep row with minimum difference (it has to be below the threshold)
select(Val_1) %>% # Select Val_1
unlist()
# If any value is found, replace it in the data frame
if(length(valuetoinput)>0)
DF[which(DF$ID==id),"Val_1"] <- valuetoinput
}
And as result:
> DF
ID Date Val_1 Val_2
1 1 01-02-2014 NA 22.0
2 2 02-02-2014 23 NA
3 3 03-02-2014 45 33.0
4 4 04-02-2014 45 33.3
5 5 05-02-2014 45 46.0
6 6 06-02-2014 33 44.5
7 7 07-02-2014 56 48.0
8 8 08-02-2014 33 44.6
9 9 09-02-2014 10 43.0
10 10 10-02-2014 14 56.0
11 11 11-02-2014 NA NA
12 12 12-02-2014 22 22.0
Will you use something similar very often? If yes, I suggest you to rewrite the for loop as a function.
RProf revealed, that the following operation I perform is rather slow:
stockHistory[.(p), stock:=stockHistory[.(p), stock] - (backorderedDemands[.(p-1),backlog] - backorderedDemands[.(p),backlog])]
I suppose this is because of the subtraction
backorderedDemands[.(p-1),backlog] - backorderedDemands[.(p),backlog]
Is there any way to speed up this operation?
.(p) subsets the data.table for a period p, .(p-1) subsets the previous period (see example data below). Would it maybe be faster to apply some kind diff() here? I do not know how to do this, though.
Example data:
backorderedDemands<-CJ(period=1:1000, articleID=letters[1:10], backlog=0)[,backlog:=round(runif(10000)*42,0)]
setkey(backorderedDemands,period, articleID)
stockHistory<-CJ(period=1:1000, articleID=letters[1:10], stock=0)[,stock:=round(runif(10000)*42+66,0)]
setkey(stockHistory,period, articleID)
You can first calculate a difference column in backorderedDemands.
backorderedDemands[, diff := c(NA, -diff(backlog)), by=articleID]
Also it is not necessary to use stockHistory[.(p), stock]. It's enough to just use stock.
stockHistoryNew[.(p), stock:=stock - backorderedDemands[.(p), diff]]
If you want to compute first differences of your data, you can do it like below. It is fast...I included step by step computation.
library(data.table)
library(dplyr)
Data
set.seed(1)
backorderedDemands <-
CJ(period = 1:1000,
articleID = letters[1:10],
backlog = 0)[,backlog:= round(runif(10000) * 42, 0)]
stockHistory <-
CJ(period = 1:1000,
articleID = letters[1:10],
stock = 0)[, stock:= round(runif(10000) * 42 + 66, 0)]
Solution
merge(stockHistory, backorderedDemands,
by = c("period", "articleID")) %>%
group_by(articleID) %>%
mutate(lag_backlog = lag(backlog, 1),
my_backlog_diff = backlog - lag_backlog,
my_diff = stock + my_backlog_diff) %>%
as.data.frame(.) %>%
head(., 20)
period articleID stock backlog lag_backlog my_backlog_diff my_diff
1 1 a 69 11 NA NA NA
2 1 b 94 16 NA NA NA
3 1 c 97 24 NA NA NA
4 1 d 71 38 NA NA NA
5 1 e 68 8 NA NA NA
6 1 f 71 38 NA NA NA
7 1 g 103 40 NA NA NA
8 1 h 101 28 NA NA NA
9 1 i 102 26 NA NA NA
10 1 j 67 3 NA NA NA
11 2 a 71 9 11 -2 69
12 2 b 89 7 16 -9 80
13 2 c 71 29 24 5 76
14 2 d 96 16 38 -22 74
15 2 e 96 32 8 24 120
16 2 f 99 21 38 -17 82
17 2 g 92 30 40 -10 82
18 2 h 87 42 28 14 101
19 2 i 85 16 26 -10 75
20 2 j 67 33 3 30 97
I have a data frame with 150000 lines in long format with multiple occurences of the same id variable. I'm using reshape (from stat, rather than package=reshape(2)) to convert this to wide format. I am generating a variable to count each occurence of a given level of id to use as an index.
I've got this working with a small dataframe using plyr, but it is far too slow for my full df. Can I programme this more efficiently?
I've struggled doing this with the reshape package as I have around 30 other variables. It may be best to reshape only what I'm looking at (rather than the whole df) for each individual analysis.
> # u=id variable with three value variables
> u<-c(rep("a",4), rep("b", 3),rep("c", 6), rep("d", 5))
> u<-factor(u)
> v<-1:18
> w<-20:37
> x<-40:57
> df<-data.frame(u,v,w,x)
> df
u v w x
1 a 1 20 40
2 a 2 21 41
3 a 3 22 42
4 a 4 23 43
5 b 5 24 44
6 b 6 25 45
7 b 7 26 46
8 c 8 27 47
9 c 9 28 48
10 c 10 29 49
11 c 11 30 50
12 c 12 31 51
13 c 13 32 52
14 d 14 33 53
15 d 15 34 54
16 d 16 35 55
17 d 17 36 56
18 d 18 37 57
>
> library(plyr)
> df2<-ddply(df, .(u), transform, count=rank(u, ties.method="first"))
> df2
u v w x count
1 a 1 20 40 1
2 a 2 21 41 2
3 a 3 22 42 3
4 a 4 23 43 4
5 b 5 24 44 1
6 b 6 25 45 2
7 b 7 26 46 3
8 c 8 27 47 1
9 c 9 28 48 2
10 c 10 29 49 3
11 c 11 30 50 4
12 c 12 31 51 5
13 c 13 32 52 6
14 d 14 33 53 1
15 d 15 34 54 2
16 d 16 35 55 3
17 d 17 36 56 4
18 d 18 37 57 5
> reshape(df2, idvar="u", timevar="count", direction="wide")
u v.1 w.1 x.1 v.2 w.2 x.2 v.3 w.3 x.3 v.4 w.4 x.4 v.5 w.5 x.5 v.6 w.6 x.6
1 a 1 20 40 2 21 41 3 22 42 4 23 43 NA NA NA NA NA NA
5 b 5 24 44 6 25 45 7 26 46 NA NA NA NA NA NA NA NA NA
8 c 8 27 47 9 28 48 10 29 49 11 30 50 12 31 51 13 32 52
14 d 14 33 53 15 34 54 16 35 55 17 36 56 18 37 57 NA NA NA
I still can't quite figure out why you would want to ultimately convert your dataset from wide to long, because to me, that seems like it would be an extremely unwieldy dataset to work with.
If you're looking to speed up the enumeration of your factor levels, you can consider using ave() in base R, or .N from the "data.table" package. Considering that you are working with a lot of rows, you might want to consider the latter.
First, let's make up some data:
set.seed(1)
df <- data.frame(u = sample(letters[1:6], 150000, replace = TRUE),
v = runif(150000, 0, 10),
w = runif(150000, 0, 100),
x = runif(150000, 0, 1000))
list(head(df), tail(df))
# [[1]]
# u v w x
# 1 b 6.368412 10.52822 223.6556
# 2 c 6.579344 75.28534 450.7643
# 3 d 6.573822 36.87630 283.3083
# 4 f 9.711164 66.99525 681.0157
# 5 b 5.337487 54.30291 137.0383
# 6 f 9.587560 44.81581 831.4087
#
# [[2]]
# u v w x
# 149995 b 4.614894 52.77121 509.0054
# 149996 f 5.104273 87.43799 391.6819
# 149997 f 2.425936 60.06982 160.2324
# 149998 a 1.592130 66.76113 118.4327
# 149999 b 5.157081 36.90400 511.6446
# 150000 a 3.565323 92.33530 252.4982
table(df$u)
#
# a b c d e f
# 25332 24691 24993 24975 25114 24895
Load our required packages:
library(plyr)
library(data.table)
Create a "data.table" version of our dataset
DT <- data.table(df, key = "u")
DT # Notice that the data are now automatically sorted
# u v w x
# 1: a 6.2378578 96.098294 643.2433
# 2: a 5.0322400 46.806132 544.6883
# 3: a 9.6289786 87.915303 334.6726
# 4: a 4.3393403 1.994383 753.0628
# 5: a 6.2300123 72.810359 579.7548
# ---
# 149996: f 0.6268414 15.608049 669.3838
# 149997: f 2.3588955 40.380824 658.8667
# 149998: f 1.6383619 77.210309 250.7117
# 149999: f 5.1042725 87.437989 391.6819
# 150000: f 2.4259363 60.069820 160.2324
DT[, .N, by = key(DT)] # Like "table"
# u N
# 1: a 25332
# 2: b 24691
# 3: c 24993
# 4: d 24975
# 5: e 25114
# 6: f 24895
Now let's run a few basic tests. The results from ave() aren't sorted, but they are in "data.table" and "plyr", so we should also test the timing for sorting when using ave().
system.time(AVE <- within(df, {
count <- ave(as.numeric(u), u, FUN = seq_along)
}))
# user system elapsed
# 0.024 0.000 0.027
# Now time the sorting
system.time(AVE2 <- AVE[order(AVE$u, AVE$count), ])
# user system elapsed
# 0.264 0.000 0.262
system.time(DDPLY <- ddply(df, .(u), transform,
count=rank(u, ties.method="first")))
# user system elapsed
# 0.944 0.000 0.984
system.time(DT[, count := 1:.N, by = key(DT)])
# user system elapsed
# 0.008 0.000 0.004
all(DDPLY == AVE2)
# [1] TRUE
all(data.frame(DT) == AVE2)
# [1] TRUE
That syntax for "data.table" sure is compact, and it's speed is blazing!
Using base R to create an empty matrix and then fill it in appropriately can often be significantly faster. In the code below I suspect the slow part would be converting the data frame to a matrix and transposing, as in the first two lines; if so, that could perhaps be avoided if it could be stored differently to start with.
g <- df$a
x <- t(as.matrix(df[,-1]))
k <- split(seq_along(g), g)
n <- max(sapply(k, length))
out <- matrix(ncol=n*nrow(x), nrow=length(k))
for(idx in seq_along(k)) {
out[idx, seq_len(length(k[[idx]])*nrow(x))] <- x[,k[[idx]]]
}
rownames(out) <- names(k)
colnames(out) <- paste(rep(rownames(x), n), rep(seq_len(n), each=nrow(x)), sep=".")
out
# b.1 c.1 d.1 b.2 c.2 d.2 b.3 c.3 d.3 b.4 c.4 d.4 b.5 c.5 d.5 b.6 c.6 d.6
# a 1 20 40 2 21 41 3 22 42 4 23 43 NA NA NA NA NA NA
# b 5 24 44 6 25 45 7 26 46 NA NA NA NA NA NA NA NA NA
# c 8 27 47 9 28 48 10 29 49 11 30 50 12 31 51 13 32 52
# d 14 33 53 15 34 54 16 35 55 17 36 56 18 37 57 NA NA NA