I'm working on a data set where the source name is specified by a 2-letter abbreviation in front of the variable. So all variables from source AA start with AA_var1, and source bb has bb_variable_name_2. There are actually a lot of sources, and a lot of variable names, but I leave only 2 as a minimal example.
I want to create a mean variable for any row where the number of sources, that is, where the number of unique prefixes for which the data on that row is not NA, is greater than 1. If there's only one source, I want that total variable to be NA.
So, for example, my data looks like this:
> head(df)
AA_var1 AA_var2 myid bb_meow bb_A_v1
1 NA NA 123456 10 12
2 NA 10 194200 12 NA
3 12 10 132200 NA NA
4 12 NA 132201 NA 12
5 NA NA 132202 NA NA
6 12 13 132203 14 NA
And I want the following:
> head(df)
AA_var1 AA_var2 myid bb_meow bb_A_v1 rowMeanIfDiverseData
1 NA NA 123456 10 12 NA #has only bb
2 NA 10 194200 12 NA 11 #has AA and bb
3 12 10 132200 NA NA NA #has only AA
4 12 NA 132201 NA 12 12 #has AA and bb
5 NA NA 132202 NA NA NA #has neither
6 12 13 132203 14 NA 13 #has AA and bb
Normally, I just use rowMeans() for this kind of thing. But the additional subsetting of selecting only rows whose variable names follow a convention /at the row level/ has caught me confused between the item-level and the general apply-level statements I'm used to.
I can get the prefixes at the dataframe level:
mynames <- names(df[!names(df) %in% c("myid")])
tmp <- str_extract(mynames, perl("[A-Za-z]{2}(?=_)"))
uniq <- unique(tmp[!is.na(tmp)])
So,
> uniq
[1] "AA" "bb"
So, I can make this a function I can apply to df like so:
multiSource <- function(x){
nm = names(x[!names(x) %in% badnames]) # exclude c("myid")
tmp <- str_extract(nm, perl("[A-Za-z]{2}(?=_)")) # get prefixes
uniq <- unique(tmp[!is.na(tmp)]) # ensure unique and not NA
if (length(uniq) > 1){
return(T)
} else {
return(F)
}
}
But this is clearly confused, and still getting data-set level, ie:
> lapply(df,multiSource)
$AA_var1
[1] FALSE
$AA_var2
[1] FALSE
$bb_meow
[1] FALSE
$bb_A_v1
[1] FALSE
And...
> apply(df,MARGIN=1,FUN=multiSource)
Gives TRUE for all.
I'd otherwise like to be saying...
df$rowMean <- rowMeans(df, na.rm=T)
# so, in this case
rowMeansIfTest <- function(X,test) {
# is this row muliSource True?
# if yes, return(rowMeans(X))
# else return(NA)
}
df$rowMeanIfDiverseData <- rowMeansIfTest(df, test=multiSource)
But it is unclear to me how to do this without some kind of for loop.
The strategy here is to split the data frame by columns into variable groups, and for each row identifying if there are non-NA values. We then check with rowsums to make sure there are at least two variables with non-NA values for a row, and if so, add the mean of those values with cbind.
This will generalize to any number of columns so long as they are named in the AA_varXXX format, and so long as the only column not in that format is myid. Easy enough to fix if this isn't strictly the case, but these are the limitations on the code as written now.
df.dat <- df[!names(df) == "myid"]
diverse.rows <- rowSums(
sapply(
split.default(df.dat, gsub("^([A-Z]{2})_var.*", "\\1", names(df.dat))),
function(x) apply(x, 1, function(y) any(!is.na(y)))
) ) > 1
cbind(df, div.mean=ifelse(diverse.rows, rowMeans(df.dat, na.rm=T), NA))
Produces:
AA_var1 AA_var2 myid BB_var3 BB_var4 div.mean
1 NA NA 123456 10 12 NA
2 NA 10 194200 12 NA 11
3 12 10 132200 NA NA NA
4 12 NA 132201 NA 12 12
5 NA NA 132202 NA NA NA
6 12 13 132203 14 NA 13
This solution seems a little convoluted to me, so there's probably a better way, but it should work for you.
# Here's your data:
df <- data.frame(AA_var1 = c(NA,NA,12,12,NA,12),
AA_var2 = c(NA,10,10,NA,NA,13),
BB_var3 = c(10,12,NA,NA,NA,14),
BB_var4 = c(12,NA,NA,12,NA,NA))
# calculate rowMeans for each subset of variables
a <- rowMeans(df[,grepl('AA',names(df))], na.rm=TRUE)
b <- rowMeans(df[,grepl('BB',names(df))], na.rm=TRUE)
# count non-missing values for each subset of variables
a2 <- rowSums(!is.na(df[,grepl('AA',names(df))]), na.rm=TRUE)
b2 <- rowSums(!is.na(df[,grepl('BB',names(df))]), na.rm=TRUE)
# calculate means:
rowSums(cbind(a*a2,b*b2)) /
rowSums(!is.na(df[,grepl('[AA]|[BB]',names(df))]), na.rm=TRUE)
Result:
> df$rowMeanIfDiverseData <- rowSums(cbind(a*a2,b*b2)) /
+ rowSums(!is.na(df[,grepl('[AA]|[BB]',names(df))]), na.rm=TRUE)
> df
AA_var1 AA_var2 BB_var3 BB_var4 rowMeanIfDiverseData
1 NA NA 10 12 NaN
2 NA 10 12 NA 11
3 12 10 NA NA NaN
4 12 NA NA 12 12
5 NA NA NA NA NaN
6 12 13 14 NA 13
And a little cleanup to exactly match your intended output:
> df$rowMeanIfDiverseData[is.nan(df$rowMeanIfDiverseData)] <- NA
> df
AA_var1 AA_var2 BB_var3 BB_var4 rowMeanIfDiverseData
1 NA NA 10 12 NA
2 NA 10 12 NA 11
3 12 10 NA NA NA
4 12 NA NA 12 12
5 NA NA NA NA NA
6 12 13 14 NA 13
My attempt, somewhat longwinded.....
dat<-data.frame(AA_var1=c(NA,NA,12,12,NA,12),
AA_var2=c(NA,10,10,NA,NA,13),
myid=1:6,
BB_var3=c(10,12,NA,NA,NA,14),
BB_var4=c(12,NA,NA,12,NA,NA))
#what columns are associated with variables used in our mean
varcols<-grep("*var[1-9]",names(dat),value=T)
#which rows have the requisite diversification of non-nulls
#i assume these columns will start with capitals and folloowed by underscore
meanrow<-apply(!is.na(dat[,varcols]),1,function(x){n<-varcols[x]
1<length(unique(regmatches(n,regexpr("[A-Z]+_",n))))
})
#do the row mean for all
dat$meanval<-rowMeans(dat[,varcols],na.rm=T)
#null out for those without diversification (i.e. !meanrow)
dat[!meanrow,"meanval"]<-NA
I think some of the answers are making this seem more complicated than it is. This will do it:
df$means = ifelse(rowSums(!is.na(df[, grep('AA_var', names(df))])) &
rowSums(!is.na(df[, grep('BB_var', names(df))])),
rowMeans(df[, grep('_var', names(df))], na.rm = T), NA)
# AA_var1 AA_var2 myid BB_var3 BB_var4 means
#1 NA NA 123456 10 12 NA
#2 NA 10 194200 12 NA 11
#3 12 10 132200 NA NA NA
#4 12 NA 132201 NA 12 12
#5 NA NA 132202 NA NA NA
#6 12 13 132203 14 NA 13
Here's a generalization of the above, given the comment, assuming unique id's (if they're not, create a unique index instead):
library(data.table)
library(reshape2)
dt = data.table(df)
setkey(dt, myid) # not strictly necessary, but makes life easier
# find the conditional
cond = melt(dt, id.var = 'myid')[,
sum(!is.na(value)), by = list(myid, sub('_var.*', '', variable))][,
all(V1 != 0), keyby = myid]$V1
# fill in the means (could also do a join, but will rely on ordering instead)
dt[cond, means := rowMeans(.SD, na.rm = T), .SDcols = grep('_var', names(dt))]
dt
# AA_var1 AA_var2 myid BB_var3 BB_var4 means
#1: NA NA 123456 10 12 NA
#2: 12 10 132200 NA NA NA
#3: 12 NA 132201 NA 12 12
#4: NA NA 132202 NA NA NA
#5: 12 13 132203 14 NA 13
#6: NA 10 194200 12 NA 11
fun <- function(x) {
MEAN <- mean(c(x[1], x[2], x[4], x[5]), na.rm=TRUE)
CHECK <- sum(!is.na(c(x[1], x[2]))) > 0 & sum(!is.na(c(x[4], x[5])) > 0)
MEAN * ifelse(CHECK, 1, NaN)
}
df$rowMeanIfDiverseData <- apply(df, 1, fun)
df
AA_var1 AA_var2 myid BB_var3 BB_var4 rowMeanIfDiverseData
1 NA NA 123456 10 12 NaN
2 NA 10 194200 12 NA 11
3 12 10 132200 NA NA NaN
4 12 NA 132201 NA 12 12
5 NA NA 132202 NA NA NaN
6 12 13 132203 14 NA 13
Related
This is a followup question to a previous post of mine about building a function for calculating row means.
I want to use any function of the apply family to iterate over my dataset and each time compute the row mean (which is what the function does) for a group of columns I specify. Unfortunately, I miss something critical in the way I should tweak apply(), because I get an error that I can't troubleshoot.
Example Data
capital_cities_df <-
data.frame("europe_paris" = 1:10,
"europe_london" = 11:20,
"europe_rome" = 21:30,
"asia_bangkok" = 31:40,
"asia_tokyo" = 41:50,
"asia_kathmandu" = 51:60)
set.seed(123)
capital_cities_df <- as.data.frame(lapply(capital_cities_df,
function(cc) cc[ sample(c(TRUE, NA),
prob = c(0.70, 0.30),
size = length(cc),
replace = TRUE) ]))
> capital_cities_df
europe_paris europe_london europe_rome asia_bangkok asia_tokyo asia_kathmandu
1 1 NA NA NA 41 NA
2 NA 12 22 NA 42 52
3 3 NA 23 33 43 NA
4 NA 14 NA NA NA NA
5 NA 15 25 35 45 NA
6 6 NA NA 36 NA 56
7 NA 17 NA NA NA 57
8 NA 18 NA 38 48 NA
9 NA 19 NA 39 49 NA
10 10 NA 30 40 NA 60
Custom Function
library(dplyr)
library(rlang)
continent_mean <- function(df, continent) {
df %>%
select(starts_with(continent)) %>%
dplyr::mutate(!!quo_name(continent) := rowMeans(., na.rm = TRUE))
}
## works for a single case:
continent_mean(capital_cities_df, "europe")
europe_paris europe_london europe_rome europe
1 1 NA 21 11
2 2 12 22 12
3 3 NA 23 13
4 4 14 NA 9
5 NA 15 25 20
6 6 16 26 16
7 NA 17 NA 17
8 NA 18 NA 18
9 NA 19 NA 19
10 10 20 30 20
Trying to apply the function over the data, unsuccessfully
apply(
capital_cities_df,
MARGIN = 2,
FUN = continent_mean(capital_cities_df, continent = "europe")
)
Error in match.fun(FUN) :
'continent_mean(capital_cities_df, continent = "europe")' is not a function, character or symbol
Any other combination of the arguments in apply() didn't work either, nor did sapply. This unsuccessful attempt of using apply is only for one type of columns I wish to get the mean for ("europe"). However, my ultimate goal is to be able to pass c("europe", "asia", etc.) with apply, so I could get the custom function to create row means columns for all groups of columns I specify, in one hit.
What is wrong with my code?
Thanks!
EDIT 19-AUG-2019
I was trying the solution suggested by A. Suliman (see below). It did work for the example data I posted here, but not when trying to scale it up to my real dataset, where I need to subset additional columns (rather than the "continent" batch only). More specifically, in my real data I have an ID column which I want to get outputted along the other data, when I apply my custom-made function.
Example data including "ID" column
capital_cities_df <- data.frame(
"europe_paris" = 1:10,
"europe_london" = 11:20,
"europe_rome" = 21:30,
"asia_bangkok" = 31:40,
"asia_tokyo" = 41:50,
"asia_kathmandu" = 51:60)
set.seed(123)
capital_cities_df <- as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA),
prob = c(0.70, 0.30),
size = length(cc),
replace = TRUE) ]))
id <- 1:10
capital_cities_df <- cbind(id, capital_cities_df)
> capital_cities_df
id europe_paris europe_london europe_rome asia_bangkok asia_tokyo asia_kathmandu
1 1 1 NA NA NA 41 NA
2 2 NA 12 22 NA 42 52
3 3 3 NA 23 33 43 NA
4 4 NA 14 NA NA NA NA
5 5 NA 15 25 35 45 NA
6 6 6 NA NA 36 NA 56
7 7 NA 17 NA NA NA 57
8 8 NA 18 NA 38 48 NA
9 9 NA 19 NA 39 49 NA
10 10 10 NA 30 40 NA 60
My function (edited to select id as well)
continent_mean <- function(df, continent) {
df %>%
select(., id, starts_with(continent)) %>%
dplyr::mutate(!!quo_name(continent) := rowMeans(., na.rm = TRUE))
}
> continent_mean(capital_cities_df, "europe") ## works in a single run
id europe_paris europe_london europe_rome europe
1 1 1 NA NA 1.000000
2 2 NA 12 22 12.000000
3 3 3 NA 23 9.666667
4 4 NA 14 NA 9.000000
5 5 NA 15 25 15.000000
6 6 6 NA NA 6.000000
7 7 NA 17 NA 12.000000
8 8 NA 18 NA 13.000000
9 9 NA 19 NA 14.000000
10 10 10 NA 30 16.666667
Trying to apply the function beyond the single use (based on A. Suliman's method) -- unsuccessfully
continents <- c("europe", "asia")
lst <- lapply(continents, function(x) continent_mean(df=capital_cities_df[, grep(x, names(capital_cities_df))], continent=x))
## or:
purrr::map_dfc(continents, ~continent_mean(df=capital_cities_df[, grep(.x, names(capital_cities_df))], continent=.x))
In either case I get a variety of error messages:
Error in inds_combine(.vars, ind_list) : Position must be between 0
and n
At other times:
Error: invalid column index : NA for variable: 'NA' = 'NA'
All I wanted was a simple function to let me calculate row means per specification of which columns to run over, but this gets nasty for some reason. Even though I'm eager to figure out what's wrong with my code, if anybody has a better overarching solution for the entire process I'd be thankful too.
Thanks!
Use lapply to loop through continents then use grep to select columns with the current continent
continents <- c("europe", "asia")
lst <- lapply(continents, function(x) continent_mean(df=capital_cities_df[, grep(x, names(capital_cities_df))], continent=x))
#To a dataframe not a list
do.call(cbind, lst)
Using map_dfc from purrr we can get the result in one step
purrr::map_dfc(continents, ~continent_mean(df=capital_cities_df[, grep(.x, names(capital_cities_df))], continent=.x))
Update:
#grep will return column positions when they match with "europe" or "asia", e.g
> grep("europe", names(capital_cities_df))
[1] 2 3 4
#If we need the column names then we add value=TRUE to grep
> grep("europe", names(capital_cities_df), value = TRUE)
[1] "europe_paris" "europe_london" "europe_rome"
So to add a new column we can just use the c() function and call the function as usual
#NOTE: Here I'm using the old function without select
lst <- lapply(continents, function(x) continent_mean(df=capital_cities_df[, c('id',grep(x, names(capital_cities_df), value = TRUE))], continent=x))
do.call(cbind, lst)
id europe_paris europe_london europe_rome europe id asia_bangkok asia_tokyo asia_kathmandu asia
1 1 1 NA NA 1.00000 1 NA 41 51 31.00000
2 2 NA 12 22 12.00000 2 NA 42 52 32.00000
3 3 3 13 23 10.50000 3 33 43 NA 26.33333
4 4 NA 14 NA 9.00000 4 NA 44 54 34.00000
5 5 NA 15 25 15.00000 5 35 45 55 35.00000
6 6 6 NA NA 6.00000 6 36 46 56 36.00000
7 7 7 17 27 14.50000 7 NA 47 57 37.00000
8 8 NA 18 28 18.00000 8 38 48 NA 31.33333
9 9 9 19 29 16.50000 9 39 49 NA 32.33333
10 10 10 NA 30 16.66667 10 40 NA 60 36.66667
#We have one problem, id column gets duplicated, map_dfc with select will solve this issue
purrr::map_dfc(continents, ~continent_mean(df=capital_cities_df[, c('id',grep(.x, names(capital_cities_df), value = TRUE))], continent=.x)) %>%
#Don't select any column name ends with id followed by one digit
select(-matches('id\\d'))
If you'd like to use the new function with select then just pass capital_cities_df without grep, e.g using map_dfc
purrr::map_dfc(continents, ~continent_mean(df=capital_cities_df, continent=.x)) %>%
select(-matches('id\\d'))
Correction: in continent_mean
continent_mean <- function(df, continent) {
df %>%
select(., id, starts_with(continent)) %>%
#Exclude id from the rowMeans calculation
dplyr::mutate(!!quo_name(continent) := rowMeans(.[grep(continent, names(.))], na.rm = TRUE))
}
Here is the code I am using:
for(i in 1:(max_length - nrow(dat2006))){
datDummy[nrow(datDummy + i), ] <- NA
}
Can someone please suggest a better approach
#akrun: Example
df <- data.frame("var1" = c(1,2,3))
df
Original Data Frame
var1
1 1
2 2
3 3
For loop (Indexing is a problem, I wanted to add 5 more rows)
for(i in 1:5){
df[nrow(df) + i, ] <- NA
}
Output of For
df
var1
1 1
2 2
3 3
4 NA
5 NA
6 NA
7 NA
8 NA
9 NA
10 NA
11 NA
12 NA
13 NA
14 NA
15 NA
16 NA
17 NA
18 NA
I was able to achieve what I wanted using the following code:
row2007 <- as.integer(max_length - nrow(dat2007))
for(i in row2007){
dat2007[(nrow(dat2007) + i), ] <- NA
}
Hi I have a df and How to calculate rolling mean for previous N rows ,In my case it is 3
df <- data.frame(X=c(24,NA,NA,45,NA,20,24,10,40,20,20), Y=c(10,NA,14,14,14,10,NA,10,14,14,14))
I tried to get o/p as below
library(data.table)
library(zoo)
df[, Rolling.Average := rollmeanr(X, 3, fill = NA), by = Y]
o/p:
X Y Rolling.Average
24 10 NA
NA NA NA
NA 14 NA
45 14 NA
NA 14 45
20 10 NA
24 NA NA
10 10 20
40 14 NA
20 14 40
20 14 30
Error: k <= n is not TRUE
Any modifications in above code,please suggest
Thanks
The error is thrown as you have less that 3 rows for the "NA" group -- which data.table seems to interpret as a group.
So, your data is expected to have at least three entries for each by-group. Then the code works (I have added a row below to your example below):
df <- data.frame(X=c(24,NA,NA,45,NA,20,24,10,40,20,20,20),
Y=c(10,NA,14,14,14,10,NA,10,14,14,14,NA))
library(data.table)
library(zoo)
df <- as.data.table(df)
df[, Rolling.Average := rollmeanr(X, k = 3, fill = NA), by = Y]
print(df)
# X Y Rolling.Average
# 1: 24 10 NA
# 2: NA NA NA
# 3: NA 14 NA
# 4: 45 14 NA
# 5: NA 14 NA
# 6: 20 10 NA
# 7: 24 NA NA
# 8: 10 10 18
# 9: 40 14 NA
#10: 20 14 NA
#11: 20 14 NA
#12: 20 NA NA
Or, you can exclude the NAs of the by-clause by something like:
df[!is.na(Y), Rolling.Average := rollmeanr(X, k = 3, fill = NA), by = Y]
Which yields the same output. It does not match your expected output, but I do not really understand how one should arrive at that.
Say I have a data frame as follows
rsi5 rsi10
1 NA NA
2 NA NA
3 NA NA
4 NA NA
5 NA NA
6 44.96650 NA
7 39.68831 NA
8 28.35625 NA
9 37.77910 NA
10 53.54822 NA
11 52.05308 46.01867
12 80.44368 66.09973
13 60.88418 56.04507
14 53.59851 52.10633
15 46.45874 48.23648
I wish to simply add 1 (i.e. 9 becomes 10) to each non-NA element of this data frame. There is probably a very simple solution to this but simple arithmetics on dataframes do not seem to work in R giving very strange results.
Just use + 1 as you would expect. Below is a mock example as it wasn't worth copying your data for for this.
Step One: Create a data.frame
R> df <- data.frame(A=c(NA, 1, 2, 3), B=c(NA, NA, 12, 13))
R> df
A B
1 NA NA
2 1 NA
3 2 12
4 3 13
R>
Step Two: Add one
R> df + 1
A B
1 NA NA
2 2 NA
3 3 13
4 4 14
R>
I have positive, negative and NA values in a Table, I need to replace negative values by NA values. Positive and NA values should remain as they are in Table. My Data set is similar to the one below:
NO. q
1 NA
2 NA
3 -133.6105198
4 -119.6991209
5 28.84460104
6 66.05345087
7 84.7058947
8 -134.4522694
9 NA
10 NA
11 73.20465643
12 -69.90723514
13 NA
14 69.70833003
15 65.27859906
I tried this:
if (q>0) {
q=NA
} else {
q=q
}
Or use replace:
> df$q2 <- replace(df$q, which(df$q < 0), NA)
> df
NO. q q2
1 1 NA NA
2 2 NA NA
3 3 -133.61052 NA
4 4 -119.69912 NA
5 5 28.84460 28.84460
6 6 66.05345 66.05345
7 7 84.70589 84.70589
8 8 -134.45227 NA
9 9 NA NA
10 10 NA NA
11 11 73.20466 73.20466
12 12 -69.90724 NA
13 13 NA NA
14 14 69.70833 69.70833
15 15 65.27860 65.27860
Or with data.table:
library(data.table)
setDT(df)[q < 0, q := NA]
Or with replace in a dplyr pipe:
library(dplyr)
df %>% mutate(q = replace(q, which(q<0), NA))
You could try this:
sample <- c(1, -2, NA)
sample[sample < 0] <- NA
sample
[1] 1 NA NA
Or if you're using a data.frame (suppose it's called df):
df$q[df$q < 0] <- NA
You could try
df1$q1 <- NA^(df1$q <0) * df1$q
df1
# NO. q q1
#1 1 NA NA
#2 2 NA NA
#3 3 -133.61052 NA
#4 4 -119.69912 NA
#5 5 28.84460 28.84460
#6 6 66.05345 66.05345
#7 7 84.70589 84.70589
#8 8 -134.45227 NA
#9 9 NA NA
#10 10 NA NA
#11 11 73.20466 73.20466
#12 12 -69.90724 NA
#13 13 NA NA
#14 14 69.70833 69.70833
#15 15 65.27860 65.27860
Or use ifelse
with(df1, ifelse(q < 0, NA, q))
Or
is.na(df1$q) <- df1$q < 0
Another way of accomplishing the same thing is (now I see this is ALMOST the same as another answer by akrun, sorry for that)
daf$q = ifelse(daf$q < 0, NA_real_, daf$q)