Using mutate and last functions with NAs - r

Based on the last function in dplyr package, if you want to take the last element in a vector, excluding NAs, you can just introduce the na.omit.
library(dplyr)
x <- c(1:10,NA)
last(x)
# [1] NA
last(na.omit(x))
# [1] 10
I would like to impute the last element for var1 for each id. The following is an example of the dataframe used.
id<-rep(c(1,2,3),c(3,2,2))
var1<-c(5,1,4,2,NA,NA,NA)
df<-data.frame(id,var1)
df
# id var1
# 1 1 5
# 2 1 1
# 3 1 4
# 4 2 2
# 5 2 NA
# 6 3 NA
# 7 3 NA
Notice that id=1 contains only numeric for var1, id=2 contains one numeric and one NA, while id=3 contains only NAs and no numeric.
I would like to obtain the following:
df
# id var1
# 1 1 4
# 2 1 4
# 3 1 4
# 4 2 2
# 5 2 2
# 6 3 NA
# 7 3 NA
Here is what I did to achieve what I wanted, but I got an error.
mutate(var1=ifelse(length(na.omit(var1))==0,NA,last(na.omit(var1))))
# Error: Unsupported vector type language
EDIT1: Based on the comments, the above code works well for dplyr 0.4.3, and apparently not for dplyr 0.5.0 (in my case). Additionally, I want to impute using the last element not the element with the maximum value. Thus, I have changed my data frame to make it more general.
EDIT2:I have considered a data frame that list all possible cases. Three cases, (1) all numeric, (2) numeric + NAs and (3) all NAs.

I was asked to explain my solution, but I actually don't fully understand why OP's solution doesn't work. Initially I thought it was something due to the class of object returned by na.omit
> na.omit(var1)
[1] 1 2 3 4
attr(,"na.action")
[1] 5
attr(,"class")
[1] "omit"
But then I noticed that nth (and I think last is just a wrapper for it) works fine:
df %>%
group_by(id) %>%
mutate(var1=nth(na.omit(var1),-1L))
An alternative, is to use tail rather then last
df %>%
group_by(id) %>%
mutate(var1=tail(na.omit(var1),1))
Or to create a new function, as I initially did:
aa <- function(x) last(na.omit(x))
df %>% group_by(id) %>% mutate(var1=aa(var1))
I was just curious about any differences in performance, so I checked them out but I would say they are equivalent
Unit: microseconds
expr min lq mean median uq max neval
mutate(var1 = nth(na.omit(var1), -1L)) 795.270 830.4880 1022.196 897.6375 1026.795 4437.483 1000
mutate(var1 = tail(na.omit(var1))) 791.035 825.6165 1011.288 892.6270 1037.463 3406.842 1000
mutate(var1 = aa(var1)) 788.085 825.5180 1108.872 888.9945 1036.664 102915.926 1000

Using dplyr package, we can group by each id and take max values of each id and replace in var1
library(dplyr)
df <- df %>%
group_by(id) %>%
mutate(var1 = max(var1,na.rm=T))
df
id var1
<dbl> <int>
1 1 3
2 1 3
3 1 3
4 2 4
5 2 4

I had a similar issue. This worked for me:
df %>%
group_by(id) %>%
mutate(missing = is.na(var1)) %>%
mutate(var1 = ifelse(any(!missing), var1[!missing][length(var1[!missing])], NA))

Related

Appending a column to each data frame within a list

I have a list of dataframes and want to append a new column to each, however I keep getting various error messages. Can anybody explain why the below code doesn't work for me? I'd be happy if rowid_to)column works as the data in my actual set is alright ordered correctly, otherwise i'd like a new column with a list going from 1:length(data$data)
##dataset
data<- tibble(Location = c(rep("London",6),rep("Glasgow",6),rep("Dublin",6)),
Day= rep(seq(1,6,1),3),
Average = runif(18,0,20),
Amplitude = runif(18,0,15))%>%
nest_by(Location)
###map + rowid_to_column
attempt1<- data%>%
map(.,rowid_to_column(.,var = "hour"))
##mutate
attempt2<-data %>%
map(., mutate("Hours" = 1:6))
###add column
attempt3<- data%>%
map(.$data,add_column(.data,hours = 1:6))
newcolumn<- 1:6
###lapply
attempt4<- lapply(data,cbind(data$data,newcolumn))
Many thanks,
Stuart
You were nearly there with your base R attempt, but you want to iterate over data$data, which is a list of data frames.
data$data <- lapply(data$data, function(x) {
hour <- seq_len(nrow(x))
cbind(x, hour)
})
data$data
# [[1]]
# Day Average Amplitude hour
# 1 1 6.070539 1.123182 1
# 2 2 3.638313 8.218556 2
# 3 3 11.220683 2.049816 3
# 4 4 12.832782 14.858611 4
# 5 5 12.485757 7.806147 5
# 6 6 19.250489 6.181270 6
Edit: Updated as realised it was iterating over columns rather than rows. This approach will work if the data frames have different numbers of rows, which the methods with the vector defined as 1:6 will not.
a data.table approach
library(data.table)
setDT(data)
data[, data := lapply(data, function(x) cbind(x, new_col = 1:6))]
data$data
# [[1]]
# Day Average Amplitude test new_col
# 1 1 11.139917 0.3690539 1 1
# 2 2 5.350847 7.0925508 2 2
# 3 3 9.602104 6.1782818 3 3
# 4 4 14.866074 13.7356913 4 4
# 5 5 1.114201 1.1007080 5 5
# 6 6 2.447236 5.9944926 6 6
#
# [[2]]
# Day Average Amplitude test new_col
# 1 1 17.230213 13.966576 1 1
# .....
A purrr approach:
data<- tibble(Location = c(rep("London",6),rep("Glasgow",6),rep("Dublin",6)),
Day= rep(seq(1,6,1),3),
Average = runif(18,0,20),
Amplitude = runif(18,0,15))%>%
group_split(Location) %>%
purrr::map_dfr(~.x %>% mutate(Hours = c(1:6)))
If you want to use your approach and preserve the same data structure, this is a way again using purrr (you need to ungroup, otherwise it will not work due to the rowwise grouping)
data %>% ungroup() %>%
mutate_at("data", .f = ~map(.x, ~.x %>% mutate(Hours = c(1:6))) )

Remove first 10 and last 10 values

I have a file that contains multiple individuals and multiple values for the same individual.
I need to remove the first 10 and last 10 values of each individual, putting all the leftover values in a new table.
This is what my data kinda looks like:
Cow Data
NL123456 123
NL123456 456
I tried doing a for-loop, counting per individual how many values there were (but I think, I already got stuck there, because I am not using the right command I think? All variables in Cow are a factor).
I figured removing the first and last had to be something like this:
data1[c(11: n-10),]
If you know you always have more than 20 datapoints by cow you can do the following, illustrated on the iris dataset :
library(dplyr)
dim(iris)
# [1] 150 5
iris_trimmed <-
iris %>%
group_by(Species) %>%
slice(11:(n()-10)) %>%
ungroup()
dim(iris_trimmed)
# [1] 90 5
On your data :
res <-
your_data %>%
group_by(Cow) %>%
slice(11:(n()-10)) %>%
ungroup()
In base R you can do :
iris_trimmed <- do.call(
rbind,
lapply(split(iris, iris$Species),
function(x) head(tail(x,-10),-10)))
dim(iris_trimmed)
# [1] 90 5
Using data.table:
library(data.table)
idt <- as.data.table(iris)
idt[, .SD[11:(.N-10)], Species]
Same logic in base R:
do.call(
rbind,
lapply(
split(iris, iris[["Species"]]),
function(x) x[11:(nrow(x)-10), ]
)
)
Here a solution with dplyr.
In my example I cut only the first and last values. (you can adapt it by changing 2 with any number in filter).
The idea is to add after you group_by id the number of row per each observation starting from the top (n) and in reverse from the bottom (n1), then you simply filter out.
library(dplyr)
data %>%
group_by(id) %>%
mutate(n=1:n(),
n1 = n():1) %>% # n and n1 are the row numbers
filter(n >= 2,n1 >= 2) %>% # change 2 with 10, or whatever
# filter() keeps only the rows that you want
select(-n, -n1) %>%
ungroup()
# # A tibble: 4 x 2
# id value
# <dbl> <int>
# 1 1 6
# 2 1 8
# 3 2 1
# 4 2 2
Data:
set.seed(123)
data <- data.frame(id = c(rep(1,4), rep(2,4)), value=sample(8))
data
# id value
# 1 1 3
# 2 1 6
# 3 1 8
# 4 1 5
# 5 2 4
# 6 2 1
# 7 2 2
# 8 2 7

r: Summarise for rowSums after group_by

I've tried searching a number of posts on SO but I'm not sure what I'm doing wrong here, and I imagine the solution is quite simple. I'm trying to group a dataframe by one variable and figure the mean of several variables within that group.
Here is what I am trying:
head(airquality)
target_vars = c("Ozone","Temp","Solar.R")
airquality %>% group_by(Month) %>% select(target_vars) %>% summarise(rowSums(.))
But I get the error that my lenghts don't match. I've tried variations using mutate to create the column or summarise_all, but neither of these seem to work. I need the row sums within group, and then to compute the mean within group (yes, it's nonsensical here).
Also, I want to use select because I'm trying to do this over just certain variables.
I'm sure this could be a duplicate, but I can't find the right one.
EDIT FOR CLARITY
Sorry, my original question was not clear. Imagine the grouping variable is the calendar month, and we have v1, v2, and v3. I'd like to know, within month, what was the average of the sums of v1, v2, and v3. So if we have 12 months, the result would be a 12x1 dataframe. Here is an example if we just had 1 month:
Month v1 v2 v3 Sum
1 1 1 0 2
1 1 1 1 3
1 1 0 0 3
Then the result would be:
Month Average
1 8/3
You can try:
library(tidyverse)
airquality %>%
select(Month, target_vars) %>%
gather(key, value, -Month) %>%
group_by(Month) %>%
summarise(n=length(unique(key)),
Sum=sum(value, na.rm = T)) %>%
mutate(Average=Sum/n)
# A tibble: 5 x 4
Month n Sum Average
<int> <int> <int> <dbl>
1 5 3 7541 2513.667
2 6 3 8343 2781.000
3 7 3 10849 3616.333
4 8 3 8974 2991.333
5 9 3 8242 2747.333
The idea is to convert the data from wide to long using tidyr::gather(), then group by Month and calculate the sum and the average.
This seems to deliver what you want. It's regular R. The sapply function keeps the months separated by "name". The sum function applied to each dataframe will not keep the column sums separate. (Correction # 2: used only target_vars):
sapply( split( airquality[target_vars], airquality$Month), sum, na.rm=TRUE)
5 6 7 8 9
7541 8343 10849 8974 8242
If you wanted the per number of variable results, then you would divide by the number of variables:
sapply( split( airquality[target_vars], airquality$Month), sum, na.rm=TRUE)/
(length(target_vars))
5 6 7 8 9
2513.667 2781.000 3616.333 2991.333 2747.333
Perhaps this is what you're looking for
library(dplyr)
library(purrr)
library(tidyr) # forgot this in original post
airquality %>%
group_by(Month) %>%
nest(Ozone, Temp, Solar.R, .key=newcol) %>%
mutate(newcol = map_dbl(newcol, ~mean(rowSums(.x, na.rm=TRUE))))
# A tibble: 5 x 2
# Month newcol
# <int> <dbl>
# 1 5 243.2581
# 2 6 278.1000
# 3 7 349.9677
# 4 8 289.4839
# 5 9 274.7333
I've never encountered a situation where all the answers disagreed. Here's some validation (at least I think) for the 5th month
airquality %>%
filter(Month == 5) %>%
select(Ozone, Temp, Solar.R) %>%
mutate(newcol = rowSums(., na.rm=TRUE)) %>%
summarise(sum5 = sum(newcol), mean5 = mean(newcol))
# sum5 mean5
# 1 7541 243.2581

How to repeat empty rows so that each split has the same number

My goal is to get the same number of rows for each split (based on column Initial). I am trying to basically pad the number of rows so that each person has the same amount, while retaining the Initial column so I can tell them apart. My attempt failed completely. Anybody have suggestions?
df<-data.frame(Initials=c("a","a","b"),data=c(2,3,4))
attach(df)
maxrows=max(table(Initials))+1
arr<-split(df,Initials)
lapply(arr,function(x){
toadd<-maxrows-dim(x)[1]
replicate(toadd,x<-rbind(x,rep(NA,1)))#colnames -1 because col 1 should the the same Initial
})
Goal:
a 2
a 3
b 4
b NA
Using data.table...
my_rows <- seq.int(max(tabulate(df$Initials)))
library(data.table)
setDT(df)[ , .SD[my_rows], by=Initials]
# Initials data
# 1: a 2
# 2: a 3
# 3: b 4
# 4: b NA
.SD is the Subset of Data associated with each by= group. We can subset its rows like .SD[row_numbers], unlike a data.frame which requires an additional comma DF[row_numbers,].
The analogue in dplyr is
my_rows <- seq.int(max(tabulate(df$Initials)))
library(dplyr)
setDT(df) %>% group_by(Initials) %>% slice(my_rows)
# Initials data
# (fctr) (dbl)
# 1 a 2
# 2 a 3
# 3 b 4
# 4 b NA
Strangely, this only works if df is a data.table. I've filed a report/query with dplyr. There's a good chance that the dplyr devs will prevent this usage in a future version.
Here's a dplyr/tidyr method. We group_by initials, add row_numbers, ungroup, complete row numbers/Initials combinations, then remove our row numbers:
library(dplyr)
library(tidyr)
df %>% group_by(Initials) %>%
mutate(row = row_number()) %>%
ungroup() %>%
complete(Initials, row) %>%
select(-row)
Source: local data frame [4 x 2]
Initials data
(fctr) (dbl)
1 a 2
2 a 3
3 b 4
4 b NA
Interesting problem. Try:
to.add <- max(table(df$Initials)) - table(df$Initials)
rbind(df, c(rep(names(to.add), to.add), rep(NA, ncol(df)-1)))
# Initials data
#1 a 2
#2 a 3
#3 b 4
#4 b <NA>
We calculate the number of extra initials needed then combine the extras with NA values then rbind to the data frame.
max(table(df$Initials)) calculates the the initial with the most repeats. In this case a 2. By subtracting that max amount by the other initials table(df$Initials) we get a vector with the necessary additions. There's an added bonus to this method, by using table we also automatically have a named vector.
We use the names of the new vector to know 1) what initials to repeat, and 2) how many times should they be repeated.
To preserve the class of the data, you can add newdf$data <- as.numeric(newdf$data).

Count occurence across multiple columns using R & dplyr

This should be a simple solution...I just can't wrap my head around this. I'd like to count the occurrences of a factor across multiple columns of a data frame. There're 13 columns range from abx.1 > abx.13 and a huge number of rows.
Sample data frame:
library(dplyr)
abx.1 <- c('Amoxil', 'Cipro', 'Moxiflox', 'Pip-tazo')
start.1 <- c('2012-01-01', '2012-02-01', '2013-01-01', '2014-01-01')
abx.2 <- c('Pip-tazo', 'Ampicillin', 'Amoxil', NA)
start.2 <- c('2012-01-01', '2012-02-01', '2013-01-01', NA)
abx.3 <- c('Ampicillin', 'Amoxil', NA, NA)
start.3 <- c('2012-01-01', '2012-02-01', NA,NA)
worksheet <-data.frame (abx.1, start.1, abx.2, start.2, abx.3, start.3)
Result I'd like:
name count
Amoxil 3
Ampicillin 2
Pip-tazo 2
Cipro 1
Moxiflox 1
I've tried :
worksheet %>% group_by (abx.1, abx.2, abx.3) %>% summarise(count = n())
This doesn't give me my desired output. Any thoughts would be greatly appreciated.
If you want a dplyr solution, I'd suggest combining it with tidyr in order to convert your data to a long format first
library(tidyr)
worksheet %>%
select(starts_with("abx")) %>%
gather(key, value, na.rm = TRUE) %>%
count(value)
# Source: local data frame [5 x 2]
#
# value n
# 1 Amoxil 3
# 2 Ampicillin 2
# 3 Cipro 1
# 4 Moxiflox 1
# 5 Pip-tazo 2
Alternatively, with base R, it's just
as.data.frame(table(unlist(worksheet[grep("^abx", names(worksheet))])))
# Var1 Freq
# 1 Amoxil 3
# 2 Cipro 1
# 3 Moxiflox 1
# 4 Pip-tazo 2
# 5 Ampicillin 2

Resources