How do I delete rows with NAs and those that follow the NAs? - r

I have some data where I want to remove the NAs and the data that follows the NAs by the level of a factor.
Removing the NAs is easy:
df <- data.frame(a=c("A","A","A","B","B","B","C","C","C","D","D","D"), b=c(0,1,0,0,0,0,0,1,0,0,0,1) ,c=c(4,5,3,2,1,5,NA,5,1,6,NA,2))
df
newdf<-df[complete.cases(df),];newdf
The final result should remove all of the rows for C and the final two rows of D.
Hope you can help.

We can try with data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'a', get the cumulative sum of logical vector of NA elements in 'c' and check whether it is less than 1 to subset
library(data.table)
setDT(df)[, .SD[cumsum(is.na(c))<1], by= a]
Or a faster option with .I to return the row index of the logical vector and subset the rows.
setDT(df)[df[, .I[cumsum(is.na(c)) < 1], by = a]$V1]
# a b c
#1: A 0 4
#2: A 1 5
#3: A 0 3
#4: B 0 2
#5: B 0 1
#6: B 0 5
#7: D 0 6

A classic split-apply-combine in base R:
do.call(rbind,lapply(split(df, df$a),function(x)x[cumsum(is.na(x$c))<1,]))
Here it is again, but in several lines:
split_df <- split(df, df$a)
apply_df <- lapply(split_df, function(x)x[cumsum(is.na(x$c))<1,])
combine_df <- do.call(rbind, apply_df)
The result:
> do.call(rbind,lapply(split(df, df$a),function(x)x[cumsum(is.na(x$c))<1,]))
# a b c
#A.1 A 0 4
#A.2 A 1 5
#A.3 A 0 3
#B.4 B 0 2
#B.5 B 0 1
#B.6 B 0 5
#D D 0 6

A similar solution in dplyr would be
library(dplyr)
df %>% group_by(a) %>% filter(!is.na(cumsum(c)))
Output:
Source: local data frame [7 x 3]
Groups: a [3]
a b c
<fctr> <dbl> <dbl>
1 A 0 4
2 A 1 5
3 A 0 3
4 B 0 2
5 B 0 1
6 B 0 5
7 D 0 6
If we take the cumulative sum of variable C, any values after the first NA will be converted to NA. Performing this at the group level allows us to remove NA rows and get the desired output.

Related

Multiply columns in different dataframes

I am writing a code for analysis a set of dplyr data.
here is how my table_1 looks:
1 A B C
2 5 2 3
3 9 4 1
4 6 3 8
5 3 7 3
And my table_2 looks like this:
1 D E F
2 2 9 3
I would love to based on table 1 column"A", if A>6, then create a column "G" in table1, equals to "C*D+C*E"
Basically, it's like make table 2 as a factor...
Is there any way I can do it?
I can apply a filter to Column "A" and multiply Column"C" with a set number instead of a factor from table_2
table_1_New <- mutate(Table_1,G=if_else(A<6,C*2+C*9))
You could try
#Initialize G column with 0
df1$G <- 0
#Get index where A value is greater than 6
inds <- df1$A > 6
#Multiply those values with D and E from df2
df1$G[inds] <- df1$C[inds] * df2$D + df1$C[inds] * df2$E
df1
# A B C G
#2 5 2 3 0
#3 9 4 1 11
#4 6 3 8 0
#5 3 7 3 0
Using dplyr, we can do
df1 %>% mutate(G = ifelse(A > 6, C*df2$D + C*df2$E, 0))

R--aggregate like function that ensures interactions of all factor levels

I'm wondering how I can ensure I included all interactions of factors when using aggregate even if they don't appear in the given dataset.
dff <- data.frame(a=as.factor(c(rep(1,3), rep(2,4), rep(3,3))),
b=as.factor(c(rep("A", 4), rep("B",6))),
c=sample(100,10))
levels(dff$b) <- c(levels(dff$b), "C")
levels(dff$a) <- c(levels(dff$a), 10)
dff$b
#[1] A A A A B B B B B B
#Levels: A B C
dff$a
#[1] 1 1 1 2 2 2 2 3 3 3
#Levels: 1 2 3 10
aggregate(c~a+b, dff, sum)
# a b c
#1 1 A 233
#2 2 A 78
#3 2 B 212
#4 3 B 73
what I want is
a b c
1 1 A 233
2 1 B 0
3 1 C 0
4 2 A 78
5 2 B 212
6 2 C 0
7 3 A 0
8 3 B 73
9 3 C 0
10 10 A 0
11 10 B 0
12 10 C 0
NA is fine too.
The reason I want it in this format is because I need to interact dff$c with results from other datasets and they may be of different length if not all factor levels are accounted for. I'm trying avoid merge and instead use vector calculation.
Thank you in advance.
If your aggregation function is just going to be sum, you can just use xtabs, which would create an object that includes the class table. As such, you can use data.frame, which would call the respective "method", which creates a "long" data.frame.
data.frame(xtabs(c ~ b + a, dff))
# b a Freq
# 1 A 1 121
# 2 B 1 0
# 3 C 1 0
# 4 A 2 89
# 5 B 2 203
# 6 C 2 0
# 7 A 3 0
# 8 B 3 126
# 9 C 3 0
# 10 A 10 0
# 11 B 10 0
# 12 C 10 0
This is similar to #nicola's suggestion to use as.data.frame.table, which explicitly calls the method for something that is not explicitly of the class "table" but can be treated as one.
One advantage of this approach (and all the others that follow) is that you can use different functions other than sum.
as.data.frame.table(tapply(dff$c, dff[c("a","b")], sum))
If merge is OK, you can continue with your aggregate step. In this case, we use expand.grid on the levels of your factor vectors:
merge(expand.grid(lapply(dff[c(1, 2)], levels)),
aggregate(c~a+b, dff, sum, drop = FALSE), all = TRUE)
A similar approach can be taken in "data.table":
library(data.table)
as.data.table(dff)[, sum(c), by = .(a, b)][do.call(CJ, lapply(dff[c(1, 2)], levels)), on = c("a", "b")]
Or using "dplyr" + "tidyr" (which essentially hides the merge, but ultimately uses left_join to create the missing combinations):
library(dplyr)
library(tidyr)
dff %>%
group_by(a, b) %>%
summarise(c = sum(c)) %>%
complete(a, b, fill = list(c = 0))

Mutate with dplyr using multiple conditions

I have a data frame (df) below and I want to add an additional column, result, using dplyr that will take on the value 1 if z == "gone" and where x is the maximum value for group y.
y x z
1 a 3 gone
2 a 5 gone
3 a 8 gone
4 a 9 gone
5 a 10 gone
6 b 1
7 b 2
8 b 4
9 b 6
10 b 7
If I were to simply select the maximum for each group it would be:
df %>%
group_by(y) %>%
slice(which.max(x))
which will return:
y x z
1 a 10 gone
2 b 7
This is not what I want. I need to take advantage of the max value of x for each group in y while checking to see if z == "gone", and if TRUE 1 otherwise 0. This would look like:
y x z result
1 a 3 gone 0
2 a 5 gone 0
3 a 8 gone 0
4 a 9 gone 0
5 a 10 gone 1
6 b 1 0
7 b 2 0
8 b 4 0
9 b 6 0
10 b 7 0
I'm assuming I would use a conditional statement within mutate() but I cannot seem to find an example. Please advise.
With dplyr you can use:
df %>% group_by(y) %>% mutate(result = +(x == max(x) & z == 'gone'))
The +(..) notation is shorthand for as.integer to coerce the logical output to 1's and 0's. Some don't like it so it's a matter of shorter code versus readability. Efficiency gains can be debated on the circumstance.
Also to appreciate what data.table and dplyr have done for data manipulation with R, let's do the same thing in the old-fashioned "split-apply-combine" way:
#split data.frame by group
split.df <- split(df, df$y)
#apply required function to each group
lst <- lapply(split.df, function(dfx) {
dfx$result <- +(dfx$x == max(dfx$x) & dfx$z == "gone")
dfx})
#combine result in new data.frame
newdf <- do.call(rbind, lst)
We can do this with data.table. We convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'y', we create the logical condition for maximum value of 'x' and the 'gone' element in 'z', coerce it to 'integer' (as.integer) and assign (:=) the output to the new column ('result').
library(data.table)
setDT(df)[, result := as.integer(x==max(x) & z=='gone') , by = y]
df
# y x z result
# 1: a 3 gone 0
# 2: a 5 gone 0
# 3: a 8 gone 0
# 4: a 9 gone 0
# 5: a 10 gone 1
# 6: b 1 0
# 7: b 2 0
# 8: b 4 0
# 9: b 6 0
#10: b 7 0
Or we can use ave from base R
df$result <- with(df, +(ave(x, y, FUN=max)==x & z=='gone' ))

How to combine two data frames using dplyr or other packages?

I have two data frames:
df1 = data.frame(index=c(0,3,4),n1=c(1,2,3))
df1
# index n1
# 1 0 1
# 2 3 2
# 3 4 3
df2 = data.frame(index=c(1,2,3),n2=c(4,5,6))
df2
# index n2
# 1 1 4
# 2 2 5
# 3 3 6
I want to join these to:
index n
1 0 1
2 1 4
3 2 5
4 3 8 (index 3 in two df, so add 2 and 6 in each df)
5 4 3
6 5 0 (index 5 not exists in either df, so set 0)
7 6 0 (index 6 not exists in either df, so set 0)
The given data frames are just part of large dataset. Can I do it using dplyr or other packages in R?
Using data.table (would be efficient for bigger datasets). I am not changing the column names, as the rbindlist uses the name of the first dataset ie. in this case n from the second column (Don't know if it is a feature or bug). Once you join the datasets by rbindlist, group it by column index i.e. (by=index) and do the sum of n column (list(n=sum(n)) )
library(data.table)
rbindlist(list(data.frame(index=0:6,n=0), df1,df2))[,list(n=sum(n)), by=index]
index n
#1: 0 1
#2: 1 4
#3: 2 5
#4: 3 8
#5: 4 3
#6: 5 0
#7: 6 0
Or using dplyr. Here, the column names of all the datasets should be the same. So, I am changing it before binding the datasets using rbind_list. If the names are different, there will be multiple columns for each name. After joining the datasets, group it by index and then use summarize and do the sum of column n.
library(dplyr)
nm1 <- c("index", "n")
colnames(df1) <- colnames(df2) <- nm1
rbind_list(df1,df2, data.frame(index=0:6, n=0)) %>%
group_by(index) %>%
summarise(n=sum(n))
This is something you could do with the base functions aggregate and rbind
df1 = data.frame(index=c(0,3,4),n=c(1,2,3))
df2 = data.frame(index=c(1,2,3),n=c(4,5,6))
aggregate(n~index, rbind(df1, df2, data.frame(index=0:6, n=0)), sum)
which returns
index n
1 0 1
2 1 4
3 2 5
4 3 8
5 4 3
6 5 0
7 6 0
How about
names(df1) <- c("index", "n") # set colnames of df1 to target
df3 <- rbind(df1,setNames(df2, names(df1))) # set colnnames of df2 and join
df <- df3 %>% dplyr::arrange(index) # sort by index
Cheers.

Removing rows after a certain value in R

I have a data frame in R,
df <- data.frame(a=c(1,1,1,2,2,5,5,5,5,5,6,6), b=c(0,1,0,0,0,0,0,1,0,0,0,1))
I want to remove the rows which has values for the variable b equal to 0 which occurs after the value equals to 1 for the duplicated variable a values.
So the output I am looking for is,
df.out <- data.frame(a=c(1,1,2,2,5,5,5,6,6), b=c(0,1,0,0,0,0,1,0,1))
Is there a way to do this in R?
This should do the trick?
ind = intersect(which(df$b==0), which(df$b==1)+1)
df.out = df[-ind,]
The which(df$b==1) returns the index of the df where b==1. add one to this and intersect with the indexes where b==0.
How about
df[ ave(df$b, df$a, FUN=function(x) x>=cummax(x))==1, ]
# a b
# 1 1 0
# 2 1 1
# 4 2 0
# 5 2 0
# 6 5 0
# 7 5 0
# 8 5 1
# 11 6 0
# 12 6 1
Here we use ave to look within each level of a and we test to see if we've seen a 1 yet with cummax.

Resources