R: Row-wise expansion of data frame with consecutive integers - r

I have measured some positions pos, e.g.:
library(dplyr)
set.seed(8)
data <-
data.frame(id=LETTERS[1:5],
pos=c(0,round(runif(4, 1, 10),0))) %>%
arrange(pos)
> data
id pos
1 A 0
2 C 3
3 B 5
4 E 7
5 D 8
How can I expand a data frame like data with every possible pos (0,1,2,..,n) where n would be max(data$pos) (i.e. 8 in this example). I like to get something as:
id pos
1 A 0
2 NA 1
3 NA 2
4 C 3
5 NA 4
6 B 5
7 NA 6
8 E 7
9 D 8

You can do this a number of ways, but one way, in base R, is by using merge:
merge(data.frame(pos = 0:8), data, all.x = TRUE)
Or, using dplyr, it's:
data.frame(pos = 0:8) %>% left_join(data)

We can try
library(data.table)
setDT(data)[data.table(pos=0:8), on='pos']
# id pos
#1: A 0
#2: NA 1
#3: NA 2
#4: C 3
#5: NA 4
#6: B 5
#7: NA 6
#8: E 7
#9: D 8

Related

Conditionally replace NAs in Certain Columns Based on Row Values

For a dataframe like I have below, I am trying to selectively replace the NAs in columns a, b, and c with a 0 using R, but only when there is at least one missing value in those columns for that row.
For example, I would want to replace the NAs in rows 1,2, and 5, but leave row 4 alone, and not replace the NA in column d
sample data
df <- data.frame(a = c(1,NA,2,NA,3,4),
b = c(NA,5,6,NA,7,8),
c = c(9,NA,10,NA,NA,11),
d = c("Alpha","Beta","Charlie","Delta",NA,"Foxtrot"))
> df
a b c d
1 1 NA 9 Alpha
2 NA 5 NA Beta
3 2 6 10 Charlie
4 NA NA NA Delta
5 3 7 NA <NA>
6 4 8 11 Foxtrot
Desired outcome
> df_naReplaced
a b c d
1 1 0 9 Alpha
2 0 5 0 Beta
3 2 6 10 Charlie
4 NA NA NA Delta
5 3 7 0 <NA>
6 4 8 11 Foxtrot
The solutions that I have found so far only work on conditions by column, but not by row, or would require actively removing those columns from their context (in this example separating it from d).
I have tried using ifelse and an if statement like below but was unable to get it to work as selectively as I would like, as it replaces all NA in that column.
if(df %>% select(a:c) %>% any(!is.na(.))){
df<- df %>% replace_na(list(a= 0,
b= 0,
c= 0)
)
}
Thank you for whatever help you are able to offer!
Here's an R base solution
> df[,-4][(is.na(df[, -4]) & rowSums(is.na(df[, -4])) < 3)] <- 0
> df
a b c d
1 1 0 9 Alpha
2 0 5 0 Beta
3 2 6 10 Charlie
4 NA NA NA Delta
5 3 7 0 <NA>
6 4 8 11 Foxtrot

Insert NA-rows in data frame according to rownames of other data frame

I have 2 data frames with different rownames, e.g.:
df1 <- data.frame(A = c(1,3,7,1,5), B = c(5,2,9,5,5), C = c(1,1,3,4,5))
df2 <- data.frame(A = c(4,3,2), B = c(4,4,9), C = c(3,9,3))
rownames(df2) <- c(1, 3, 6)
> df1
A B C
1 1 5 1
2 3 2 1
3 7 9 3
4 1 5 4
5 5 5 5
> df2
A B C
1 4 4 3
3 3 4 9
6 2 9 3
I need to insert NA-rows in both data frames for each row that does exist in only one of the data frames. In the given example:
> df1
A B C
1 1 5 1
2 3 2 1
3 7 9 3
4 1 5 4
5 5 5 5
6 NA NA NA
> df2
A B C
1 4 4 3
2 NA NA NA
3 3 4 9
4 NA NA NA
5 NA NA NA
6 2 9 3
I will have to perform this operation many times with different data frames, so I need an automatized way to do this. I was trying to solve the issue with different if/else loops, but I am sure there must be a much more automatized way.
We can use functions union, %in% or intersect to find the common rownames and assign rows of an NA dataframe with the values of the dataset if it matches the rownames
un1 <- union(rownames(df1), rownames(df2))
d1 <- as.data.frame(matrix(NA, ncol = ncol(df1),
nrow = length(un1), dimnames = list(un1, names(df1))))
d2 <- d1
d1[rownames(d1) %in% rownames(df1),] <- df1
d2[rownames(d2) %in% rownames(df2),] <- df2
d2
# A B C
#1 4 4 3
#2 NA NA NA
#3 3 4 9
#4 NA NA NA
#5 NA NA NA
#6 2 9 3

Filter data frame matching all values of a vector

I want to filter data frame x by including IDs that contain rows for Hour that match all values of testVector.
ID <- c('A','A','A','A','A','B','B','B','B','C','C')
Hour <- c('0','2','5','6','9','0','2','5','6','0','2')
x <- data.frame(ID, Hour)
x
ID Hour
1 A 0
2 A 2
3 A 5
4 A 6
5 A 9
6 B 0
7 B 2
8 B 5
9 B 6
10 C 0
11 C 2
testVector <- c('0','2','5')
The solution should yield the following data frame:
x
ID Hour
1 A 0
2 A 2
3 A 5
4 A 6
5 A 9
6 B 0
7 B 2
8 B 5
9 B 6
All values of ID C were dropped because it was missing Hour 5. Note that I want to keep all values of Hour for IDs that match testVector.
A dplyr solution would be ideal, but any solution is welcome.
Based on other related questions on SO, I'm guessing I need some combination of %in% and all, but I can't quite figure it out.
Your combination of %in% and all sounds promising, in base R you could use those to your advantage as follows:
to_keep = sapply(lapply(split(x,x$ID),function(x) {unique(x$Hour)}),
function(x) {all(testVector %in% x)})
x = x[x$ID %in% names(to_keep)[to_keep],]
Or similiarly, but skipping an unneccessary lapply and more efficient as per d.b. in the comments:
temp = sapply(split(x, x$ID), function(a) all(testVector %in% a$Hour))
x[temp[match(x$ID, names(temp))],]
Output:
ID Hour
1 A 0
2 A 2
3 A 5
4 A 6
5 A 9
6 B 0
7 B 2
8 B 5
9 B 6
Hope this helps!
Here's another dplyr solution without ever leaving the pipe:
ID <- c('A','A','A','A','A','B','B','B','B','C','C')
Hour <- c('0','2','5','6','9','0','2','5','6','0','2')
x <- data.frame(ID, Hour)
testVector <- c('0','2','5')
x %>%
group_by(ID) %>%
mutate(contains = Hour %in% testVector) %>%
summarise(all = sum(contains)) %>%
filter(all > 2) %>%
select(-all) %>%
inner_join(x)
## ID Hour
## <fctr> <fctr>
## 1 A 0
## 2 A 2
## 3 A 5
## 4 A 6
## 5 A 9
## 6 B 0
## 7 B 2
## 8 B 5
## 9 B 6
Here is an option using table from base R
i1 <- !rowSums(table(x)[, testVector]==0)
subset(x, ID %in% names(i1)[i1])
# ID Hour
#1 A 0
#2 A 2
#3 A 5
#4 A 6
#5 A 9
#6 B 0
#7 B 2
#8 B 5
#9 B 6
Or this can be done with data.table
library(data.table)
setDT(x)[, .SD[all(testVector %in% Hour)], ID]
# ID Hour
#1: A 0
#2: A 2
#3: A 5
#4: A 6
#5: A 9
#6: B 0
#7: B 2
#8: B 5
#9: B 6

Remove semi duplicate rows in R

I have the following data.frame.
a <- c(rep("A", 3), rep("B", 3), rep("C",2), "D")
b <- c(NA,1,2,4,1,NA,2,NA,NA)
c <- c(1,1,2,4,1,1,2,2,2)
d <- c(1,2,3,4,5,6,7,8,9)
df <-data.frame(a,b,c,d)
a b c d
1 A NA 1 1
2 A 1 1 2
3 A 2 2 3
4 B 4 4 4
5 B 1 1 5
6 B NA 1 6
7 C 2 2 7
8 C NA 2 8
9 D NA 2 9
I want to remove duplicate rows (based on column A & C) so that the row with values in column B are kept. In this example, rows 1, 6, and 8 are removed.
One way to do this is to order by 'a', 'b' and the the logical vector based on 'b' so that all 'NA' elements will be last for each group of 'a', and 'b'. Then, apply the duplicated and keep only the non-duplicate elements
df1 <- df[order(df$a, df$b, is.na(df$b)),]
df2 <- df1[!duplicated(df1[c('a', 'c')]),]
df2
# a b c d
#2 A 1 1 2
#3 A 2 2 3
#5 B 1 1 5
#4 B 4 4 4
#7 C 2 2 7
#9 D NA 2 9
setdiff(seq_len(nrow(df)), row.names(df2) )
#[1] 1 6 8
First create two datasets, one with duplicates in column a and one without duplicate in column a using the below function :
x = df[df$a %in% names(which(table(df$a) > 1)), ]
x1 = df[df$a %in% names(which(table(df$a) ==1)), ]
Now use na.omit function on data set x to delete the rows with NA and then rbind x and x1 to the final data set.
rbind(na.omit(x),x1)
Answer:
a b c d
2 A 1 1 2
3 A 2 2 3
4 B 4 4 4
5 B 1 1 5
7 C 2 2 7
9 D NA 2 9
You can use dplyr to do this.
df %>% distinct(a, c, .keep_all = TRUE)
Output
a b c d
1 A NA 1 1
2 A 2 2 3
3 B 4 4 4
4 B 1 1 5
5 C 2 2 7
6 D NA 2 9
There are other options in dplyr, check this question for details: Remove duplicated rows using dplyr

How to merge two datasets by the different values in R?

I have two datasets and want to merge them. How I add to first dataset only the lines that are in the second that are not in the first?
Only add to final dataset if the value not exists in the another dataset. An example dataset:
x = data.frame(id = c("a","c","d","g"),
value = c(1,3,4,7))
y = data.frame(id = c("b","c","d","e","f"),
value = c(5,6,8,9,7))
The merged dataset should look like (the order is not important):
a 1
b 5
c 3
d 4
e 9
f 7
g 7
Using !, %in% and rbind:
rbind(x[!x$id %in% y$id,], y)
id value
1 a 1
4 g 7
3 b 2
41 c 3
5 d 4
6 e 5
7 f 6
For your example to work, you first need to ensure that id in each data.frame are directly comparable. Since they're factors, you need ensure they have the same levels/labels; or you can just convert them to character.
# convert factors to character
x$id <- as.character(x$id)
y$id <- as.character(y$id)
# merge
z <- merge(x,y,by="id",all=TRUE)
# keep first value, if it exists
z$value <- ifelse(is.na(z$value.x),z$value.y,z$value.x)
# keep desired columns
z <- z[,c("id","value")]
z
# id value
# 1 a 1
# 2 b 5
# 3 c 3
# 4 d 4
# 5 e 9
# 6 f 7
# 7 g 7
You already answered your own question, but just didn't realize it right away. :)
> merge(x,y,all=TRUE)
id value
1 a 1
2 c 3
3 c 6
4 d 4
5 d 8
6 g 7
7 b 5
8 e 9
9 f 7
EDIT
I'm a bit dense here and I'm not sure where you're getting at, so I provide you with a shotgun approach. What I did was I merged the data.frames by id and copied values from x to y if y` was missing. Take whichever column you need.
> x = data.frame(id = c("a","c","d","g"),
+ value = c(1,3,4,7))
> y = data.frame(id = c("b","c","d","e","f"),
+ value = c(5,6,8,9,7))
> xy <- merge(x, y, by = "id", all = TRUE)
> xy
id value.x value.y
1 a 1 NA
2 c 3 6
3 d 4 8
4 g 7 NA
5 b NA 5
6 e NA 9
7 f NA 7
> find.na <- is.na(xy[, "value.y"])
> xy$new.col <- xy[, "value.y"]
> xy[find.na, "new.col"] <- xy[find.na, "value.x"]
> xy
id value.x value.y new.col
1 a 1 NA 1
2 c 3 6 6
3 d 4 8 8
4 g 7 NA 7
5 b NA 5 5
6 e NA 9 9
7 f NA 7 7
> xy[order(as.character(xy$id)), ]
id value.x value.y new.col
1 a 1 NA 1
5 b NA 5 5
2 c 3 6 6
3 d 4 8 8
6 e NA 9 9
7 f NA 7 7
4 g 7 NA 7

Resources