This question already has answers here:
Collapsing rows where some are all NA, others are disjoint with some NAs
(5 answers)
Closed 6 years ago.
I have a situation such like this:
df<-data.frame(A=c(1, NA), B=c(NA, 2), C=c(3, NA), D=c(4, NA), E=c(NA, 5))
df
A B C D E
1 1 NA 3 4 NA
2 NA 2 NA NA 5
What I wanted is, conditioning on all length(!is.na(df$*))==1, reduce df to :
df
A B C D E
1 1 2 3 4 5
As long as the resulting rows are equal, you can use:
dfNew <- do.call(data.frame, lapply(df, function(i) i[!is.na(i)]))
which results in
dfNew
A B C D E
1 1 2 3 4 5
Related
This question already has answers here:
How to implement coalesce efficiently in R
(9 answers)
Closed 1 year ago.
I have data like this:
df <- data.frame(id=c(1, 2, 3, 4), A=c(6, NA, NA, 4), B=c(3, 2, NA, NA), C=c(4, 3, 5, NA), D=c(4, 3, 1, 2))
id A B C D
1 1 6 3 4 4
2 2 NA 2 3 3
3 3 NA NA 5 1
4 4 4 NA NA 2
For each row: If the row has non-NA values in column "A", I want that value to be entered into a new column 'E'. If it doesn't, I want to move on to column "B", and that value entered into E. And so on. Thus, the new column would be E = c(6, 2, 5, 4).
I wanted to use the ifelse function, but I am not quite sure how to do this.
tidyverse
library(dplyr)
mutate(df, E = coalesce(A, B, C, D))
# id A B C D E
# 1 1 6 3 4 4 6
# 2 2 NA 2 3 3 2
# 3 3 NA NA 5 1 5
# 4 4 4 NA NA 2 4
coalesce is effectively "return the first non-NA in each vector". It has a SQL equivalent (or it is an equivalent of SQL's COALESCE, actually).
base R
df$E <- apply(df[,-1], 1, function(z) na.omit(z)[1])
df
# id A B C D E
# 1 1 6 3 4 4 6
# 2 2 NA 2 3 3 2
# 3 3 NA NA 5 1 5
# 4 4 4 NA NA 2 4
na.omit removes all of the NA values, and [1] makes sure we always return just the first of them. The advantage of [1] over (say) head(., 1) is that head will return NULL if there are no non-NA elements, whereas .[1] will always return at least an NA (indicating to you that it was the only option).
This question already has answers here:
MATCH function in r [duplicate]
(1 answer)
New column in dataframe based on match between two columns [duplicate]
(1 answer)
Closed 4 years ago.
What's an elegant way (without additional packages) to "expand" a given data.frame according to one of its columns?
Given:
df <- data.frame(values = 1:5, strings = c("e", "g", "h", "b", "c"))
more.strings <- letters[c(3, 5, 7, 1, 4, 8, 6)]
Desired outcome: A data.frame containing:
5 c
1 e
2 g
NA a
NA d
3 h
NA f
So those values of df$strings appearing in more.strings should be used to fill the new data.frame (otherwise NA).
you can do a join:
In base R you could do:
merge(df, more.strings, by.y="y",by.x="strings", all.y=TRUE)
strings values
1 c 5
2 e 1
3 g 2
4 h 3
5 a NA
6 d NA
7 f NA
or even as given by #thelatemailin the comment section below:
merge(df, list(strings=more.strings),by="strings", all.y=TRUE)
Using library:
library(tidyverse)
right_join(df,data.frame(strings=more.strings),by="strings")
values strings
1 5 c
2 1 e
3 2 g
4 NA a
5 NA d
6 3 h
7 NA f
We can do this without using any library i.e. using only base R
data.frame(value = with(df, match(more.strings, strings)),
strings = more.strings)
# value strings
#1 5 c
#2 1 e
#3 2 g
#4 NA a
#5 NA d
#6 3 h
#7 NA f
Or we can use complete
library(tidyverse)
complete(df, strings = more.strings) %>%
arrange(match(strings, more.strings)) %>%
select(names(df))
# A tibble: 7 x 2
# values strings
# <int> <chr>
#1 5 c
#2 1 e
#3 2 g
#4 NA a
#5 NA d
#6 3 h
#7 NA f
This question already has answers here:
Unique combination of all elements from two (or more) vectors
(6 answers)
Closed 5 years ago.
I just migrated from Python to R and I would like to know if there is any function in R which is similar to pandas.MultiIndex.from_product?
Example:
letters <- c('a', 'b')
numbers <- c(1, 2, 3)
df <- somefunction(letters, numbers)
df
letters numbers
1 a 1
2 a 2
3 a 3
4 b 1
5 b 2
6 b 3
Yes:
> letters <- c('a', 'b')
> numbers <- c(1, 2, 3)
> expand.grid(letters=letters, numbers=numbers)
letters numbers
1 a 1
2 b 1
3 a 2
4 b 2
5 a 3
6 b 3
You can also use CJ from the data.table package. It is faster. But the result is not an ordinary dataframe, it is a datatable:
> library(data.table)
> CJ(letters=letters, numbers=numbers)
letters numbers
1: a 1
2: a 2
3: a 3
4: b 1
5: b 2
6: b 3
This question already has answers here:
Select rows from a data frame based on values in a vector
(3 answers)
Closed 7 years ago.
I have a data.frame and a vector. I want to output only the rows from the data frame that have values in a column in common with the vector v.
For example:
v = (1,2,3,4,5)
df =
A B
1 a 2
2 b 6
3 c 4
4 d 1
5 e 8
What I want to do is, if df$b has any values of v in it then output the row. Basically if df$b[i] isn't in v then remove the row for i= 1:nrows(df)
output should be
A B
1 a 2
2 c 4
3 d 1
since 2,4 and 1 are in v.
You should make use of the %in% operator.
v <- c(1, 2, 3, 4, 5)
df <- read.table(text =
" A B
1 a 2
2 b 6
3 c 4
4 d 1
5 e 8", header = TRUE)
out <- df[df$B %in% v, ]
This gives:
A B
1 a 2
3 c 4
4 d 1
This question already has answers here:
Remove rows with all or some NAs (missing values) in data.frame
(18 answers)
Closed 5 years ago.
I have a dataframe with 2500 rows. A few of the rows have NAs (an excessive number of NAs), and I want to remove those rows.
I've searched the SO archives, and come up with this as the most likely solution:
df2 <- df[df[, 12] != NA,]
But when I run it and look at df2, all I see is a screen full of NAs (and s).
Any suggestions?
Depending on what you're looking for, one of the following should help you on your way:
Some sample data to start with:
mydf <- data.frame(A = c(1, 2, NA, 4), B = c(1, NA, 3, 4),
C = c(1, NA, 3, 4), D = c(NA, 2, 3, 4),
E = c(NA, 2, 3, 4))
mydf
# A B C D E
# 1 1 1 1 NA NA
# 2 2 NA NA 2 2
# 3 NA 3 3 3 3
# 4 4 4 4 4 4
If you wanted to remove rows just according to a few specific columns, you can use complete.cases or the solution suggested by #SimonO101 in the comments. Here, I'm removing rows which have an NA in the first column.
mydf[complete.cases(mydf$A), ]
# A B C D E
# 1 1 1 1 NA NA
# 2 2 NA NA 2 2
# 4 4 4 4 4 4
mydf[!is.na(mydf[, 1]), ]
# A B C D E
# 1 1 1 1 NA NA
# 2 2 NA NA 2 2
# 4 4 4 4 4 4
If, instead, you wanted to set a threshold--as in "keep only the rows that have fewer than 2 NA values" (but you don't care which columns the NA values are in--you can try something like this:
mydf[rowSums(is.na(mydf)) < 2, ]
# A B C D E
# 3 NA 3 3 3 3
# 4 4 4 4 4 4
On the other extreme, if you want to delete all rows that have any NA values, just use complete.cases:
mydf[complete.cases(mydf), ]
# A B C D E
# 4 4 4 4 4 4