Merge rows dependent on Column (delete NA values) [duplicate] - r

This question already has answers here:
Combining rows based on a column
(1 answer)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 4 years ago.
I got a problem I could not find a solution yet.
I have a dataframe in R which looks like that:
p v1 v2 v3 v4 v5 v6 v7 v8 v9 <- Header
V1 1 2 3 NA NA NA NA NA NA
V2 1 2 3 NA NA NA NA NA NA
V3 1 2 3 NA NA NA NA NA NA
V1 NA NA NA 4 5 6 NA NA NA
V2 NA NA NA 4 5 6 NA NA NA
V3 NA NA NA 4 5 6 NA NA NA
V1 NA NA NA NA NA NA 7 8 9
V2 NA NA NA NA NA NA 7 8 9
V3 NA NA NA NA NA NA 7 8 9
How can I merge all the rows dependent in the first coloum the get the following output:
V1 1 2 3 4 5 6 7 8 9
V2 1 2 3 4 5 6 7 8 9
V3 1 2 3 4 5 6 7 8 9
Thank you very much!

We can group by the first column and then get the sum
library(dplyr)
df1 %>%
group_by(p) %>%
summarise_all(sum, na.rm = TRUE)
# A tibble: 3 x 10
# p v1 v2 v3 v4 v5 v6 v7 v8 v9
# <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#1 V1 1 2 3 4 5 6 7 8 9
#2 V2 1 2 3 4 5 6 7 8 9
#3 V3 1 2 3 4 5 6 7 8 9

Related

r - Lag a data.frame by the number of NAs

In other words, I am trying to lag a data.frame that looks like this:
V1 V2 V3 V4 V5 V6
1 1 1 1 1 1
2 2 2 2 2 NA
3 3 3 3 NA NA
4 4 4 NA NA NA
5 5 NA NA NA NA
6 NA NA NA NA NA
To something that looks like this:
V1 V2 V3 V4 V5 V6
1 NA NA NA NA NA
2 1 NA NA NA NA
3 2 1 NA NA NA
4 3 2 1 NA NA
5 4 3 2 1 NA
6 5 4 3 2 1
So far, I have used a function that counts the number of NAs, and have tried to lag my each column in my data.frame by the corresponding number of NAs in that column.
V1 <- c(1,2,3,4,5,6)
V2 <- c(1,2,3,4,5,NA)
V3 <- c(1,2,3,4,NA,NA)
V4 <- c(1,2,3,NA,NA,NA)
V5 <- c(1,2,NA,NA,NA,NA)
V6 <- c(1,NA,NA,NA,NA,NA)
mydata <- cbind(V1,V2,V3,V4,V5,V6)
na.count <- colSums(is.na(mydata))
lag.by <- function(mydata, na.count){lag(mydata, k = na.count)}
lagged.df <- apply(mydata, 2, lag.by)
But this code just lags the entire data.frame by one...
One option would be to loop through the columns with apply and append the NA elements first by subsetting the NA elements using is.na and then the non-NA element by negating the logical vector (is.na)
apply(mydata, 2, function(x) c(x[is.na(x)], x[!is.na(x)]))
# V1 V2 V3 V4 V5 V6
#[1,] 1 NA NA NA NA NA
#[2,] 2 1 NA NA NA NA
#[3,] 3 2 1 NA NA NA
#[4,] 4 3 2 1 NA NA
#[5,] 5 4 3 2 1 NA
#[6,] 6 5 4 3 2 1
You could use the sort function with option na.last = FALSE like this:
edit:
Akrun's comment is a valid one. If the values need to stay in the order as they are in the data.frame, then Akrun's answer is the best. Sort will out everything in order from low to high with the NA's in front.
library(purrr)
map_df(mydata, sort, na.last = FALSE)
# A tibble: 6 x 6
V1 V2 V3 V4 V5 V6
<int> <int> <int> <int> <int> <int>
1 1 NA NA NA NA NA
2 2 1 NA NA NA NA
3 3 2 1 NA NA NA
4 4 3 2 1 NA NA
5 5 4 3 2 1 NA
6 6 5 4 3 2 1
Or apply:
apply(mydata, 2, sort , na.last = FALSE)
V1 V2 V3 V4 V5 V6
[1,] 1 NA NA NA NA NA
[2,] 2 1 NA NA NA NA
[3,] 3 2 1 NA NA NA
[4,] 4 3 2 1 NA NA
[5,] 5 4 3 2 1 NA
[6,] 6 5 4 3 2 1
edit2:
As nicolo commented. order can preserve the order of the variables.
mydata[,3] <- c(4, 3, 1, 2, NA, NA)
map_df(mydata, function(x) x[order(!is.na(x))])
# A tibble: 6 x 6
V1 V2 V3 V4 V5 V6
<int> <int> <dbl> <int> <int> <int>
1 1 NA NA NA NA NA
2 2 1 NA NA NA NA
3 3 2 4 NA NA NA
4 4 3 3 1 NA NA
5 5 4 1 2 1 NA
6 6 5 2 3 2 1

Add new observations to a column in a dataframe in R

Lets start with two data frames:
m1 <- matrix(sample(c(NA, 1:10), 100, replace = TRUE), 10)
df1 <- as.data.frame(m1)
df1
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 4 5 9 8 3 8 7 1 5 5
2 2 1 NA 6 6 NA 3 8 8 2
3 NA 5 7 2 1 10 8 6 5 7
4 8 1 1 6 8 4 5 3 5 2
5 10 4 9 9 1 NA 7 8 6 2
6 1 8 NA 6 5 7 9 9 9 3
7 1 10 2 4 NA 10 6 5 5 4
8 7 3 10 7 5 5 2 1 NA 1
9 NA NA 8 10 6 4 3 10 7 7
10 7 10 2 2 9 4 NA 1 2 10
m2 <- matrix(sample(c(NA, 2:20), 100, replace = TRUE), 10)
df2 <- as.data.frame(m2)
df2
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 5 NA NA 19 20 15 5 11 4 17
2 4 13 20 NA 9 18 7 11 5 12
3 17 3 14 4 6 2 11 16 11 7
4 14 10 9 16 NA 7 20 5 8 6
5 5 14 10 20 19 16 NA 7 NA NA
6 12 14 14 8 3 20 15 7 15 17
7 4 15 18 12 4 2 19 13 9 8
8 14 11 4 20 5 17 NA 13 19 12
9 15 3 14 16 14 19 17 8 5 NA
10 2 2 11 2 16 4 NA 18 20 NA
Now, I do not want to merge both df, but only some colums.
How can I move df2$V10 to df1$V4?
The resulting df would be composed by 20 rows, but rows 11:20 would be filled by the 10 values of df2$V10. The remaining columns in these interval should be NA.
Extract the 'V10' column from 'df2', create a data.frame and use bind_rows to bind the two datasets. The other column values will be by default filled by NAs
library(dplyr)
bind_rows(df1, data.frame(V4 = df2$V10))
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
#1 2 10 NA 9 7 NA NA 8 1 5
#2 2 5 10 10 8 8 3 7 NA 2
#3 3 7 NA 5 4 5 2 5 7 2
#4 9 4 6 4 8 6 7 9 8 2
#5 3 6 2 3 3 6 10 5 9 5
#6 1 NA 3 7 5 4 6 3 7 10
#7 6 3 1 3 4 10 2 6 NA 7
#8 9 1 5 4 4 7 4 2 2 1
#9 3 1 6 6 1 7 7 6 6 1
#10 NA 6 10 9 10 10 6 4 3 9
#11 NA NA NA 10 NA NA NA NA NA NA
#12 NA NA NA 3 NA NA NA NA NA NA
#13 NA NA NA 4 NA NA NA NA NA NA
#14 NA NA NA 18 NA NA NA NA NA NA
#15 NA NA NA 20 NA NA NA NA NA NA
#16 NA NA NA 11 NA NA NA NA NA NA
#17 NA NA NA 15 NA NA NA NA NA NA
#18 NA NA NA 2 NA NA NA NA NA NA
#19 NA NA NA 3 NA NA NA NA NA NA
#20 NA NA NA 14 NA NA NA NA NA NA
For multiple columns, subset the dataset and set the column names of interest before doing the bind_rows
bind_rows(df1, setNames(df2[c('V10', 'V8')], c('V4', 'V2')))

Count number of NAs between 2 values by row in R

My data looks something like this:
db <- as.data.frame(matrix(ncol=10, nrow=3,
c(3,NA,NA,4,5,NA,7,NA,NA,NA,NA,NA,7,NA,8,9,NA,NA,4,6,NA,NA,7,8,11,5,10,NA,NA,NA), byrow = TRUE))
db
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 3 NA NA 4 5 NA 7 NA NA NA
2 NA NA 7 NA 8 9 NA NA 4 6
3 NA NA 7 8 11 5 10 NA NA NA
For each row, I'm trying to count the number of NAs that appear between the first and last non-NA element (I have numbers and characters) by row.
The output should be something like this:
db$na.tot <- c(3, 3, 0)
db
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 na.tot
1 3 NA NA 4 5 NA 7 NA NA NA 3
2 NA NA 7 NA 8 9 NA NA 4 6 3
3 NA NA 7 8 11 5 10 NA NA NA 0
Where na.tot represents the number of NAs observed between the first and last non-NA elements by row (between 3 and 7, 7 and 6 and 7 and 10 in rows 1, 2 and 3 respectively).
Does anyone have a simple solution?
Thanks!
Try this:
require(data.table)
z<-as.data.table(which(!is.na(db),arr.ind=TRUE))
setkey(z,row,col)
z[,list(NAs=last(col)-first(col)-.N+1),by=row]
# row NAs
#1: 1 3
#2: 2 3
#3: 3 0

Remove rows with missing data conditionally [duplicate]

This question already has answers here:
Remove rows that have more than a threshold of missing values missing values [closed]
(1 answer)
Delete columns/rows with more than x% missing
(5 answers)
Closed 5 years ago.
I have a dataframe with some missing values, displayed as NA.
For example:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 4 3 6 7 2 1 2 3 4 1
2 5 5 4 3 2 1 3 7 6 7
3 6 6 NA NA NA NA NA NA NA NA
4 5 2 2 1 7 NA NA NA NA NA
5 7 NA NA NA NA NA NA NA NA NA
I would like to remove rows that have contain at least 80% of missing data. In this example it is clearly row 3 and 5. I know how to remove rows manually, but I would like some help with the code because my original dataframe contains 480 Variables and more than 1000 rows, so a code for automatically identifying and removing rows with >80% NA data would be extremely useful.
Thanking you in advance
you could use rowMeans:
df = read.table(text=' V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 4 3 6 7 2 1 2 3 4 1
2 5 5 4 3 2 1 3 7 6 7
3 6 6 NA NA NA NA NA NA NA NA
4 5 2 2 1 7 NA NA NA NA NA
5 7 NA NA NA NA NA NA NA NA NA')
df[rowMeans(is.na(df))<.8,]
Output:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 4 3 6 7 2 1 2 3 4 1
2 5 5 4 3 2 1 3 7 6 7
4 5 2 2 1 7 NA NA NA NA NA
Hope this helps!
We can use rowSums on the logical matrix
df1[rowSums(is.na(df1))/ncol(df1) < 0.8,]
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
#1 4 3 6 7 2 1 2 3 4 1
#2 5 5 4 3 2 1 3 7 6 7
#4 5 2 2 1 7 NA NA NA NA NA

how to calculate the proportion of certain observations in each variable in r?

I have data-frame (populations1) which consists of 11 million rows (observations) and 11 columns (individuals). The first few rows of my dataframe look like this:
> head(population1)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
1 7 3 NA NA 10 NA NA NA NA NA NA
2 14 11 7 NA 12 3 4 5 14 3 6
3 13 11 7 NA 11 4 NA 4 13 3 4
4 3 NA 4 5 4 NA NA 6 17 NA 7
5 3 NA 5 5 4 NA NA 7 20 NA 8
6 6 NA 3 6 NA NA NA 5 16 NA 10
For each individual, I want to estimate the proportion of observations with values more than 5. Is there any easy solution to do it in R?
Here is a solution uses sapply to apply a function to each column. The function is defined to count how many observations are larger than 5 and then divided by the length of x.
sapply(dt, function(x) sum(x > 5, na.rm = TRUE)/length(x))
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
0.6666667 0.3333333 0.3333333 0.1666667 0.5000000 0.0000000 0.0000000 0.3333333 0.8333333 0.0000000
V11
0.6666667
DATA
dt <- read.table(text = " V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
1 7 3 NA NA 10 NA NA NA NA NA NA
2 14 11 7 NA 12 3 4 5 14 3 6
3 13 11 7 NA 11 4 NA 4 13 3 4
4 3 NA 4 5 4 NA NA 6 17 NA 7
5 3 NA 5 5 4 NA NA 7 20 NA 8
6 6 NA 3 6 NA NA NA 5 16 NA 10",
header = TRUE)
Here is an option using tidyverse
library(dplyr)
pop1 %>%
summarise_all(funs(sum(.>5, na.rm = TRUE)/n()))
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
#1 0.6666667 0.3333333 0.3333333 0.1666667 0.5 0 0 0.3333333 0.8333333 0 0.6666667
If we need as a vector then unlist it
pop1 %>%
summarise_all(funs(sum(.>5, na.rm = TRUE)/n())) %>%
unlist(., use.names = FALSE)

Resources