Select last value in a row & place it in another column - r

I have data table like this
Col1 | Col2 | Colx
12 | 13 | 19
34 | NA | NA
13 | 33 | NA
to determine the last value in each row I used Andrie's suggestion here for a previous question on the same subject
But I'd like the output to be in a separated column, the expected output for the above example.
>
Column
19
34
33
The OG question in the link above didn't solve my problem, as the output is not coming in a new column.

We can do
apply(df, 1, function(x) tail(x[!is.na(x)], 1))
If you want the result in a new column, you can do:
df$newColumn <- apply(df, 1, function(x) tail(x[!is.na(x)], 1))

Another option is
i1 <- which(!is.na(df1), arr.ind=TRUE)
unname(tapply(df1[i1], i1[,1], FUN=tail,1))
#[1] 19 34 33
data
df1 <- structure(list(Col1 = c(12, 34, 13), Col2 = c(13,
NA, 33), Colx = c(19,
NA, NA)), .Names = c("Col1", "Col2", "Colx"),
row.names = c(NA, -3L), class = "data.frame")

Related

set a threshold for complete cases to remove NA from multiple columns in R

There might be an easy answer to this, but I am not able to make it work. I have a data table that looks like this:
df <- data.table(t = c(1, 2, 3), a = c(NA, NA, 4), b = c(NA, 4, NA), c = c(NA, 4, NA))
How can I remove only the rows where all columns but "t" have NA's. It should be fast because of my big data files, so I would like to do it especially with complete.cases. I couldn't find a solution to this problem yet.
The result should look like this
dfRes <- data.table(t = c(2, 3), a = c(NA, 4), b = c(4, NA), c = c(4, NA))
We can use complete.cases with Reduce
library(data.table)
df[df[, Reduce(`|`, lapply(.SD, complete.cases)), .SDcols = a:c]]
# t a b c
#1: 2 NA 4 4
#2: 3 4 NA NA
We can use rowSums on columns other than "t".
library(data.table)
cols <- which(names(df) != 't')
df[rowSums(!is.na(df[, ..cols])) > 0, ]
# t a b c
#1: 2 NA 4 4
#2: 3 4 NA NA

Remove columns from a dataframe based on number of rows with valid values

I have a dataframe:
df = data.frame(gene = c("a", "b", "c", "d", "e"),
value1 = c(NA, NA, NA, 2, 1),
value2 = c(NA, 1, 2, 3, 4),
value3 = c(NA, NA, NA, NA, 1))
I would like to keep all those columns (plus the first, gene) with more than or equal to atleast 2 valid values (i.e., not NA). How do I do this?
I am thinking something like this ...
df1 = df %>% select_if(function(.) ...)
Thanks
We can sum the non-NA elements and create a logical condition to select the columns of interest
library(dplyr)
df1 <- df %>%
select_if(~ sum(!is.na(.)) > 2)
df1
# gene value2
#1 a NA
#2 b 1
#3 c 2
#4 d 3
#5 e 4
Or another option is keep
library(purrr)
keep(df, ~ sum(!is.na(.x)) > 2)
Or create the condition based on the number of rows
df %>%
select_if(~ mean(!is.na(.)) > 0.5)
Or use Filter from base R
Filter(function(x) sum(!is.na(x)) > 2, df)
We can use colSums in base R to count the non-NA value per column
df[colSums(!is.na(df)) > 2]
# gene value2
#1 a NA
#2 b 1
#3 c 2
#4 d 3
#5 e 4
Or using apply
df[apply(!is.na(df), 2, sum) > 2]

Subtracting list inside data.frame with list from another data.frame

I have two data frames that display a month and a list of ids in each row. They look like this:
dataframe A:
Month ID
2016-03 1,2,3
2016-04 4,5,6
2016-05 7,8,9
dataframe B:
Month ID
2016-03 2,3,4
2016-04 5,6,7
2016-05 8,9,10
Seems simple, and perhaps I'm overthinking it, but I'm having trouble subtracting the corresponding rows from dataframe B from dataframe A.
Ultimate goal is to get the count of ids per row from dataframe A after dataframe B is removed.
So the resulting dataframe would look like:
Month ID
2016-03 1
2016-04 4
2016-05 7
and my count would be 1, 1, 1.
Thanks in advance for the help!
Update:
The values in the "ID" column are list objects like:
c("1", "2", "3")
Use setdiff once you have appropriate vectors for each Month:
result <- Map(setdiff, A$ID, B$ID[match(A$Month,B$Month)] ))
#[[1]]
#[1] 1
#
#[[2]]
#[1] 4
#
#[[3]]
#[1] 7
If you need the lengths you can easily do:
lengths(result)
#[1] 1 1 1
Where, the data used was:
A <- structure(list(Month = c("2016-03", "2016-04", "2016-05"), ID = list(
c(1, 2, 3), c(4, 5, 6), c(7, 8, 9))), .Names = c("Month",
"ID"), row.names = c(NA, -3L), class = "data.frame")
B <- structure(list(Month = c("2016-03", "2016-04", "2016-05"), ID = list(
c(2, 3, 4), c(5, 6, 7), c(8, 9, 10))), .Names = c("Month",
"ID"), row.names = c(NA, -3L), class = "data.frame")

Calculate column medians with NA's

I am trying to calculate the median of individual columns in R and then subtract the median value with every value in the column. The problem that I face here is I have N/A's in my column that I dont want to remove but just return them without subtracting the median. For example
ID <- c("A","B","C","D","E")
Point_A <- c(1, NA, 3, NA, 5)
Point_B <- c(NA, NA, 1, 3, 2)
df <- data.frame(ID,Point_A ,Point_B)
Is it possible to calculate the median of a column having N/A's? My resulting output would be
+----+---------+---------+
| ID | Point_A | Point_B |
+----+---------+---------+
| A | -2 | NA |
| B | NA | NA |
| C | 0 | -1 |
| D | NA | 1 |
| E | 2 | 0 |
+----+---------+---------+
If we talking real NA values (as per OPs comment), one could do
df[-1] <- lapply(df[-1], function(x) x - median(x, na.rm = TRUE))
df
# ID Point_A Point_B
# 1 A -2 NA
# 2 B NA NA
# 3 C 0 -1
# 4 D NA 1
# 5 E 2 0
Or using the matrixStats package
library(matrixStats)
df[-1] <- df[-1] - colMedians(as.matrix(df[-1]), na.rm = TRUE)
When original df is
df <- structure(list(ID = structure(1:5, .Label = c("A", "B", "C",
"D", "E"), class = "factor"), Point_A = c(1, NA, 3, NA, 5), Point_B = c(NA,
NA, 1, 3, 2)), .Names = c("ID", "Point_A", "Point_B"), row.names = c(NA,
-5L), class = "data.frame")
Another option is
library(dplyr)
df %>%
mutate_each(funs(median=.-median(., na.rm=TRUE)), -ID)
Of course it is possible.
median(df[,]$Point_A, na.rm = TRUE)
where df is the data frame, while df[,] means for all rows and columns. But, be aware that the column the specified afterwards by $Point_A. The same could be written in this notation:
median(df[,"Point_A"], na.rm = TRUE)
where once again, df[,"Point_A"] means for all rows of the column Point_A.

Replace NA-s with mean values

I have a data frame with values missing. How can I replace NA-s with the mean of value from the previous and next row? In the example with (30+10)/2=20.
id value
1 30
2 NA
3 10
4 20
Try
library(zoo)
na.approx(df$value)
#[1] 30 20 10 20
Suppose if the data has first or last rows as NA values or consecutive NAs (it is not clear in the post), the function would return
na.approx(df1$value, na.rm=FALSE)
#[1] NA 20 25 24 23 22 27 28 29 NA
data
df <- structure(list(id = 1:4, value = c(30L, NA, 10L, 20L)), .Names = c("id",
"value"), class = "data.frame", row.names = c(NA, -4L))
df1 <- data.frame(id=1:10, value=c(NA, 20,25, NA, NA, 22, 27, 28, 29, NA))

Resources