Related
I am very green to R so please bear with my wording. I have a df from a csv that has 106 obs of 11 variables. I only care about 2 of those variables so I made a new df called "df."
bc=read.csv("---.csv")
df=cbind.data.frame('A'=bc$A,'B'=bc$B)
#Example of the new df:
A B
mass 0.1
mass 0.2
height 0.5
height 0.3
color 0.9
color 0.1
Then I made (4) vectors, each based on how many rows could satisfy (2) simultaneous conditions: greater than OET or less than OET AND type is "mass" or type is not "mass."
TP= df[df$B>=i & df$A=="mass",]
TN= df[df$B<=i & df$A!="mass",]
FP= df[df$B<=i & df$A!="mass",]
FN= df[df$B<=i & df$A=="mass",]
I think I want to use a for loop so I could have a vector for every B condition, every i. If I set "i" to a value, the vectors will give me all rows that fit and then nrow("vector") to see how many rows that is- but I cannot type all 106 df$B values into i. I did print to see if my i would work and it showed that I could get every row from df$B. So then I tried with half of the TP vector with df$A. That worked. Now I tried the df$B part alone. But this gave me all 106 obs which I know is wrong becuse the non-looped TP gave me 21 obs. The end goal of the code is to give me a number of TP and and TN for every df$B that meets my (2) conditions so that I can plug them into another function to ggplot. [like Y=TP/TP-TN]
N=c(df$B)
for(i in N){
print(paste(i))
}
# worked
for(i in N){
TPA=df[df$A=="mass",]
TP=nrow(TPA)
}
# worked
for(i in N){
TPB=df[df$B>=i,]
TP=nrow(TPB)
}
#ran but did not do what I wanted
I guess my question is how do I run all rows of df$B against each df$B, all 106 of them, and store them?
When i = df$B[1], how many rows of df$B are >i
When i= df$B[2], how many rows of df$B are >i
From a formula like this, I would like an output like below:
results=data.frame(matrix(nrow=,ncol=4))
colnames(results)=c("A","B","TP","TN")
B=rep(c("mass","not mass"),each=106)
N=c(df$B)
for(i in N){
TPC=df[df$A=='mass' & df$B>=i,]
TP=nrow(TPC)
TNC=df[df$A!='mass' & df$B<=i,]
TN=nrow(TNC)
}
results=cbind.data.frame(B,A,results)
B A TP TN
mass df$B[1] 21 0
mass df$B[2] 18 12
...
notmass df$B[1] 1 11
notmass df$B[2] 3 10
...
If you read this far, thank you! Any direction or answer would be most appreciated!
I'm not sure I'm understanding the terms of your confusion matrix properly, but here's a suggestion for a general approach that seems to me more idiomatic to R, using in this case dplyr and tidyr.
Starting with your data:
df1 <- data.frame(
stringsAsFactors = FALSE,
A = c("mass", "mass", "height", "height", "color", "color"),
B = c(0.1, 0.2, 0.5, 0.3, 0.9, 0.1)
)
We can add a logical mass variable to capture if A is or isn't equal to "mass". We can also make a list of the values of B to use later.
df1$mass = df1$A == "mass"
B_val = sort(unique(df1$B))
Below, I make a copy of the data for each value of B_val and use dplyr::case_when to define the values of the confusion matrix. (I suspect I don't have these right, but should be simple to fix.)
Finally, at the bottom I count how many combinations arise, and then reshape the data into wider format with columns named for each conclusion.
library(dplyr); library(tidyr)
df1 %>%
crossing(B_val) %>%
mutate(type = case_when(
B >= B_val & mass ~ "TP",
B <= B_val & !mass ~ "TN",
B <= B_val & mass ~ "FP",
B >= B_val & !mass ~ "FN",
TRUE ~ "undefined"
)) %>%
count(mass, B_val, type) %>%
# group_by(mass, B_val) %>% #un-comment these lines for proportions
# mutate(n = n / sum(n)) %>%
pivot_wider(names_from = type, values_from = n)
This produces the output below:
# A tibble: 10 x 6
mass B_val FN TN TP FP
<lgl> <dbl> <int> <int> <int> <int>
1 FALSE 0.1 3 1 NA NA
2 FALSE 0.2 3 1 NA NA
3 FALSE 0.3 2 2 NA NA
4 FALSE 0.5 1 3 NA NA
5 FALSE 0.9 NA 4 NA NA
6 TRUE 0.1 NA NA 2 NA
7 TRUE 0.2 NA NA 1 1
8 TRUE 0.3 NA NA NA 2
9 TRUE 0.5 NA NA NA 2
10 TRUE 0.9 NA NA NA 2
Or if looking at proportions:
mass B_val FN TN TP FP
<lgl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 FALSE 0.1 0.75 0.25 NA NA
2 FALSE 0.2 0.75 0.25 NA NA
3 FALSE 0.3 0.5 0.5 NA NA
4 FALSE 0.5 0.25 0.75 NA NA
5 FALSE 0.9 NA 1 NA NA
6 TRUE 0.1 NA NA 1 NA
7 TRUE 0.2 NA NA 0.5 0.5
8 TRUE 0.3 NA NA NA 1
9 TRUE 0.5 NA NA NA 1
10 TRUE 0.9 NA NA NA 1
I'm trying to drop columns that have more than 90% of NA values present, I've followed the following but I only get a values in return, not sure what I can be doing wrong. I would be expecting an actual data frame, I tried putting as.data.frame in front but this is also erroneous.
Linked Post: Delete columns/rows with more than x% missing
Example DF
gene cell1 cell2 cell3
A 0.4 0.1 NA
B NA NA 0.1
C 0.4 NA 0.5
D NA NA 0.5
E 0.5 NA 0.6
F 0.6 NA NA
Desired DF
gene cell1 cell3
A 0.4 NA
B NA 0.1
C 0.4 0.5
D NA 0.5
E 0.5 0.6
F 0.6 NA
Code
#Select Genes that have NA values for 90% of a given cell line
df_col <- df[,2:ncol(df)]
df_col <-df_col[, which(colMeans(!is.na(df_col)) > 0.9)]
df <- cbind(df[,1], df_col)
I would use dplyr here.
If you want to use select() with logical conditions, you are probably looking for the where() selection helper in dplyr.
It can be used like this: select(where(condition))
I used a 80% threshold because 90% would keep all columns and would therefore not illustrate the solution as well
library(dplyr)
df %>% select(where(~mean(is.na(.))<0.8))
It can also be done with base R and colMeans:
df[, c(TRUE, colMeans(is.na(df[-1]))<0.8)]
or with purrr:
library(purrr)
df %>% keep(~mean(is.na(.))<0.8)
Output:
gene cell1 cell3
1 a 0.4 NA
2 b NA 0.1
3 c 0.4 0.5
4 d NA 0.5
5 e 0.5 0.6
6 f 0.6 NA
Data
df<-data.frame(gene=letters[1:6],
cell1=c(0.4, NA, 0.4, NA, 0.5, 0.6),
cell2=c(0.1, rep(NA, 5)),
cell3=c(NA, 0.1, 0.5, 0.5, 0.6, NA))
Well, cell3 has 83% NA values (5/6) but anyway you can do -
ignore <- 1
perc <- 0.8 #80 %
df <- cbind(df[ignore], df[-ignore][colMeans(is.na(df[-ignore])) < perc])
df
# gene cell1 cell3
#1 A 0.4 NA
#2 B NA 0.1
#3 C 0.4 0.5
#4 D NA 0.5
#5 E 0.5 0.6
#6 F 0.6 NA
I already made a similar question but now I want just to restrict the new values of NA.
I have some data like this:
Date 1 Date 2 Date 3 Date 4 Date 5 Date 6
A NA 0.1 0.2 NA 0.3 0.2
B 0.1 NA NA 0.3 0.2 0.1
C NA NA NA NA 0.3 NA
D 0.1 0.2 0.3 NA 0.1 NA
E NA NA 0.1 0.2 0.1 0.3
I would like to change the NA values of my data based on the first date a value is registered. So for example for A, the first registration is Date 2. Then I want that before that registration the values of NA in A are 0, and after the first registration the values of NA become the mean of the nearest values (mean of date 3 and 5).
In case the last value is an NA, transform it into the last registered value (as in C and D). In the case of E all NA values will become 0.
Get something like this:
Date 1 Date 2 Date 3 Date 4 Date 5 Date 6
A 0 0.1 0.2 0.25 0.3 0.2
B 0.1 0.2 0.2 0.3 0.2 0.1
C 0 0 0 0 0.3 0.3
D 0.1 0.2 0.3 0.2 0.1 0.1
E 0 0 0.1 0.2 0.1 0.3
Can you help me? I'm not sure how to do it in R.
Here is a way using na.approx from the zoo package and apply with MARGIN = 1 (so this is probably not very efficient but get's the job done).
library(zoo)
df1 <- as.data.frame(t(apply(dat, 1, na.approx, method = "constant", f = .5, na.rm = FALSE)))
This results in
df1
# V1 V2 V3 V4 V5
#A NA 0.1 0.2 0.25 0.3
#B 0.1 0.2 0.2 0.30 0.2
#C NA NA NA NA 0.3
#E NA NA 0.1 0.20 0.1
Replace NAs and rename columns.
df1[is.na(df1)] <- 0
names(df1) <- names(dat)
df1
# Date_1 Date_2 Date_3 Date_4 Date_5
#A 0.0 0.1 0.2 0.25 0.3
#B 0.1 0.2 0.2 0.30 0.2
#C 0.0 0.0 0.0 0.00 0.3
#E 0.0 0.0 0.1 0.20 0.1
explanation
Given a vector
x <- c(0.1, NA, NA, 0.3, 0.2)
na.approx(x)
returns x with linear interpolated values
#[1] 0.1000000 0.1666667 0.2333333 0.3000000 0.2000000
But OP asked for constant values so we need the argument method = "constant" from the approx function.
na.approx(x, method = "constant")
# [1] 0.1 0.1 0.1 0.3 0.2
But this is still not what OP asked for because it carries the last observation forward while you want the mean for the closest non-NA values. Therefore we need the argument f (also from approx)
na.approx(x, method = "constant", f = .5)
# [1] 0.1 0.2 0.2 0.3 0.2 # looks good
From ?approx
f : for method = "constant" a number between 0 and 1 inclusive, indicating a compromise between left- and right-continuous step functions. If y0 and y1 are the values to the left and right of the point then the value is y0 if f == 0, y1 if f == 1, and y0*(1-f)+y1*f for intermediate values. In this way the result is right-continuous for f == 0 and left-continuous for f == 1, even for non-finite y values.
Lastly, if we don't want to replace the NAs at the beginning and end of each row we need na.rm = FALSE.
From ?na.approx
na.rm : logical. If the result of the (spline) interpolation still results in NAs, should these be removed?
data
dat <- structure(list(Date_1 = c(NA, 0.1, NA, NA), Date_2 = c(0.1, NA,
NA, NA), Date_3 = c(0.2, NA, NA, 0.1), Date_4 = c(NA, 0.3, NA,
0.2), Date_5 = c(0.3, 0.2, 0.3, 0.1)), .Names = c("Date_1", "Date_2",
"Date_3", "Date_4", "Date_5"), class = "data.frame", row.names = c("A",
"B", "C", "E"))
EDIT
If there are NAs in the last column we can replace these with the last non-NAs before we apply na.approx as shown above.
dat$Date_6[is.na(dat$Date_6)] <- dat[cbind(1:nrow(dat),
max.col(!is.na(dat), ties.method = "last"))][is.na(dat$Date_6)]
This is another possible answer, using na.locf from the zoo package.
Edit: apply is actually not required; This solution fills in the last observed value if this value is missing.
# create the dataframe
Date1 <- c(NA,.1,NA,NA)
Date2 <- c(.1, NA,NA,NA)
Date3 <- c(.2,NA,NA,.1)
Date4 <- c(NA,.3,NA,.2)
Date5 <- c(.3,.2,.3,.1)
Date6 <- c(.1,NA,NA,NA)
df <- as.data.frame(cbind(Date1,Date2,Date3,Date4,Date5,Date6))
rownames(df) <- c('A','B','C','D')
> df
Date1 Date2 Date3 Date4 Date5 Date6
A NA 0.1 0.2 NA 0.3 0.1
B 0.1 NA NA 0.3 0.2 NA
C NA NA NA NA 0.3 NA
D NA NA 0.1 0.2 0.1 NA
# Load library
library(zoo)
df2 <- t(na.locf(t(df),na.rm = F)) # fill last observation carried forward
df3 <- t(na.locf(t(df),na.rm = F, fromLast = T)) # last obs carried backward
df4 <- (df2 + df3)/2 # mean of both dataframes
df4 <- t(na.locf(t(df4),na.rm = F)) # fill last observation carried forward
df4[is.na(df4)] <- 0 # NA values are 0
Date1 Date2 Date3 Date4 Date5 Date6
A 0.0 0.1 0.2 0.25 0.3 0.1
B 0.1 0.2 0.2 0.30 0.2 0.2
C 0.0 0.0 0.0 0.00 0.3 0.3
D 0.0 0.0 0.1 0.20 0.1 0.1
Here's another option with base R + rollmean from zoo (clearly easy to rewrite in base R for this case with window size k = 2).
t(apply(df, 1, function(x) {
means <- c(0, rollmean(na.omit(x), 2), tail(na.omit(x), 1))
replace(x, is.na(x), means[1 + cumsum(!is.na(x))[is.na(x)]])
}))
# Date1 Date2 Date3 Date4 Date5 Date6
# A 0.0 0.1 0.2 0.25 0.3 0.2
# B 0.1 0.2 0.2 0.30 0.2 0.1
# C 0.0 0.0 0.0 0.00 0.3 0.3
# D 0.1 0.2 0.3 0.20 0.1 0.1
# E 0.0 0.0 0.1 0.20 0.1 0.3
Explanation. Suppose that x is the first row of df:
# Date1 Date2 Date3 Date4 Date5 Date6
# A NA 0.1 0.2 NA 0.3 0.2
Then
means
# [1] 0.00 0.15 0.25 0.25 0.20
is a vector of 0, rolling means of two the following non-NA elements, and the last non-NA element. Then all we need to do is to replace those elements of x that are is.na(x). We will replace them by the elements of means at indices 1 + cumsum(!is.na(x))[is.na(x)]. That's the trickier part. Here
cumsum(!is.na(x))
# [1] 0 1 2 2 3 4
Meaning that the first element of x has seen 0 non-NA elements, while, say, the last one has seen 4 non-NA elements so far. Then
cumsum(!is.na(x))[is.na(x)]
# [1] 0 2
is about those NA elements in x that we want to replace. Notice that then
1 + cumsum(!is.na(x))[is.na(x)]
# [1] 1 3
corresponds to the elements of means that we want to use for replacement.
I am finding the function below too complicated but it works, so here it goes.
fun <- function(x){
if(anyNA(x)){
inx <- which(!is.na(x))
if(inx[1] > 1) x[seq_len(inx[1] - 1)] <- 0
prev <- inx[1]
for(i in inx[-1]){
if(i - prev > 1){
m <- mean(c(x[i], x[prev]))
while(prev < i){
x[prev] <- m
prev <- prev + 1
}
}
prev <- i
}
}
x
}
res <- t(apply(df1, 1, fun))
res <- as.data.frame(res)
res
# Date.1 Date.2 Date.3 Date.4 Date.5
#A 0.0 0.1 0.25 0.25 0.3
#B 0.2 0.2 0.20 0.30 0.2
#C 0.0 0.0 0.00 0.00 0.3
#E 0.0 0.0 0.10 0.20 0.1
Data.
df1 <- read.table(text = "
Date.1 Date.2 Date.3 Date.4 Date.5
A NA 0.1 0.2 NA 0.3
B 0.1 NA NA 0.3 0.2
C NA NA NA NA 0.3
E NA NA 0.1 0.2 0.1
", header = TRUE)
need help in N number or column wise subtraction and division, Below are the columns in a input dataframe.
input dataframe:
> df
A B C D
1 1 3 6 2
2 3 3 3 4
3 1 2 2 2
4 4 4 4 4
5 5 2 3 2
formula - a, (b - a) / (1-a)
MY CODE
ABC <- cbind.data.frame(DF[1], (DF[-1] - DF[-ncol(DF)])/(1 - DF[-ncol(DF)]))
Expected out:
A B C D
1 Inf -1.5 0.8
3 0.00 0.0 -0.5
1 Inf 0.0 0.0
4 0.00 0.0 0.0
5 0.75 -1.0 0.5
But i dont want to use ncol here, cause there is a last column after column D in the actual dataframe.
So want to apply this formula only till first 4 column, IF i use ncol, it will traverse till last column in the dataframe.
Please help thanks.
What about trying:
df <- matrix(c(1,3,6,2,3,3,3,4,1,2,2,2,4,4,4,4,5,2,3,2), nrow = 5, byrow = TRUE)
df_2 <- matrix((df[,2]-df[,1])/(1-df[,1]),5,1)
df_3 <- matrix((df[,3]-df[,2])/(1-df[,2]),5,1)
df_4 <- matrix((df[,4]-df[,3])/(1-df[,3]),5,1)
cbind(df[,1],df_2,df_3,df_4)
edit: a loop version
df <- matrix(c(1,3,6,2,3,3,3,4,1,2,2,2,4,4,4,4,5,2,3,2), nrow = 5, byrow = TRUE)
test_bind <- c()
test_bind <- cbind(test_bind, df[,1])
for (i in 1:3){df_1 <- matrix((df[,i+1]-df[,i])/(1-df[,i]),5,1)
test_bind <- cbind(test_bind,df_1)}
test_bind
here is one option with tidyverse
library(dplyr)
library(purrr)
map2_df(DF[2:4], DF[1:3], ~ (.x - .y)/(1- .y)) %>%
bind_cols(DF[1], .)
# A B C D
#1 1 Inf -1.5 0.8
#2 3 0.00 0.0 -0.5
#3 1 Inf 0.0 0.0
#4 4 0.00 0.0 0.0
#5 5 0.75 -1.0 0.5
I have a dataframe df with columns ID, X and Y
ID = c(1,1,2,2)
X = c(1,0.4,0.8,0.1)
Y = c(0.5,0.5,0.7,0.7)
df <- data.frame(ID,X,Y)
ID X Y
1 1.0 0.5
1 0.4 0.5
2 0.8 0.7
2 0.1 0.7
I would like to obtain two new columns:
Xg equal to X when X is greater than Y and NA otherwise
Xl equal to X when X is less than Y and NA otherwise. That is,
ID X Y Xg Xl
1 1.0 0.5 1.0 NA
1 0.4 0.5 NA 0.4
2 0.8 0.7 0.8 NA
2 0.1 0.7 NA 0.1
Below should work, even if there are NA's in X or Y:
library(dplyr)
df %>%
mutate(Xg = ifelse(X > Y, X, NA),
Xl = ifelse(X < Y, Y, NA))
If you want to use if_else from dplyr, you have to convert NA to numeric. if_else is stricter than ifelse in that it checks whether the TRUE and FALSE values are the same type:
df %>%
mutate(Xg = if_else(X > Y, X, as.numeric(NA)),
Xl = if_else(X < Y, Y, as.numeric(NA)))
Result:
ID X Y Xg Xl
1 1 1.0 0.5 1.0 NA
2 1 0.4 0.5 NA 0.5
3 2 0.8 0.7 0.8 NA
4 2 0.1 0.7 NA 0.7
5 3 NA 1.0 NA NA
6 3 3.0 NA NA NA
Data:
ID = c(1,1,2,2,3,3)
X = c(1,0.4,0.8,0.1,NA,3)
Y = c(0.5,0.5,0.7,0.7,1,NA)
df <- data.frame(ID,X,Y)
What about some plain old R indexing and subsetting?
ID <- c(1,1,2,2, 3, 3)
X <- c(1,0.4,0.8,0.1, NA, 2)
Y <- c(0.5,0.5,0.7,0.7, 2, NA)
Xg <- Xl <- rep(NA_real_, length(ID))
Xg[which(X > Y)] <- X[which(X > Y)]
Xl[which(X < Y)] <- X[which(X < Y)]
data.frame(ID, X, Y, Xg, Xl)
Note: I assume that if X or Y is missing, Xg and Xl should be NA.
For the sake of completeness and as the question originally used data.table() before it was edited (and because I like the concise code) here is "one-liner" using data.table's update in place:
library(data.table)
setDT(df)[X > Y, Xg := X][X < Y, Xl := X][]
ID X Y Xg Xl
1: 1 1.0 0.5 1.0 NA
2: 1 0.4 0.5 NA 0.4
3: 2 0.8 0.7 0.8 NA
4: 2 0.1 0.7 NA 0.1
5: 3 NA 1.0 NA NA
6: 3 3.0 NA NA NA
(Using the data of useR)
NA's are handled automatically as only matching rows are updated.