Automated fill in columns in r - r

I have a dataframe (shown below) where there are some asterisks in the "sig" column.
I want to fill in asterisks in the empty cells in the sig column everywhere above the furthest down row where there is an asterisk, which in this case would be everywhere from row "H" up to get something like this:
I'm thinking some sort of a for loop where it identifies the furthest down row where there is an asterisk and then fills in asterisks in empty cells above might be the way to go, but I'm not sure how to code this.
For debugging purposes, I make the data frame in R with
df<- data.frame("variable"= c("a","b","c","d","e","f","g","h","i","j","k"),
"value" = c(0.04,0.03,0.04,0.02,0.03,0.02,0.02,0.01,0.04,0.1,0.02),
"sig" = c("*","*","*","","*","*","","*","","",""))
Any help would be greatly appreciated - thanks!

Another way:
df[1:max(which(df$sig == "*")), "sig"] = "*"
Gives:
variable value sig
1 a 0.04 *
2 b 0.03 *
3 c 0.04 *
4 d 0.02 *
5 e 0.03 *
6 f 0.02 *
7 g 0.02 *
8 h 0.01 *
9 i 0.04
10 j 0.10
11 k 0.02

We could use replace based on finding the index of the last element having *
library(dplyr)
df <- df %>%
mutate(sig = replace(sig, seq(tail(which(sig == "*"), 1)), "*"))
-output
df
variable value sig
1 a 0.04 *
2 b 0.03 *
3 c 0.04 *
4 d 0.02 *
5 e 0.03 *
6 f 0.02 *
7 g 0.02 *
8 h 0.01 *
9 i 0.04
10 j 0.10
11 k 0.02

Another solution would be using fill, but you need to change "" to NA
Libraries
library(tidyverse)
Data
df <-
data.frame("variable"= c("a","b","c","d","e","f","g","h","i","j","k"),
"value" = c(0.04,0.03,0.04,0.02,0.03,0.02,0.02,0.01,0.04,0.1,0.02),
"sig" = c("*","*","*","","*","*","","*","","",""))
Code
df %>%
mutate(sig = if_else(sig == "",NA_character_,sig)) %>%
fill(sig,.direction = "up")
Output
variable value sig
1 a 0.04 *
2 b 0.03 *
3 c 0.04 *
4 d 0.02 *
5 e 0.03 *
6 f 0.02 *
7 g 0.02 *
8 h 0.01 *
9 i 0.04 <NA>
10 j 0.10 <NA>
11 k 0.02 <NA>

Related

How to use mutate result as input to calc another column in R dplyr

I'd like to calculate data for two new columns in a data.frame where the results are based on the value of the previous row. However, the previous row also needs to be calculated, which means that there is a dependency between the two columns (the input for one calculation is based on the output of another calculation). I could do it through a for, but maybe it's not the right way.
This is a sample for this case:
df <- data.frame(A=c(0.91,0.98,1,1.1), B=c(0.81, 1.11, 0.83, 0.92), C=c(0.09,0.06,0.09,0.08))
df$D <- NA
df$E <- NA
df[1,]$D <- 0.0
I've been trying it through dplyr::mutate.
df %>%
mutate(D = ifelse( lag(A) < 1, lag(E), lag(E) - lag(E) * lag(A)),
E = B - (B - D) * exp(-C)
)
This is how the output should be:
> df
A B C D E
1 0.91 0.81 0.09 0.00000000 0.06971574
2 0.98 1.11 0.06 0.06971574 0.13029718
3 1.00 0.83 0.09 0.13029718 0.19051977
4 1.10 0.92 0.08 0.00000000 0.07073296

Find the nth largest values in the top row and omit the rest of the columns in R

I am trying to change a data frame such that I only include those columns where the first value of the row is the nth largest.
For example, here let's assume I want to only include the columns where the top value in row 1 is the 2nd largest (top 2 largest).
dat1 = data.frame(a = c(0.1,0.2,0.3,0.4,0.5), b = c(0.6,0.7,0.8,0.9,0.10), c = c(0.12,0.13,0.14,0.15,0.16), d = c(NA, NA, NA, NA, 0.5))
a b c d
1 0.1 0.6 0.12 NA
2 0.2 0.7 0.13 NA
3 0.3 0.8 0.14 NA
4 0.4 0.9 0.15 NA
5 0.5 0.1 0.16 0.5
such that a and d are removed, because 0.1 and NA are not the 2nd largest values in
row 1. Here 0.6 and 0.12 are larger than 0.1 and NA in column a and d respectively.
b c
1 0.6 0.12
2 0.7 0.13
3 0.8 0.14
4 0.9 0.15
5 0.1 0.16
Is there a simple way to subset this? I do not want to order it, because that will create problems with other data frames I have that are related.
Complementing pieca's answer, you can encapsulate that into a function.
Also, this way, the returning data.frame won't be sorted.
get_nth <- function(df, n) {
df[] <- lapply(df, as.numeric) # edit
cols <- names(sort(df[1, ], na.last = NA, decreasing = TRUE))
cols <- cols[seq(n)]
df <- df[names(df) %in% cols]
return(df)
}
Hope this works for you.
Sort the first row of your data.frame, and then subset by names:
cols <- names(sort(dat1[1,], na.last = NA, decreasing = TRUE))
> dat1[,cols[1:2]]
b c
1 0.6 0.12
2 0.7 0.13
3 0.8 0.14
4 0.9 0.15
5 0.1 0.16
You can get an inverted rank of the first row and take the top nth columns:
> r <- rank(-dat1[1,], na.last=T)
> r <- r <= 2
> dat1[,r]
b c
1 0.6 0.12
2 0.7 0.13
3 0.8 0.14
4 0.9 0.15
5 0.1 0.16

replacing values in dataframe with another dataframe r

I have a dataframe of values that represent fold changes as such:
> df1 <- data.frame(A=c(1.74,-1.3,3.1), B=c(1.5,.9,.71), C=c(1.1,3.01,1.4))
A B C
1 1.74 1.50 1.10
2 -1.30 0.90 3.01
3 3.10 0.71 1.40
And a dataframe of pvalues as such that matches rows and columns identically:
> df2 <- data.frame(A=c(.02,.01,.8), B=c(NA,.01,.06), C=c(.01,.01,.03))
A B C
1 0.02 NA 0.01
2 0.01 0.01 0.01
3 0.80 0.06 0.03
What I want is to modify the values in df1 so that only retain the values that had a correponding pvalue in df2 < .05, and replace with NA otherwise. Note there are also NA in df2.
> desired <- data.frame(A=c(1.74,-1.3,NA), B=c(NA,.9,NA), C=c(1.1,3.01,1.4))
> desired
A B C
1 1.74 NA 1.10
2 -1.30 0.9 3.01
3 NA NA 1.40
I first tried to use vector syntax on these dataframes and that didn't work. Then I tried a for loop by columns and that also failed.
I don't think i understand how to index each i,j position and then replace df1 values with df2 values based on a logical.
Or if there is a better way in R.
You can try this:
df1[!df2 < 0.05 | is.na(df2)] <- NA
Out:
> df1
A B C
1 1.74 NA 1.10
2 -1.30 0.9 3.01
3 NA NA 1.40
ifelse and as.matrix seem to do the trick.
df1 <- data.frame(A=c(1.74,-1.3,3.1), B=c(1.5,.9,.71), C=c(1.1,3.01,1.4))
df2 <- data.frame(A=c(.02,.01,.8), B=c(NA,.01,.06), C=c(.01,.01,.03))
x1 <- as.matrix(df1)
x2 <- as.matrix(df2)
as.data.frame( ifelse( x2 >= 0.05 | is.na(x2), NA, x1) )
Result
A B C
1 1.74 NA 1.10
2 -1.30 0.9 3.01
3 NA NA 1.40

Create factor labels in a DF using a sequence of numbers

I have a data.frame containing numerics. I want to create a new column within that data.frame that will house factor labels using (letters[]). I want these factor labels to be built from a sequence of numbers that I have, and can change every time.
For example, my original DF has 1 column x containing numerics, I then have a sequence of numbers (3,7,9). So I need the new FLABEL column to populate according to the number sequence, i.e. first 3 lines are a, next 4 lines b and so on.
x FLABEL
0.23 a
0.21 a
0.19 a
0.27 b
0.25 b
0.22 b
0.15 b
0.09 c
0.32 c
0.19 d
0.17 d
I'm struggling with how to do this, I'm assuming some form of for-loop given that my number sequence can vary in length every time I run it So I could be populating letters a & b...or many more.
Based on the comment by #scoa, I suggest the following modified approach:
series <- c(3, 7, 9)
series <- c(series, nrow(DF)) # This ensures that the sequence extends to the last row of DF
series2 <- c(series[1] ,diff(series))
DF$FLABEL <- rep(letters[1:length(series2)], series2)
#> DF
# x FLABEL
#1 0.23 a
#2 0.21 a
#3 0.19 a
#4 0.27 b
#5 0.25 b
#6 0.22 b
#7 0.15 b
#8 0.09 c
#9 0.32 c
#10 0.19 d
#11 0.17 d
By using diff() the length of each sequence is calculated based on the index numbers in the input vector series. In this case, the index values 3, 7, 9 are converted into the number of repetitions of subsequent letters up to the last row of the data frame and stored in series2: 3, 4, 2, 2.
data
text <- "x FLABEL
0.23 x
0.21 x
0.19 x
0.27 x
0.25 x
0.22 x
0.15 x
0.09 x
0.32 x
0.19 x
0.17 x"
DF <- read.table(text = text, header=T)

Subsetting a dataframe based on another dataframe in R

df:
y x
F T
F F
T T
T F
df1:
y z probs.x x probs.y new
F F 0.08 T 0.4 0.032
F F 0.24 F 0.4 0.096
F T 0.12 T 0.6 0.072
F T 0.36 F 0.6 0.216
T F 0.40 T 0.5 0.200
T F 0.20 F 0.5 0.100
T T 0.40 T 0.5 0.200
T T 0.20 F 0.5 0.100
df and df1 are the two data frames. And for each row of df, I want to select the matching rows in df1, add the values in column “new”, and store output in a new data frame like this.
df_res:
y x new
F T .104
F F .312
T T .4
T F .2
Kindly help me out! I have been toiling over this for a long time now. The table headers will change according to the variables, so please do do not hard code the table headers.
Thanks.
I don't know how long is your data but this can be one approach.
df<- read.table(text="y x
F T
F F
T T
T F",header=T,sep="")
df1 <- read.table(text="y z probs.x x probs.y new
F F 0.08 T 0.4 0.032
F F 0.24 F 0.4 0.096
F T 0.12 T 0.6 0.072
F T 0.36 F 0.6 0.216
T F 0.40 T 0.5 0.200
T F 0.20 F 0.5 0.100
T T 0.40 T 0.5 0.200
T T 0.20 F 0.5 0.100", header=T, sep="")
df$yx <- paste0(df$y,df$x)
df1$yx <- paste0(df1$y, df1$x)
# Update automatically using the for loop
for (i in 1:4){
new[i] <- sum(df1[which(df1[,7]==df[i,3]),6])
}
df$new <- new
df
y x yx new
1 FALSE TRUE FALSETRUE 0.104
2 FALSE FALSE FALSEFALSE 0.312
3 TRUE TRUE TRUETRUE 0.400
4 TRUE FALSE TRUEFALSE 0.200
Using sapply
new <- sapply(1:4, function(x) sum(df1[which(df1[,7]==df[x,3]),6]))
it seems like if all you want is F,T combination. this works. otherwise you have to write more clearly.
text=" y z probs.x x probs.y new
F F 0.08 T 0.4 0.032
F F 0.24 F 0.4 0.096
F T 0.12 T 0.6 0.072
F T 0.36 F 0.6 0.216
T F 0.40 T 0.5 0.200
T F 0.20 F 0.5 0.100
T T 0.40 T 0.5 0.200
T T 0.20 F 0.5 0.100"
df<-read.table(text=text, header=T)
df_res<-aggregate(data=df, new~interaction(y,x),sum)
interaction(y, x) new
1 FALSE.FALSE 0.312
2 TRUE.FALSE 0.200
3 FALSE.TRUE 0.104
4 TRUE.TRUE 0.400
Here's an answer using merge and plyr.
Read in your example data.frame:
df1 <- read.table(text="y z probs.x x probs.y new
F F 0.08 T 0.4 0.032
F F 0.24 F 0.4 0.096
F T 0.12 T 0.6 0.072
F T 0.36 F 0.6 0.216
T F 0.40 T 0.5 0.200
T F 0.20 F 0.5 0.100
T T 0.40 T 0.5 0.200
T T 0.20 F 0.5 0.100", header=T, sep="")
If I understand, there are 2 steps to what your asking. First is to select rows in df1 that match patterns in df. That can be done with merge. The df you gave has all combinations of True and False for x and y. Let's leave one out so we can see the effect:
df <- read.table(text="y x
F T
T T
T F",header=T,sep="")
df_merged <- merge(df, df1, all.y=F)
The results are a new data.frame the omits the rows where both x and y are F. This is equivalent to a left join in a SQL database.
y x z probs.x probs.y new
1 FALSE TRUE FALSE 0.08 0.4 0.032
2 FALSE TRUE TRUE 0.12 0.6 0.072
3 TRUE FALSE FALSE 0.20 0.5 0.100
4 TRUE FALSE TRUE 0.20 0.5 0.100
5 TRUE TRUE FALSE 0.40 0.5 0.200
6 TRUE TRUE TRUE 0.40 0.5 0.200
The second part of the question is to group the data and apply a sum to the groups. Plyr is a great tool for this kind of data manipulation:
library(plyr)
ddply(df_merged, .(y,x), function(df) c(new=sum(df$new)))
The dd means we are giving a data.frame and want a data.frame as a result. The next argument .(y,x) is a quoted expression and names the variables we're grouping by. The result is this:
y x new
1 FALSE TRUE 0.104
2 TRUE FALSE 0.200
3 TRUE TRUE 0.400

Resources