attempting to combine (mutate) two rows into a column - r

I'm trying to mutate a column by dividing the value of a row with the value above. For example, lets say i have this dataframe:
V1
A 4
B 2
C 8
Using something like:
df <- mutate(df, V2 = V1[row+1] / V1[row])
I want to get:
V1 v2
A 4 NA
B 2 2
C 8 0.25
I can't find any way to do this...does anyone have any info?
edit: clarity

Try with:
library(dplyr)
df <- mutate(df, v2 = lag(V1) / V1)
Output:
V1 v2
A 4 NA
B 2 2.00
C 8 0.25

In base R, we can remove the first and last element and do the division
df$V2 <- with(df, c(NA, V1[-length(V1)]/V1[-1]))
data
df <- structure(list(V1 = c(4, 2, 8)), class = "data.frame",
row.names = c("A",
"B", "C"))

Related

R - best way to apply an if statement on multiple arguments

Example: Let's say I have the two dataframes
DF1 = data.frame(V1 = c("","A", "B"), V2 = c("x",0,1), V3 = c("y",2,3), V4 = c("z",4,5))
DF2 = data.frame(V1 = c("","A", "B"), V2 = c("x",6,7), V3 = c("y",8,9), V4 = c("z",0,0))
so
> DF1 > DF2
V1 V2 V3 V4 V1 V2 V3 V4
1 x y z 1 x y z
2 A 0 2 4 2 A 6 8 0
3 B 1 3 5 3 B 7 9 0
and I want to have the first row as column names here, so
>DF1 > DF2
x y z x y z
1 A 0 2 4 1 A 6 8 0
2 B 1 3 5 2 B 7 9 0
What I do to achieve this is
if("V2" %in% names(DF1)){
names(DF1) = as.character(unlist(DF1[1,]))
DF1 = DF1[-1, ]
}
if("V2" %in% names(DF2)){
names(DF2) = as.character(unlist(DF2[1,]))
DF2 = DF2[-1, ]
}
which does what we want in this example.
QUESTION: What's the best way here to avoid having two if statements here? The first thing that came to my mind is iterating over the two DFs in a loop, but this doesn't work because you have to rename the DFs (at least it didn't work for me)
Or more generally, how to avoid doing the same thing for multiple arguments where loops don't work
We could use row_to_names from janitor
library(janitor)
DF1 <- row_to_names(DF1, 1)
DF2 <- row_to_names(Df2, 1)
There must be a better way but until someone gives that, I think this chunk of code works:
DF1 = data.frame(V1 = c("","A", "B"), V2 = c("x",0,1), V3 = c("y",2,3), V4 = c("z",4,5))
DF2 = data.frame(V1 = c("","A", "B"), V2 = c("x",6,7), V3 = c("y",8,9), V4 = c("z",0,0))
FUNK = function(x){
if("V2" %in% names(x)){
names(x) = as.character(unlist(x[1,]))
x = x[-1, ]
}
return(x)
}
list1 = list(DF1,DF2)
list2 = lapply(1:2,FUN = function(x) FUNK(list1[[x]]))
for (i in 1:2){
assign(paste0("DF",i),list2[[i]])
}
The function just does the if statement, then this is applied to a list of dataframes, and then the new dataframes are assignd to original names using "assign" function.
You can avoid using if condition at all in this problem. Is this what you desire?
ls <- list(DF1,DF2)
for (k in 1:length(ls)) {
names(ls[[k]]) <- ls[[k]] %>% slice(1) %>% unlist()
assign(paste0("DF",k),ls[[k]][-1,])
}
row.names(DF1) <- NULL
row.names(DF2) <- NULL
output
> DF1
x y z
1 A 0 2 4
2 B 1 3 5
> DF2
x y z
1 A 6 8 0
2 B 7 9 0

Checking conditions and assigning values by row in R

I have a dataset that has one row per subject, and there is a variable for which I want to reassign values based on a condition. For example, if the value of the variable is 6, I want to change the value to the mean of the other variables in the dataset.
Subject V1 V2 V3 V4
123 2 2 2 3
234 1 5 4 4
345 1 4 3 6
In the above dataset, for each patient, I want to reassign all 6's for V4 with the mean of that patient's V1, V2, V3. Thus, for subject 345, V4 would take on the new value 8/3 or ((1+4+3)/3). I was thinking of using an ifelse statement, but I haven't been able to get it to work. Any help would be greatly appreciated.
Given:
library(dplyr)
library(tibble)
data <- tibble(
Subject = c("123", "234", "345"),
V1 = c(2, 1, 1),
V2 = c(2, 5, 4),
V3 = c(2, 4, 3),
V4 = c(3, 4, 6)
)
You could do this using base-R:
data$V4 <- ifelse(data$V4 == 6,(data$V1 + data$V2 + data$V3)/3, data$V4)
Or using a dplyr chain:
data <- data %>%
mutate(V4 = ifelse(V4 == 6,(V1 + V2 + V3)/3, V4))
Turn the V4 value to NA and replace them with rowMeans.
df$V4[df$V4 == 6] <- NA
df$V4 <- ifelse(is.na(df$V4), rowMeans(df[-1], na.rm = TRUE), df$V4)
df
# Subject V1 V2 V3 V4
#1 123 2 2 2 3.00
#2 234 1 5 4 4.00
#3 345 1 4 3 2.67
You can use any of the below formula.
d[,4]<-ifelse(d[,4]==6,(d[,1]+d[,2]+d[,3])/3,d[,4])
d[,4]<-ifelse(d[,4]==6,rowMeans(d[,1:3]),d[,4])

How to use a dataset to extract specific columns from another dataset?

How to use a dataset to extract specific columns from another dataset?
Use intersect to find common names between two data sets.
snp.common <- intersect(data1$snp, colnames(data2$snp))
data2.separated <- data2[,snp.common]
It's always better to supply a minimal reproducible example:
df1 <- data.frame(V1 = 1:3,
V2 = 4:6,
V3 = 7:9)
df2 <- data.frame(snp = c("V2", "V3"),
stringsAsFactors=FALSE)
Now we can use a character vector to index the columns we want:
df1[, df2$snp]
Returns:
V2 V3
1 4 7
2 5 8
3 6 9
Edit:
Would you know how to do this so that it retains the "i..POP" column in data2?
df1 <- data.frame(ID = letters[1:3],
V1 = 1:3,
V2 = 4:6,
V3 = 7:9)
names(df1)[1] <- "ï..POP"
df2 <- data.frame(snp = c("V2", "V3"),
stringsAsFactors=FALSE)
We can use c to combine the names of the columns:
df1[, c("ï..POP", df2$snp)]
ï..POP V2 V3
1 a 4 7
2 b 5 8
3 c 6 9

Apply na.fill to every column

I have a dataset that looks like this:
Col1 Col2 Col3 Col4 Col5
A B 4 5 7
G H 5 6 NA
H I NA 9 8
K F 9 NA NA
E L NA 8 9
H I 1 0 10
How do I apply the na.fill() function to all the columns after Col2?
If I were to do it individually, it would be something like this:
df$Col3<-na.fill(df$Col3, c(NA, "extend", NA))
df$Col4<-na.fill(df$Col4, c(NA, "extend", NA))
df$Col5<-na.fill(df$Col5, c(NA, "extend", NA))
The problem is that my actual dataframe has over 100 columns. Is there a quick way to apply this function to all the columns after the first 2?
na.fill does handle multiple columns. Really no need to use lapply, mutate, etc. Just replace the relevant columns with the result of running na.fill on those same columns. If you know what ix is then you could replace the first line with it so that in this example we could alternately use ix <- 3:5 or ix <- -(1:2) .
ix <- sapply(DF, is.numeric)
replace(DF, ix, na.fill(DF[ix], c(NA, "extend", NA)))
giving:
Col1 Col2 Col3 Col4 Col5
1 A B 4 5.0 7.0
2 G H 5 6.0 7.5
3 H I 7 9.0 8.0
4 K F 9 8.5 8.5
5 E L 5 8.0 9.0
6 H I 1 0.0 10.0
Note that you could alternately use na.approx:
replace(DF, ix, na.approx(DF[ix], na.rm = FALSE))
Note
Lines <- "Col1 Col2 Col3 Col4 Col5
A B 4 5 7
G H 5 6 NA
H I NA 9 8
K F 9 NA NA
E L NA 8 9
H I 1 0 10"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE, strip.white = TRUE)
The mutate_-family of functions in the dplyr package would do the trick.
There are a few ways to do this. Some may work better than others depending on what your other columns look like. Here are three versions that would work better in different circumstances.
# Make dummy data.
df <- data.frame(
Col1 = LETTERS[1:6],
Col2 = LETTERS[7:12],
Col3 = c(4, 5, NA, 9, NA, 1),
Col4 = c(5,6,9,NA,8,0),
Col5 = c(7,NA,8,NA,9,10)
)
You can apply the na.fill function to columns specified by name vector. This is useful if you want to use a regular expression to select columns with certain name parts.
cn <- names(df) %>%
str_subset("[345]") # Column names with 3, 4 or 5 in them.
result_1 <- df %>%
mutate_at(vars(cn),
zoo::na.fill, c(NA, 'extend', NA)
)
You can apply the na.fill function to any numeric column.
result_2 <- df %>%
mutate_if(is.numeric, # First argument is function that returns a logical vector.
zoo::na.fill, c(NA, 'extend', NA)
)
You can apply the function to columns specified in an numeric index vector.
result_3 <- df
result_3[ , 3:5] <- result_3[ , 3:5] %>% # Just replace columns 3 through 5
mutate_all(
zoo::na.fill, c(NA, 'extend', NA)
)
In this case, all three versions should have done the same thing.
all.equal(result_1, result_2) # TRUE
all.equal(result_1, result_3) # TRUE

subset a dataframe based on sum of a column

I have a df that looks like this:
> df2
name value
1 a 0.20019421
2 b 0.17996454
3 c 0.14257010
4 d 0.14257010
5 e 0.11258865
6 f 0.07228970
7 g 0.05673759
8 h 0.05319149
9 i 0.03989362
I would like to subset it using the sum of the column value, i.e, I want to extract those rows which sum of values from column value is higher than 0.6, but starting to sum values from the first row. My desired output will be:
> df2
name value
1 a 0.20019421
2 b 0.17996454
3 c 0.14257010
4 d 0.14257010
I have tried df2[, colSums[,5]>=0.6] but obviously colSums is expecting an array
Thanks in advance
Here's an approach:
df2[seq(which(cumsum(df2$value) >= 0.6)[1]), ]
The result:
name value
1 a 0.2001942
2 b 0.1799645
3 c 0.1425701
4 d 0.1425701
I'm not sure I understand exactly what you are trying to do, but I think cumsum should be able to help.
First to make this reproducible, let's use dput so others can help:
df <- structure(list(name = structure(1:9, .Label = c("a", "b", "c",
"d", "e", "f", "g", "h", "i"), class = "factor"), value = c(0.20019421,
0.17996454, 0.1425701, 0.1425701, 0.11258865, 0.0722897, 0.05673759,
0.05319149, 0.03989362)), .Names = c("name", "value"), class = "data.frame", row.names = c(NA,
-9L))
Then look at what cumsum(df$value) provides:
cumsum(df$value)
# [1] 0.2001942 0.3801587 0.5227289 0.6652990 0.7778876 0.8501773 0.9069149 0.9601064 1.0000000
Finally, subset accordingly:
subset(df, cumsum(df$value) <= 0.6)
# name value
# 1 a 0.2001942
# 2 b 0.1799645
# 3 c 0.1425701
subset(df, cumsum(df$value) >= 0.6)
# name value
# 4 d 0.14257010
# 5 e 0.11258865
# 6 f 0.07228970
# 7 g 0.05673759
# 8 h 0.05319149
# 9 i 0.03989362

Resources