How to use a dataset to extract specific columns from another dataset? - r

How to use a dataset to extract specific columns from another dataset?

Use intersect to find common names between two data sets.
snp.common <- intersect(data1$snp, colnames(data2$snp))
data2.separated <- data2[,snp.common]

It's always better to supply a minimal reproducible example:
df1 <- data.frame(V1 = 1:3,
V2 = 4:6,
V3 = 7:9)
df2 <- data.frame(snp = c("V2", "V3"),
stringsAsFactors=FALSE)
Now we can use a character vector to index the columns we want:
df1[, df2$snp]
Returns:
V2 V3
1 4 7
2 5 8
3 6 9
Edit:
Would you know how to do this so that it retains the "i..POP" column in data2?
df1 <- data.frame(ID = letters[1:3],
V1 = 1:3,
V2 = 4:6,
V3 = 7:9)
names(df1)[1] <- "ï..POP"
df2 <- data.frame(snp = c("V2", "V3"),
stringsAsFactors=FALSE)
We can use c to combine the names of the columns:
df1[, c("ï..POP", df2$snp)]
ï..POP V2 V3
1 a 4 7
2 b 5 8
3 c 6 9

Related

R - best way to apply an if statement on multiple arguments

Example: Let's say I have the two dataframes
DF1 = data.frame(V1 = c("","A", "B"), V2 = c("x",0,1), V3 = c("y",2,3), V4 = c("z",4,5))
DF2 = data.frame(V1 = c("","A", "B"), V2 = c("x",6,7), V3 = c("y",8,9), V4 = c("z",0,0))
so
> DF1 > DF2
V1 V2 V3 V4 V1 V2 V3 V4
1 x y z 1 x y z
2 A 0 2 4 2 A 6 8 0
3 B 1 3 5 3 B 7 9 0
and I want to have the first row as column names here, so
>DF1 > DF2
x y z x y z
1 A 0 2 4 1 A 6 8 0
2 B 1 3 5 2 B 7 9 0
What I do to achieve this is
if("V2" %in% names(DF1)){
names(DF1) = as.character(unlist(DF1[1,]))
DF1 = DF1[-1, ]
}
if("V2" %in% names(DF2)){
names(DF2) = as.character(unlist(DF2[1,]))
DF2 = DF2[-1, ]
}
which does what we want in this example.
QUESTION: What's the best way here to avoid having two if statements here? The first thing that came to my mind is iterating over the two DFs in a loop, but this doesn't work because you have to rename the DFs (at least it didn't work for me)
Or more generally, how to avoid doing the same thing for multiple arguments where loops don't work
We could use row_to_names from janitor
library(janitor)
DF1 <- row_to_names(DF1, 1)
DF2 <- row_to_names(Df2, 1)
There must be a better way but until someone gives that, I think this chunk of code works:
DF1 = data.frame(V1 = c("","A", "B"), V2 = c("x",0,1), V3 = c("y",2,3), V4 = c("z",4,5))
DF2 = data.frame(V1 = c("","A", "B"), V2 = c("x",6,7), V3 = c("y",8,9), V4 = c("z",0,0))
FUNK = function(x){
if("V2" %in% names(x)){
names(x) = as.character(unlist(x[1,]))
x = x[-1, ]
}
return(x)
}
list1 = list(DF1,DF2)
list2 = lapply(1:2,FUN = function(x) FUNK(list1[[x]]))
for (i in 1:2){
assign(paste0("DF",i),list2[[i]])
}
The function just does the if statement, then this is applied to a list of dataframes, and then the new dataframes are assignd to original names using "assign" function.
You can avoid using if condition at all in this problem. Is this what you desire?
ls <- list(DF1,DF2)
for (k in 1:length(ls)) {
names(ls[[k]]) <- ls[[k]] %>% slice(1) %>% unlist()
assign(paste0("DF",k),ls[[k]][-1,])
}
row.names(DF1) <- NULL
row.names(DF2) <- NULL
output
> DF1
x y z
1 A 0 2 4
2 B 1 3 5
> DF2
x y z
1 A 6 8 0
2 B 7 9 0

attempting to combine (mutate) two rows into a column

I'm trying to mutate a column by dividing the value of a row with the value above. For example, lets say i have this dataframe:
V1
A 4
B 2
C 8
Using something like:
df <- mutate(df, V2 = V1[row+1] / V1[row])
I want to get:
V1 v2
A 4 NA
B 2 2
C 8 0.25
I can't find any way to do this...does anyone have any info?
edit: clarity
Try with:
library(dplyr)
df <- mutate(df, v2 = lag(V1) / V1)
Output:
V1 v2
A 4 NA
B 2 2.00
C 8 0.25
In base R, we can remove the first and last element and do the division
df$V2 <- with(df, c(NA, V1[-length(V1)]/V1[-1]))
data
df <- structure(list(V1 = c(4, 2, 8)), class = "data.frame",
row.names = c("A",
"B", "C"))

replace values of data frame column when matching value exists in 2nd data frame

I want to replace those values of df1$colB with values from df2$replacement where df1$colB is equal to df2$matches.
df1 <- data.frame(colA = 1:10, colB = letters[1:10])
df2 <- data.frame(matches= letters[4:1], replacement= LETTERS [4:1])
The result should look like df3:
df3 <- data.frame(colA =1:10, colB = c(LETTERS[1:4],letters[5:10]))
I'd like to avoid a for-loop solution for this task.
You could use the chartr function in base R.
# read in data with character vectors, not factors
df1 <- data.frame(colA = 1:10, colB = letters[1:10], stringsAsFactors=F)
df2 <- data.frame(matches= letters[4:1], replacement= LETTERS [4:1], stringsAsFactors=F)
df3 <- data.frame(colA =1:10, colB = c(LETTERS[1:4],letters[5:10]), stringsAsFactors=F)
# replace the characters with the desired characters
df1$colB <- chartr(paste(df2$matches, collapse=""),
paste(df2$replacement, collapse=""), df1$colB)
According to the help file, `?chartr, the function
Translate(s) characters in character vectors
You can do a merge on df1 and df2, and then replace the colB value by the replacement:
library(dplyr)
merge(df1, df2, by.x = "colB", by.y = "matches", all.x = T) %>%
mutate(colB = ifelse(!is.na(replacement), replacement, colB)) %>%
select(colA, colB)
colA colB
1 1 A
2 2 B
3 3 C
4 4 D
5 5 e
6 6 f
7 7 g
8 8 h
9 9 i
10 10 j

Append 0 to missing observations in a dataframe.

I have a dataset where I expect a fixed number of observations in a data-frame
A 20
B 10
C 5
However, upon running my analysis this is not always the case sometimes I find missing observations and the resulting dataframe looks like this
A 10
C 5
In this case there are no observations for B. I would want to append 0 observations to the final dataframe before ploting so as to indicate the values of the missing observation.
final data frame should look like this
A 10
B 0
C 5
How can I accomplish this in R?
If you define the ID column (with A,B,C) as factor which seems appropriate here, you could plot the data and even those factor levels which are not in the data (but in the defined factor levels) will be plotted. Here's a small example:
df <- data.frame(ID = LETTERS[1:3], x = rnorm(3))
df
# ID x
#1 A 1.350458
#2 B 1.340855
#3 C 1.311329
subdf <- df[c(1,3),]
subdf
# ID x
#1 A 1.350458
#3 C 1.311329
with(subdf, plot(x ~ ID))
You'll find that "B" is also present in the plot although it's not in the subsetted data.
Maybe you can do something with melt and dcast from "reshape2".
Here's what I had in mind:
library(reshape2)
out <- dcast(
melt( # Makes a data.frame from a list
mget(ls(pattern = "df\\d")), # Collects the relevant df in a list
id.vars = "V1"), # The variable to melt by
L1 ~ V1, value.var = "value", fill = 0) # Other options for dcast
out
# L1 A B C
# 1 df1 20 10 5
# 2 df2 10 0 5
From there, you could go back to a long data form.
melt(out, id.vars = "L1")
# L1 variable value
# 1 df1 A 20
# 2 df2 A 10
# 3 df1 B 10
# 4 df2 B 0
# 5 df1 C 5
# 6 df2 C 5
If separate data.frames are required, then you can also look at using split, but if you are just going to be plotting, this format should work just fine.
Sample data
df1 <- structure(list(V1 = c("A", "B", "C"), V2 = c(20L, 10L, 5L)),
.Names = c("V1", "V2"), class = "data.frame",
row.names = c(NA, -3L))
df2 <- structure(list(V1 = c("A", "C"), V2 = c(10L, 5L)),
.Names = c("V1", "V2"), class = "data.frame",
row.names = c(NA, -2L))

Create a subsetting function according to one or more couple of values for a data.frame

How to make a function to use one or mores couples of values (x1,y1 ; x2,y2 ; ... according to need) to subset a data frame like
selection <- function(x1,y1, ...){
dfselected <- subset(df, V1 == "x1" & V2 == "y1"
## MAY OR MAY NOT BE PRESENT ##
| V1 == "x2" & V2 == "y2")
return(dfselected)
}
I can do it with subset() for a single indexing. Example:
df <- data.frame(
V1 = c(rep("a",5), rep("b",5)),
V2 = rep(c(1:5),2),
V3 = c(101:110)
)
ie
V1 V2 V3
a 1 101
a 2 102
a 3 103
a 4 104
a 5 105
b 1 106
b 2 107
b 3 108
b 4 109
b 5 110
And the subsetting for the couples ("a","3") and ("b","4") look likes
dfselected <- subset(df, V1 == "a" & V2 == 3 | V1 == "b" & V2 == 4 )
I couldn't find a similar function. I don't know if I have to pass an unspecified number of parameters to a function (the so-called "three dots") or to use if/else. I'am a beginner to functions, so links or examples are welcome too.
I started mostly with that: http://www.ats.ucla.edu/stat/r/library/intro_function.htm
------------------------------ Solution after hadley's answer
selection <- function (x,y){
match <- data.frame(
V1 = x,
V2 = y,
stringsAsFactors = FALSE
)
return(dplyr::semi_join(df, match))
}
It sounds like you want a semi-join: find all rows in x that have matching entries in y:
df <- data.frame(
V1 = c(rep("a",5), rep("b",5)),
V2 = rep(c(1:5), 2),
V3 = c(101:110),
stringsAsFactors = FALSE
)
match <- data.frame(
V1 = c("a", "b"),
V2 = c(3L, 4L),
stringsAsFactors = FALSE
)
library(dplyr)
semi_join(df, match)
Unless I'm missing something, you could just use base R's merge().
With the two example data.frames Hadley provided,
merge(df, match)
# V1 V2 V3
# 1 a 3 103
# 2 b 4 109

Resources