Example: Let's say I have the two dataframes
DF1 = data.frame(V1 = c("","A", "B"), V2 = c("x",0,1), V3 = c("y",2,3), V4 = c("z",4,5))
DF2 = data.frame(V1 = c("","A", "B"), V2 = c("x",6,7), V3 = c("y",8,9), V4 = c("z",0,0))
so
> DF1 > DF2
V1 V2 V3 V4 V1 V2 V3 V4
1 x y z 1 x y z
2 A 0 2 4 2 A 6 8 0
3 B 1 3 5 3 B 7 9 0
and I want to have the first row as column names here, so
>DF1 > DF2
x y z x y z
1 A 0 2 4 1 A 6 8 0
2 B 1 3 5 2 B 7 9 0
What I do to achieve this is
if("V2" %in% names(DF1)){
names(DF1) = as.character(unlist(DF1[1,]))
DF1 = DF1[-1, ]
}
if("V2" %in% names(DF2)){
names(DF2) = as.character(unlist(DF2[1,]))
DF2 = DF2[-1, ]
}
which does what we want in this example.
QUESTION: What's the best way here to avoid having two if statements here? The first thing that came to my mind is iterating over the two DFs in a loop, but this doesn't work because you have to rename the DFs (at least it didn't work for me)
Or more generally, how to avoid doing the same thing for multiple arguments where loops don't work
We could use row_to_names from janitor
library(janitor)
DF1 <- row_to_names(DF1, 1)
DF2 <- row_to_names(Df2, 1)
There must be a better way but until someone gives that, I think this chunk of code works:
DF1 = data.frame(V1 = c("","A", "B"), V2 = c("x",0,1), V3 = c("y",2,3), V4 = c("z",4,5))
DF2 = data.frame(V1 = c("","A", "B"), V2 = c("x",6,7), V3 = c("y",8,9), V4 = c("z",0,0))
FUNK = function(x){
if("V2" %in% names(x)){
names(x) = as.character(unlist(x[1,]))
x = x[-1, ]
}
return(x)
}
list1 = list(DF1,DF2)
list2 = lapply(1:2,FUN = function(x) FUNK(list1[[x]]))
for (i in 1:2){
assign(paste0("DF",i),list2[[i]])
}
The function just does the if statement, then this is applied to a list of dataframes, and then the new dataframes are assignd to original names using "assign" function.
You can avoid using if condition at all in this problem. Is this what you desire?
ls <- list(DF1,DF2)
for (k in 1:length(ls)) {
names(ls[[k]]) <- ls[[k]] %>% slice(1) %>% unlist()
assign(paste0("DF",k),ls[[k]][-1,])
}
row.names(DF1) <- NULL
row.names(DF2) <- NULL
output
> DF1
x y z
1 A 0 2 4
2 B 1 3 5
> DF2
x y z
1 A 6 8 0
2 B 7 9 0
Related
How do you save rows of outputs from a for loop to a R object?
Below I am trying to iterate by group, and save the outputs to a new data frame in R.
I can get the output to print in the console, but am unable to save it to an object. What is the best way to do this?
#dataset
mat = matrix(1:30, nrow = 10, ncol = 3) #dataset values
group = c(1,1,1,2,2,2,3,3,3,3) #three groups
df = as.data.frame(cbind(mat, group)) #dataset to use
df[,1] = as.numeric(df[,1])
df[,2] = as.numeric(df[,2])
df[,3] = as.numeric(df[,3])
df[,4] = as.factor(df[,4])
#function
funk = function(x) {
V1 = mean(x[,1])
V2 = min(x[,2])
V3 = max(x[,3])
c(V1,V2,V3)
}
For loop
#for loop
iterate <- function(x) {
z = levels(x[,4])
z = as.numeric(z)
for (i in z) {
y <- subset(x, x[,4] == i)
out <- funk(y)
out = as.data.frame(t(out))
print(out)
#!! how to save to an object with three rows (one for each group)
}
}
iterate(df)
Output
You might use dplyr for this,
df %>%
group_by(group) %>%
summarize(V1 = mean(V1), V2 = min(V2), V3 = max(V3))
# # A tibble: 3 x 4
# group V1 V2 V3
# <fct> <dbl> <dbl> <dbl>
# 1 1 2 11 23
# 2 2 5 14 26
# 3 3 8.5 17 30
Or you can do it with base R, not necessarily as elegantly:
do.call(
rbind,
by(df, df$group, FUN = function(z) data.frame(group = z$group[1], V1=mean(z$V1), V2=min(z$V2), V3=max(z$V3)))
)
# group V1 V2 V3
# 1 1 2.0 11 23
# 2 2 5.0 14 26
# 3 3 8.5 17 30
You can use the function assign() inside the loop to save the object:
Adding assign(paste0("out",i,sep = "_"), out) will create for each iteration an object called out_i with i the number of iteration.
Consider the following data frame:
df <- setNames(data.frame(1:5,rep(1,5)), c("id", "value"))
I want to change the names for multiple cells in the column "id". Let's say I want to change the following:
df$id[df$id %In% 2:3] <- 1
df$id[df$id == 4] <- 3
However, instead of using the code above, I want to create a function, where I can do the transformation more "smooth" (because I have a lot of data frames, where I need to change the names for the cells). I want to create a function:
mapping <- function(...) {
...
}
where I afterward can create a simple and smooth mapping function for my df, where I only have to specific the "old" and the "new" names for the cells. Something like this:
df_mapping <- function(...) {
2.1
3.1
4.3
}
And then I can apply the function on my data and specific which column it should do it for, and it will work in the same way as the code with gsub:
df <- df_mapping(df,id)
Is it possible to create that mapping function?
if we need a function, then can have a 'data' argument, column name, values to replace and replacer value, then create the logical condition, subset the column, assign with replacer_val and return the dataset after the assignment
f1 <- function(dat, colnm, values_to_replace, replacer_val) {
dat[[colnm]][dat[[colnm]] %in% values_to_replace] <- replacer_val
return(dat)
}
f1(df, "id", c(2, 3), 1)
-output
# id value
#1 1 1
#2 1 1
#3 1 1
#4 4 1
#5 5 1
To replace values with corresponding sets of replacers,
f2 <- function(dat, colnm, values_to_replace, replacer_vals) {
nm1 <- setNames(replacer_vals, values_to_replace)
v1 <- nm1[as.character(dat[[colnm]])]
i1 <- !is.na(v1)
dat[[colnm]][i1] <- v1[i1]
return(dat)
}
f2(df, "id", c(2, 3), c(5, 6))
# id value
#1 1 1
#2 5 1
#3 6 1
#4 4 1
#5 5 1
Or another option is to create a key/value dataset and use merge or join
library(data.table)
f3 <- function(dat, colnm, values_to_replace, replacer_vals) {
keydat <- data.frame(key = values_to_replace, val = replacer_vals)
names(keydat)[1] <- colnm
dt <- as.data.table(dat)
dt[keydat, (colnm) := val, on = colnm][]
return(dt)
}
f3(df, "id", c(2, 5), c(3, 6))
Maybe a mapping like below could help
mapping <- function(df, id, to_replace, obj_value) {
transform(df, id = replace(id, id %in% to_replace, obj_value))
}
e.g.,
> mapping(df, id, c(2, 3), 1)
id value
1 1 1
2 1 1
3 1 1
4 4 1
5 5 1
You can use dplyr's recode function
mapping <- function(data, col, old, new) {
data[[col]] <- dplyr::recode(data[[col]], !!!setNames(new, old))
data
}
mapping(df, "id", c(2, 3), c(7L, 8L))
# id value
#1 1 1
#2 7 1
#3 8 1
#4 4 1
#5 5 1
How to use a dataset to extract specific columns from another dataset?
Use intersect to find common names between two data sets.
snp.common <- intersect(data1$snp, colnames(data2$snp))
data2.separated <- data2[,snp.common]
It's always better to supply a minimal reproducible example:
df1 <- data.frame(V1 = 1:3,
V2 = 4:6,
V3 = 7:9)
df2 <- data.frame(snp = c("V2", "V3"),
stringsAsFactors=FALSE)
Now we can use a character vector to index the columns we want:
df1[, df2$snp]
Returns:
V2 V3
1 4 7
2 5 8
3 6 9
Edit:
Would you know how to do this so that it retains the "i..POP" column in data2?
df1 <- data.frame(ID = letters[1:3],
V1 = 1:3,
V2 = 4:6,
V3 = 7:9)
names(df1)[1] <- "ï..POP"
df2 <- data.frame(snp = c("V2", "V3"),
stringsAsFactors=FALSE)
We can use c to combine the names of the columns:
df1[, c("ï..POP", df2$snp)]
ï..POP V2 V3
1 a 4 7
2 b 5 8
3 c 6 9
I want to efficiently sum the entries of two data frames, though the data frames are not guaranteed to have the same dimensions or column names. Merge isn't really what I'm after here. Instead I want to create an output object with all of the row and column names that belong to either of the added data frames. In each position of that output, I want to use the following logic for the computed value:
If a row/column pairing belongs to both input data frames I want the output to include their sum
If a row/column pairing belongs to just one input data frame I want to include that value in the output
If a row/column pairing does not belong to any input matrix I want to have 0 in that position in the output.
As an example, consider the following input data frames:
df1 = data.frame(x = c(1,2,3), y = c(4,5,6))
rownames(df1) = c("a", "b", "c")
df2 = data.frame(x = c(7,8), z = c(9,10), w = c(2, 3))
rownames(df2) = c("a", "d")
> df1
x y
a 1 4
b 2 5
c 3 6
> df2
x z w
a 7 9 2
d 8 10 3
I want the final result to be
> df2
x y z w
a 8 4 9 2
b 2 5 0 0
c 3 6 0 0
d 8 0 10 3
What I've done so far -
bind_rows / bind_cols in dplyr can throw the following:
"Error: incompatible number of rows (3, expecting 2)"
I have duplicated column names, so 'merge' isn't working for my purposes either - returns an empty df for some reason.
Seems like you could merge on the rownames, then take care of the sums and conversion of NA to zero with some additional munging:
library(dplyr)
df.new = df1 %>% add_rownames %>%
full_join(df2 %>% add_rownames, by="rowname") %>%
mutate_each(funs(replace(., which(is.na(.)), 0))) %>%
mutate(x = x.x + x.y) %>%
select(rowname,x,y,z,w)
Or, with #DavidArenburg's much more elegant and extensible solution:
df.new = df1 %>% add_rownames %>%
full_join(df2 %>% add_rownames) %>%
group_by(rowname) %>%
summarise_each(funs(sum(., na.rm = TRUE)))
df.new
rowname x y z w
1 a 8 4 9 2
2 b 2 5 0 0
3 c 3 6 0 0
4 d 8 0 10 3
This seems like some type of a simple merge on common column names (+ row names) and then a simple aggregation, this is how I would tackle this
library(data.table)
merge(setDT(df1, keep.rownames = TRUE), # Convert to data.table + keep rows
setDT(df2, keep.rownames = TRUE), # Convert to data.table + keep rows
by = intersect(names(df1), names(df2)), # merge on common column names
all = TRUE)[, lapply(.SD, sum, na.rm = TRUE), by = rn] # Sum all columns by group
# rn x y z w
# 1: a 8 4 9 2
# 2: b 2 5 0 0
# 3: c 3 6 0 0
# 4: d 8 0 10 3
Are a pretty straight forward base R solution
df1$rn <- row.names(df1)
df2$rn <- row.names(df2)
res <- merge(df1, df2, all = TRUE)
rowsum(res[setdiff(names(res), "rn")], res[, "rn"], na.rm = TRUE)
# x y z w
# a 8 4 9 2
# b 2 5 0 0
# c 3 6 0 0
# d 8 0 10 3
First, I would grab the names of all the rows and columns of the new entity:
(all.rows <- unique(c(row.names(df1), row.names(df2))))
# [1] "a" "b" "c" "d"
(all.cols <- unique(c(names(df1), names(df2))))
# [1] "x" "y" "z" "w"
Then I would construct an output matrix with those rows and column names (with matrix data initialized to all 0s), adding df1 and df2 to the relevant parts of that matrix.
out <- matrix(0, nrow=length(all.rows), ncol=length(all.cols))
rownames(out) <- all.rows
colnames(out) <- all.cols
out[row.names(df1),names(df1)] <- unlist(df1)
out[row.names(df2),names(df2)] <- out[row.names(df2),names(df2)] + unlist(df2)
out
# x y z w
# a 8 4 9 2
# b 2 5 0 0
# c 3 6 0 0
# d 8 0 10 3
Using xtabs on melted / stacked data frames:
out <- rbind(cbind(rn=rownames(df1),stack(df1)), cbind(rn=rownames(df2),stack(df2)))
as.data.frame.matrix(xtabs(values ~ rn + ind, data=out))
# x y w z
#a 8 4 2 9
#b 2 5 0 0
#c 3 6 0 0
#d 8 0 3 10
I’m not convinced the accepted (or alternative merge) method is the best. It will give incorrect results if you have common rows, they’ll get joined and not summed.
This can be shown trivialy by changing df2 to:
df2 = data.frame(x = c(1,2), y = c(4,5), z = c(9,10), w = c(2, 3))
rownames(df2) = c("a", "d")
expected results:
rn x y z w
1: a 2 8 9 2
2: b 2 5 0 0
3: c 3 6 0 0
4: d 2 5 10 3
actual results
merge(setDT(df1, keep.rownames = TRUE),
setDT(df2, keep.rownames = TRUE),
by = intersect(names(df1), names(df2)),
all = TRUE)[, lapply(.SD, sum, na.rm = TRUE), by = rn]
rn x y z w
1: a 1 4 9 2
2: b 2 5 0 0
3: c 3 6 0 0
4: d 2 5 10 3
You need to combine both the outer join with an inner join (or left/right joins, merge all=T/all=F). Or alternatively using plyr’s rbind.fill :
base R solution
res <- rbind.fill(df1,df2)
rowsum(res[setdiff(names(res), "rn")], res[, "rn"], na.rm = TRUE)
data table solution
as.data.table(rbind.fill(
setDT(df1, keep.rownames = TRUE),
setDT(df2, keep.rownames = TRUE)
))[, lapply(.SD, sum, na.rm = TRUE), by = rn]
I prefer the rbind.fill method as you can "merge" > 2 data frames using the same syntax.
How to make a function to use one or mores couples of values (x1,y1 ; x2,y2 ; ... according to need) to subset a data frame like
selection <- function(x1,y1, ...){
dfselected <- subset(df, V1 == "x1" & V2 == "y1"
## MAY OR MAY NOT BE PRESENT ##
| V1 == "x2" & V2 == "y2")
return(dfselected)
}
I can do it with subset() for a single indexing. Example:
df <- data.frame(
V1 = c(rep("a",5), rep("b",5)),
V2 = rep(c(1:5),2),
V3 = c(101:110)
)
ie
V1 V2 V3
a 1 101
a 2 102
a 3 103
a 4 104
a 5 105
b 1 106
b 2 107
b 3 108
b 4 109
b 5 110
And the subsetting for the couples ("a","3") and ("b","4") look likes
dfselected <- subset(df, V1 == "a" & V2 == 3 | V1 == "b" & V2 == 4 )
I couldn't find a similar function. I don't know if I have to pass an unspecified number of parameters to a function (the so-called "three dots") or to use if/else. I'am a beginner to functions, so links or examples are welcome too.
I started mostly with that: http://www.ats.ucla.edu/stat/r/library/intro_function.htm
------------------------------ Solution after hadley's answer
selection <- function (x,y){
match <- data.frame(
V1 = x,
V2 = y,
stringsAsFactors = FALSE
)
return(dplyr::semi_join(df, match))
}
It sounds like you want a semi-join: find all rows in x that have matching entries in y:
df <- data.frame(
V1 = c(rep("a",5), rep("b",5)),
V2 = rep(c(1:5), 2),
V3 = c(101:110),
stringsAsFactors = FALSE
)
match <- data.frame(
V1 = c("a", "b"),
V2 = c(3L, 4L),
stringsAsFactors = FALSE
)
library(dplyr)
semi_join(df, match)
Unless I'm missing something, you could just use base R's merge().
With the two example data.frames Hadley provided,
merge(df, match)
# V1 V2 V3
# 1 a 3 103
# 2 b 4 109