I am using two differrent data frames. I would like to complete one using the information that is contained in the other. The first data frame contains a list of observations of individual young animals whose birthdate and natal territory are known. The second data frame contains observations of adult animals that were present in given territories within given time intervals.
Here is a reproducible example:
#First dataframe:
ID_young <- c(rep(c("a", "b", "c"), each=3), "d") # All individuals observed three times except "d", observed once
Territory_young <- c(rep(c("x", "y", "z"), each=3), "x") # All individuals are from different territories, except "a" and "d" who are from the same territory, namely "x".
Birthdate <- c(rep(c("2014-01-29", "2014-12-17", "2013-11-19"), each=3), "2012-12-04")
Birthdate <- as.Date(Birthdate)
# Second dataframe:
ID_adult <- c("e", "f", "g", "h", "i", "j", "e","f")
Territory_adult <- c("x", "x", "y", "z", "z", "z", "z", "w")
First_date <- as.Date(c("2014-01-01", "2014-01-15", "2013-12-14", "2013-05-17", "2013-05-09", "2012-09-01", "2013-06-18", "2011-04-17"))
Last_date <- as.Date(c("2014-02-28", "2014-04-17", "2014-11-02", "2014-01-13", "2015-01-03", "2013-04-17", "2013-12-25", "2014-11-11"))
# Data frames complete:
df1 <- data.frame(ID_young, Territory_young, Birthdate)
df2 <- data.frame(ID_adult, Territory_adult, First_date, Last_date)
My goal is to create a new column in df1 that consists of the number of adult animals present in the young animal's territory at the time of its birth.
In other words,
For each line of df1:
find the corresponding territory in df2
count the number of lines in df2 where the interval between df2$First_date and df2$Last_date includes df1$Birthdate
fill in that number in the new column of df1
For example, for the first three lines of df1 (corresponding to the young animal "a"), that count would be 2, because adults "e" and "f" were present in territory "x" when young "a" was born (2014-01-29).
Could someone help me formulate the right combination of conditional statements that would allow me to do that? I am trying for and if statements at the moment but have nothing worth showing.
Thanks!
nb.adults = apply(df1, 1, function(row, df2) {
terr = as.character(row[2])
bd = row[3]
nb.adults = length(which(df2$First_date < bd & bd < df2$Last_date &
df2$Territory_adult==terr))
return(nb.adults)
}, df2)
df1 = cbind(df1, nb.adults)
The recent versions of data.table support non-equi joins which can be used for this purpose:
library(data.table) # CRAN version 1.10.4 used
# coerce to data.table
DT1 <- data.table(df1)
DT2 <- data.table(df2)
# right non-equi join to find any adults present in terrority during birth
DT2[unique(DT1),
on = c("Territory_adult==Territory_young",
"First_date<=Birthdate",
"Last_date>=Birthdate")][
# count adults for each young
, .(Count_adult = sum(!is.na(ID_adult))), by = ID_young][
# join counts into each matching row of first data.table
DT1, on = "ID_young"]
ID_young Count_adult Territory_young Birthdate
1: a 2 x 2014-01-29
2: a 2 x 2014-01-29
3: a 2 x 2014-01-29
4: b 0 y 2014-12-17
5: b 0 y 2014-12-17
6: b 0 y 2014-12-17
7: c 3 z 2013-11-19
8: c 3 z 2013-11-19
9: c 3 z 2013-11-19
10: d 0 x 2012-12-04
Note that df1 and DT1, resp., contain duplicate rows which require to use unique() in the non-equi join with the adults and to use another join finally to make sure that the adults count appears on each row.
Related
I have 2 data frames with account numbers and amounts plus some other irrelevant columns. I would like to compare the output with a Y or N if they match or not.
I need to compare the account number in row 1 in dataframe A to the account number in row 1 in dataframe B and if they match put a Y in a column or an N if they don't. I've managed to get the code to check if there is a match in the entire dataframe but I need to check each row individually.
E.g.
df1
|account.num|x1|x2|x3|
|100|a|b|c|
|101|a|b|c|
|102|a|b|c|
|103|a|b|c|
df2
|account.num|x1|x2|x3|
|100|a|b|c|
|102|a|b|c|
|101|a|b|c|
|103|a|b|c|
output
|account.num|x1|x2|x3|match|
|100|a|b|c|Y|
|101|a|b|c|N|
|102|a|b|c|N|
|103|a|b|c|Y|
So, row 1 matches as they have the same account number, but row 2 doesn't because they are different. However, the other data in the dataframe doesn't matter just that column. Can I do this without merging the data frames? (I did have tables, but they won't work. I don't know why. So sorry if that's hard to follow).
You can use == to compare if account.num is equal, and use this boolean vector to subset c("N", "Y")
df1$match <- c("N", "Y")[1 + (df1[[1]] == df2[[1]])]
df1
# account.num x1 x2 x3 match
#1 100 a b c Y
#2 101 a b c N
#3 102 a b c N
#4 103 a b c Y
Data:
df1 <- data.frame(account.num=100:103, x1="a", x2="b", x3="c")
df2 <- data.frame(account.num=c(100,102,101,103), x1="a", x2="b", x3="c")
If you want a base R solution, here is a quick sketch. Assuming boath dataframes are of the same length (number of rows), it should work with your data.
# example dataframes
a <- data.frame(A=c(1,2,3), B=c("one","two","three"))
b <- data.frame(A=c(3,2,1), B=c("three","two","one"))
res <- c() #initialise empty result vector
for (rownum in c(1:nrow(a))) {
# iterate over all numbers of rows
res[rownum] <- all(a[rownum,]==b[rownum,])
}
res # result vector
# [1] FALSE TRUE FALSE
# you can put it in frame a like this. example colname is "equalB"
a$equalB <- res
If you want a tidyverse solution, you can use left_join.
The principle here would be to try to match the data from df2 to the data from df1. If it matches, it would add TRUE to a match column. Then, the code replace the NA values with FALSE.
I'm also adding code to create the data frames from the exemple.
library(tidyverse)
df1 <-
tribble(~account_num, ~x1, ~x2, ~x3,
100, "a", "b", "c",
101, "a", "b", "c",
102, "a", "b", "c",
103, "a", "b", "c") %>%
rowid_to_column() # because position in the df is an important information,
# I need to hardcode it in the df
df2 <-
tribble(~account_num, ~x1, ~x2, ~x3,
100, "a", "b", "c",
102, "a", "b", "c",
101, "a", "b", "c",
103, "a", "b", "c") %>%
rowid_to_column()
# take a
df1 %>%
# try to match df1 with version of df2 with a new column where `match` = TRUE
# according to `rowid`, `account_num`, `x1`, `x2`, and `x3`
left_join(df2 %>%
tibble::add_column(match = TRUE),
by = c("rowid", "account_num", "x1", "x2", "x3")
) %>%
# replace the NA in `match` with FALSE in the df
replace_na(list(match = FALSE))
I'm trying to figure out how to replace rows in one dataframe with another by matching the values of one of the columns. Both dataframes have the same column names.
Ex:
df1 <- data.frame(x = c(1,2,3,4), y = c("a", "b", "c", "d"))
df2 <- data.frame(x = c(1,2), y = c("f", "g"))
Is there a way to replace the rows of df1 with the same row in df2 where they share the same x variable? It would look like this.
data.frame(x = c(1,2,3,4), y = c("f","g","c","d")
I've been working on this for a while and this is the closest I've gotten -
df1[which(df1$x %in% df2$x),]$y <- df2[which(df1$x %in% df2$x),]$y
But it just replaces the values with NA.
Does anyone know how to do this?
We can use match. :
inds <- match(df1$x, df2$x)
df1$y[!is.na(inds)] <- df2$y[na.omit(inds)]
df1
# x y
#1 1 f
#2 2 g
#3 3 c
#4 4 d
First off, well done in producing a nice reproducible example that's directly copy-pastable. That always helps, specially with an example of expected output. Nice one!
You have several options, but lets look at why your solution doesn't quite work:
First of all, I tried copy-pasting your last line into a new session and got the dreaded factor-error:
Warning message:
In `[<-.factor`(`*tmp*`, iseq, value = 1:2) :
invalid factor level, NA generated
If we look at your data frames df1 and df2 with the str function, you will see that they do not contain text but factors. These are not text - in short they represent categorical data (male vs. female, scores A, B, C, D, and F, etc.) and are really integers that have a text as label. So that could be your issue.
Running your code gives a warning because you are trying to import new factors (labels) into df1 that don't exist. And R doesn't know what to do with them, so it just inserts NA-values.
As r2evens answered, he used the stringsAsFactors to disable using strings as Factors - you can even go as far as disabling it on a session-wide basis using options(stringsAsFactors=FALSE) (and I've heard it will be disabled as default in forthcoming R4.0 - yay!).
After disabling stringsAsFactors, your code works - or does it? Try this on for size:
df2 <- df2[c(2,1),]
df1[which(df1$x %in% df2$x),]$y <- df2[which(df1$x %in% df2$x),]$y
What's in df1 now? Not quite right anymore.
In the first line, I swapped the two rows in df2 and lo and behold, the replaced values in df1 were swapped. Why is that?
Let's deconstruct your statement df2[which(df1$x %in% df2$x),]$y
Call df1$x %in% df2$x returns a logical vector (boolean) of which elements in df1$x are found ind df2 - i.e. the first two and not the second two. But it doesn't relate which positions in the first vector corresponds to which in the second.
Calling which(df1$x %in% df2$x) then reduces the logical vector to which indices were TRUE. Again, we do not now which elements correspond to which.
For solutions, I would recommend r2evans, as it doesn't rely on extra packages (although data.table or dplyr are two powerful packages to get to know).
In his solution, he uses merge to perform a "full join" which matches rows based on the value, rather than - well, what you did. With transform, he assigns new variables within the context of the data.frame returned from the merge function called in the first argument.
I think what you need here is a "merge" or "join" operation.
(I add stringsAsFactors=FALSE to the frames so that the merging and later work is without any issue, as factors can be disruptive sometimes.)
Base R:
df1 <- data.frame(x = c(1,2,3,4), y = c("a", "b", "c", "d"), stringsAsFactors = FALSE)
# df2 <- data.frame(x = c(1,2), y = c("f", "g"), stringsAsFactors = FALSE)
merge(df1, df2, by = "x", all = TRUE)
# x y.x y.y
# 1 1 a f
# 2 2 b g
# 3 3 c <NA>
# 4 4 d <NA>
transform(merge(df1, df2, by = "x", all = TRUE), y = ifelse(is.na(y.y), y.x, y.y))
# x y.x y.y y
# 1 1 a f f
# 2 2 b g g
# 3 3 c <NA> c
# 4 4 d <NA> d
transform(merge(df1, df2, by = "x", all = TRUE), y = ifelse(is.na(y.y), y.x, y.y), y.x = NULL, y.y = NULL)
# x y
# 1 1 f
# 2 2 g
# 3 3 c
# 4 4 d
Dplyr:
library(dplyr)
full_join(df1, df2, by = "x") %>%
mutate(y = coalesce(y.y, y.x)) %>%
select(-y.x, -y.y)
# x y
# 1 1 f
# 2 2 g
# 3 3 c
# 4 4 d
A join option with data.table where we join on the 'x' column, assign the values of 'y' in second dataset (i.y) to the first one with :=
library(data.table)
setDT(df1)[df2, y := i.y, on = .(x)]
NOTE: It is better to use stringsAsFactors = FALSE (in R 4.0.0 - it is by default though) or else we need to have all the levels common in both datasets
I have a data frame of 58207 x 6. It is produced as a result of different combination of values. Using tidyverse I have grouped by the first column and used do() to assign each unique 1st column value to its specific dataframe from column 3 to 6. However, I cannot figure out how to do the same for column 2 with the difference that I only need unique values stored in a list and not the repeats.
Here is the head of the data frame.
# A tibble: 58,207 x 6
id pfam go_id name nmspace linkage_type
<chr> <fct> <fct> <fct> <fct> <fct>
1 O00273_~ PF020~ GO:000~ cytoplasm cellular_compo~ IEA
2 O00273_~ PF020~ GO:000~ cytosol cellular_compo~ IDA
3 O00273_~ PF020~ GO:000~ plasma membrane cellular_compo~ IDA
4 O00273_~ PF020~ GO:000~ nuclear chromatin cellular_compo~ IDA
5 O00273_~ PF020~ GO:000~ apoptotic process biological_pro~ IEA
6 O00273_~ PF020~ GO:000~ protein binding molecular_func~ IPI
Any suggestions on how to get the levels() value for each group_by(id) on the second column and storing the to a list corresponding to the id would be appreciated.
And I am new in this. If you have any suggestions on how to handle data such as this please do let me know. Basically I'm hoping to do comparisons between different IDs after.
Does this work ok for you?
# dummy data, using data.table package, converting from tibble
library(data.table)
library(tibble)
library(gtools)
df <- tibble(id = rep(c("id1", "id2", "id3"), each=3),
X1 = c("a", "f", "b",
"b", "a", "e",
"a", "f", "f"))
dt <- as.data.table(df)
dt[]
# retaining data structure
out1 <- dt[, .(unique.X1 = unique(X1)), by = id]
out1[]
# as a list
out2 <- dt[, .(unique.X1 = list(unique(X1))), by = id]
out2[]
# back to original format
out2.df <- as.tibble(out2)
out2.df
# EDIT: getting unique combinations
ids <- unique(df$id)
lookup <- as.data.table(gtools::combinations(length(ids), 2))
lookup[, V1 := ids[lookup$V1]][, V2 := ids[lookup$V2]]
setnames(lookup, c("V1", "V2"), c("ID1", "ID2"))
lookup[, index := .I]
setkey(dt, id)
joined <- lookup[, .(intersect = list(intersect(dt[J(ID1), X1], dt[J(ID2), X1]))), by=index]
out <- merge(joined, lookup, by="index")
out[, index := NULL]
out[]
Update:
I realized that the dummy data frame I created originally does not reflect the structure of the data frame that I am working with. Allow me to rephrase my question here.
Data frame that I'm starting with:
StudentAndClass <- c("Anthropology College_Name","x","y",
"Geology College_Name","z","History College_Name", "x","y","z")
df <- data.frame(StudentAndClass)
Students ("x","y","z") are enrolled in classes that they are listed under. e.g. "x" and "y" are in Anthropology, while "x", "y", "z" are in History.
How can I create the desired data frame below?
Student <- c("x", "y", "z", "x", "y","z")
Class <- c("Anthropology College_Name", "Anthropology College_Name",
"Geology College_Name", "History College_Name",
"History College_Name", "History College_Name")
df_tidy <- data.frame(Student, Class)
Original post:
I have a data frame with observations of two variables merged in a single column like so:
StudentAndClass <- c("A","x","y","A","B","z","B","C","x","y","z","C")
df <- data.frame(StudentAndClass)
where "A", "B", "C" represent classes, and "x", "y", "z" students who are taking these classes. Notice that observations of students are wedged between observations of classes.
I'm wondering how I can create a new data frame with the following format:
Student <- c("x", "y", "z", "x", "y","z")
Class <- c("A", "A", "B", "C", "C", "C")
df_tidy <- data.frame(Student, Class)
I want to extract the rows containing observations of students and put them in a new column, while making sure that each Student observation is paired with the corresponding Class observation in the Class column.
One option is to create a vector
v1 <- c('x', 'y', 'z')
Then split the data based on logical vector and rbind
setNames(do.call(cbind, split(df, !df[,1] %in% v1)), c('Student', 'Class'))
# Student Class
#2 x A
#3 y A
#6 z B
#9 x B
#10 y C
#11 z C
Or with tidyverse
library(tidyverse)
df %>%
group_by(grp = c('Class', 'Student')[(StudentAndClass %in% v1) + 1]) %>%
mutate(n = row_number()) %>%
spread(grp, StudentAndClass) %>%
select(-n)
# A tibble: 6 x 2
# Class Student
#* <fctr> <fctr>
#1 A x
#2 A y
#3 B z
#4 B x
#5 C y
#6 C z
Update
If we need this based on elements between each pair of same 'LETTERS'
grp <- with(df, cummax(match(StudentAndClass, LETTERS[1:3], nomatch = 0)))
do.call(rbind, lapply(split(df, grp), function(x)
data.frame(Class = x[,1][2:(nrow(x)-1)], Student = x[[1]][1], stringsAsFactors=FALSE)))
Updated
In essence, you just need to find which indexes have college names, use those to get the range of students in each college, then subset the main vector by those ranges. Since students aren't guaranteed to be nested between two similar values, you have to be careful about any "empty" colleges.
college_indices <- which(endsWith(StudentAndClass, 'College_Name'))
colleges <- StudentAndClass[college_indices]
bounds_mat <- rbind(
start = college_indices,
end = c(college_indices[-1], length(StudentAndClass))
)
colnames(bounds_mat) <- colleges
bounds_mat['start', ] <- bounds_mat['start', ] + 1
bounds_mat['end', ] <- bounds_mat['end', ] - 1
# This prevents any problems if a college has no listed students
empty_college <- bounds_mat['start', ] > bounds_mat['end', ]
bounds_mat <- bounds_mat[, !empty_college]
class_listing <- apply(
bounds_mat,
2,
function(bounds) {
StudentAndClass[bounds[1]:bounds[2]]
}
)
df_tidy <- data.frame(
Student = unlist(class_listing),
Class = rep(names(class_listing), lengths(class_listing)),
row.names = NULL
)
Must one melt a data frame prior to having it cast? From ?melt:
data molten data frame, see melt.
In other words, is it absolutely necessary to have a data frame molten prior to any acast or dcast operation?
Consider the following:
library("reshape2")
library("MASS")
xb <- dcast(Cars93, Manufacturer ~ Type, mean, value.var="Price")
m.Cars93 <- melt(Cars93, id.vars=c("Manufacturer", "Type"), measure.vars="Price")
xc <- dcast(m.Cars93, Manufacturer ~ Type, mean, value.var="value")
Then:
> identical(xb, xc)
[1] TRUE
So in this case the melt operation seems to have been redundant.
What are the general guiding rules in these cases? How do you decide when a data frame needs to be molten prior to a *cast operation?
Whether or not you need to melt your dataset depends on what form you want the final data to be in and how that relates to what you currently have.
The way I generally think of it is:
For the LHS of the formula, I should have one or more columns that will become my "id" rows. These will remain as separate columns in the final output.
For the RHS of the formula, I should have one or more columns that combine to form new columns in which I will be "spreading" my values out across. When this is more than one column, dcast will create new columns based on the combination of the values.
I must have just one column that would feed the values to fill in the resulting "grid" created by these rows and columns.
To illustrate with a small example, consider this tiny dataset:
mydf <- data.frame(
A = c("A", "A", "B", "B", "B"),
B = c("a", "b", "a", "b", "c"),
C = c(1, 1, 2, 2, 3),
D = c(1, 2, 3, 4, 5),
E = c(6, 7, 8, 9, 10)
)
Imagine that our possible value variables are columns "D" or "E", but we are only interested in the values from "E". Imagine also that our primary "id" is column "A", and we want to spread the values out according to column "B". Column "C" is irrelevant at this point.
With that scenario, we would not need to melt the data first. We could simply do:
library(reshape2)
dcast(mydf, A ~ B, value.var = "E")
# A a b c
# 1 A 6 7 NA
# 2 B 8 9 10
Compare what happens when you do the following, keeping in mind my three points above:
dcast(mydf, A ~ C, value.var = "E")
dcast(mydf, A ~ B + C, value.var = "E")
dcast(mydf, A + B ~ C, value.var = "E")
When is melt required?
Now, let's make one small adjustment to the scenario: We want to spread out the values from both columns "D" and "E" with no actual aggregation taking place. With this change, we need to melt the data first so that the relevant values that need to be spread out are in a single column (point 3 above).
dfL <- melt(mydf, measure.vars = c("D", "E"))
dcast(dfL, A ~ B + variable, value.var = "value")
# A a_D a_E b_D b_E c_D c_E
# 1 A 1 6 2 7 NA NA
# 2 B 3 8 4 9 5 10